# Text Acquisition and Pre-processing

This component focuses on obtaining, cleaning, and preparing text from an online source. The goal is to extract content from a webpage and construct a **pandas** DataFrame containing segmented, tokenized text enriched with multiple linguistic annotations.

The workflow involves retrieving raw text, performing preprocessing steps, and applying the relevant objects and functions for structured analysis.

In [149]:
import re
import pandas as pd
import spacy
from spacy.tokenizer import Tokenizer
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore")

## Text Extraction

The text used in this project is taken from the Food and Agriculture Organization of the United Nations article “World food prices dip in December.”
In a typical workflow, the HTML source would be downloaded directly from the website. For example:

```python
import requests
URL = "https://www.fao.org/newsroom/detail/world-food-prices-dip-in-december/en"
page = requests.get(URL)
html_content = page.content
```

For this project, the HTML file has already been downloaded. The file `world-food-prices.html` is included in the project directory and can be opened and processed as a standard text file.


In [150]:
import requests
URL="https://www.fao.org/newsroom/detail/world-food-prices-dip-in-december/en"
page=requests.get(URL)
html_content=page.content

In [151]:
with open("world-food-prices.html",encoding="utf8") as html_file:
    html_content=html_file.read()
html_content[:1500]

' <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>\n\tWorld food prices dip in December\n</title> <script src="/ScriptResource.axd?d=okuX3IVIBwfJlfEQK32K3hu4wA2qYZOscmtsXGLNMaT1SeSa2ByRKpPz9pkmicdQmLZjrfXbzQg-t-PYtREZ1mv-AHy-XqG8V1C8KEuJc1LwVjfZ2AWtsXusqOzwjxwAkWajaiTob5rdLJ_1Q_rhyISygdJ2WS4kb3-Mf0bSt_7dAdqZ2JnDovQKGlnv0vvH0&amp;t=ffffffffb0940fc0" type="text/javascript"></script><script src="/ScriptResource.axd?d=ePnjFy9PuY6CB3GWMX-b_9Fw4jG3rW51lh6cTRiQ1f_9YOhRVOpDf4gVRQwVzn4JRlDVp-Aj_GWhYCgMY8uVHBZj_w4a27EVOxonvJSMs3yERFILsgdOHu7up3GVU-jExdmK0YWhyY1E0W4ye5rzFrSYUigZQBN7nFt18-5XwfQs2ZTBZ5-Na5q3Phaw58Dx0&amp;t=ffffffffb0940fc0" type="text/javascript"></script><script src="https://cse.google.com/cse.js?cx=018170620143701104933%3Aqq82jsfba7w" type="text/javascript"></script><link href="/ResourcePackages/FAO/assets/dist/css/bootstrap.min.css?v=5.2.0&amp;package=FAO" rel="styleshe

The HTML document contained extensive markup, embedded scripts, and navigation elements that were not relevant to the article. The first stage focused on isolating only the textual content from the body of the post. This was handled by completing the `extract_text` function, which parsed the HTML using **BeautifulSoup**, located the element with the ID `"Contentplaceholder1_C011_Col00"`, and extracted its text content. The resulting output began with the article title and aligned with the expected initial 579-character excerpt.

In [152]:
def extract_text(html_content):
    soup=BeautifulSoup(html_content, 'html.parser')
    post_body=soup.find(id="Contentplaceholder1_C011_Col00")
    if post_body:
        return post_body.get_text(separator=' ', strip=True)
    return ""

#Read the HTML content from the file
with open('world-food-prices.html', 'r', encoding='utf-8') as file:
    html_content=file.read()

In [153]:
text=extract_text(html_content)
text[:580]

'World food prices dip in December FAO Food Price Index ends 2022 lower than a year earlier A farmer in Sicily carrying wheat seeds. ©FAO/Giorgio Cosulich 06/01/2023 Rome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier. However, for 2022 as a whole, the index, which tracks monthly changes in the interna'

## Text Cleanup

The raw output from `extract_text` remained noisy due to multiple newline characters and irregular spacing. The next stage involved completing the `clean_text` function to normalize the text. The function removed redundant newline characters, collapsed extra spaces, and appended missing periods to sentence-like units such as *World food prices dip in December* and *06/01/2023*.
Python string methods and regular expressions were used to perform these transformations.

The final output returned by `clean_text` produced a continuous, cleaned version of the article, with the first 499 characters matching the expected reference excerpt.

In [154]:
def clean_text(text):
    #Replace multiple newline characters and extra spaces with a single space
    text=re.sub(r'\s+', ' ', text)

    #Add a period at the end of sentences that do not have it
    text=re.sub(r'(?<!\.)\s*\n', '.\n', text)  #Ensures new lines at the end of sentences end with a period
    text=text.replace('\n', ' ')  #Remove remaining newlines

    #Add a period at the end if it doesn't exist
    if text and text[-1] != '.':
        text += '.'
        return text

#The text
text='''
World food prices dip in December
FAO Food Price Index ends 2022 lower than a year earlier
                                A farmer in Sicily carrying wheat seeds.
                             \n\n©FAO/Giorgio Cosulich \n\n
06/01/2023\n\n
Rome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier.
'''
cleaned_text=clean_text(text)
cleaned_text[:499]

' World food prices dip in December FAO Food Price Index ends 2022 lower than a year earlier A farmer in Sicily carrying wheat seeds. ©FAO/Giorgio Cosulich 06/01/2023 Rome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier. .'

## Pre-processing

After extraction and cleanup, the text was prepared for linguistic pre-processing using the **spaCy** library. spaCy provides modular processing pipelines with built-in components for tasks such as sentence segmentation, tokenization, lemmatization, stopword identification, part-of-speech tagging, dependency parsing, named entity recognition, and word embeddings.

For this project, the English CPU-optimized pipeline (*en_core_web_sm*) was used. This pipeline was loaded and applied to the cleaned text to generate the linguistic annotations required for subsequent analysis.

In [155]:
nlp=spacy.load("en_core_web_sm")

In [156]:
#Define the process_text function
def process_text(text, nlp):
    #Process the text with spaCy
    doc=nlp(text)
    return doc

#With the data of World Food Prices
with open('world-food-prices.html', 'r', encoding='utf-8') as file:
    html_content=file.read()

text=extract_text(html_content)
cleaned_text=clean_text(text)

In [157]:
doc=process_text(cleaned_text, nlp)

#Check if the doc has the necessary annotations
print(all(map(doc.has_annotation, ["LEMMA", "POS", "ENT_TYPE"])))

True


## Creating a DataFrame

The final stage involved converting the linguistic annotations from the spaCy `Doc` object into a structured **pandas** DataFrame. The DataFrame was designed to store one row per token and included the following fields:

* `sent_id`: index of the sentence within the document
* `token_id`: index of the token within its sentence
* `text`: token text
* `lemma`: token lemma
* `pos`: part-of-speech tag
* `ent`: named entity label, if present

The `to_dataframe` function iterated over each sentence in the `Doc` and over each token within the sentence. For every token, the corresponding annotation values were extracted and appended as rows of the DataFrame.

The resulting structure reproduced the expected content, such as the example sentence with `sent_id = 1`, where each token was assigned its text, lemma, POS tag, and entity type according to spaCy’s annotations.


|    |   sent_id |   token_id | text    | lemma   | pos   | ent   |
|---:|----------:|-----------:|:--------|:--------|:------|:------|
|  7 |         1 |          0 | FAO     | FAO     | PROPN | ORG   |
|  8 |         1 |          1 | Food    | Food    | PROPN | ORG   |
|  9 |         1 |          2 | Price   | Price   | PROPN | ORG   |
| 10 |         1 |          3 | Index   | Index   | PROPN | ORG   |
| 11 |         1 |          4 | ends    | end     | VERB  |       |
| 12 |         1 |          5 | 2022    | 2022    | NUM   | DATE  |
| 13 |         1 |          6 | lower   | low     | ADJ   |       |
| 14 |         1 |          7 | than    | than    | ADP   |       |
| 15 |         1 |          8 | a       | a       | DET   | DATE  |
| 16 |         1 |          9 | year    | year    | NOUN  | DATE  |
| 17 |         1 |         10 | earlier | early   | ADV   | DATE  |
| 18 |         1 |         11 | .       | .       | PUNCT |       |


In [158]:
def to_dataframe(doc):
    data=[]
    for sent_id, sent in enumerate(doc.sents):
        for token_id, token in enumerate(sent):
            data.append({
                "sent_id": sent_id,
                "token_id": token_id,
                "text": token.text,
                "lemma": token.lemma_,
                "pos": token.pos_,
                "ent": token.ent_type_ if token.ent_type_ else None
            })
    df=pd.DataFrame(data)
    return df

#Load the spaCy model
nlp=spacy.load("en_core_web_sm")

#Apply the custom tokenizer
nlp.tokenizer=customize_tokenizer(nlp)

#Ensure sentence boundaries are set
if not nlp.has_pipe("parser"):
    nlp.add_pipe("sentencizer")

#Example text for processing
text="""World food prices dip in December. FAO Food Price Index ends 2022 lower than a year earlier. A farmer in Sicily carrying wheat seeds. ©FAO/Giorgio Cosulich. 06/01/2023. Rome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier."""

#Process the text
doc=nlp(text)

#Convert the doc to a DataFrame
df=to_dataframe(doc)

#Display the DataFrame rows where sent_id is 1 (as an example)
print(df[df.sent_id == 1])

    sent_id  token_id     text  lemma    pos   ent
7         1         0      FAO    FAO  PROPN   ORG
8         1         1     Food   Food  PROPN   ORG
9         1         2    Price  Price  PROPN  None
10        1         3    Index  Index  PROPN  None
11        1         4     ends    end   VERB  None
12        1         5     2022   2022    NUM  DATE
13        1         6    lower    low    ADJ  None
14        1         7     than   than    ADP  None
15        1         8        a      a    DET  DATE
16        1         9     year   year   NOUN  DATE
17        1        10  earlier  early    ADV  DATE
18        1        11        .      .  PUNCT  None


## Customizing the Tokenizer

The default tokenizer in the `en_core_web_sm` pipeline did not split dates in `month/day/year` format, resulting in the entire date being treated as a single token. To address this, the pipeline was updated with a custom tokenizer that enforced the separation of slashes and numeric segments within dates.

The `customize_tokenizer` function received the spaCy pipeline, preserved the default vocabulary along with all existing prefix, infix, and suffix rules, and extended the infix patterns by adding a regular expression to capture the slash (`/`) character. No special-case rules or URL-matching rules were introduced, maintaining the default behavior aside from the updated infix processing.

The resulting tokenizer correctly segmented dates such as `06/01/2023` into individual tokens (`06`, `/`, `01`, `/`, `2023`), producing the expected tokenization and entity annotations.

**Before**

|    |   sent_id |   token_id | text       | lemma      | pos   | ent   |
|---:|----------:|-----------:|:-----------|:-----------|:------|:------|
| 32 |         4 |          0 | 06/01/2023 | 06/01/2023 | NUM   |       |
| 33 |         4 |          1 | .          | .          | PUNCT |       |


**After**

The goal of the last code of this task is to update the `en_core_web_sm` pipeline with a custom tokenizer that forces the splitting of dates in `month/day/year` format so that the sentence above looks like this:

|    |   sent_id |   token_id | text   | lemma   | pos   | ent      |
|---:|----------:|-----------:|:-------|:--------|:------|:---------|
| 32 |         4 |          0 | 06     | 06      | NUM   | CARDINAL |
| 33 |         4 |          1 | /      | /       | SYM   |          |
| 34 |         4 |          2 | 01     | 01      | NUM   |          |
| 35 |         4 |          3 | /      | /       | SYM   |          |
| 36 |         4 |          4 | 2023   | 2023    | NUM   |          |
| 37 |         4 |          5 | .      | .       | PUNCT |          |



In [159]:
def customize_tokenizer(nlp):
    #Get the default infix patterns and add a pattern for '/'
    infix_patterns = nlp.Defaults.infixes + [r'/']
    #Compile the infix regex with the added pattern
    infix_re=compile_infix_regex(infix_patterns)

    #Create a custom tokenizer
    custom_tokenizer=Tokenizer(nlp.vocab,
                                 rules=nlp.Defaults.tokenizer_exceptions,
                                 prefix_search=nlp.tokenizer.prefix_search,
                                 suffix_search=nlp.tokenizer.suffix_search,
                                 infix_finditer=infix_re.finditer,
                                 token_match=nlp.tokenizer.token_match,
                                 url_match=nlp.tokenizer.url_match)
    return custom_tokenizer

#Load the spaCy model
nlp=spacy.load("en_core_web_sm")

#Apply the custom tokenizer
nlp.tokenizer=customize_tokenizer(nlp)

#With the data of World Food Prices text for processing
text="""World food prices dip in December. FAO Food Price Index ends 2022 lower than a year earlier. A farmer in Sicily carrying wheat seeds. ©FAO/Giorgio Cosulich. 06/01/2023. Rome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier."""

#Process the text
doc=nlp(text)

#Convert the doc to a DataFrame
data=[]
for sent_id, sent in enumerate(doc.sents):
    for token_id, token in enumerate(sent):
        data.append({
            "sent_id": sent_id,
            "token_id": token_id,
            "text": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "ent": token.ent_type_ if token.ent_type_ else None
        })
df=pd.DataFrame(data)

#Display the DataFrame rows where sent_id is 4 (as an example)
print(df[df.sent_id == 4])

    sent_id  token_id  text lemma    pos   ent
33        4         0    06    06    NUM  DATE
34        4         1     /     /    SYM  DATE
35        4         2    01    01    NUM  DATE
36        4         3     /     /    SYM  DATE
37        4         4  2023  2023    NUM  DATE
38        4         5     .     .  PUNCT  None
