# 🧱 Foundations

We will be working with [spaCy](https://spacy.io/), a Python library for Natural Language Processing (NLP). It is fast and easy to use. It has pre-trained models for various languages and can be used for tasks like tokenization, part-of-speech tagging, named entity recognition, and more. [NLTK](https://www.nltk.org/) is another popular, slightly more traditional, text analysis / NLP library which we make brief use of.

Make sure to check out the spaCy documentation for more details on how to use it: [spaCy Documentation](https://spacy.io/usage).

In [None]:
# If spaCy or the English model is not installed, install them:
# !pip install spacy
# !python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")  # load English tokenizer, POS tagger, etc.

`"en_core_web_sm"` is one of spaCy's pre-trained model pipelines. It is a combination of different machine learning and rule based components that can be used as tools to process text in different ways. The `"en_core_web_sm"` model is a small English model that is fast and efficient, but not as accurate as larger models like `"en_core_web_md"` or `"en_core_web_lg"`, or transformer based models.

* en – English language
* core – General-purpose (not domain-specific)
* web – Trained primarily on web data (blogs, news, comments)
* sm – "Small" size model (~13MB)

You can use alternative models, provided by spaCy, other users, or create your own. We'll stick with this for some basic tasks for now, and scale up later in the day.

## 🧹 Preprocessing

Preprocessing is a crucial step in text analysis and NLP that involves cleaning and transforming raw text data into a format suitable for analysis. This step is essential for working with traditional statistical models to deep learning architectures.

### 🤔 Why Preprocessing Matters

Proper preprocessing can significantly improve the performance of text analysis and some NLP models:

* It reduces noise and irrelevant variation (e.g., unifying casing, removing filler words).
* It normalizes text so that different forms of the same word are treated equally (“running” vs “ran” → “run”).
* It can reduce dimensionality (fewer unique tokens after cleaning), which is important for traditional ML models.
 
However, be mindful: aggressive preprocessing (like removing too many words or context via stemming) can sometimes hurt performance, especially for deep learning models that might prefer raw text with context. So use preprocessing appropriately for your task.

### 🧰 Common Preprocessing Steps

The following are common preprocessing steps used. Depending on the task, model, and data, you you should choose the appropriate steps. Some steps may be more relevant for certain tasks than others, and some may not be necessary at all (or even counterproductive).

1. **Lowercasing**: Convert all text to lowercase to ensure uniformity.
2. **Tokenization**: Split text into individual words or tokens.
3. **Removing Punctuation**: Strip punctuation marks from the text to focus on the words.
4. **Removing Stop Words**: Eliminate common words (like "the", "is", "and") that may not add significant meaning to the text. Libraries like NLTK and SpaCy provide lists of stop words.
5. **Stemming/Lemmatization**: Reduce words to their base or root form. Stemming is a more aggressive approach that may not always yield valid words, while lemmatization uses a dictionary to ensure the root word is valid.
6. **Removing Numbers**: Depending on the context, you may want to remove numbers from the text.
7. **Removing Extra Whitespace**: Clean up any extra spaces or newlines in the text.
8. **Handling Special Characters**: Depending on the dataset, you may need to remove or replace special characters (like emojis or HTML tags).
9. **Handling Contractions**: Expand contractions (e.g., "don't" to "do not") to ensure uniformity.
10. **Handling URLs and Emails**: Depending on the context, you may want to remove or replace URLs and email addresses.
11. **Handling Case Sensitivity**: Depending on the task, you may want to keep the original casing of words (e.g., for named entity recognition) or convert everything to lowercase.
12. **Handling Synonyms and Antonyms**: Depending on the task, you may want to replace synonyms or antonyms with a common term (e.g., "happy" and "joyful" to "happy").
13. **Handling Negations**: Depending on the task, you may want to handle negations (e.g., "not happy" to "unhappy") to ensure the model understands the sentiment.
14. **Handling Domain-Specific Terms**: Depending on the dataset, you may need to handle domain-specific terms (e.g., medical terms, technical jargon) to ensure the model understands the context.
15. **Handling Language-Specific Features**: Depending on the language, you may need to handle specific features (e.g., accents, diacritics) to ensure the model understands the text.
16. **Handling Multilingual Text**: If your dataset contains multiple languages, you may need to preprocess each language differently or use a multilingual model.
17. **Handling Code and Programming Language Syntax**: If your dataset contains code snippets or programming language syntax, you may need to preprocess it differently to ensure the model understands the context.
18. **Handling Text Length**: Depending on the model, you may need to truncate or pad text to a specific length to ensure uniformity..........

Let’s take a sample sentence and perform some basic preprocessing with spaCy. We’ll use a simple example sentence:

In [None]:
text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text) # spaCy tokenises and processes the text
print("Original text:", text)
print("Tokens:", [token.text for token in doc])

We could have manually tokenized the text, e.g. by splitting on whitespace, punctuation, but this rapidly becomes a non trivial task. spaCY's tokenizer is a powerful tool that can handle a wide variety of tokenization tasks, including splitting on whitespace, punctuation, and special characters. It also handles edge cases like contractions, abbreviations, and special characters. Notice how spaCy handled “U.K.” as one token and split “\$1 billion” into "$", "1", and "billion" separately. It also keeps punctuation (.) as a token by default. It can also be configured with custom rules to handle specific cases, such as splitting on certain characters or keeping certain tokens together, or even a completely different tokenization algorithm. This makes it a powerful and flexible tool for tokenization in NLP tasks.

spaCy’s Doc object also contains rich linguistic annotations. We can iterate through doc to get **lemmas**, **part-of-speech (POS) tags**, and **other attributes**: 

In [None]:
for token in doc:
    print(token.text, "→ lemma:", token.lemma_, "| POS:", token.pos_, "| StopWord?", token.is_stop)

Here we see:

* Lemmatization: “is” became “be”; “buying” became “buy”. Lemmas are useful to reduce inflected forms to a common base.
* Part-of-speech tags: “Apple” is a proper noun (PROPN), “is” is an auxiliary verb (AUX), “buying” a verb, etc.
* Stop words: spaCy has a built-in list of stop words for English. It flagged “is”, “at”, “for” as stop words (common function words). “Apple” and “buying” are not stop words.

Using these annotations, we can easily perform typical preprocessing tasks. For instance, to remove stop words and punctuation and get a list of normalized tokens:

In [None]:
tokens = [token.lemma_.lower() for token in doc 
          if not token.is_stop and not token.is_punct]
print("Processed tokens:", tokens)

We lowercased all tokens and removed common stop words (“is”, “at”, “for”) and punctuation. The result is a crude but useful representation of the "content" words in the text. 

### 🏋️ Exercise

1. Try changing the example text or adding another sentence and re-running the above code. For instance, what happens if you include a contraction or a numerical date? Examine how spaCy tokenizes and lemmatizes it. You can also try printing other token attributes like token.dep_ (dependency relation) to see syntactic dependencies.

2. The spaCy pipeline we used includes POS tagging, lemmatization, etc. If you wanted to add a custom preprocessing step (say, replacing all numbers with a special token like \<NUM>). This is useful for tasks like text classification where the actual number may not be important but the fact that there is a number is. How might you do it?

*Hint: You could post-process the token list or use regex on the original text before sending to spaCy.*

## 📚 Corpora

Imagine we now have a corpus of text documents (e.g. news headlines). We can use spaCy to process each document in the corpus and extract useful features for analysis. For example, we can create a **bag-of-words** representation of the corpus, where each document is represented as a vector of word counts or term frequencies. This is a common preprocessing step for traditional machine learning models.

In [None]:
financial_headlines = [
    "Apple stocks soar to record highs as investors celebrate strong earnings",
    "Global markets crash amid fears of a deepening recession and economic turmoil",
    "Central bank raises interest rates, boosting confidence in economic recovery",
    "Tesco reports impressive quarterly profits as retail sales surge",
    "Oil prices collapse after oversupply announcement, sparking industry panic",
    "Unemployment rate drops to lowest level in a decade as job market thrives",
    "Debenhams files for bankruptcy protection as retail sector faces disaster",
    "Investors optimistic as trade tensions ease and markets rally strongly",
    "Housing market slows down as prices fall and buyers worry",
    "Novo Nordisk faces lawsuit over drug safety, shares plunge on negative news",
    "Consumer confidence reaches all-time high as retail sales boom",
    "Tesla recalls thousands of vehicles over safety concerns, shares tumble",
    "Earnings miss sends shares of Trump Inc plummeting as profits disappoint",
    "Mergers and acquisitions activity hits new peak as companies celebrate growth",
    "Cryptocurrency market rebounds after sharp decline, investors cheer recovery",
    "Retail sales disappoint during holiday season as consumer confidence drops",
    "Government stimulus package boosts economic outlook, markets surge",
    "Manufacturing sector contracts for third straight month as demand weakens",
    "Lab grown meat startup secures major funding round, industry hails innovation",
    "Layoffs announced as company restructures operations, employees face uncertainty"
]

corpus = list(nlp.pipe(financial_headlines))
print("Number of documents in the corpus:", len(corpus))
for doc in corpus:
    print("Tokens:", [token.text for token in doc])
    print()

### 🧮 Document-Feature Matrices (DFMs)

We can view this in a **document-feature matrix**

In [None]:
# import count vectorizer and pandas to display the results
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Preprocess texts with spaCy (lemmatize, remove stopwords/punctuation)
def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]

# Use CountVectorizer with the spaCy tokenizer
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer)
X = vectorizer.fit_transform(financial_headlines)

# Convert to dense matrix and view as DataFrame
df = pd.DataFrame(X.toarray(), columns=list(vectorizer.get_feature_names_out()),
                  index=financial_headlines)

The dataframe displays a count of each token in the corpus for each document. The columns represent the words in the vocabulary, and the rows represent the documents. Each row is a numerical vector that can be used to describe that document.

The token count method is a simple way to create a document-feature matrix, but there are many other ways to represent text data, such as using term frequency-inverse document frequency (TF-IDF) or word embeddings. These methods can capture more complex relationships between words and documents, and are often used in more advanced NLP tasks.

In [None]:
df

There are some repeated words (below), but overall the matrix is sparse. This is typical for text data, where most words are not present in most documents. We can also see that some words appear more frequently than others (e.g. "market" appears in several documents). This is common in text data, where some words are more informative than others.

In [None]:
df.sum().sort_values(ascending=False).head(10)

Visualise with a wordcloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

all_tokens = [token.lemma_.lower() for doc in corpus for token in doc
              if not token.is_stop and not token.is_punct]  # take all tokens from all documents that we processed with spaCy

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(" ".join(all_tokens))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

I actually think wordclouds are a bit of a gimmick, but they can be useful for quickly visualising the most common words in a corpus. The larger the word, the more frequently it appears in the corpus. This can help you identify common themes or topics in the text data.

## 😀 Sentiment Analysis

We can do some simple dictionary-based sentiment analysis to assess how positive or negative the headlines are. This is a simple approach that uses a predefined list of positive and negative words (or n-grams) to score the sentiment of each document. The score is the difference between the positive and negative token in the document. This is a very basic approach, but it can be useful for quickly assessing the sentiment of a document.

In [None]:
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load VADER
vader = SentimentIntensityAnalyzer()

# Analyse each text
for doc in corpus:
    score = vader.polarity_scores(doc.text)
    print(f"Text: {doc.text}")
    print(f"VADER Scores: {score}")
    print()

This works "OK" on this data. Firstly the documents are short, so the scores are very sensitive to the presence of a few positive or negative words. Secondly, the sentiment lexicon is not perfect - VADER is designed for social media text and conversational language. This is a common problem with dictionary-based approaches, which rely on a predefined list of words and may not capture all the nuances of the text. We could use a better specified dictionary, or one of the more advanced methods are available in spaCy (see later...).

## 👤 Named Entity Recognition

Named Entity Recognition (NER) is a common task in NLP that involves identifying and classifying "named entities" in text. Named entities can include people, organizations, locations, dates, and other specific terms. NER is useful for extracting structured information from unstructured text data. We don't have time today, but I'd recommend also looking at the docs for Entity Linking. We use the `en_core_web_sm` model, which has a built-in NER component that can recognize and classify named entities in text.

In [None]:
text = ("Barack Obama was the 44th President of the United States, serving from 2009 to 2017. "
        "A member of the Democratic Party, he previously served as a U.S. Senator from Illinois from 2005 to 2008. "
        "Obama was born on August 4, 1961, in Honolulu, Hawaii.")
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, "→", ent.label_)


The model correctly identified Barack Obama as a **PERSON**, United States as a **GPE** (geo-political entities), the years and the full date as **DATE** entities, Democratic Party as an **ORG** (organization), U.S. Senator as an **ORG** (some models label titles or certain political offices as **ORG**), and locations like Illinois, Honolulu, Hawaii as **GPE**. Notably, "44th" was labeled **ORDINAL** (as in 44th President).

Let's scale things up, considering a book about a popular wizard by some no-name author. You can download the data [here](https://www.kaggle.com/datasets/shubhammaindola/harry-potter-books/data). Move the file to the `data` folder in this project fodler and rename it `wizard_book.txt`.

In [None]:
# Read the file
with open('data/wizard_book.txt', 'r') as file:
    booktext = file.read()

# Process the text with spaCy
bookdoc = nlp(booktext)
print("Number of tokens in the book:", len(bookdoc))
print("Number of sentences in the book:", len(list(bookdoc.sents)))
print("Number of named entities in the book:", len(bookdoc.ents))

In [None]:
# Display the entites mentioned most frequently
ents = pd.Series([ent.text for ent in bookdoc.ents])
display(ents.value_counts()) 

In [None]:
# Create a network of the entities based on their co-occurrence in the same sentence
entity_sets = []
for sent in bookdoc.sents:
    entities = set(sent.ents)
    if len(entities) > 1:
        entity_sets.append(entities) 

# Each item in entity_sets is the group of entities that co-occur in the same sentence
entity_sets

Let's create an edgelist and plot as a network. This is a simple way to visualize the relationships between entities in the text. One could also use social network analysis techniques to analyze the relationships between entities, such as centrality measures or community detection algorithms.

In [None]:
# Build a dictionary of co-occurring entity pairs and their counts
edges = {}
for entity_set in entity_sets:
    # For each unique pair of entities in the set
    for n, entity1 in enumerate(entity_set):
        for m, entity2 in enumerate(entity_set):
            # Only consider each pair once and skip self-pairs
            if (m > n) and (entity1.text != entity2.text):
                # If the pair already exists (in either order), increment the count
                if (entity1.text, entity2.text) in edges:
                    edges[(entity1.text, entity2.text)] += 1
                elif (entity2.text, entity1.text) in edges:
                    edges[(entity2.text, entity1.text)] += 1
                else:
                    # Otherwise, initialize the count for this pair
                    edges[(entity1.text, entity2.text)] = 1


In [None]:
# Convert to DataFrame
df = pd.DataFrame.from_dict(edges, orient='index', columns=['weight'])

# Reset index and expand tuple into separate columns
df = df.reset_index()
df[['entity1', 'entity2']] = pd.DataFrame(df['index'].tolist(), index=df.index)
df = df[['entity1', 'entity2', 'weight']]
df = df.sort_values(by='weight', ascending=False)
df

In [None]:
import networkx as nx
import matplotlib.pyplot as plt
# Create a graph from the top 500 edges
G = nx.from_pandas_edgelist(df.head(500), 'entity1', 'entity2', ['weight'])

# Draw the graph with edge weights
plt.figure(figsize=(12, 12))
pos = nx.spring_layout(G, k=0.5)

# Get edge weights for width scaling
edges_ = G.edges(data=True)
weights = [d['weight'] for (_, _, d) in edges_]
# Normalize weights for plotting (optional: adjust scaling as needed)
max_weight = max(weights) if weights else 1
widths = [0.1 + 6 * (w / max_weight) for w in weights]

nx.draw(
    G, pos,
    with_labels=True,
    node_size=10,
    font_size=10,
    font_weight='bold',
    edge_color='gray',
    width=widths
)
plt.title("Entity Co-occurrence Network (Edge Width = Weight)")
plt.show()

It's a little bit of a mess so far, but we can already see some interesting relationships between the entities and patterns in the network.

### 🏋️ Exercise

1. How could we improve this network? What are we actually interested in?
2. Implement these changes and re-run the code. What do you notice?

## 🌯 Wrap-up

We've introduced some basic concepts in text analysis and NLP, including preprocessing, document-feature matrices, sentiment analysis, and named entity recognition.

You might have realised by this stage that (pre-)processing is pervasive. It can also be part art part science. There are different ways and points at which you can preprocess text data, and this will be sensitive to the task, data, and model you are using.

Finally, a word on validation. So far we have mostly just applying models, without rigorously validating that they are working well on our data (e.g. in sentiment analysis, NER). How do we know how well the sentiment of our headlines being captured, or how many of the named entities are being correctly identified? We should always firstly sense-check our results, then more formall;y validate (e.g. compare against some manual human labelling). More on this later...