# Introduction to Natural Language Processing

Within these notebooks, we will explore difficulties and particularities when working with text data. As an example task, we make use of the SMS Spam Collection. It contains roughly 5'600 messages, that have manually been classified into "spam" or "ham" (non-spam).

Trigger warning: Some of the text messages contain swear words or sexual content.

## Data Loading

Let's load the data and have a first look.

In [None]:
url = "https://raw.githubusercontent.com/mattminder/nlp_intro/refs/heads/main/data/sms_spam_collection/SMSSpamCollection"

In [None]:
import pandas as pd
import urllib.request
data = urllib.request.urlopen(url)

# directly load the file from github for compatability with Colab
lines_split = [
    line.decode().strip().split("\t")
    for line in data
]
df = pd.DataFrame(lines_split, columns=["label", "text"])

Let's look at five random messages.

In [None]:
df.sample(5, random_state=123)

Since we are dealing with text messages, our data is quite messy. 

What's the ratio of ham to spam?

In [None]:
df["label"].value_counts()

## Extracting Features from Text
We cannot feed text data into a model - we need numerical values instead. In our very first model, we won't look at the word contents, but instead at other features that we can extract from text. In particular, we will calculate:
- The length of a text message
- The number of punctuation that was used
- The number of upper-case letters
- The number of numbers
- The number of occurrances of the letter X


In [None]:
simple_features = df.copy()

simple_features["length"] = df["text"].apply(len)
simple_features["number_punctuation"] = df["text"].apply(lambda x: sum(1 for letter in x if letter in '".,;:!?()_*'))
simple_features["number_uppercase"] = df["text"].apply(lambda x: sum(1 for letter in x if letter!=letter.lower()))
simple_features["number_numbers"] = df["text"].apply(lambda x: sum(1 for letter in x if letter in "0123456789"))
simple_features["number_x"] = df["text"].apply(lambda x: sum(1 for letter in x if letter in "xX"))

simple_features["is_spam"] = df["label"] == "spam"

Let's visualize the effect of our new features as box-plots.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 5, figsize=(12, 4))
sns.violinplot(simple_features, x="is_spam", y="length", ax=axs[0])
sns.violinplot(simple_features, x="is_spam", y="number_punctuation", ax=axs[1])
sns.violinplot(simple_features, x="is_spam", y="number_uppercase", ax=axs[2])
sns.violinplot(simple_features, x="is_spam", y="number_numbers", ax=axs[3])
sns.violinplot(simple_features, x="is_spam", y="number_x", ax=axs[4])

fig.tight_layout()

How well do we perform with a simple logistic model?

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score

train, test = train_test_split(simple_features, test_size=.2, random_state=123)

features = [
    "length",
    "number_punctuation",
    "number_uppercase",
    "number_numbers",
    "number_x",
]

logistic_regression = LogisticRegression()
logistic_regression.fit(train[features], train["is_spam"])

test_predictions = logistic_regression.predict(test[features])

Let's calculate precision and recall:

In [None]:
print("Precision:", precision_score(test["is_spam"], test_predictions))
print("Recall:", recall_score(test["is_spam"], test_predictions))

We see that we already get very good performance without even looking at the words in the data.

## Converting text into numerical values
Next, we want to create models that actually look at the content of our messages. To do this, we have to convert the content of our messages to some numerical representation, that we can then pass it onto a model.

Let's look at a single text message:

In [None]:
example_text = df.loc[4233, "text"]
example_text

How could we do that? Since our language is composed of words, it seems intuitive to split our messages into words. Then, we can assign a unique number to every word. Splitting a text document into smaller parts (typically words or parts of words) is called **tokenization**.

To split our messages into words, we first remove all punctuation except ', and then split at every white-space.

In [None]:
def remove_punctuation(text):
    for letter in '".,;:!?()_*':
        text = text.replace(letter, " ")  # replace with a space
    return text

def to_word_list(text):
    without_punctuation = remove_punctuation(text)
    return without_punctuation.split()  # splits at any whitespace

word_list = to_word_list(example_text)
word_list

Now we give an unique id to every word:

In [None]:
def get_word_dictionary(word_list):
    """Create a mapping from every word to an integer."""
    return {
        word: i
        for i, word in enumerate(set(word_list))
    }


word_dict = get_word_dictionary(word_list)
number_list = [word_dict[word] for word in word_list]

# let's look at the word dictionary
word_dict

The text is now encoded as the following list of numbers:

In [None]:
number_list

Now we have obtained a first numerical representation of our sample data. However, passing it like this to some model doesn't make much sense: The order of the words is chosen arbitrarily, so the numbers don't really mean anything. This is a problem: The model will consider numbers that are closer together to be more similar than numbers that are far apart.

In order to make every word equally far apart, we can turn our sentence into a so-called one-hot encoding. We map every word to a vector, where the vector size is equal to the number of words in the corpus. This vector has value zero everywhere except for one row: We put the value 1 into the row corresponding to the number that our word was given.

For example: If we have 5 words in total, we would create the following vectors: 
- Word 0: `[1, 0, 0, 0, 0]`
- Word 1: `[0, 1, 0, 0, 0]`
- Word 2: `[0, 0, 1, 0, 0]`
- Word 3: `[0, 0, 0, 1, 0]`
- Word 4: `[0, 0, 0, 0, 1]`

Let's implement this in Python:


In [None]:
def number_to_one_hot(number, vocabulary_size):
    """Function that takes a single integer and the vocabulary size, and creates a one-hot vector."""
    output = [0] * vocabulary_size

    # negative numbers will be reserved for when we encounter a new word
    if number >= 0:
        output[number] = 1
    return output

number_to_one_hot(2, 5)

In [None]:
# now we can encode the entire sentence
def number_list_to_one_hot(number_list, vocabulary_size):
    return [
        number_to_one_hot(number, vocabulary_size)
        for number in number_list
    ]

vocabulary_size = max(number_list) + 1

sentence_one_hot = number_list_to_one_hot(number_list, vocabulary_size)
sentence_one_hot

Now we have converted our text into a numerical representation, where the distance between each word is equal.

## Encoding all text-messages

Of course we don't want to encode only one message, but all text messages at once. We can do this as follows:

In [None]:
df["word_lists"] = df["text"].apply(to_word_list)

def flatten_list_of_lists(list_of_lists):
    """Flattens the list of lists [[a], [b, c]] to [a, b, c]."""
    return [
        e
        for sublist in list_of_lists
        for e in sublist
    ]

# create a list of all words
list_of_all_words = flatten_list_of_lists(df["word_lists"].to_list())

# create the dictionary
full_dict = get_word_dictionary(list_of_all_words)

How big is our dictionary?

In [None]:
len(full_dict)

We have encountered over 11'000 different words in our 5'600 messages alone. Let's look at some of the words that we've found:

In [None]:
(
    pd.DataFrame(full_dict.items(), columns=["word", "index"])
    .sort_values("word")
    .sample(10, random_state=1)
)

What do we observe?
- Some words are proper names.
- Some words aren't really words, but just onomatopeas.
- Some words are grammatical variations of a single word stem.
- Some words are misspellings. 
- Some words are upper-case.
- Some words are in another language.

How could we improve this?

1. We could remove words that are rare. This will remove misspelled words, but also rare names and places.
2. We could convert every word to its grammatical origin (for example: remove the plural, or conjugate to the base word). This is called "stemming". We would reduce the vocabulary size drastically, but also lose important information (plural or not, etc).
3. We could convert every word to lower-case, since upper- and lower-case doesn't change the meaning of the word in English.

Or, alternatively:

4. We could split rare and conjugated words further: For example, split killing into [kill, #ing], where # is a special character to denote that we split a word. This is more complex to handle, but can still handle rare words and conjugation. This last approach is what is typically done in modern large language models. You can find a demo under this link: https://codesandbox.io/s/gpt-tokenizer-tjcjoz.


## Stemming and removing rare words

Let's implement the first three points mentioned above. We do very rudimentary stemming by removing certain suffixes, and only keep words that occur more than 10 times. 

In [None]:
def to_lower_case(word_list):
    return [
        word.lower() for word in word_list
    ]

to_lower_case(df.loc[4233, "word_lists"])


In [None]:
def rudimentary_stemming(word_list):
    suffixes_to_remove = [
        "s",  # plural suffix
        "ing",
        "ed",
    ]
    def remove_suffixes(word):
        for suffix in suffixes_to_remove:
            word = word.removesuffix(suffix)
        return word

    return [
        remove_suffixes(word)
        for word in word_list
    ]

to_show_index = 20

print("Original Sentence:  ", df.loc[to_show_index, "text"])
print("Stemmed Version:    ", rudimentary_stemming(df.loc[20, "word_lists"]))


We can see that our rudimentary stemming has a lot of flaws. There are more sophisticated algorithms in practice, for example the [Porter Stemming Algorithm](https://de.wikipedia.org/wiki/Porter-Stemmer-Algorithmus).

Now we can create a new column with the cleaned word lists.

In [None]:
df["cleaned_word_lists"] = df["word_lists"].apply(
    lambda x: rudimentary_stemming(to_lower_case(x))
)

We build our dictionary next - now with the criterion, that every word needs to appear at least 10 times.

In [None]:
def get_frequent_word_dictionary(word_list, minimum_count=10):
    """Create a mapping from frequent words to an integer."""
    # create a dictionary with the number of occurrences of every word
    word_count = pd.Series(word_list).value_counts().to_dict()

    # identify the set of words that are frequent enough
    relevant_words = {
        word for word, count in word_count.items() if count >= minimum_count
    }

    # turn that set into a dictionary
    return {
        word: i
        for i, word in enumerate(relevant_words)
    }


# create a list of all clean words
list_of_clean_words = flatten_list_of_lists(df["cleaned_word_lists"].to_list())

# create the dictionary
frequent_dict = get_frequent_word_dictionary(list_of_clean_words)

How many different words do we have now?

In [None]:
len(frequent_dict)

This seems more sensible for a text corpus with 5'600 text messages.

We then apply the one-hot encoder to all of our text messages.

In [None]:
import numpy as np

df["word_number_list"] = df["cleaned_word_lists"].apply(
    lambda word_list: [
        frequent_dict.get(word, -1)
        for word in word_list 
    ]
)

df["one_hot_word_encoding"] = df["word_number_list"].apply(
    lambda number_list: np.array(
        number_list_to_one_hot(number_list, vocabulary_size=len(frequent_dict) + 1)
    )
)

Let's look at an example:

In [None]:
ix = 0
print("One-Hot Encoding:\n", df["one_hot_word_encoding"][ix])
print("Shape:", df["one_hot_word_encoding"][ix].shape)

The one-hot encoding has shape equals to the number of words times the total size of the dictionary.

Let's look at another example:

In [None]:
ix = 1
print("One-Hot Encoding:\n", df["one_hot_word_encoding"][ix])
print("Shape:", df["one_hot_word_encoding"][ix].shape)

Since this second text message was shorter, the encoding grew shorter as well.

This is a problem: Standard models such as linear regressions, but also feed-forward neural networks, want our input always to have the same size. How can we deal with the different lenghts of our inputs?

## Bag of Words
The easiest way to deal with varying input lengths is to calculate the sum of the one-hot encodings of every word. Conceptually, this just calculates how many times a word occurs in the text. This is why this method is called **bag of words**.

For example: The bag of words-encoding of the sentence "hello, how are you and how are your parents?" is:

|I|you|hello|how|are|parents|your|and|
|-|---|-----|---|---|-------|----|---|
|0|1  |1    |2  |2  |1      |1   |1  |

Let's calculate the bag of words for all of our data:

In [None]:
df["bag_of_words"] = df["one_hot_word_encoding"].apply(lambda x: np.sum(x, axis=0))

df["bag_of_words"]

This we can now use to train our model on:

In [None]:
bag_of_words_df = df["bag_of_words"].apply(pd.Series)
target = df["label"] == "spam"

# we have to use the same test_size and random state as above to ensure comparability
train_x, test_x, train_y, test_y = train_test_split(
    bag_of_words_df, target, test_size=.2, random_state=123
)

# let's look at our training data
train_x.head()

Let's train a linear model on that:

In [None]:
bow_model = LogisticRegression()
bow_model.fit(train_x.fillna(0), train_y)

bow_predictions = bow_model.predict(test_x.fillna(0))

In [None]:
print("Precision:", precision_score(test["is_spam"], bow_predictions))
print("Recall:", recall_score(test["is_spam"], bow_predictions))

We see that by looking at our content, we could greatly improve the performance of our model.

## What's next?
Our bag of words model has room for improvement:
- Our vectors are very sparse and don't contain a lot of information
- We don't take into account the order of the words
- We don't distinguish between lower- and upper-case words, punctuation, etc.
