# Natural Language Processing (NLP) (without Deep Learning for now)

A large amount of textual data can serve as a source of valuable information, and for this reason, **natural language processing** (NLP) has become an important subfield of machine learning in recent years. In the upcoming exercises, we will explore the basic principles of NLP and tasks specific to texts. In today's exercise, we will demonstrate an example of **sentiment analysis**, which involves categorizing text based on the author's perspective or opinion.

For now, we will not use neural networks for this task, as it is important for you to understand the sequence of steps involved in text processing. In the next exercise, we will show how a special type of neural network, a recurrent neural network (RNN), can help with this task. We will conclude our exploration of NLP by looking at **attention mechanisms** and **transformers**, which represent the state-of-the-art in NLP.


In today's exercise, we will work with the **IMDb dataset**, which contains 50,000 movie reviews categorized into two classes: **positive** (more than 6 stars) and **negative** (less than 5 stars). After downloading the dataset, we will tackle tasks such as:

1. **Data preprocessing**;
2. **Text data vectorization**;
3. **Training a machine learning model for classification**;
4. **Working with large text data**;
5. **Estimating the content of the text**.


## 1. Data Preprocessing

[You can download the dataset from this link](http://ai.stanford.edu/~amaas/data/sentiment/), then unzip the file (approximately 80 MB).

You might notice that the data is split into two directories for training and testing, and within these folders, you will find many individual files. For easier handling, we will combine these files into a single CSV file (this process might take a few minutes). If you prefer not to wait for the results, [you can download the pre-processed CSV file here](lab07/movie_data.csv) (approximately 64 MB).


You can unzip the files directly in Python, which might be faster:


In [None]:
import tarfile
with tarfile.open("lab07/aclImdb_v1.tar.gz", 'r:gz') as tar:
    tar.extractall()

In [None]:
import os
import sys

import pandas as pd
# pip install pyprind
import pyprind

BASEPATH = "lab07/aclImdb"

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
df = pd.DataFrame()
for subdir in ('test', 'train'):
    for cat in ('pos', 'neg'):
        path = os.path.join(BASEPATH, subdir, cat)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[cat]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

Before saving the loaded data, we will shuffle it randomly and then save it to a CSV file (if you downloaded the pre-made file, you can skip these steps).


In [None]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv("movie_data.csv", index=False, encoding='utf-8')

Next, we'll load the dataset again (this step is necessary), make some adjustments, and check the amount and content of the data.


In [None]:
df = pd.read_csv("movie_data.csv", encoding='utf-8')
# krok potrebny na niektorych pocitacoch
# df = df.rename(columns={"0": "review", "1": "sentiment"})
print(df.shape)
print(df.head(3))

## 2. Vectorization of Text Data

Neural networks, as well as other machine learning algorithms, were designed to work with numerical data, which is not the case for text. Therefore, for text and other categorical data, it is necessary to transform this data into a numerical representation. In this step, we will use the **bag-of-words** approach, which assigns a unique feature vector to each word. This process will take place in two steps:

1. We will create a collection of unique tokens—such as words—from all the documents.
2. We will construct a feature vector for each document, where the vector contains information about how many times a particular word appears in that document.

It is clear that most of the values in the vectors will be 0, i.e., the vectors will be **sparse**, which is exactly what we need.


### 2.1. Generating Feature Vectors

To generate feature vectors, we will use the `scikit-learn` library, which is part of the Anaconda installation. We will demonstrate the process using simple data and later apply it to our dataset.


In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array(['Roses are red',
                 'Violets are blue',
                 'Roses are read, violets are blue, wine costs less than dinner for two'])
bag = count.fit_transform(docs)

Next, we can check and analyze the contents of the generated feature vectors.


In [None]:
print(count.vocabulary_)

In [None]:
print(bag.toarray())

The generated vocabulary represents the index of each word in the vector representation, and the vectors contain the frequency count of each word in the sentence. This count is also referred to as the **raw term frequencies**: *tf(t, d)*, which denotes the number of occurrences of term *t* in document *d*. The order of these terms does not matter, as it is derived from the indices of the vocabulary (usually sorted alphabetically).

**Note**: In our bag-of-words model, we used a 1-gram (unigram) model, but there are other representations where a single term consists of multiple tokens, such as bigrams: *roses are*, *are red*. Different tasks require different types of representation.


A disadvantage of the TF representation is that some words frequently appear in examples of both types (positive and negative), and therefore they typically don't provide significant value for classification. Instead of considering their raw frequency in the data, we can use the **term frequency-inverse document frequency** technique:

$$
tf{\text -}idf(t, d) = tf(t, d) \times idf(t, d),
$$

where

$$
idf(t,d) = \log \frac{n_{d}}{1 + df(d, t)}.
$$

Here, $n_{d}$ is the total number of documents, and $df(d, t)$ is the number of documents *d* containing term *t*. Specific implementations in the `scikit-learn` library work with minor adjustments, but that's beyond the scope for now.

We can convert our representation into the TF-IDF form using the following code:


In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

### 2.2 Data Cleaning

Real-world text data often contain special characters that don't carry any meaningful information, and therefore, it is advisable to remove them. Take the example from a movie review:



In [None]:
print(df.loc[0, 'review'][-50:])

The text contains HTML markup as well as punctuation marks. While punctuation can be crucial in certain cases for text evaluation, in our case, it is unnecessary. Therefore, we will remove both punctuation and HTML tags. Additionally, we will process emojis as a bonus.

Here’s an example of text before and after cleaning:


In [None]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [None]:
print(preprocessor(df.loc[0, 'review'][-50:]))
print(preprocessor("</a>This :) is :( a test :-)!"))

Now, we can apply the preprocessing steps to our actual data:

In [None]:
df['review'] = df['review'].apply(preprocessor)

### 2.3. Tokenization of Documents

The first step in processing text is splitting it into smaller chunks, called tokens, which are typically words. The basic approach for splitting text into words is to divide sentences based on spaces:


In [None]:
def tokenizer(text):
    return text.split()

print(tokenizer("runners like running and thus they run"))

### 2.4. Stemming (and Lemmatization)

As you can see in the previous example, some words (such as *running* and *run*, and partially *runners*) represent similar concepts. Therefore, it is unnecessary to have them multiple times in the vocabulary. We also need to process plural forms and convert words into their singular form. This process is called **stemming**, and there are several algorithms to perform it. One of the most basic ones is **Porter stemming**, which is available in the Natural Language Toolkit (NLTK) library, a part of Anaconda.


In [None]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

print(tokenizer_porter("runners like running and thus they run"))

As shown in the example above, stemming has its limitations, as it is not always perfect (for example, *thus* was incorrectly reduced to *thu*). The second similar process, **lemmatization**, always returns the base dictionary form of a word. While lemmatization is computationally more expensive, it typically produces better results.

#### Key Differences:
- **Stemming** is faster but may result in incorrect reductions (e.g., *thus* → *thu*).
- **Lemmatization** is slower but more accurate, as it returns the correct base form of the word based on its meaning and context.

In many NLP tasks, you can choose the method based on your requirements: if speed is more important and minor inaccuracies are acceptable, use stemming. If precision is critical, lemmatization is the better choice.


### 2.5. Removal of Stop Words

Stop words are words that occur frequently in all texts and therefore do not carry much meaningful information for processing, as they have little informational value for classification or other machine learning tasks. Although *tf-idf* partially eliminates the need for removing such words by reducing the importance of frequently occurring words, in some cases, it is important to remove these words. There are predefined sets available for different languages to accomplish this.


In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
print([w for w in tokenizer_porter("a runner likes running and runs a lot") if w not in stop])

## 3. Training a Classification Model

In today's exercise, we will not be using neural networks, but we will demonstrate the process of training a classification model using logistic regression. During the training, we will also use hyperparameter optimization with `sklearn`. First, we will prepare the training and test data:


In [None]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

Next, we will train the model:


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
small_param_grid = [
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [None],
        'vect__tokenizer': [tokenizer, tokenizer_porter],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]
    },
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words': [stop, None],
        'vect__tokenizer': [tokenizer],
        'vect__use_idf': [False],
        'vect__norm': [None],
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]
    }
]
lr_tfidf = Pipeline([
    ('vect', tfidf),
    ('clf', LogisticRegression(solver='liblinear'))
])
gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy', cv=5,
                           verbose=2, n_jobs=1)  # n_jobs=-1 for parallel processing
gs_lr_tfidf.fit(X_train, y_train)

The most successful parameter settings and the corresponding results can be found using the following code:


In [None]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

We can also verify this on the test data:


In [None]:
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

## 4. Working with Large Text Datasets

In the previous step, you may have noticed that hyperparameter optimization can take quite a long time when preprocessing data, such as tokenization and subsequent stemming. If you have a large dataset, it is possible that the entire dataset may not even fit into the computer's memory, which can cause such a search to fail.

Similar problems with neural networks are solved by minibatch training, and a similar approach exists for other models, called **out-of-core learning**, where the model is trained only partially on a smaller number of examples using the `partial_fit` function.

In the first step, we define a new tokenizer with stop word removal:


In [None]:
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
def new_tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

Next, we define a generator that will return individual documents from the entire dataset:


In [None]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [None]:
next(stream_docs(path="movie_data.csv"))

The next function will provide a single minibatch of training data:


In [None]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

When vectorizing, we can no longer use `CountVectorizer` or `TfidfVectorizer`, as these methods require information about the total number of word occurrences in the dataset. Instead, we will use a different type of vectorization, specifically `HashingVectorizer`, which is data-independent:


In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,  # vector size - estimate
                         preprocessor=None,
                         tokenizer=new_tokenizer)
clf = SGDClassifier(loss='log', random_state=1)
doc_stream = stream_docs(path="movie_data.csv")

Finally, we can begin the training:


In [None]:
import pyprind

pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

After successful training, we can also check the training accuracy:


In [None]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

The final accuracy may be slightly lower than in the previous case, but out-of-core training is much faster and less memory-intensive.

Since we used the "test" set for validation, if we are satisfied with the results, we can also use this data for training:


In [None]:
clf = clf.partial_fit(X_test, y_test)

## 5. Estimating Text Content

Estimating text content can help us in clustering unlabeled text data, where we try to assign labels to individual texts based on their content. One possible algorithm for solving this type of problem is **Latent Dirichlet Allocation**, or LDA, which is based on Bayesian inference. Its goal is to find words that frequently occur together across multiple documents, thus defining a topic or category. Its input is a bag-of-words model, from which it generates two matrices: document-to-topic and word-to-topic. By multiplying them, we can recover the original text with minimal error. A hyperparameter of LDA is the number of topics to be found.


In [None]:
import pandas as pd

df = pd.read_csv("movie_data.csv", encoding='utf-8')
# krok potrebny na niektorych pocitacoch
# df = df.rename(columns={"0": "review", "1": "sentiment"})

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=.1,  # ignore words with a frequency above 10%
                        max_features=5000)
X = count.fit_transform(df['review'].values)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)
lda.components_.shape

After training, we can visualize the most frequent words in each category:


In [None]:
n_top_words = 5
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx + 1)}:')
    print(' '.join([feature_names[i]
                    for i in topic.argsort()\
                    [:-n_top_words - 1:-1]]))

Sample examples for the *horror* category:


In [None]:
horror = X_topics[:, 5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')

## Sources Used

* **IMDb dataset**: Maas, Andrew, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. "Learning word vectors for sentiment analysis." In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142-150. 2011.
* **Porter stemmer:** Porter, Martin F. "An algorithm for suffix stripping." Program 14, no. 3 (1980): 130-137.
* **Natural Language Toolkit:** https://www.nltk.org
* **Latent Dirichlet allocation:** Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3, no. Jan (2003): 993-1022.
* Raschka, Sebastian, Yuxi Hayden Liu, Vahid Mirjalili, and Dmytro Dzhulgakov. Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd, 2022.