# 04a - Basic NLP and Text Representations
Prepared by Jan Christian Blaise Cruz

DLSU Machine Learning Group 

# Preliminaries

First, we'll download the **iMDB Sentiments Dataset** to use for Sentiment Classification.

In [None]:
!wget https://s3.us-east-2.amazonaws.com/blaisecruz.com/datasets/imdb/imdb.zip
!unzip imdb.zip && rm imdb.zip

--2020-08-07 02:18:56--  https://s3.us-east-2.amazonaws.com/blaisecruz.com/datasets/imdb/imdb.zip
Resolving s3.us-east-2.amazonaws.com (s3.us-east-2.amazonaws.com)... 52.219.100.50
Connecting to s3.us-east-2.amazonaws.com (s3.us-east-2.amazonaws.com)|52.219.100.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13289973 (13M) [application/zip]
Saving to: ‘imdb.zip’


2020-08-07 02:18:57 (44.9 MB/s) - ‘imdb.zip’ saved [13289973/13289973]

Archive:  imdb.zip
   creating: imdb/
  inflating: imdb/train.csv          


Then import the necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import spacy
import nltk
from tqdm import tqdm

from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('wordnet')

np.random.seed(42)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Let's load the dataset. We'll only use 1000 samples from the dataset. The dataset is sufficiently large and using all of it would take a lot of time to process.

In [None]:
df = pd.read_csv('imdb/train.csv').sample(1000, random_state=42)
text, labels = list(df['text']), list(df['sentiment'])

Here's an example.

In [None]:
text[0]

"Great little thriller. I was expecting some type of silly horror movie but what I got was tight short thriller that waste none of our time. Mostof these movies we have to get into the back characters stories so we will either feel sympathy for them or hatred when people start getting killed. o such foolishness here. Yes you see a few characters but they really only interact with the principals. Such as the husband wife at the motel whose room was canceled. We saw them so we could just how efficient the Lisa character was and how inefficient the new Hotel clerk was. We see the little girl simply because she will have a very small but important role later in the movie when all heck breaks loose. THe Flight Atrendants because we need on in particular to move the plot ahead. The bad guy in particular needs her in the beginning of the flight. The rude guy in the airport was important to the movie too. The only 2 characters that were just 5 liners with no use to the plot were the two young 

In the labels, 0 means positive and 1 means negative.

In [None]:
labels[0]

0

Let's split them into training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = text[:700], text[700:], labels[:700], labels[700:]

# Tokenization

Our first job is to tokenize our data. To do this, we have to split our data into **tokens**. There are many different definitions of what a token can be (it can be a word, a phrase, a subword, etc.)

In this example, we'll do simple tokenization using a pretrained tokenizer from Spacy. In the future, we'll look at building our own tokenizers.

In [None]:
en = spacy.load('en')
def tokenize(t): return [str(token) for token in en(t)]

Let's test it out.

In [None]:
print(tokenize(X_train[0]))

['Great', 'little', 'thriller', '.', 'I', 'was', 'expecting', 'some', 'type', 'of', 'silly', 'horror', 'movie', 'but', 'what', 'I', 'got', 'was', 'tight', 'short', 'thriller', 'that', 'waste', 'none', 'of', 'our', 'time', '.', 'Mostof', 'these', 'movies', 'we', 'have', 'to', 'get', 'into', 'the', 'back', 'characters', 'stories', 'so', 'we', 'will', 'either', 'feel', 'sympathy', 'for', 'them', 'or', 'hatred', 'when', 'people', 'start', 'getting', 'killed', '.', 'o', 'such', 'foolishness', 'here', '.', 'Yes', 'you', 'see', 'a', 'few', 'characters', 'but', 'they', 'really', 'only', 'interact', 'with', 'the', 'principals', '.', 'Such', 'as', 'the', 'husband', 'wife', 'at', 'the', 'motel', 'whose', 'room', 'was', 'canceled', '.', 'We', 'saw', 'them', 'so', 'we', 'could', 'just', 'how', 'efficient', 'the', 'Lisa', 'character', 'was', 'and', 'how', 'inefficient', 'the', 'new', 'Hotel', 'clerk', 'was', '.', 'We', 'see', 'the', 'little', 'girl', 'simply', 'because', 'she', 'will', 'have', 'a', 

We'll tokenize our training and testing sets.

In [None]:
X_train = [tokenize(t) for t in tqdm(X_train)]
X_test = [tokenize(t) for t in tqdm(X_test)]

100%|██████████| 700/700 [00:35<00:00, 19.92it/s]
100%|██████████| 300/300 [00:14<00:00, 20.52it/s]


And check if the output matches our tests.

In [None]:
print(X_train[0])

['Great', 'little', 'thriller', '.', 'I', 'was', 'expecting', 'some', 'type', 'of', 'silly', 'horror', 'movie', 'but', 'what', 'I', 'got', 'was', 'tight', 'short', 'thriller', 'that', 'waste', 'none', 'of', 'our', 'time', '.', 'Mostof', 'these', 'movies', 'we', 'have', 'to', 'get', 'into', 'the', 'back', 'characters', 'stories', 'so', 'we', 'will', 'either', 'feel', 'sympathy', 'for', 'them', 'or', 'hatred', 'when', 'people', 'start', 'getting', 'killed', '.', 'o', 'such', 'foolishness', 'here', '.', 'Yes', 'you', 'see', 'a', 'few', 'characters', 'but', 'they', 'really', 'only', 'interact', 'with', 'the', 'principals', '.', 'Such', 'as', 'the', 'husband', 'wife', 'at', 'the', 'motel', 'whose', 'room', 'was', 'canceled', '.', 'We', 'saw', 'them', 'so', 'we', 'could', 'just', 'how', 'efficient', 'the', 'Lisa', 'character', 'was', 'and', 'how', 'inefficient', 'the', 'new', 'Hotel', 'clerk', 'was', '.', 'We', 'see', 'the', 'little', 'girl', 'simply', 'because', 'she', 'will', 'have', 'a', 

Great!

Next we'll have to build a **vocabulary**. We need three things:
1. A ``set`` of all unique words so we can check if a specific token exists in $O(1)$.
2. A ``list`` that converts numerical indices into their corresponding unique token.
3. A ``dictionary`` that converts tokens into their corresponding unique indices.

In [None]:
vocab = ['<unk>']
for sample in X_train: vocab.extend(sample)
vocab_set = set(vocab)

idx2word = list(vocab_set)
word2idx = {idx2word[i]: i for i in range(len(idx2word))}

Let's see the number of unique tokens in our dataset.

In [None]:
len(idx2word)

16839

Let's see if our list and dictionaries match entries properly.

In [None]:
idx2word[42]

'oi'

Seems about right.

In [None]:
word2idx['oi']

42

For the testing set, if a token doesn't exist in the training vocabulary, we'll have to mark that as an unknown token.

In [None]:
X_test = [[token if token in vocab_set else '<unk>'for token in line] for line in X_test]

Here's a small check.

In [None]:
print(X_test[0])

['Although', 'at', 'one', 'point', 'I', 'thought', 'this', 'was', 'going', 'to', 'turn', 'into', 'The', '<unk>', ',', 'I', 'have', 'to', 'say', 'that', 'The', 'Mother', 'does', 'an', 'excellent', 'job', 'of', 'explaining', 'the', 'sexual', 'desires', 'of', 'an', 'older', 'woman', '.', ' ', 'I', "'m", 'so', 'glad', 'this', 'is', 'a', 'British', 'film', 'because', 'Hollywood', 'never', 'would', 'have', 'done', 'it', ',', 'and', 'even', 'if', 'they', 'had', ',', 'they', 'would', 'have', 'ruined', 'it', 'by', 'not', 'taking', 'the', 'time', 'to', 'develop', 'the', 'characters', '.', ' ', 'The', 'story', 'is', 'revealed', 'slowly', 'and', 'realistically', '.', 'The', 'acting', 'is', 'superb', ',', 'the', 'characters', 'are', '<unk>', 'flawed', ',', 'and', 'the', 'dialogue', 'is', 'sensitive', '.', 'I', 'tried', 'many', 'times', 'to', 'predict', 'what', 'was', 'going', 'to', 'happen', ',', 'and', 'I', 'was', 'always', 'wrong', ',', 'so', 'I', 'was', 'very', 'intrigued', 'by', 'the', 'story',

# Bag of Words

The most primitive way to convert text to features is through the use of a **bag of words**. This is also sometimes called a "one hot encoding" representation.

In this representation, we do not care about the order of the words, we only care about the existence of a word. Each sample will be converted into a list of length ``vocab_length``. If a word exists in the sentence, we mark it as 1. Otherwise, we mark it as a 0.

In [None]:
def make_bag(tokens, idx2word):
    bag = set(tokens)
    return [1 if word in bag else 0 for word in idx2word]

Let's test it on the first sample in our training set.

In [None]:
s = make_bag(X_train[0], idx2word)
print(s)
print(sum(s))

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 

Then we convert our training and testing sets into bags of words.

In [None]:
X_train_bags = [make_bag(tokens, idx2word) for tokens in tqdm(X_train)]
X_test_bags = [make_bag(tokens, idx2word) for tokens in tqdm(X_test)]

100%|██████████| 700/700 [00:01<00:00, 645.37it/s]
100%|██████████| 300/300 [00:00<00:00, 609.84it/s]


Convert them into NumPy arrays.

In [None]:
X_train_bags, X_test_bags = np.array(X_train_bags), np.array(X_test_bags)
y_train, y_test = np.array(y_train), np.array(y_test)

Here's what our training set now looks like.

In [None]:
X_train_bags

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

And here are their shapes.

In [None]:
X_train_bags.shape, X_test_bags.shape

((700, 16839), (300, 16839))

We'll make a logistic regression model and train it on the training set.

In [None]:
model = LogisticRegression()
model.fit(X_train_bags, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Then test on the testing set.

In [None]:
y_pred = model.predict(X_test_bags)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))

Accuracy: 78.00%


We got 78% accuracy on a subset of the iMDB dataset! That's pretty good. Let's see if we can push the accuracy more by a few points.

# Stemming, Lematization, and Stopwords

First thing we can do is remove stopwords. These are words that commonly appear in a language that only contribute noise. We'll use NLTK's helper tool to get english stopwords.

In [None]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

We'll make a set for lookups.

In [None]:
stops = set(stopwords.words('english'))

Stemming and Lemmatization are two very similar techniques. Stemming will reduce a word to it's **stem** (linguistically the smallest subword unit that has meaning). Let's check this with NLTK.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer() 

Here's an example.

In [None]:
stemmer.stem('combining')

'combin'

Lemmatization, on the other hand, can be though of as a smarter form of stemming. In this case, we reduce a word to it's base **lemma**. Unlike stemming, lemmatization is not a destructive operation.

In [None]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 

Here's an example.

In [None]:
lemmatizer.lemmatize('corpora')

'corpus'

For our dataset, we'll remove all stopwords, then lemmatize all the words.

In [None]:
def process(tokens):
    temp = []
    for token in tokens:
        if token not in stops:
            temp.append(token)
    return [lemmatizer.lemmatize(token) for token in temp]

Let's test it out.

In [None]:
print(process(X_train[0]))

['Great', 'little', 'thriller', '.', 'I', 'expecting', 'type', 'silly', 'horror', 'movie', 'I', 'got', 'tight', 'short', 'thriller', 'waste', 'none', 'time', '.', 'Mostof', 'movie', 'get', 'back', 'character', 'story', 'either', 'feel', 'sympathy', 'hatred', 'people', 'start', 'getting', 'killed', '.', 'foolishness', '.', 'Yes', 'see', 'character', 'really', 'interact', 'principal', '.', 'Such', 'husband', 'wife', 'motel', 'whose', 'room', 'canceled', '.', 'We', 'saw', 'could', 'efficient', 'Lisa', 'character', 'inefficient', 'new', 'Hotel', 'clerk', '.', 'We', 'see', 'little', 'girl', 'simply', 'small', 'important', 'role', 'later', 'movie', 'heck', 'break', 'loose', '.', 'THe', 'Flight', 'Atrendants', 'need', 'particular', 'move', 'plot', 'ahead', '.', 'The', 'bad', 'guy', 'particular', 'need', 'beginning', 'flight', '.', 'The', 'rude', 'guy', 'airport', 'important', 'movie', '.', 'The', '2', 'character', '5', 'liner', 'use', 'plot', 'two', 'young', 'guy', 'plane', '.', 'THat', 'clev

Let's use it on the entire training and testing sets.

In [None]:
X_train_proc = [process(tokens) for tokens in tqdm(X_train)]
X_test_proc = [process(tokens) for tokens in tqdm(X_test)]

100%|██████████| 700/700 [00:00<00:00, 1380.96it/s]
100%|██████████| 300/300 [00:00<00:00, 1528.74it/s]


Then recreate our vocabulary.

In [None]:
vocab = ['<unk>']
for sample in X_train_proc: vocab.extend(sample)
vocab_set = set(vocab)

idx2word = list(vocab_set)
word2idx = {idx2word[i]: i for i in range(len(idx2word))}

We now have a smaller vocabulary because of this. This is good as the data tends to be less sparse.

In [None]:
len(idx2word)

15476

Let's check if the list corresponds with our dictionary.

In [None]:
idx2word[42]

'Harris'

Seems good.

In [None]:
word2idx['Harris']

42

We'll again turn all the tokens in the test set that don't exist in the training set into unknown tokens.

In [None]:
X_test_proc = [[token if token in vocab_set else '<unk>'for token in line] for line in X_test]

Here's a check.

In [None]:
print(X_test_proc[0])

['Although', '<unk>', 'one', 'point', 'I', 'thought', '<unk>', '<unk>', 'going', '<unk>', 'turn', '<unk>', 'The', '<unk>', ',', 'I', '<unk>', '<unk>', 'say', '<unk>', 'The', 'Mother', '<unk>', '<unk>', 'excellent', 'job', '<unk>', 'explaining', '<unk>', 'sexual', '<unk>', '<unk>', '<unk>', 'older', 'woman', '.', ' ', 'I', "'m", '<unk>', 'glad', '<unk>', '<unk>', '<unk>', 'British', 'film', '<unk>', 'Hollywood', 'never', 'would', '<unk>', 'done', '<unk>', ',', '<unk>', 'even', '<unk>', '<unk>', '<unk>', ',', '<unk>', 'would', '<unk>', 'ruined', '<unk>', '<unk>', '<unk>', 'taking', '<unk>', 'time', '<unk>', 'develop', '<unk>', '<unk>', '.', ' ', 'The', 'story', '<unk>', 'revealed', 'slowly', '<unk>', 'realistically', '.', 'The', 'acting', '<unk>', 'superb', ',', '<unk>', '<unk>', '<unk>', '<unk>', 'flawed', ',', '<unk>', '<unk>', 'dialogue', '<unk>', 'sensitive', '.', 'I', 'tried', 'many', '<unk>', '<unk>', 'predict', '<unk>', '<unk>', 'going', '<unk>', 'happen', ',', '<unk>', 'I', '<unk

Then make our bags of words.

In [None]:
X_train_bags = [make_bag(tokens, idx2word) for tokens in tqdm(X_train_proc)]
X_test_bags = [make_bag(tokens, idx2word) for tokens in tqdm(X_test_proc)]

100%|██████████| 700/700 [00:01<00:00, 675.58it/s]
100%|██████████| 300/300 [00:00<00:00, 696.74it/s]


Convert them to NumPy arrays.

In [None]:
X_train_bags, X_test_bags = np.array(X_train_bags), np.array(X_test_bags)
y_train, y_test = np.array(y_train), np.array(y_test)

Then we'll finally create our model and train it.

In [None]:
model = LogisticRegression()
model.fit(X_train_bags, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Lastly, let's test on the testing set.

In [None]:
y_pred = model.predict(X_test_bags)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))

Accuracy: 80.00%


With a little more work, we were able to increase our accuracy to 80%!

# Problems with Bag-of-Words

There are some problems with bags-of-words. Let's see them by doing some of our own testing. Let's write a function to predict the sentiment of a custom sentence we give.

In [None]:
def predict_model(text, model):
    test_sample = text.split()
    X_sample = np.array(make_bag(test_sample, idx2word))[np.newaxis, ...]
    pred = model.predict(X_sample)[0]
    return pred

Let's test it on a positive sentiment.

In [None]:
predict_model("This is a good movie", model)

0

Now on a negative sentiment.

In [None]:
predict_model("This is not a good movie", model)

0

That doesn't look correct.

One problem with bags-of-words is that they do not have any notion of **sequentiality**. This means that our model does not have any information to know that *not* modified the meaning of *good movie*. 

To get a better sentiment model, we'll have to be able to process sequentiality. Sadly, this is something linear models (like logistic regression) cannot do.

Let's test on something from the test set this time.

In [None]:
s = ' '.join(X_test[23])
print(s)
predict_model(s, model)

You know you 're in trouble when the opening narration basically tells you who survives . It all goes <unk> from there . <unk> , " <unk> bullet - time camera work . <unk> cuts to video game footage . <unk> old sea <unk> and wacky <unk> . <unk> who become skilled <unk> in the <unk> of an eye . Even the zombies are boring .   I was hoping for at least a " so bad it 's good " zombie movie , but this one is " so bad those involved with its creation should be <unk> from ever making a movie again " .  


1

So far so good. Let's shorten the text.

In [None]:
s = ' '.join(X_test[23][:50])
print(s)
predict_model(s, model)

You know you 're in trouble when the opening narration basically tells you who survives . It all goes <unk> from there . <unk> , " <unk> bullet - time camera work . <unk> cuts to video game footage . <unk> old sea <unk> and wacky <unk> . <unk> who


1

Okay. Let's make it even shorter.

In [None]:
s = ' '.join(X_test[23][:30])
print(s)
predict_model(s, model)

You know you 're in trouble when the opening narration basically tells you who survives . It all goes <unk> from there . <unk> , " <unk> bullet - time


0

Now it starts to buckle. Let's see what happens when we use the sequence of tokens that we just removed (which contributed to the "negativity" of the sentiment).

In [None]:
s = ' '.join(X_test[23][30:50])
print(s)
predict_model(s, model)

camera work . <unk> cuts to video game footage . <unk> old sea <unk> and wacky <unk> . <unk> who


0

This is a problem related to **sparsity**. Since our one-hot encoded features are sparse (in layman's terms, too much zeros), the model fails to make correct decisions when there is a shortage of information.

In simple terms, the model fails to assess sentiment properly when given shorter sentences. This sparsity makes it overfit on the training set.