# Module 3: Machine Learning

## Sprint 3: Introduction to Natural Language Processing and Computer Vision

## Subproject 1: Introduction to Natural Language Processing

Welcome to the third sprint of the Machine Learning course! In this sprint we will learn about how to process and build models on textual, visual data. What is more, we will learn about some more practical issues in machine learning - what are some common pitfalls and how to make sure that your model avoids them.

There is a lot of textual data in the world and it comes in many forms - articles, questions, their answers, dialogues, comments, item names and their descriptions, etc. What is more, there are many languages and each has its own characteristics.

While there are similarities in the way one should deal with various forms of textual data, there are also differences - sometimes a simple pattern matching algorithm can work very well, while sometimes you need complex architectures pre-trained on vast amounts of textual data to get a reasonably working model.

In any case, having some practical skills of how to deal with textual data is very useful, and this will be the topic of this notebook.

## Learning outcomes

- Regular expressions
- Converting text to a vector
- Building models on textual data

---

## Introduction

Let's begin learning about natural language processing from the Kaggle intro course:

- https://www.kaggle.com/learn/natural-language-processing

## Regular expressions

Regular expressions is a powerful and convenient way to find a variety of different patterns in a text. Here is a good introductory tutorial about them:

- https://scotch.io/tutorials/an-introduction-to-regex-in-python

Additionally, it might be useful to skim through the documentation of a python package for regular expressions:

- https://docs.python.org/3/library/re.html

The best way to learn regular expressions is to practice them a lot, so let's do the exercises in the following link:

- https://regexone.com

## Vectorization

Machine learning models usually work with vectors as the input, even though you will learn about some more complex models in the Deep Learning course. Converting textual data to a vector can be done in many ways, but one of the most common is using a bag of words approach - looking at the text as a collection of words and encoding that collection by a vector, where each vector coordinate corresponds to the existence (or count) of a specific word.

Tf-idf vectorization is slightly more complex version of the count vectorization described above. The main idea of tf-idf is to encode very common words with lower values, as this might help the model not to put much weight to words that are generic, such as "the" or "is". To learn about count and tf-idf vectorization in more detail, read the following article:

- https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Scikit-learn has some convenient objects for text vectorization. As usual, go through the parameters and try to understand as many of them as possible. Also, look at the examples:
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## NLP in practice

Let's begin by importing the modules we will need and setting the random state:

In [3]:
!pip install scikit-learn==0.23

Collecting scikit-learn==0.23
[?25l  Downloading https://files.pythonhosted.org/packages/23/c3/5f6e7317246d39b1921d3b697b4e419657eb728a1f02f9df4a019a35ccaf/scikit_learn-0.23.0-cp37-cp37m-manylinux1_x86_64.whl (7.3MB)
[K     |████████████████████████████████| 7.3MB 16.7MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.23.0 threadpoolctl-2.1.0


In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets, metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
import re

from sklearn import set_config
set_config(display='diagram')

RANDOM_STATE = 7

We will be using the newsgroups dataset (https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#loading-the-20-newsgroups-dataset), however, we will only use 4 categories as defined below:

In [2]:
categories = ['comp.sys.mac.hardware', 'comp.windows.x', 'sci.med', 'sci.space']

x_train = datasets.fetch_20newsgroups(subset='train', categories=categories)
x_val = datasets.fetch_20newsgroups(subset='test', categories=categories)

We can look at our target values as below:

In [3]:
x_val.target

array([3, 1, 3, ..., 1, 3, 2])

It is a bit inconvenient to see the targets as integers, as our targets are actually named and interpretable:

In [4]:
x_val.target_names

['comp.sys.mac.hardware', 'comp.windows.x', 'sci.med', 'sci.space']

So let's build a dictionary mapping our target ids to their respective names:

In [5]:
target_id_to_name = {idx: x_val.target_names[idx] for idx in range(len(categories))}

What is more, let's create a dataframe holding our input text and the target:

In [6]:
train_df = pd.DataFrame({'text': x_train.data, 'target': x_train.target})
val_df = pd.DataFrame({'text': x_val.data, 'target': x_val.target})
train_df.head()

Unnamed: 0,text,target
0,From: jbh55289@uxa.cso.uiuc.edu (Josh Hopkins)...,3
1,From: nodine@lcs.mit.edu (Mark H. Nodine)\nSub...,0
2,From: drisko@ics.com (Jason Drisko)\nSubject: ...,1
3,From: straw@cam.nist.gov (Mike_Strawbridge_x38...,1
4,From: petrack@vnet.IBM.COM\nSubject: disabling...,0


We can now conveniently use our dictionary to map target ids to their names:

In [7]:
train_df['target'] = train_df['target'].map(target_id_to_name)
val_df['target'] = val_df['target'].map(target_id_to_name)
train_df.head()

Unnamed: 0,text,target
0,From: jbh55289@uxa.cso.uiuc.edu (Josh Hopkins)...,sci.space
1,From: nodine@lcs.mit.edu (Mark H. Nodine)\nSub...,comp.sys.mac.hardware
2,From: drisko@ics.com (Jason Drisko)\nSubject: ...,comp.windows.x
3,From: straw@cam.nist.gov (Mike_Strawbridge_x38...,comp.windows.x
4,From: petrack@vnet.IBM.COM\nSubject: disabling...,comp.sys.mac.hardware


Let's see what is the target distribution:

In [8]:
train_df.groupby('target').size().sort_values(ascending=False)

target
sci.med                  594
sci.space                593
comp.windows.x           593
comp.sys.mac.hardware    578
dtype: int64

We see that the classes have approximately equal amount of samples. When the dataset is imbalanced, we should be careful when using the accuracy metric, as if 90% of the dataset has the same target, then it is trivial to build a model that always predicts it and is 90% accurate.

To illustrate the use of regular expressions (check https://docs.python.org/3/library/re.html if something is unclear about the code below), let's use one that catches patterns that likely mean a price, like 5\\$ or \\\$105:

In [9]:
pattern = re.compile('(\$\d)|(\d\$)')
train_df['contains_price'] = train_df['text'].apply(lambda x: pattern.search(x) is not None)
train_df.head()

Unnamed: 0,text,target,contains_price
0,From: jbh55289@uxa.cso.uiuc.edu (Josh Hopkins)...,sci.space,False
1,From: nodine@lcs.mit.edu (Mark H. Nodine)\nSub...,comp.sys.mac.hardware,False
2,From: drisko@ics.com (Jason Drisko)\nSubject: ...,comp.windows.x,False
3,From: straw@cam.nist.gov (Mike_Strawbridge_x38...,comp.windows.x,False
4,From: petrack@vnet.IBM.COM\nSubject: disabling...,comp.sys.mac.hardware,False


We used the OR operator (|) to catch both the variant when the dollar sign is on the left and on the right of the digit. As the sign \\$ is a special symbol in regular expressions, we had to escape it by using \\. Finally, \d is the pattern for any digit (0-9).



Let's look at the distribution of target values among those that have this feature as True, or, in other words, have a token that likely means a price in the text:

In [10]:
train_df.query('contains_price == True').groupby('target').size().sort_values(ascending=False)

target
sci.space                111
comp.sys.mac.hardware     95
sci.med                   36
comp.windows.x            29
dtype: int64

We see that prices are more likely to be mentioned in the space category, while it is less likely when the category is Windows. This might mean that adding this feature as additional feature to our machine learning model could help it distinguish between the categories. Various other regular expressions could be used to engineer new useful features too!

Let's build a CountVectorizer object (as with all scikit-learn objects, refer to the scikit-learn documentation if its use is unclear):

In [11]:
cv = CountVectorizer(
    lowercase=True,
    max_features=15000,
    min_df=50,
    binary=True,
    ngram_range=(1,2),
    strip_accents='ascii'
)

As seen from above, we will be using word pairs as features as well as single words, and as the number of possible word pairs is large, we might end up with a feature vector having many dimensions (each dimension encoding for a unique word pair). It will be useful to also use an object to select the most important features (the most important word pairs):

In [12]:
feature_selector = SelectFromModel(
    LogisticRegression(max_iter=1000),
    threshold=0.05
)

As a baseline model, let's pick logistic regression. It is also a great model when you have to deal with sparse (few non zero values) feature vectors of large dimensionality:

In [13]:
model = LogisticRegression(max_iter=1000)

Now, let's put our:

- preprocessor (CountVectorizer) that converts our text into a one-hot encoded vector, each dimension coding for the existence (or not) of a specific word pair
- feature selector, that selects the word pairs with feature weight above a specific threshold
- logistic regression model

into a single pipeline object:

In [14]:
pipe = Pipeline(
    steps=[
        ('preprocessor', cv),
        ('feature_selector', feature_selector),
        ('model',  model)]
)

Now we can fit our pipeline:

In [15]:
%%time
pipe.fit(train_df['text'], train_df['target'])

CPU times: user 3.63 s, sys: 53.7 ms, total: 3.68 s
Wall time: 3.7 s


Let's look at the accuracy scores of our model:

In [16]:
%%time
train_df['target_prediction'] = pipe.predict(train_df['text'])
print('Training accuracy is', metrics.accuracy_score(train_df['target'], train_df['target_prediction']))

val_df['target_prediction'] = pipe.predict(val_df['text'])
print('Validation accuracy is', metrics.accuracy_score(val_df['target'], val_df['target_prediction']))

Training accuracy is 1.0
Validation accuracy is 0.8560509554140128
CPU times: user 1.8 s, sys: 11 ms, total: 1.81 s
Wall time: 1.82 s


We see that our training set was fit perfectly, maybe we have too many features? Let's see how large is our vocabulary:

In [17]:
len(cv.vocabulary_)

1607

In total there were 1607 word pairs used as features by the feature selector model. Let's see how many did it select based on the given threshold:

In [18]:
sum(feature_selector.get_support())

1604

So 1604 features (word pairs) were selected out of 1607.

Let's try to select the features a bit more selectively, using a higher threshold:

In [26]:
feature_selector = SelectFromModel(
    LogisticRegression(max_iter=1000),
    threshold=0.4
)

pipe = Pipeline(
    steps=[
        ('preprocessor', cv),
        ('feature_selector', feature_selector),
        ('model',  model)]
)

pipe.fit(train_df['text'], train_df['target'])

In [27]:
%%time
train_df['target_prediction'] = pipe.predict(train_df['text'])
print('Training accuracy is', metrics.accuracy_score(train_df['target'], train_df['target_prediction']))

val_df['target_prediction'] = pipe.predict(val_df['text'])
print('Validation accuracy is', metrics.accuracy_score(val_df['target'], val_df['target_prediction']))

Training accuracy is 1.0
Validation accuracy is 0.8598726114649682
CPU times: user 1.8 s, sys: 10.1 ms, total: 1.81 s
Wall time: 1.84 s


In [28]:
sum(feature_selector.get_support())

917

With a more aggresive selection strategy we managed to improve the validation score while also reducing the number of features, the complexity and inference time of our model.

## Exercise

Try to improve the above validation scores using a different model or different model hyper-parameters (you could also try using hyper-parameter optimization!)

In [76]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt

rf_model = RandomForestClassifier(
    random_state=RANDOM_STATE,
    n_estimators=1500
)

In [77]:
cv = CountVectorizer(
    lowercase=True,
    max_features=5000,
    min_df=50,
    binary=True,
    ngram_range=(1,1),
    strip_accents='ascii'
)

In [78]:
rf_pipe = Pipeline(
    steps=[
           ('preprocessor', cv),
           ('model', rf_model)
    ]
)

In [79]:
%%time
rf_pipe.fit(train_df['text'], train_df['target'])

train_df['target_prediction'] = rf_pipe.predict(train_df['text'])
print('Training accuracy is', metrics.accuracy_score(train_df['target'], train_df['target_prediction']))

val_df['target_prediction'] = rf_pipe.predict(val_df['text'])
print('Validation accuracy is', metrics.accuracy_score(val_df['target'], val_df['target_prediction']))

Training accuracy is 1.0
Validation accuracy is 0.8878980891719745
CPU times: user 30.7 s, sys: 98.8 ms, total: 30.8 s
Wall time: 30.9 s


---

## Summary

In this notebook we learned about natural language processing - how you can find patterns in the text using regular expressions and how you can convert that text into a vector form by tokenizing the text and encoding the existence (or not) of specific textual patterns (such as words, word pairs). 

## Further research

- https://developers.google.com/edu/python/regular-expressions
- https://scikit-learn.org/stable/modules/naive_bayes.html
- https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html