# Processing Text With Python
## Preprocess Data For Classification and Regression
### Load Initial Dataset with Multiple Documents of Text

Data is based on the Imdb movie review database. The data has been stored locally. If you want to fetch the data you can find it here: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz . The data is organized in two subfolders train and test each of which contains folders with positive and negative reviews.

In [1]:
from sklearn.datasets import load_files
import numpy as np

reviews_train = load_files("../../P/datasets/imdb/aclImdb/train/")


In [2]:
type(reviews_train)

sklearn.utils.Bunch

A bunch object is dictionary like and consists of the keys
* data An array of data to learn from
* target The associated classifications labels
* target_names the meaning of the labels
* feature_names the meaning of the features (optional)
* DESCR the full description of the dataset
* filename the physical location of iris csv dataset

Heres what the first entries look like:

In [3]:
print(reviews_train.keys())

print(reviews_train.data[0])
print(reviews_train.target[0])
print(reviews_train.target_names[0])
print(reviews_train.DESCR)
print(reviews_train.filenames)

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
b'Full of (then) unknown actors TSF is a great big cuddly romp of a film.<br /><br />The idea of a bunch of bored teenagers ripping off the local sink factory is odd enough, but add in the black humour that Forsyth & Co are so good at and your in for a real treat.<br /><br />The comatose van driver by itself worth seeing, and the canal side chase is just too real to be anything but funny.<br /><br />And for anyone who lived in Glasgow it\'s a great "Oh I know where that is" film.'
2
neg
None
['../../P/datasets/imdb/aclImdb/train/unsup\\5384_0.txt'
 '../../P/datasets/imdb/aclImdb/train/unsup\\48493_0.txt'
 '../../P/datasets/imdb/aclImdb/train/unsup\\20575_0.txt' ...
 '../../P/datasets/imdb/aclImdb/train/unsup\\25853_0.txt'
 '../../P/datasets/imdb/aclImdb/train/unsup\\26711_0.txt'
 '../../P/datasets/imdb/aclImdb/train/unsup\\48943_0.txt']


Using the .data key and the .target key we can create a list of training data with their related class labels

In [4]:
# load_files returns a bunch, containing training texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))

print("type of y_train: {}".format(type(y_train)))
print("length of y_train: {}".format(len(y_train)))

type of text_train: <class 'list'>
length of text_train: 75000
type of y_train: <class 'numpy.ndarray'>
length of y_train: 75000


We preprocess the data by removing the html tags

In [5]:
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

Inspect which categories we have

In [6]:
print(reviews_train.target_names)

['neg', 'pos', 'unsup']


'Unsup' are unsupported reviews (for which no categorization exists). We should ignore these. Care must be taken when removing the respective entries from the data. We need to identify the "unsup" class positions in the y_train array first and then remove them in y_train and the respective indices from text_train.

In [7]:
idx = np.where(y_train == 2)
y_train = np.delete(y_train, idx[0])
text_train = np.delete(text_train, idx[0])

In [8]:
len(text_train)


25000

In [9]:
print(len(text_train))
print(np.unique(y_train))

25000
[0 1]


# Representing the data as bag of words
Now we create a bag of words, i.e. for each word in the documents, we count their occurence on a per document level. Hence, each new word adds a dimension to our feature vector, and each new document adds a new line in our feature matrix.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
imdb_bag_of_words = vectorizer.fit_transform(text_train)

The vectorizer now contains all the words that have been found. The matrix has now the dimension (num_words x num_documents)

In [11]:
print("Vocabulary size: {}".format(len(vectorizer.vocabulary_)))
print("bag_of_words: {}".format(repr(imdb_bag_of_words)))
for x in list(vectorizer.vocabulary_)[0:10]:
    print ("word: {}\t count {} ".format(x,  vectorizer.vocabulary_[x]))

Vocabulary size: 74849
bag_of_words: <25000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 3431196 stored elements in Compressed Sparse Row format>
word: dan	 count 16414 
word: katzir	 count 36171 
word: has	 count 29999 
word: produced	 count 51851 
word: wonderful	 count 73496 
word: film	 count 24536 
word: that	 count 66322 
word: takes	 count 65348 
word: us	 count 70492 
word: on	 count 46916 


If we look at the features, we see, that there are features, that are not very likely to give us valuable informatio. These contain numbers, time of day representations and stopwords:

In [12]:
feature_names = vectorizer.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 74849
First 20 features:
['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02']
Features 20010 to 20030:
['dratted', 'draub', 'draught', 'draughts', 'draughtswoman', 'draw', 'drawback', 'drawbacks', 'drawer', 'drawers', 'drawing', 'drawings', 'drawl', 'drawled', 'drawling', 'drawn', 'draws', 'draza', 'dre', 'drea']
Every 2000th feature:
['00', 'aesir', 'aquarian', 'barking', 'blustering', 'bête', 'chicanery', 'condensing', 'cunning', 'detox', 'draper', 'enshrined', 'favorit', 'freezer', 'goldman', 'hasan', 'huitieme', 'intelligible', 'kantrowitz', 'lawful', 'maars', 'megalunged', 'mostey', 'norrland', 'padilla', 'pincher', 'promisingly', 'receptionist', 'rivals', 'schnaas', 'shunning', 'sparse', 'subset', 'temptations', 'treatises', 'unproven', 'walkman', 'xylophonist']


The feature space is vast with nearly 75000 dimensions. Hence we should try to reduce the number of dimensions by:
* use only words that have a mimimum occurence in all documents (minimal document frequency) min_df
* remove stop words (like 'a', 'and', 'the') as they don't give valuable information for classification and/or 
* remove words that occur in may documents (maximum document frequency) max_df 

In [13]:
vect = CountVectorizer(min_df=5, stop_words="english", max_df=1000).fit(text_train)
X_train = vect.transform(text_train)

In [14]:
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("X_train size: {}".format(repr(X_train)))
for x in list(vect.vocabulary_)[0:10]:
    print ("word: {}\t count {} ".format(x,  vect.vocabulary_[x]))
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Vocabulary size: 26630
X_train size: <25000x26630 sparse matrix of type '<class 'numpy.int64'>'
	with 1399128 stored elements in Compressed Sparse Row format>
word: dan	 count 5964 
word: produced	 count 18384 
word: roller	 count 20154 
word: coaster	 count 4572 
word: ride	 count 19969 
word: romance	 count 20163 
word: troubles	 count 24544 
word: surrounding	 count 23234 
word: modern	 count 15397 
word: israel	 count 12698 
Number of features: 26630
First 20 features:
['00', '000', '007', '00s', '01', '02', '03', '04', '05', '06', '07', '08', '09', '100', '1000', '100th', '101', '102', '103', '104']
Features 20010 to 20030:
['riley', 'ring', 'ringer', 'ringing', 'ringmaster', 'ringo', 'rings', 'ringu', 'ringwald', 'rio', 'rios', 'riot', 'riotous', 'riots', 'rip', 'ripe', 'ripley', 'ripoff', 'ripoffs', 'ripped']
Every 2000th feature:
['00', 'bananas', 'characterised', 'dapper', 'endurance', 'germany', 'inauthentic', 'livesey', 'negotiate', 'pos', 'righteousness', 'sod', 'ticking', 

# Rescaling the data using term frequencey inverse document frequency
Here, term frequency is the number of occurences of a term (word) $t$ in a document $d$. 

$\operatorname{tf}(t, d) = f_{t, d}$ 

Sometimes tf gets normalized to the length of $d$
The inverse document frequency idf is a measure on the amount of information a term t carries. Rare occurences of t leads to a high amount of information common occurence to a low amount of information. The idf is computed as 

$\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1$

where $n$ is the total number of documents and $\text{df}(t)$ is the number of documents that contain the term $t$. Hence, the tf-idf is the product of the two terms:

$\text{tf-idf(t,d)}=\text{tf(t,d)} \cdot \text{idf(t)}$

scikit-learn supports this in the `TfidfTransformer`, when using the following parameters: `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)`. Refer to the scikit documentation for the parameter sets and how this changes the formula.

# Application of tf-idf to Imdb data

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(min_df=5, stop_words="english", max_df=1000, norm=None)
X_idf_train = tfidf_vect.fit_transform(text_train)

Each word now has an tf-idf value when it occurs in a document or zero, if it does not occur in a document (row of the matrix)

In [16]:
# find maximum value for each of the features over dataset:
max_value = X_idf_train.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()

feature_names = np.array(tfidf_vect.get_feature_names())

print("Features with lowest tfidf:\n{}".format(
      feature_names[sorted_by_tfidf[:20]]))
print("tf-idf values:\n{}".format(max_value[sorted_by_tfidf[:20]]))
print("Features with highest tfidf: \n{}".format(
      feature_names[sorted_by_tfidf[-20:]]))
print("tf-idf values:\n{}".format(max_value[sorted_by_tfidf[-20:]]))

Features with lowest tfidf:
['poignant' 'disagree' 'instantly' 'importantly' 'lacked' 'occurred'
 'currently' 'altogether' 'nearby' 'undoubtedly' 'directs' 'avoided'
 'fond' 'stinker' 'emphasis' 'commented' 'realizing' 'disappoint'
 'downhill' 'inane']
tf-idf values:
[6.0704253  6.24386918 6.2668587  6.27464084 6.32265006 6.38173897
 6.39047265 6.39928328 6.42619074 6.44453988 6.4822802  6.49194211
 6.49194211 6.49194211 6.50169829 6.51155059 6.53155125 6.53155125
 6.55196012 6.56232291]
Features with highest tfidf: 
['roy' 'coop' 'homer' 'dillinger' 'hackenstein' 'gadget' 'macarthur'
 'taker' 'vargas' 'jesse' 'basket' 'dominick' 'victor' 'bridget'
 'victoria' 'khouri' 'zizek' 'rob' 'timon' 'titanic']
tf-idf values:
[178.44626553 180.00910673 184.13677468 185.30580621 186.69823268
 188.84578956 190.11881797 190.11881797 190.74952846 192.19175433
 192.31709746 199.03905035 200.53489816 200.82037452 204.27548969
 205.36805594 210.46552255 216.97960963 220.32008782 247.74660926]


# Drawbacks of the current approach
While tf-idf (the `TfidfVectorizer` is a `CountVectorizer` followed by a `TfidfTransformer`) gives a proper numerical representation on a vocabulary and a set of documents, it has an important limitation. As the vectorization approach considers only on word at a time, the semantic of a sentence is ignored which leads to:
"this is good, not bad" and "this is bad, not good" to have the same representation in their vectorized form.

To cope with that we can consider more than one word at a time. A combination of n words is called a n-gram.

## Bag-of-Words and tf-idf with n-Grams


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(min_df=5, stop_words="english", max_df=1000, ngram_range = (1,3), norm=None)
X_idf_train = tfidf_vect.fit_transform(text_train)


In [18]:
print(X_idf_train.toarray().shape)
print("Vocabulary size: {}".format(len(tfidf_vect.vocabulary_)))

(25000, 82833)
Vocabulary size: 82833


## Alternative approach using lemmatization

In [19]:
import spacy
import re

# code partially taken from Müller & Guido "Introduction to Machine Learning with Python"
# load spacy language model
en_nlp = spacy.load('en', disable=['parser', 'ner'])

# create a custom tokenizer using the SpaCy document processing pipeline
# (now using our own tokenizer)
def custom_tokenizer(document):
    doc_spacy = en_nlp(document)
    return [token.lemma_ for token in doc_spacy]

# define a count vectorizer with the custom tokenizer
lemma_vect = TfidfVectorizer(min_df=5, stop_words="english", max_df=1000, norm=None, tokenizer=custom_tokenizer)


In [20]:
# transform text_train using CountVectorizer with lemmatization
X_train_lemma = lemma_vect.fit_transform(text_train)
print("X_train_lemma.shape: {}".format(X_train_lemma.shape))

  'stop_words.' % sorted(inconsistent))


X_train_lemma.shape: (25000, 21395)


In [21]:
# standard CountVectorizer for reference
vect = TfidfVectorizer(min_df=5, stop_words="english", max_df=1000, norm=None).fit(text_train)
X_train = vect.transform(text_train)
print("X_train.shape: {}".format(X_train.shape))

X_train.shape: (25000, 26630)


Still, the dimensionality of the feature space is high (esp. when adding n-Grams to it). 

# Classification of text
## Classification with k-NN Classifier and tf-idf using n-Grams
`TfidfVectorizer` and `CountVectorizer` both support the parameter `ngram_range(min_n, max_n)`. So we can try out a simple classifier (e.g. k-NN) and check the performance for different n-grams.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

pipe = make_pipeline(TfidfVectorizer(min_df=5, stop_words="english", max_df=1000), KNeighborsClassifier())
print(pipe.get_params().keys())
# running the grid-search takes a long time because of the
# relatively large grid and the inclusion of trigrams
param_grid = {'kneighborsclassifier__n_neighbors': [3, 5, 7, 9],
              "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3), (1, 4)]}

grid = GridSearchCV(pipe, param_grid, cv=5,  n_jobs=8, verbose=10)
grid.fit(text_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))

In [None]:
scores = grid.cv_results_['mean_test_score'].reshape(-1, 4).T
print(scores)

In [None]:
import matplotlib as plt
import seaborn as sns
sns.set()
ax = sns.heatmap(scores, annot=True, cmap="viridis", xticklabels=param_grid['kneighborsclassifier__n_neighbors'], yticklabels=param_grid["tfidfvectorizer__ngram_range"])

## Classification using Random Forest Classifier and n-Grams
Using a different classifier leads to better results

In [None]:
from sklearn.ensemble import RandomForestClassifier
pipe = make_pipeline(TfidfVectorizer(min_df=5, stop_words="english", max_df=1000), RandomForestClassifier())
print(pipe.get_params().keys())
# running the grid-search takes a long time because of the
# relatively large grid and the inclusion of trigrams
param_grid = {'randomforestclassifier__n_estimators': [47, 81, 101],
              "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3), (1, 4)]}

grid = GridSearchCV(pipe, param_grid, cv=5,  n_jobs=8, verbose=10)
grid.fit(text_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))
scores = grid.cv_results_['mean_test_score'].reshape(-1, 4).T


In [None]:
print(scores)
sns.set()
ax = sns.heatmap(scores, annot=True, cmap="viridis", xticklabels=param_grid['randomforestclassifier__n_estimators'], yticklabels=param_grid["tfidfvectorizer__ngram_range"])


## Classification using Support Vector Classifications and n-Grams
Note: This will take approx. 70-90 minutes to train on an quad core machine

In [None]:
from sklearn.svm import SVC
pipe = make_pipeline(TfidfVectorizer(min_df=5, stop_words="english", max_df=1000), SVC(gamma='auto'))
print(pipe.get_params().keys())
# running the grid-search takes a long time because of the
# relatively large grid and the inclusion of trigrams
param_grid = {'svc__C': [0.1, 1, 10],
              "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3), (1, 4)]}

grid = GridSearchCV(pipe, param_grid, cv=5,  n_jobs=8, verbose=10)
grid.fit(text_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))
scores = grid.cv_results_['mean_test_score'].reshape(-1, 4).T

In [None]:
print(scores)
sns.set()
ax = sns.heatmap(scores, annot=True, cmap="viridis", xticklabels=param_grid['svc__C'], yticklabels=param_grid["tfidfvectorizer__ngram_range"])


Here we can see that more sophisticated classifiers, which often are considered to be superior to simple classifiers (like Decisionn Trees), are not performing too well for text classification. The reason is, that the feature vector is very high dimensional **and** sparse. This is known as the "curse of dimensionality". Here we need alternative approaches to reduce the sparsity and dimensionality --> Data Science II

(c) 2020 Ingo Elsen (FH Aachen University of Applied Sciences)