# 11. Approaching Text Classification/Regression 1

Text data can be categorized as tabular with having more structure. Contrary to previous examples so far, explanotary variables/inputs are not numerical but textual. Since ML algorithms need numbers to work on, we must devise a process to turn textual data into numerical values. Let's check the dataset used in this chapter. 

**Concepts:**

- [Tokenization](#Tokenization)
- [Bag of words](#Bag-of-words)
- [Pre-tokenization]()
- [Term frequency, inverse document frequency]()
- [N-grams]()
- [Stemming and lemmatization]()
- [Topic extraction]()
- [Latent Semantic Analysis]()


In [1]:
%matplotlib inline

import numpy as np
import nltk # Natural Language Toolkit
import matplotlib.pyplot as plt
import os
import pandas as pd

# from google.colab import drive
# drive.mount('/content/gdrive')

nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Main dataset is **Imdb Review Dataset**. There is a single feature column named as review and target column named as sentiment. Target distributon is 50/50 so it's a balanced dataset.

In [None]:
df_raw = pd.read_csv('data/imdb.csv')
df_raw.sentiment = df_raw.sentiment.apply(lambda x: 1 if x == 'positive' else 0)
df = df_raw.copy(deep=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

## Tokenization

As a start, we can decompose a text into sentences, then into words. This process is called tokenizing. After that we formulate a mapping between tokens and their numerical values. In order to tokenize a string of words, we use `nltk` library's `word_tokenize` method. As opposed to simple string `split` method, `word_tokenize` also considers punctuations. 

In [None]:
from nltk.tokenize import word_tokenize

sentence = 'hi, how are you?'

print(sentence.split())

print(word_tokenize(sentence))

## Bag of words

`CountVectorizer` processes a number of documents, in this case strings, and outputs a vector for each document and collects the results in a matrix. Every row vector is called **bag of words** and includes all the terms in all the documents. Matrices produced by `CountVectorizer` and the coming `TfidfVectorizer` are called **term-document** matrices.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
          'hello, how are you?',
          'hello did you know about counts',
          'YES!!!'
]

ctv = CountVectorizer()

ctv.fit(corpus)

corpus_transformed = ctv.transform(corpus)
print('Sparse matrix')
print(corpus_transformed)
print('Vocabulary values')
print(sorted(ctv.vocabulary_.items(), key=lambda x: x[1]))
print('Array form')
print(corpus_transformed.toarray())


## Pre-tokenization

Logistic regression code example in the book is not very efficient and takes too long due to tokenization of dataset over and over again. The catch is that tokenization is the same process for every iteration and can be performed in advance which is called **pre-tokenization**. Therefore, we tokenize the dataset beforehand and use kfold indexing during loop.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize

cvt= CountVectorizer(tokenizer=word_tokenize, token_pattern=None)
df_tokenized = cvt.fit_transform(df.review)

For `LogisticRegression`, default value of `max_iter` is 100 which is not sufficient for our example. If you run the code without changing iteration number, a warning will pop up regarding non convergance. For convergance, `max_iter = 10_000` should be chosen in case runtime is not an issue. 

In [None]:
from nltk.tokenize import word_tokenize
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer

y = df.sentiment.values

kf = model_selection.StratifiedKFold(n_splits=5)
accuracies = []


for fold_, (train_, validate_) in enumerate(kf.split(X=df, y=y)):


    X_train = df_tokenized[train_, :]
    y_train = df.sentiment[train_]
 
    X_test = df_tokenized[validate_, :]
    y_test = df.sentiment[validate_]

    model = linear_model.LogisticRegression(solver='sag')
    model.fit(X_train, y_train)

    preds = model.predict(X_test)

    accuracy = metrics.accuracy_score(y_test, preds)
    accuracies.append(accuracy)


    print(f'Fold: {fold_}')
    print(f'Accuracy {accuracy}')

print(f'Mean accuracy {sum(accuracies)/len(accuracies)}')

Without pre tokenization `LogisticRegression` takes a long time, but `df_tokenized` decreases the runtime drastically. Now, as a second method we use `naive_bayes` that finishes instantaneously and considering the accuracy is very effective.

In [None]:
from nltk.tokenize import word_tokenize
from sklearn import metrics
from sklearn import model_selection
from sklearn import naive_bayes
from sklearn.feature_extraction.text import CountVectorizer

y = df.sentiment.values

kf = model_selection.StratifiedKFold(n_splits=5)
accuracies = []


for fold_, (train_, validate_) in enumerate(kf.split(X=df, y=y)):


    X_train = df_tokenized[train_, :]
    y_train = df.sentiment[train_]
 
    X_test = df_tokenized[validate_, :]
    y_test = df.sentiment[validate_]

    model = naive_bayes.MultinomialNB()
    model.fit(X_train, y_train)

    preds = model.predict(X_test)

    accuracy = metrics.accuracy_score(y_test, preds)
    accuracies.append(accuracy)


    print(f'Fold: {fold_}')
    print(f'Accuracy {accuracy}')

print(f'Mean accuracy {sum(accuracies)/len(accuracies)}')

## Term Frequency and Inverse Document Frequency

In additon to word count used in bag of words approach, we can use frequency values of words both regarding the document itself and the collection of documents. Following metrics are useful in this approach:

- Term frequency (**TF**) concerns with a single document and is the number of times a term occurs in a document: 
$$TF(t) = \frac{\text{# of times t occur}}{\text{# of terms}} \tag{11-1}$$ 

- Inverse document frequencey (**IDF**) concerns with a collection of documents and is logarithm of the number of documents divided by the number of documents including the given term t:

$$IDF(t) = \log \left(\frac{\text{# of documents}}{\text{# of documents with t in it}} \right) \tag{11-2}$$

- Finally, **TF-IDF** combines the two metrics into one.

$$TF\text{-}IDF(t) = TF(t) \times IDF(t) \tag{11-3}$$

Subtlety of **TF-IDF** lies in its capacity to consider and combine informaton about the whole collection of documents. In bag of words approach, we are only able to assign a value to a token regarding single document. In **TF-IDF**, numerical value is assigned according to its occurence in all documents.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize

corpus = [
          "hello, how are you?",
          "im getting bored at home. And you? What do you think?",
          "did you know about counts",
          "let's see if this works!",
          "YES!!!!"
]

tfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)

tfv.fit(corpus)
corpus_transformed = tfv.transform(corpus)
print(corpus_transformed)

We use the same pre tokenization approach again here with `TfidVectorizer`.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize

tvt = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
df_tfidf = tvt.fit_transform(df.review)

Now, we can feed tokenized dataset to `LogisticRegression`.

In [None]:
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection

y = df.sentiment.values

kf = model_selection.StratifiedKFold(n_splits=5)
accuracies = []


for fold_, (train_, validate_) in enumerate(kf.split(X=df, y=y)):


    X_train = df_tfidf[train_, :]
    y_train = df.sentiment[train_]
 
    X_test = df_tfidf[validate_, :]
    y_test = df.sentiment[validate_]

    model = linear_model.LogisticRegression(solver='sag')
    model.fit(X_train, y_train)

    preds = model.predict(X_test)

    accuracy = metrics.accuracy_score(y_test, preds)
    accuracies.append(accuracy)


    print(f'Fold: {fold_}')
    print(f'Accuracy {accuracy}')

print(f'Mean accuracy {sum(accuracies)/len(accuracies)}')

## N-grams

Next tokenization approach is **n-grams**. Instead fo simple word counts, **n-grams** also take into account order by including n-1 surronding words.

In [None]:
from nltk import ngrams
from nltk.tokenize import word_tokenize

N = 3

sentence = 'Hi, how are you?'

tokenized_sentence = word_tokenize(sentence)

n_grams = list(ngrams(tokenized_sentence, N))

print(n_grams)

Once again we pre tokenize the dataset. This time we add n-gram argument.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize

tvt_ngram = TfidfVectorizer(tokenizer=word_tokenize, 
                            token_pattern=None, 
                            ngram_range=(1,2))

df_tfidf_ngrams = tvt_ngram.fit_transform(df.review)


One should note the size difference between `df_tfidf` and `df_tfidf_ngrams`. The first one is `(50000, 168707)` and the second one is `(50000, 2348110)`.

In [None]:
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection

y = df.sentiment.values

kf = model_selection.StratifiedKFold(n_splits=5)
accuracies = []


for fold_, (train_, validate_) in enumerate(kf.split(X=df, y=y)):


    X_train = df_tfidf_ngrams[train_, :]
    y_train = df.sentiment[train_]
 
    X_test = df_tfidf_ngrams[validate_, :]
    y_test = df.sentiment[validate_]

    model = linear_model.LogisticRegression(solver='sag')
    model.fit(X_train, y_train)

    preds = model.predict(X_test)

    accuracy = metrics.accuracy_score(y_test, preds)
    accuracies.append(accuracy)


    print(f'Fold: {fold_}')
    print(f'Accuracy {accuracy}')

print(f'Mean accuracy {sum(accuracies)/len(accuracies)}')

In [None]:
from nltk import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer('english')

words = ['fishing', 'fishes', 'fished']

for word in words:
    print(f'word={word}')
    print(f'stemmed word={stemmer.stem(word)}')
    print(f'lemma={lemmatizer.lemmatize(word)}')


## Latent Semantic Analysis

This section is a little too terse if you are not familiar with the concepts and also lacks motivation. Before proceeding to dive into the code, some explanation about SVD and TruncatedSVD is necessary. Singular Value Decompositon (SVD) is a matrix decompostion method such that given a dataset matrix $X$ produces a low rank approximation to $X$. 

Assuming $m$ is number of documents and $n$ is the number of unique terms, our data matrix (term-document matrix produced by a `Vectorizer`) $X$ can be decomposed with a given `n_components=t` as follows:

$$\underbrace{\underbrace{X}_{(m \times n)} \approx X_{t} = \underbrace{U_{t}\Sigma_{t}V_{t}^{\intercal}}_{(m \times t) ~ (t\times t) ~ (t\times n)}}_{\text{Dimensions}}   \tag{11-4}$$

When SVD applied to term-document matrices as we obtain by running `CountVectorizer` or `TfidfVectorizer`, it's called latent semantic analysis. Purpose of this approach is to reduce the dimensionality of data matrix. As we have seen the matrices produced by `Vectorizers` have large number of columns and took too long to process. Another advantage is that `SVD` can work with sparse matrices and therefore can deal with sizable datasets.

In [None]:
from nltk.tokenize import word_tokenize
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer


corpus = df.review.values[:10_000]

tfv =TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
tfv.fit(corpus)

corpus_transformed = tfv.transform(corpus)

svd = decomposition.TruncatedSVD(n_components=10)

corpus_svd = svd.fit(corpus_transformed)

for sample_index in range(5):
    feature_scores = dict(zip(tfv.get_feature_names(), 
                          corpus_svd.components_[sample_index]))

    print(sorted(feature_scores, key=feature_scores.get, reverse=True)[:6])


We can clear punctuation and other non letter characters using `apply` method of pandas. 

In [None]:
import re
import string

from nltk.tokenize import word_tokenize
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer

def clean_text(s):
    s = s.split()
    s = ' '.join(s)
    s = re.sub(f'[{re.escape(string.punctuation)}]', '', s)
    return s

corpus = df.copy(deep=True)

corpus.loc[:, 'review'] = corpus.review.apply(clean_text)

corpus = corpus.review.values

tfv =TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
tfv.fit(corpus)

corpus_transformed = tfv.transform(corpus)

svd = decomposition.TruncatedSVD(n_components=1000)

df_reduced = svd.fit_transform(corpus_transformed)

After tranfromation, we run `LogisticRegression` with reduced dataset and set `n_components=1000` to retain 0.89 accuracy.

In [None]:
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection

y = df.sentiment.values

kf = model_selection.StratifiedKFold(n_splits=5)
accuracies = []


for fold_, (train_, validate_) in enumerate(kf.split(X=df, y=y)):


    X_train = df_reduced[train_, :]
    y_train = df.sentiment[train_]
 
    X_test = df_reduced[validate_, :]
    y_test = df.sentiment[validate_]

    model = linear_model.LogisticRegression(solver='sag')
    model.fit(X_train, y_train)

    preds = model.predict(X_test)

    accuracy = metrics.accuracy_score(y_test, preds)
    accuracies.append(accuracy)


    print(f'Fold: {fold_}')
    print(f'Accuracy {accuracy}')

print(f'Mean accuracy {sum(accuracies)/len(accuracies)}')