[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut4_NLP_pipeline_teacher.ipynb) 

# Exercise NLP pipeline
You are already familiar with building predictive models on tabular data. In tabular data, you have a feature matrix `X` and a target vector `Y`. Given these data structures, you can apply learning algorithms like neural networks or random forests to learn the relationship between `X` and `Y`. In this exercise, you are provided with a data set of movie reviews. Your goal is to build a classifier predicting whether a review is positive or negative (this task is called sentiment classification). Hence, you have a prediction problem with a binary target, `Y`, which is nothing new for you. However, in this exercise, what is new for you is that you need to deal with text data instead of tabular data. With text data, you need to process the data to obtain the required feature matrix `X`. This processing of data is what we call the "NLP pipeline". 

In this exercise, you will need to set up an NLP pipeline. You are provided with a data set of movie reviews, where each sample contains a review (just a string cell). To obtain a feature matrix, each sample string cell needs to be transformed into a feature vector $x$. This process is called vectorization. There are multiple possible vectorization procedures. Today, you will implement a bag-of-words model for feature extraction. This feature extraction process involves two steps:
1. Vocabulary building
 * Tokenization: Transforming a review, which is a single string at the beginning, into a vector of strings (tokens).
 * Cleaning and compressing techniques: Reducing the number of distinct tokens. E.g., correcting the misspelling of words or lower casing the letters prevents the same word from appearing in multiple spelling ways. Additionally, similar words (e.g. different forms of a verb) can be united into a single token. 
 * Building a bag-of-words: a vector whose length corresponds exactly to the number of different tokens. Each token is assigned the position within the vector. 
 
2. Feature creation based on term frequency: Each review gets transformed into a feature vector $x$. The length of the feature vector corresponds to the length of the bag-of-words vector, created in step 1. An element $x_{j}$ of the feature vector is calculated by a frequency measure, measuring how frequently token $j$ from the bag-of-words vector occurs in the review. 

The first code cells provide the required packages and load the review data set, which you will use for the exercise. In the exercise, you will build the most simple NLP pipeline, which means that you go through steps 1 and 2 of the NLP pipeline, but you skip the "cleaning and compressing" part of step 1. This simple NLP pipeline provides you with a feature matrix `X` (possibly not ideal). You will use this feature matrix to build and evaluate a predictive model.

In the tutorial, we will extend your NLP pipeline by including the cleaning and compressing techniques (according techniques are also covered in detail in the demo notebook `nlp_foundations.ipynb`). That will lead to another feature matrix, `X`. Then, we will build another predictive model on this new feature matrix `X` and compare the performance to the model built by the simplified NLP pipeline.

In [2]:
# required packages
import pandas as pd
import nltk
# nltk.download('punkt') If needed
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from bs4 import BeautifulSoup ## handles html
import re ## provides regular expressions functionality
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pickle

In [3]:
# Remeber to adjust the path so that it matches your environment
df = pd.read_csv("IMDB-50K-Movie-Review.csv", sep=",", encoding="ISO-8859-1")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [4]:
## get to know the data
print(df)
df.head()

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
# Only use the first 10000 observations to reduce run time.
df = df.loc[0:10000,:]

df.reset_index(inplace=True, drop=True)  # dropping the index prohibits a reidentification of the cases in the original data frame
df.sentiment.value_counts()

positive    5028
negative    4973
Name: sentiment, dtype: int64

In [6]:
# Map label
df['sentiment'] = df['sentiment'].map({'positive' : 1, 'negative': 0})
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Exercise (simple NLP pipeline):
You need to transform the text data, contained in the column `df["review"]`, such that it is suitable as a feature matrix `X`, which you need for predictive model building. This means in detail: 

a) Create a list "reviews_tokenized", where each element corresponds to a string vector, representing a review. Use NLTK's `word_tokenize()` function.

In [7]:
nltk.word_tokenize('Hello world')

['Hello', 'world']

In [8]:
reviews_tokenized = df.apply(lambda row: nltk.word_tokenize(row['review']), axis=1)
print(reviews_tokenized)

0        [One, of, the, other, reviewers, has, mentione...
1        [A, wonderful, little, production, ., <, br, /...
2        [I, thought, this, was, a, wonderful, way, to,...
3        [Basically, there, 's, a, family, where, a, li...
4        [Petter, Mattei, 's, ``, Love, in, the, Time, ...
                               ...                        
9996     [Give, me, a, break, ., How, can, anyone, say,...
9997     [This, movie, is, a, bad, movie, ., But, after...
9998     [This, is, a, movie, that, was, probably, made...
9999     [Smashing, film, about, film-making, ., Shows,...
10000    [``, While, sporadically, engrossing, (, inclu...
Length: 10001, dtype: object


b) Split the review data (`reviews_tokenized`) as well as the target `df['sentiment']` in training and test sets. Use 80% of the data for training. Use sklearn's `train_test_split()` function.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(reviews_tokenized, df['sentiment'], test_size = 0.2, random_state = 5)

c) Now, we need to set up a vocabulary for all tokens and apply this vocabulary to obtain feature vectors $x$. We do this using sklearn's `TfidfVectorizer`. We provide the code to set up the vectorizer below. You need to apply the vectorizer to the data.

In [10]:
def dummy_fun(doc):
    return doc       
vectorizer = TfidfVectorizer(
    analyzer = 'word',
    tokenizer = dummy_fun,
    preprocessor = dummy_fun,
    token_pattern = None)

## Set up the dictionary and calculate the document frequency of each token on the training set.
## Then generate the features on the training set, using the document frequency table.
reviews_tr = vectorizer.fit_transform(X_train)

## Apply the document frequency table one the test set, to generate feature vectors.
reviews_ts = vectorizer.transform(X_test)

The `TfidfVectorizer` did multiple steps at once. To better understand how it works, you should examine the results step by step.

d) Examine the vocabulary it created: How many tokens does it include? Which tokens are included? Would it maybe be better to leave some of these tokens out to reduce the dimension of the vocabulary and the derived feature matrix?

In [13]:
vocab = vectorizer.get_feature_names()
print('The vocabulary contains ' + str(len(vocab)) + ' tokens.')
print('Now let us look at some examples of these tokens.')
print(vocab[0:50])

The vocabulary contains 74267 tokens.
Now let us look at some examples of these tokens.
['\x10own', '!', '#', '$', '%', '&', "'", "''", "''Headin", "''Scarface", "''Wallace", "'*name", "'.", "'007", "'00s", "'01", "'02", "'03", "'04", "'05", "'06", "'07", "'10", "'12", "'15", "'1st", "'20th", "'28", "'30", "'30s", "'30s-'40s", "'30s-Ray", "'30s/'40s", "'40", "'40s", "'42", "'43", "'45", "'50", "'50s", "'50s/early", "'51", "'54-'55", "'59", "'60", "'60s", "'60s-early", "'61", "'63", "'66"]


e) Let's recap how feature vectors are generated from this vocabulary. The basic idea of bag-of-words based feature extraction is to generate for each token in the vocabulary a column in the feature matrix `X`. For an observation $i$ (corresponding to a single review), the entry $X_{i,j}$ of the feature matrix would be 1 if the review contains the token of column $j$ and 0 otherwise. There are some variations to this approach. The Tfidf approach (*term frequency-inverse document frequency*), which we apply in this exercise. This encodes $X_{i,j}$ not as 1 if the review contains token $j$ but as the occurrence frequency of token $j$ in the review divided by the occurrence frequency of token $j$ in the whole document (all reviews of the data set combined). Have a look at the matrix, which the `TfidfVectorizer` created. 

In [14]:
## print the first 100 feature entries for the first review
print(reviews_tr[0,0:100].todense())
print(X_train[0])

[[0.        0.        0.        0.        0.        0.        0.
  0.0742515 0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.       ]]
['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'th

f) Fit a ridge regression classifier (`RidgeClassifier`) and evaluate the accuracy of the predictions on the training and test sets. We provide the code below.

In [15]:
classifier = RidgeClassifier(random_state=42, alpha=0.8)
classifier.fit(reviews_tr, y_train)
pred_test = classifier.predict(reviews_ts)
pred_train = classifier.predict(reviews_tr)
print(metrics.accuracy_score(y_train, pred_train))
print(metrics.accuracy_score(y_test, pred_test))

0.996125
0.8690654672663668


## Tutorial exercise
In the previous exercise, we took little care about the cleaning and compressing part of the NLP pipeline. As a consequence, we obtained a dictionary with a lot of tokens which are most likely not so informative. The high number of tokens in the dictionary resulted in a very high dimension of the feature matrix `X`. In this exercise, we will add the cleaning and compressing part to the NLP pipeline. We hope to create a feature matrix of lower dimension, which yields more accurate predictions.

In [16]:
## download pre-learned NLP tools
nltk.download('stopwords') ## to identify stopwords 
nltk.download('averaged_perceptron_tagger') ## for part-of-speech tagging (used for lemmatization)
nltk.download('omw-1.4')
nltk.download('wordnet')

# Lemmatize with POS Tag (Parts of Speech tagging)
def get_wordnet_pos(word):
    """Map POS tag to first character for lemmatization"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

## function to clean text data
def clean_reviews(df):
    """ Standard NLP pre-processing chain including removal of html tags, non-alphanumeric characters, and stopwords.
        Words are subject to lemmatization using their POS tags, which are determind using WordNet. 
    """
    reviews = []

    lemmatizer = WordNetLemmatizer()
    
    print('*' * 40)
    print('Cleaning {} movie reviews.'.format(df.shape[0]))
    counter = 0
    for review in df:
        
        # remove html content
        review_text = BeautifulSoup(review).get_text()
        
        # remove non-alphabetic characters
        review_text = re.sub("[^a-zA-Z]"," ", review_text)
    
        # tokenize the sentences with all capital letters transformed to lower case
        words = word_tokenize(review_text.lower())
  
        # filter stopwords
        words = [w for w in words if w not in stopwords.words("english")]
        
        # lemmatize each word to its lemma
        lemma_words =[lemmatizer.lemmatize(i, get_wordnet_pos(i)) for i in words]
    
        reviews.append(lemma_words)
              
        if (counter > 0 and counter % 500 == 0):
            print('Processed {} reviews'.format(counter))
            
        counter += 1
        
    print('DONE')
    print('*' * 40)

    return(reviews) 

[nltk_data] Downloading package stopwords to /Users/hawk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hawk/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /Users/hawk/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/hawk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [17]:
#* Do the cleaning
# CAUTION: takes around 20 minutes 
reviews_clean = clean_reviews(df.review)

****************************************
Cleaning 10001 movie reviews.
Processed 500 reviews
Processed 1000 reviews
Processed 1500 reviews
Processed 2000 reviews
Processed 2500 reviews
Processed 3000 reviews
Processed 3500 reviews
Processed 4000 reviews
Processed 4500 reviews
Processed 5000 reviews
Processed 5500 reviews
Processed 6000 reviews
Processed 6500 reviews
Processed 7000 reviews
Processed 7500 reviews
Processed 8000 reviews
Processed 8500 reviews
Processed 9000 reviews
Processed 9500 reviews
Processed 10000 reviews
DONE
****************************************


In [20]:
## save cleaned data
open_file = open('reviews_clean.pkl', "wb")
pickle.dump(reviews_clean, open_file)
open_file.close()

In [21]:
## load cleaned data
open_file = open('reviews_clean.pkl', "rb")
reviews_clean = pickle.load(open_file)
open_file.close()

In [22]:
## While the text gets cleaned, we have a look at the part-of-speech-tagging and lemmatization part of the cleaning function
## part-of-speech tagging identifies the word category (whether a word is a verb, noun, adjective, or adverb)
print(get_wordnet_pos('running'))
print(get_wordnet_pos('runner'))

v
n


In [23]:
## the word categorie determines how to lemmatize the word
lemmatizer_example = WordNetLemmatizer()
print(lemmatizer_example.lemmatize('running',get_wordnet_pos('running')))
print(lemmatizer_example.lemmatize('runner',get_wordnet_pos('runner')))
print(lemmatizer_example.lemmatize('run',get_wordnet_pos('run')))

run
runner
run


In [24]:
## split reviews in training and test set
Xclean_train, Xclean_test, y_train, y_test = train_test_split(reviews_clean, df['sentiment'], test_size = 0.2, random_state = 5)

In [25]:
## apply tfidf feature extraction
vectorizer_clean = TfidfVectorizer(
    analyzer = 'word',
    tokenizer = dummy_fun,
    preprocessor = dummy_fun,
    token_pattern = None)

## apply tfidf to training set and create vocabulary
reviews_clean_tr = vectorizer_clean.fit_transform(Xclean_train)

## Apply the document frequency table one the test set, to generate feature vectors.
reviews_clean_ts = vectorizer_clean.transform(Xclean_test)

In [27]:
## analyze vocabulary
vocab_clean = vectorizer_clean.get_feature_names()
print('The vocabulary contains ' + str(len(vocab_clean)) + ' tokens.')
print('Now let us look at some examples of these tokens.')
# print(vocab_clean[10000:10050])
print(vocab_clean[0:50])

The vocabulary contains 38070 tokens.
Now let us look at some examples of these tokens.
['aa', 'aaa', 'aaaaahhhh', 'aaaarrgh', 'aaah', 'aaall', 'aaargh', 'aaaugh', 'aag', 'aah', 'aaip', 'aaliyah', 'aames', 'aamir', 'aamto', 'aankhen', 'aap', 'aardman', 'aaron', 'aarp', 'aashok', 'aatish', 'aavjo', 'aawip', 'ab', 'abandon', 'abandonment', 'abash', 'abba', 'abbey', 'abbie', 'abbot', 'abbott', 'abbreviate', 'abc', 'abdalla', 'abdic', 'abdomen', 'abduct', 'abductee', 'abduction', 'abductor', 'abducts', 'abdul', 'abe', 'abedded', 'abel', 'abemethie', 'abernathie', 'abernethie']


In [28]:
## apply and evaluate classifier on clean text data
classifier_clean = RidgeClassifier(random_state=42, alpha=0.8)
classifier_clean.fit(reviews_clean_tr, y_train)
pred_test = classifier_clean.predict(reviews_clean_ts)
pred_train = classifier_clean.predict(reviews_clean_tr)
print(metrics.accuracy_score(y_train, pred_train))
print(metrics.accuracy_score(y_test, pred_test))

0.993125
0.8690654672663668
