## Lighthouse Labs
### W08D4 NLP II
Instructor: Socorro Dominguez  
February 25, 2020

**Agenda:**
* Introduction to NLP modeling

* Sentiment analysis
    * Supervised learning sentiment analysis

* Topic modeling
    * LDA (Latent-Dirichlet-Allocation)

In [1]:
import os.path
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt

import gensim 
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.models.wrappers import LdaMallet

import gensim.corpora as corpora
from gensim.corpora import Dictionary

from gensim import matutils, models
import pyLDAvis.gensim
import string

%matplotlib inline


from sklearn.model_selection import train_test_split

  from collections import Iterable
  from collections import Mapping


## Sentiment Anaysis

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

## Using Supervised Learning Algorithms for Sentiment Analysis

Naive Bayes is popular in text classification tasks. 

You have used NB before. Today, we will use it for sentiment analysis, which is a problem of assigning positive or negative label to a text based on the sentiment or attitude expressed in  it. 

For this example, we will use [IMDB movie review data set](https://www.kaggle.com/utathya/imdb-review-dataset). If you want to reproduce this example, you will need to download the data on your own.

### Loading data and preprocessing

1. We need to load data CSV as a pandas DataFrame.

2. There are three possible labels in the dataset: `pos`, `neg`, and `unsup`. For now, let's discard rows with `unsup`.

In [2]:
imdb_df = pd.read_csv('data/imdb_master.csv', encoding = "ISO-8859-1")
imdb_df.head()

Unnamed: 0.1,Unnamed: 0,type,review,label,file
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt


In [3]:
imdb_df['label'].value_counts()

unsup    50000
pos      25000
neg      25000
Name: label, dtype: int64

In [4]:
# only consider positive and negative reviews
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]

### Feature extraction

The current data is in the form of moview reviews (text paragraphs) and their targets (`pos` or `neg`). 
We need to encode movie reviews into feature vectors so that we can train supervised machine learning models with `scikit-learn`. 

How can we do this?



#### Create word frequency counts (`X_counts`)
Turn the text into sparse vector of word frequency counts using [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from  `scikit-learn`. 

When you reproduce this, explore the arguments of `CountVectorizer` (e.g., [`stop_words`](https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words), `ngram_range`, `max_features`, `min_df`, and `tokenizer`).  

#### Create binarized representation of words (`X_binary`)
Create binarized encoding (`X_binary`) of `X_counts`, where you replace word frequencies $\geq$ 1 by 1.    
The intuition behind using binarized representation is that for sentiment analysis word occurrence may matter more than word frequency. For instance, the occurrence of the word _excellent_ tells us a lot and the fact that it occurs four times may not tell us much more. 

In [5]:
# For tokenization
import nltk
# For converting words into frequency counts
from sklearn.feature_extraction.text import CountVectorizer

  regargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
  from collections import Sequence, defaultdict


In [6]:
# initialize movie_vector object and then turn movie reviews train data into a vector 
movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, stop_words='english')

# use top 5000 words only
# movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features = 5000) 
X_counts = movie_vec.fit_transform(imdb_df['review'])

# Convert raw frequency counts into binarized representation. 
X_binary = X_counts > 0


## Split before

### Train Naive Bayes classifier

1. Split (`X_counts`, `imdb_df.label`) into train (80%) and test (20%).
2. Train [multinomial Naive Bayes algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) on the train set. 
3. Report train and test accuracies.
4. Now repeat steps 1, 2, and 3 with (`X_binary`, `imdb_df.label`). 
5. Compare your results for `X_counts` and `X_binary` and note your observations. 

Note to the reader: I should have split my dataset before applying CountVectorizer. In the training part, you should apply .fit_transform and in the testing part, you should apply only .tranform

As an exercise, fix this error in your forked copy.

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

In [8]:
def get_NB_train_test_accuracies(X, y, classifier = 'multinominal'):
    """
    Given X, y, and the classifier, this function splits the 
    data into train and test splits, prints the train and test accuracies,
    and returns the model.     
    """
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y, 
                                                        test_size = 0.20, 
                                                        random_state = 12)
    if classifier.startswith('multinominal'):
        model = MultinomialNB().fit(X_train, y_train)
    elif classifier.startswith('bernoulli'):
        model = BernoulliNB().fit(X_train, y_train)
    print('Training accuracy:', model.score(X_train, y_train))
    print('Test accuracy: ', model.score(X_test, y_test))
    print('---------')
    return model

In [9]:
print('Evaluation on binarized encoding ')
model_binary = get_NB_train_test_accuracies(X_binary, imdb_df.label, classifier = 'bernoulli')

print('Evaluation on counts encoding ')
model_counts = get_NB_train_test_accuracies(X_counts, imdb_df.label)

Evaluation on binarized encoding 
Training accuracy: 0.90135
Test accuracy:  0.8567
---------
Evaluation on counts encoding 
Training accuracy: 0.89905
Test accuracy:  0.8558
---------



What do you observe?

### Let's play with fake reviews 

Let's see how the model performs on fake movie reviews. Some examples are given below.

In [10]:
fake_reviews = ['This movie was excellent! The performances were oscar-worthy!',
               'Unbelievably disappointing.', 
               'Full of zany characters and richly applied satire, and some great plot twists',
               'This is the greatest screwball comedy ever filmed',
               'It was pathetic. The worst part about it was the boxing scenes.', 
               '''It could have been a great movie. It could have been excellent, 
                and to all the people who have forgotten about the older, 
                greater movies before it, will think that as well. 
                It does have beautiful scenery, some of the best since Lord of the Rings. 
                The acting is well done, and I really liked the son of the leader of the Samurai.
                He was a likeable chap, and I hated to see him die...
                But, other than all that, this movie is nothing more than hidden rip-offs.
                '''
              ]
gold_labels = ['pos', 'neg', 'pos', 'pos', 'neg', 'neg']

In [11]:
# Create word count encoding of the reviews.  
fake_reviews_counts = movie_vec.transform(fake_reviews)
fake_reviews_binary = fake_reviews_counts > 0

In [12]:
# Predict using the Naive Bayes classifier
predictions = model_binary.predict(fake_reviews_binary)

In [13]:
print(predictions.tolist())

['pos', 'neg', 'pos', 'pos', 'neg', 'pos']


In [14]:
pd.set_option('display.max_colwidth', 0)
d = {'Review':fake_reviews, 'Gold labels':gold_labels, 'NB labels':predictions}
df = pd.DataFrame(d)
df

Unnamed: 0,Review,Gold labels,NB labels
0,This movie was excellent! The performances were oscar-worthy!,pos,pos
1,Unbelievably disappointing.,neg,neg
2,"Full of zany characters and richly applied satire, and some great plot twists",pos,pos
3,This is the greatest screwball comedy ever filmed,pos,pos
4,It was pathetic. The worst part about it was the boxing scenes.,neg,neg
5,"It could have been a great movie. It could have been excellent, \n and to all the people who have forgotten about the older, \n greater movies before it, will think that as well. \n It does have beautiful scenery, some of the best since Lord of the Rings. \n The acting is well done, and I really liked the son of the leader of the Samurai.\n He was a likeable chap, and I hated to see him die...\n But, other than all that, this movie is nothing more than hidden rip-offs.\n",neg,pos


1. Our model works well when there are clear words indicating whether the review is positive or negative, as the features we are using are word features.
2. Fails for more complex examples, where understanding the context and overall text is essential to correctly classify reviews. The last example has many positive words in the beginning but the last sentence negates all positivity in the previous text. We need to incorporate deeper linguistic knowledge to correctly classify such cases. 

### Sentiment Analysis with Vader

In [15]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/seiryu8808/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [16]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

VADER's SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:

* negative
* neutral
* positive
* compound (computed by normalizing the scores above)

In [17]:
a = 'The weather today is horrible. I dont feel like getting out'
sid.polarity_scores(a)

{'neg': 0.412, 'neu': 0.588, 'pos': 0.0, 'compound': -0.6818}

In [18]:
a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)

{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}

In [19]:
df['Vader_scores'] = df['Review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['Vader_scores'].apply(lambda score_dict: score_dict['compound'])

df['Vader labels'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df_labels = df[['Review', 'Gold labels', 'NB labels', 'Vader labels']]

df_labels.head()

Unnamed: 0,Review,Gold labels,NB labels,Vader labels
0,This movie was excellent! The performances were oscar-worthy!,pos,pos,pos
1,Unbelievably disappointing.,neg,neg,neg
2,"Full of zany characters and richly applied satire, and some great plot twists",pos,pos,pos
3,This is the greatest screwball comedy ever filmed,pos,pos,pos
4,It was pathetic. The worst part about it was the boxing scenes.,neg,neg,neg


10 min Break

## Topic modeling 

- Suppose your company has a large collection of documents on a variety of topics

### Example: A corpus of food magazines 
<center>
<img src="images/00_TM_food_magazines.png" height="2000" width="2000"> 
</center>

### Example: A corpus of news articles 
<center>
<img src="images/01_TM_NYT_articles.png" height="2000" width="2000"> 
</center>

### Topic modeling 

- Suppose your company has a large collection of documents on a variety of topics
- Suppose they ask you to 
    - infer different topics in the documents
    - pull all documents about a certain topic    

### Topic modeling motivation

- Humans are pretty good at reading and understanding documents and answering questions such as 
    - What is it about?  
    - What is it related to in terms of content?     
- Labeling by hand? 
    - Probably not
- Use topic modeling which automates this process of inferring underlying structure in a large corpus of text documents

### Topic modeling: Input 

<center>
<img src="images/02_TM_science_articles.png" height="2000" width="2000"> 
</center>
(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling: output
<center>
<img src="files/images/TM_topics.png" height="900" width="900"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling: output with interpretation

- The labels are assigned manually.  
<center>
<img src="images/03_TM_topics_with_labels.png" height="800" width="800"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))


## Topic modeling pipeline 

- Feed knowlege into the machines; let it read large amount of text
    * E.g., Wikipedia or News articles     
- Preprocess your corpus 
    - Be careful with the features (i.e., words)
- Train ML models
    - For now Latent Dirichlet Allocation (LDA)
- Interpret your topics     
- Evaluate
    - How well your model does on unseen documents? 

### Baysian approach: Latent Dirichlet Allocation (LDA)

- Developed by [David Blei](http://www.cs.columbia.edu/~blei/) and colleagues. 
    * One of the most cited papers in the last 15 years.
    
- Insight: 
    - Each document is a random mixture of corpus-wide topics
        - Every document is a discrete probability distribution of topics

    - Every topic is a mixture words
        - Every topic is a discrete probability distribution of words 

### LDA: insight
- Each document is a random mixture of corpus-wide topics
- Every topic is a mixture words
<center>
<img src="images/04_TM_dist_topics_words_blei.png" height="1000" width="1000"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Example: Every document is a discrete probability distribution of topics

- Assume two topics: Topic 1 (topic model) and Topic 2 (fashion model)
- Document 1: 100% topic models
- Document 4: 100% fashion models
- Document 7: 60% topic models + 40% fashion model

<blockquote>
Document 1: probabilistic topic model<br>
Document 2: probabilistic topic model<br>
Document 3: probabilistic topic model<br>
Document 4: famous fashion model<br>
Document 5: famous fashion model<br>
Document 6: famous fashion model<br>
Document 7: famous fashion model at probabilistic topic model conference<br>    
</blockquote>
    
(Credit: The example is adapted from [Topic models tutorial](http://topicmodels.info/))

### Example: Every topic is a discrete probability distribution of words

- Assume two topics: Topic 1 (topic model) and Topic 2 (fashion model)
- Topic 1: _model_ (0.33), _probabilistic_ (0.32), _topic_ (0.32), ...    
- Topic 2: _model_ (0.33), _famous_ (0.32), _fashion_ (0.32), ...    

<blockquote>
Document 1: probabilistic topic model<br>
Document 2: probabilistic topic model<br>
Document 3: probabilistic topic model<br>
Document 4: famous fashion model<br>
Document 5: famous fashion model<br>
Document 6: famous fashion model<br>
Document 7: famous fashion model at probabilistic topic model conference<br>    
</blockquote>
    
(Credit: The example is adapted from [Topic models tutorial](http://topicmodels.info/))

# Intuition

What is Dirichlet???

What are our objectives?

![img](images/theparty.png)

![img](images/dangers.png)

![img](images/DirichletDistributions.png)

### LDA model

- Observable features: words
- All other parameters are hidden or latent

<center>
<img src="images/05_TM_topic_model_blei.png" height="700" width="700"> 
</center>

(Adapted from [David Blei's paper](http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf))

# LDA Machine

- We want to get the best settings
![img](images/Lda_machine.png)

![img](images/words_triangle.png)

![img](images/topics_triangle.png)

![img](images/2distributions.png)

### LDA: Hyperparameters

- $\alpha$ 
   - High alpha &rarr; every document contains a mixture of most of the topics
   - Low alpha &rarr; every document is representative of only a few topic
- $\beta$
    - High beta &rarr; Every topic contains a mixture of most of the words
    - Low beta &rarr; Every topic contains a mixture of only few words

<center>
<img src="images/05_TM_topic_model_blei.png" height="600" width="600"> 
</center>

(Adapted from [David Blei's paper](http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf))

![img](images/probas.png)

![img](images/the_blueprint_relationship.png)

### LDA learning: goals

Infer the underlying topic structure in the documents. In particular, 
- Learn the probability distribution of topics in each document
- Learn the discrete probability distribution of words in each topic

### LDA learning: intuition

Intuition: A word in a document is likely to belong to the same topic as the other words in that document. 

### LDA algorithm 

- Choose the number of topics you think are there in your corpus
    * Example: k = 2

### LDA algorithm

- Repeat the following steps till the topics make sense:     
- Randomly assign each words in each document to one of the topics
    * Example: The word _probabilistic_ is randomly assigned to topic 2 (fashion).
- Go through every word and its topic assignment in each document, looking at
    * How often the topic occurs in the document?
    * How often the word occurs with the topic overall? 
    * Example: Seems like topic 2 does not occur in Document 1 and the word _probabilistic_ doesn't occur much in topic 2 (fashion). So the word _probabilistic_ should probably be assigned to topic 1. 


### Training LDA with [Gensim](https://radimrehurek.com/gensim/models/ldamodel.html)

You need

- Document-term matrix 
- Pick number of topics: `num_topics`
- Pick number of passes: `passes`



* *Disclaimer: You can also check out Sklearn's model. However, Gensim is more used in NLP.*

In [20]:
toy_df = pd.read_csv('data/toy_lda_data.csv')
toy_df

Unnamed: 0,doc_id,text
0,1,famous fashion model
1,2,famous fashion model
2,3,famous fashion model
3,4,famous fashion model
4,5,famous fashion model
5,6,famous fashion model
6,7,famous fashion model
7,8,famous fashion model
8,9,famous fashion model
9,10,famous fashion model


In [21]:
corpus = [doc.split() for doc in toy_df['text'].tolist()]
corpus

[['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['famous', 'fashion', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model'],
 ['probabilistic', 'topic', 'model']]

In [22]:
# Create a vocabulary for the lda model and 
# convert our corpus into document-term matrix for Lda
dictionary = corpora.Dictionary(corpus)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]
doc_term_matrix

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)],
 [(2, 1), (3, 1), (4, 1)]]

In [23]:
lda = models.LdaModel(corpus=doc_term_matrix, 
                      id2word=dictionary, 
                      num_topics=2, 
                      passes=10)

In [24]:
lda.print_topics()

[(0,
  '0.326*"probabilistic" + 0.326*"topic" + 0.324*"model" + 0.012*"famous" + 0.012*"fashion"'),
 (1,
  '0.327*"model" + 0.323*"fashion" + 0.323*"famous" + 0.014*"topic" + 0.014*"probabilistic"')]

In [25]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary, sort_topics=False)
vis

# END SOLUTION 

### Tips when you build an LDA model on a large corpus 

- Preprocessing is crucial!! 
    - Tokenize, remove punctuation, convert text to lower case
    - Discard words with length < threshold or word frequency < threshold        
    - Stoplist: Remove most commonly used words in English 
    - Lemmatization: Consider the root form of the word. 
    - Restrict to specific part of speech
        * Only consider nouns, verbs, and adjectives