## NLP Part 2:

**Agenda**

- Sentiment Analysis
    - Supervised learning sentiment anlaysis
    - Sentiment as feature engineering
    
- Topic Modeling with LDA (Latent-Dirichlet-Allocation)


### Quick Overview of the NLP lifecycle: From dataset to ouput
![workflowlda](images/sentiment_workflow.jpg)


![workflow](images/topic_workflow.jpg)


In [1]:
import pandas as pd
#pd.set_option('display.max_colwidth', 0) #code to wrap text for easy viewing
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = imdb_df = pd.read_csv("imdb_sentiment.csv")

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.shape

(50000, 2)

In [5]:
df.sentiment.value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

In [6]:
#Replace sentiment column with 1 & 0s 
df.sentiment = df.sentiment.map({"positive":1, "negative":0})

### Preprocessing

This is a dataset of movie review from the IMBD website with already labeled sentiment of positive and negative. 

**Target Variable is sentiment**, we will be trying to classify if it's positive or negative.

In [7]:
#NLTK
import nltk #spacy 
nltk.download('punkt') #download if you have not done so
from nltk.tokenize import word_tokenize
nltk.download('stopwords') #download if you have not done so
from nltk.corpus import stopwords

# Import vectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
#instatiate tokenizer
tokenizer = word_tokenize

#get stop_words
stop_words = stopwords.words('english')
#stop_words.extend(['.',',',"'d", "'ll", "'re", "'s", "'ve", 'could', 'might', 'must', "n't", 'need', 'sha', 'wo', 'would'])

In [9]:
#instatiate vectorizer
tfidf = TfidfVectorizer(min_df = 2, tokenizer = tokenizer, stop_words = stop_words)

Now that we created our tokenizer function which will be fed into our TFIDF vectorizer down below, ideally I may want to optimize the min_df value of our vectorizer. The **min_df** used for removing terms that appear too infrequently.

For example:
* min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
* min_df = 5 means "ignore terms that appear in less than 5 documents".

### Modeling

In [10]:
from sklearn.naive_bayes import BernoulliNB #because binary 
from sklearn.pipeline import Pipeline

#build pipeline
nlp_pipeline = Pipeline([
    ('preprocessing', tfidf),
    ('model', BernoulliNB())
], verbose = True)

In [11]:
#train_test_split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.review, df.sentiment, 
                                                    test_size=0.20, random_state=42)

In [12]:
%%time
#fit pipeline
nlp_pipeline.fit(X_train, y_train)

train_accuracy = nlp_pipeline.score(X_train,y_train)
test_accuracy = nlp_pipeline.score(X_test,y_test)

[Pipeline] ..... (step 1 of 2) Processing preprocessing, total=  48.2s
[Pipeline] ............. (step 2 of 2) Processing model, total=   0.1s
Wall time: 1min 48s


In [13]:
print(f'Train accuracy:\t{train_accuracy}')
print(f'Test accuracy:\t{test_accuracy}')

Train accuracy:	0.9012
Test accuracy:	0.8601


In [26]:
#Test prediction
nlp_pipeline.predict(['Welcome to the terrible world of data science'])

array([0], dtype=int64)

### How do we find out the sentiment of a document if it isn't labeled?

There are python packages out there, such as textblob, nltk, & vaderSentiment, that makes it easy for us to determine a sentiment for any string.

- You may want to do this as part of a feature engineering step.


#### TextBlob Example

Note: Make sure you install textblob, it does not come with anaconda.
```
pip install textblob
```

In [27]:
from textblob import TextBlob

In [33]:
documents = ['Today is a good day!',
            'That is extraordinarily terrible!',
            'That is a waste of time.',
            'That is sick movie.']

In [34]:
for i in documents:
    print(TextBlob(i).sentiment)

Sentiment(polarity=0.875, subjectivity=0.6000000000000001)
Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment(polarity=-0.2, subjectivity=0.0)
Sentiment(polarity=-0.7142857142857143, subjectivity=0.8571428571428571)


- Polarity ranges from -1 to 1, with -1 being most negative, and 1 being most positive
- Subjectivity is from 0 to 1, socre of 0 implying the statement is factual, score of 1 implies highly subejctive statement

#### Valence Aware Dictionary and sEntiment Reasoner (VADER)

- recenlty developed lexicon-based sentiment anlaysis tool, whose accuracy has been much greater than existing lexicon-based sentiment analyzers. It is better than other becauses it includes colloquial langauge terms, such as slang, emoticons, acronyms, and it also factors in the intesity of words.
- Model deeloped by georgia tech. Read the paper on it [here](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf), it's an easy to read paper if you're interested.

In [35]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer as Vader

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [36]:
vader = Vader()

In [37]:
for i in documents:
    print(vader.polarity_scores(i))

{'neg': 0.0, 'neu': 0.484, 'pos': 0.516, 'compound': 0.4926}
{'neg': 0.531, 'neu': 0.469, 'pos': 0.0, 'compound': -0.5255}
{'neg': 0.412, 'neu': 0.588, 'pos': 0.0, 'compound': -0.4215}
{'neg': 0.524, 'neu': 0.476, 'pos': 0.0, 'compound': -0.5106}




VADER's SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:

    negative
    neutral
    positive
    compound (computed by normalizing the scores above)



_______________________

## LDA Topic Modelling

**What is topic modelling? What are the use cases?**




### Example: A corpus of food magazines 

![exampleA](images/00_TM_food_magazines.png)

### Example: A corpus of news articles
![exampleB](images/01_TM_NYT_articles.png)


### Topic modeling 

- Suppose your company has a large collection of documents on a variety of topics
- Suppose they ask you to 
    - infer different topics in the documents
    - pull all documents about a certain topic    
    
**Use Cases:**
- Understanding customer support chat logs
- Understanding customer reviews
- Categorizing knowledge databases
- Looking at the change of topics over time. 
- Any idea on combining topic modeling and sentiment analysis?

### Topic modeling motivation

- Humans are pretty good at reading and understanding documents and answering questions such as 
    - What is it about?  
    - What is it related to in terms of content?     
- Labeling by hand? 
    - Probably not
- Use topic modeling which automates this process of inferring underlying structure in a large corpus of text documents

### Topic modeling: Input 

<center>
<img src="images/02_TM_science_articles.png" height="2000" width="2000"> 
</center>
(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling: output
<center>
<img src="images/TM_topics.png" height="600" width="600"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Topic modeling: output with interpretation

- The labels are assigned manually.  
<center>
<img src="images/03_TM_topics_with_labels.png" height="800" width="800"> 
</center>

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

## Topic modeling pipeline 

- Feed knowlege into the machines; let it read large amount of text
    * E.g., Wikipedia or News articles     
- Preprocess your corpus 
    - Be careful with the features (i.e., words)
- Train ML models
    - For now Latent Dirichlet Allocation (LDA)
- Interpret your topics     
- Evaluate
    - How well your model does on unseen documents? 

### Baysian approach: Latent Dirichlet Allocation (LDA)

- Developed by [David Blei](http://www.cs.columbia.edu/~blei/) and colleagues. 
    * One of the most cited papers in the last 15 years.
    
- Main Idea:
    - Documents exhibit multiple topics
    - A word in a document is likely to belong to the same topics the other words of that document
    
- Insight: 
    - Each document is a random mixture of corpus-wide topics.
        - Every document is a discrete probability distribution of topics

    - Every topic is a mixture words, aka. vocabulary, that is equal to the length of number of words in corpus.
        - Every topic is a discrete probability distribution of words 
        



### LDA: insight
- Each document is a random mixture of corpus-wide topics
- Every topic is a mixture words


![lda](images/04_TM_dist_topics_words_blei.png)

(Credit: [Dave Blei's presentation](http://www.cs.columbia.edu/~blei/talks/Blei_Science_2008.pdf))

### Example: Every document is a discrete probability distribution of topics

- Assume two topics: Topic 1 (topic model) and Topic 2 (fashion model)
- Document 1: 100% topic models
- Document 4: 100% fashion models
- Document 7: 60% topic models + 40% fashion model

<blockquote>
Document 1: probabilistic topic model<br>
Document 2: probabilistic topic model<br>
Document 3: probabilistic topic model<br>
Document 4: famous fashion model<br>
Document 5: famous fashion model<br>
Document 6: famous fashion model<br>
Document 7: famous fashion model at probabilistic topic model conference<br>    
</blockquote>
    
(Credit: The example is adapted from [Topic models tutorial](http://topicmodels.info/))

### Example: Every topic is a discrete probability distribution of words

- Assume two topics: Topic 1 (topic model) and Topic 2 (fashion model)
- Topic 1: _model_ (0.33), _probabilistic_ (0.32), _topic_ (0.32), ...    
- Topic 2: _model_ (0.33), _famous_ (0.32), _fashion_ (0.32), ...    

<blockquote>
Document 1: probabilistic topic model<br>
Document 2: probabilistic topic model<br>
Document 3: probabilistic topic model<br>
Document 4: famous fashion model<br>
Document 5: famous fashion model<br>
Document 6: famous fashion model<br>
Document 7: famous fashion model at probabilistic topic model conference<br>    
</blockquote>
    
(Credit: The example is adapted from [Topic models tutorial](http://topicmodels.info/))


![topics](images/topics.png)

![topic_traingle](images/topics_triangle.png)

### LDA learning: goals

Infer the underlying topic structure in the documents. In particular, 
- Learn the probability distribution of topics in each document
- Learn the discrete probability distribution of words in each topic

### LDA learning: intuition

Intuition: A word in a document is likely to belong to the same topic as the other words in that document. 

### LDA algorithm 

- Choose the number of topics you think are there in your corpus
    * Example: k = 2
    
### LDA algorithm

- Repeat the following steps till the topics make sense:     
- Randomly assign each words in each document to one of the topics
    * Example: The word _probabilistic_ is randomly assigned to topic 2 (fashion).
- Go through every word and its topic assignment in each document, looking at
    * How often the topic occurs in the document?
    * How often the word occurs with the topic overall? 
    * Example: Seems like topic 2 does not occur in Document 1 and the word _probabilistic_ doesn't occur much in topic 2 (fashion). So the word _probabilistic_ should probably be assigned to topic 1. 


#### Training LDA with [Gensim](https://radimrehurek.com/gensim/index.html)

Tutorial from [here](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#sphx-glr-auto-examples-tutorials-run-lda-py)

In [38]:
import io
import os.path
import re
import tarfile

import smart_open

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    with smart_open.open(url, "rb") as file:
        with tarfile.open(fileobj=file) as tar:
            for member in tar.getmembers():
                if member.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', member.name):
                    member_bytes = tar.extractfile(member).read()
                    yield member_bytes.decode('utf-8', errors='replace')

docs = list(extract_documents())

In [39]:
print(len(docs))
print(docs[0][:500]) #look at the first document

1740
387 
Neural Net and Traditional Classifiers  
William Y. Huang and Richard P. Lippmann 
MIT Lincoln Laboratory 
Lexington, MA 02173, USA 
Abstract
Previous work on nets with continuous-valued inputs led to generative 
procedures to construct convex decision regions with two-layer percepttons (one hidden 
layer) and arbitrary decision regions with three-layer percepttons (two hidden layers). 
Here we demonstrate that two-layer perceptton classifiers trained with back propagation 
can form both c


In [40]:
#Pre Processing
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
doocs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

In [41]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [42]:
import gensim

In [43]:
%%time
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

Wall time: 12.3 s


In [44]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [45]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [46]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 9738
Number of documents: 1740


In [47]:
%%time
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10 #number of topics, similar to K in k-means
chunksize = 2000 # Number of documents to be used in each training chunk.
passes = 20 #Number of passes through the corpus during training.
iterations = 400 #Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

Wall time: 1min 9s


In [48]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.1672.
[([(0.018794239, 'neuron'),
   (0.01863369, 'cell'),
   (0.0086115245, 'response'),
   (0.007806326, 'stimulus'),
   (0.0072809695, 'spike'),
   (0.006781008, 'activity'),
   (0.006232301, 'synaptic'),
   (0.0056042876, 'firing'),
   (0.004935798, 'cortex'),
   (0.004325682, 'connection'),
   (0.0042543025, 'visual'),
   (0.004132585, 'frequency'),
   (0.0040870407, 'cortical'),
   (0.0038517267, 'signal'),
   (0.0036974992, 'orientation'),
   (0.0035579538, 'potential'),
   (0.0033264365, 'field'),
   (0.0032710019, 'fig'),
   (0.00315523, 'layer'),
   (0.0030391028, 'temporal')],
  -0.8341374567087377),
 ([(0.0106591955, 'gaussian'),
   (0.009123859, 'mixture'),
   (0.007869442, 'component'),
   (0.007790583, 'density'),
   (0.0069111274, 'likelihood'),
   (0.006691491, 'matrix'),
   (0.0056680543, 'prior'),
   (0.0054988363, 'em'),
   (0.004901987, 'noise'),
   (0.004713703, 'bayesian'),
   (0.004685639, 'estimate'),
   (0.004622938, 'posterior'),
 

In [49]:
%%time
#import pyLDAvis.sklearn Sklearn version
import pyLDAvis
import pyLDAvis.gensim_models
 
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(model, corpus, dictionary, sort_topics=False)
vis

Wall time: 6.59 s
