# Assignment 4. Classification and Clustering on Text Data
### Due Date: May 8th (Wednesday) Midnight

This week, we will have a look at some classical tasks and you can learn how to apply supervised and unsupervised learning on text data.

# <font color="blue"> Submission Instructions</font>

1. Click the Save button at the top of the Jupyter Notebook.
2. Rename your assignment notebook with "(Your Name)\_Assignment_X_ ..." (Not the title, the na)
3. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of ll cells). 
4. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
5. Once you've rerun everything, select File -> Download as -> **HTML(.html)**
6. Look at the **HTML** file and make sure all your solutions are there, displayed correctly.
7. If a file (eg. xxx.json) is required, submit it with the jupyter notebook file and zip them. Rename the zipped file with "(Your Name)\_Assignment_X_ ..."
8. Submit your zipped file/notebook on Canvas.

In [39]:
# packages you may need to import
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.decomposition import LatentDirichletAllocation
import nltk
from nltk.cluster import KMeansClusterer
import numpy as np
import pandas as pd
import warnings 
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline

## Problem 1: Text Data Classification

In this problem we will ues a dataset of movie reviews from IMDb to learn how to perform classification on text data.

Download the dataset here: http://ai.stanford.edu/~amaas/data/sentiment/

The dataset contains the text of the reviews, together with label that marks a review as "positive" or "negative". In Assignment 2 we have already learnt how to use built-in tools (NLTK, textBlob) to do sentiment analysis. Here we will create a classifier from scratch and you can easily extend the techniques you learned to other advanced tasks.

### Q1: Load and Preprocess the dataset (30 pts)

If you have a look at the unzipped dataset, there is two top-level folders named "train" and "test" for training and testing data, and in "train" folder there are three subfolders for "pos(itive)", "neg(ative)" and "unsup(ported)". For the sake of simplicity, we will treat the training dataset as our whole dataset, resample it to half size and discard data in "unsup" folder. 

Please do following things:

1. Delete folder "unsup" in "train".
2. Load data from "../aclImdb/train" and store training data into `text_raw` and `y_raw` (sample and target).
4. Print out data type of `text_raw` (*list*, *dict* or *tuple*?) and length of it.
5. Print out the first element of `text_raw`. Perform reasonale preprocessing on it, and print out the first element of preprocessed text data (named as `text_ready`).
6. Split `text_ready` into `X_train_text`, `X_test_text`, `y_train`, `y_test`. The ratio of training data size to `text_ready` size is 0.4, and that of test data is 0.1. Also, the ratio of `pos` to `neg` after split should be the same as that of original data.
7. Print out the number of training sample by label (How many for *positive*? How many for *negative*?)

In [40]:
train_data = load_files(r'aclImdb\\train')
train_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [41]:
text_raw = train_data['data']
y_raw = train_data['target']

In [42]:
type(text_raw)

list

In [43]:
len(text_raw)

25000

In [44]:
text_raw[0]

b"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty."

In [45]:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

text_new = []
for i in text_raw:
    a = tokenizer.tokenize(i.decode())
    text_new.append(a)
    
stop_words = stopwords.words('english')
for i in text_new:
    for j in i:
        if j in stop_words:
            i.remove(j)

text_ready = []
for i in text_new:
    text_ready.append(' '.join(i))

print('------------------')
print('Preprocessed Text')
print('------------------')
print(text_ready[0])

------------------
Preprocessed Text
------------------
Zero Day leads think even think two boys young men would did commit mutual suicide via slaughtering classmates It captures must beyond bizarre mode being two humans have decided withdraw common civility order define own mutual world via coupled destruction br br It not perfect movie given what money time filmmaker actors remarkable product In terms explaining motives actions the two young suicide murderers is better Elephant terms being film gets our rationalistic skin it is a far far better film almost anything are likely to see br br Flawed honest a terrible honesty


In [46]:
df = pd.DataFrame()
df['text_ready'] = text_ready
df['y_raw'] = y_raw
df.head()

Unnamed: 0,text_ready,y_raw
0,Zero Day leads think even think two boys young...,1
1,Words t describe bad movie I t explain by writ...,0
2,Everyone plays part pretty well little nice mo...,1
3,There lot highly talented filmmakers actors Ge...,0
4,I evidence confirmed suspicions A bunch kids 1...,0


In [47]:
X_train_text, X_test_text, y_train, y_test = train_test_split(df['text_ready'],
                                                             df['y_raw'],
                                                              train_size = 0.4,
                                                             test_size = 0.1,
                                                              stratify = df['y_raw'],
                                                             random_state=0)
print('The number of Positive samples:',sum(y_train))
print('The number of Negative samples:',len(y_train) - sum(y_train))

The number of Positive samples: 5000
The number of Negative samples: 5000


### Q2: Count-Based Word Vectors (30 pts)

One of the challenges for classification on text is how to represent text in a meaningful way. We are familiar with classification on "numbers", but how about "text"? Do we have a way to retrieve all the information hidden behind text and store them into numbers (vector, matrix or tensor)?

One of the most simple but effective way is using *bag-of-words* representation. It is a count-based encoding method, which means we discard most of the structure information on text (like chapters, paragraphs and context), and use "counts" (how often each word appears in the text) as the representation.

In this problem, we will encode our text data into count-based word vectors. Then we can feed them into our classifier.

Please do following things:
    
1. Use `CountVectorizer` from scikit-learn to produce a vectorizer based on the whole training text data.
2. Transform the training and test text data into vectors (numbers) given the vectorizer. Name them as `X_train` and `X_test`.
3. Print out how many rows, how many vocabularies (features) and first 20 features. 
4. Print out a DataFrame whose rows are training samples (sample 0, 1, 2 ...) and columns are features generated by vectorization. (The DataFrame should be very sparse. You can just show `head()` for HTML output)

Here is an example:
    
For corpus: `["I love Dartmouth", "I love data science", "homework is not my love"]`

Your dataframe output should be like this (three rows for three sentences, and columns are "vocabularies"):
    
| * | dartmouth | data | homework | is | love | my | not | science |
|:-:|:---------:|:----:|:--------:|:--:|:----:|:--:|:---:|:-------:|
| 0 |     1     |   0  |     0    |  0 |   1  |  0 |  0  |    0    |
| 1 |     0     |   1  |     0    |  0 |   1  |  0 |  0  |    1    |
| 2 |     0     |   0  |     1    |  1 |   1  |  1 |  1  |    0    |

In [48]:
# write your code here
vect = CountVectorizer(ngram_range=(2,3),min_df=5)

X_tn = vect.fit(X_train_text)
X_tt = vect.fit(X_test_text)

X_train = X_tn.transform(X_train_text)
X_test = X_tt.transform(X_test_text)

In [49]:
X_train.shape

(10000, 7799)

In [50]:
print(X_tn.get_feature_names()[0:20])

['10 10', '10 because', '10 br', '10 br br', '10 minutes', '10 of', '10 of 10', '10 stars', '10 the', '10 years', '100 years', '12 year', '12 year old', '14 year', '14 year old', '15 minutes', '15 years', '20 minutes', '20 years', '20th century']


In [51]:
df_train = pd.DataFrame(X_train.todense(),columns=X_tn.get_feature_names())
df_train.head()

Unnamed: 0,10 10,10 because,10 br,10 br br,10 minutes,10 of,10 of 10,10 stars,10 the,10 years,...,your heart,your life,your local,your mind,your money,your own,your seat,your time,zero mostel,zeta jones
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Q3: Classify the Text Data (30 pts)

Now we are ready to do the classification:

1. Use `LogisticRegression` (param `C` as 0.001) as our classifier.
2. Print out cross-validation score (five folds, LR as classifer)
3. Use `GridSearchCV` to find out best `C` value among `[0.001, 0.01, 0.1, 1, 10]`. Print out best cross-validation score.
4. Do prediction on testset. Print out **precision score**, **recall score** and **f1 score**.
5. Draw a **confusion matrix** to show the classification result.

In [52]:
# write your code here
LR = LogisticRegression(C=0.001)
LR.fit(X_train,y_train)

score = cross_val_score(LR, X_train,y_train,cv=5)
score

array([0.723 , 0.7   , 0.6965, 0.7075, 0.707 ])

In [53]:
param_grid = {'C':[0.001,0.01,0.1,1,10]}
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid.fit(X_train,y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [0.001, 0.01, 0.1, 1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [54]:
grid.best_params_

{'C': 0.1}

In [55]:
grid.best_score_

0.7896

In [56]:
predictions = grid.predict(X_test)
predictions

array([1, 0, 1, ..., 0, 1, 1])

In [57]:
print('Precision score:',precision_score(y_test,predictions))
print('Recall score:',recall_score(y_test,predictions))
print('F1 score:',f1_score(y_test,predictions))
print(confusion_matrix(y_test,predictions))

Precision score: 0.8090062111801242
Recall score: 0.8336
F1 score: 0.8211189913317573
[[1004  246]
 [ 208 1042]]


### Q4: Improve your Classifier (20 pts)

Let's make some improvements on current classifier. Note that this question is quite open, the grading will depend on how much improvement you can achieve.

Leaving blank or no better result will lead to zero point in this question.

**Hints**: Context information? Numbers? Stop words? Form of words?

In [58]:
# write your code here
# change ngram range from (2,3) to (1,2)
# change min_df from 5 to 2 to increase sample size
vect = CountVectorizer(ngram_range=(1,2),min_df=2)

X_tn = vect.fit(X_train_text)
X_tt = vect.fit(X_test_text)

X_train = X_tn.transform(X_train_text)
X_test = X_tt.transform(X_test_text)

In [59]:
X_train.shape

(10000, 50588)

In [60]:
LR = LogisticRegression(C=0.001)
LR.fit(X_train,y_train)

score = cross_val_score(LR, X_train,y_train,cv=5)
score

array([0.84 , 0.835, 0.827, 0.825, 0.818])

In [61]:
param_grid = {'C':[0.001,0.01,0.1,1,10]}
grid = GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid.fit(X_train,y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [0.001, 0.01, 0.1, 1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [62]:
grid.best_params_

{'C': 0.1}

In [63]:
grid.best_score_

0.8732

In [64]:
predictions = grid.predict(X_test)
predictions

array([1, 0, 0, ..., 0, 1, 0])

In [65]:
print('Precision score:',precision_score(y_test,predictions))
print('Recall score:',recall_score(y_test,predictions))
print('F1 score:',f1_score(y_test,predictions))
print(confusion_matrix(y_test,predictions))

Precision score: 0.8752978554408261
Recall score: 0.8816
F1 score: 0.8784376245516142
[[1093  157]
 [ 148 1102]]


## Problem 2: Topic Modeling and Text Clustering with LDA and KMeans

In problem 1 we have learnt a classical example of supervised leaerning (classification). In this problem we are going to look at some unsupervised learning techniques, LDA and KMeans, which play key roles in finding similarity around documents.

### Q1: Latent Dirichlet Allocation (20 pts)

In some Read It Later or Bookmarking Apps (like Evernote Web Clipper, Pocket, Readability and etc.), one feature is that when you clip some webpage, the app will automatically assign a "smart category" or "smart keywords" to the page. How does it work?

One possible implementation is through Topic Modeling. Let's apply LDA to our movie review dataset to see whether we can find out some "topics" hidden behind the reviews.

Please do following things:
    
1. Configure the `CountVectorizer` so that it can remove "common words" that appear in at least `15 percent` of the reviews, and also limit feature number to `10000`.
2. Vectorize the training data with the configured `CountVectorizer` .
3. Use `LatentDirichletAllocation` from scikit-learn to mine the topics. (`10` topics, `batch` as learning method, `30` as max iterations)
4. Print out LDA results. Make some comments about the result (Can you name a certain topic?).

**Note**: We provide a message printing function named `print_LDA_results()`, where param `lda_model` is your fitted LDA model obejct, `feature_names` are features that build up a certain topic. The default number of supporting features to display is set to 15.

In [66]:
def print_LDA_results(lda_model, feature_names, n_top_words=15):
    for topic_idx, topic in enumerate(lda_model.components_):
        message = "Topic %d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
        print()

In [67]:
# write your code here
vect = CountVectorizer(max_df = 0.15, max_features=10000)

train_vect = vect.fit(X_train_text)
train_vect_t = train_vect.transform(X_train_text)

lda = LatentDirichletAllocation(n_components=10, learning_method='batch', max_iter=30)
lda.fit(train_vect_t)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=30, mean_change_tol=0.001,
             n_components=10, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [68]:
features = vect.get_feature_names()
print_LDA_results(lda,features)

Topic 0: worst 10 want horror nothing going actually guy don got pretty funny minutes give thought

Topic 1: action fight western james stewart batman john hero bruce martial lee fighting arts kong police

Topic 2: music show years funny comedy saw cast lot big always actors fun wonderful songs me

Topic 3: we us world war how its real family these things may human fact point other

Topic 4: black killer white south school serial high washington tarzan african woods smith jane africa teacher

Topic 5: script actors book director nothing poor enough read minutes work any look actually cast seems

Topic 6: performance role director cast actor play played performances wife actors work young screen most plays

Topic 7: show series episode tv funny original kids new episodes shows old years then children we

Topic 8: woman young wife girl gets old father husband new him mother house family goes home

Topic 9: horror its effects budget quite work director most dvd however rather more bit set

### Q2: K-Means (20 pts)

You may have heard of K-Means algorithm from Machine Learning course. It is such a popular method for unsupervised learning task, partly because its potential advantage on large scale data (easy to parallel). NLTK and Scikit-Learn both provide K-Means implementation, but NLTK has more flexibility on distance metric. Here we will have a look at how to use NLTK version of K-Means.

Please do following things:

1. Compose `KMeansClusterer` object with `10` as number of target topics, `cosine_distance` from nltk as distance metric and true for `avoid_empty_clusters`.
2. Use `CountVectorizer` to create **one-hot encoding** vector representation for `X_train_text`, other params should be same as that of LDA.
3. Apply K-Means to vectorized text data (with param `assign_clusters` as True). Print out the first ten topic assignment results.

A sample output of printing should be like:

```
Review: 'b'I registered just to make this comment (which pretty much echos some of the ones here already) The acting is worse than subpar, it expounds on commonly held stereotypes, has some of the worst displays of tasteless female objectification (all bod no brain), and has some of the cheesiest lines known to man.  including but not limited to "allright lets see what these guys can do" I should also mention that when they show the crashes involving innocent civilians, you end up feeling bad for the innocent people and start to hate the characters themselves. Eddie Griffin\'s character is also one of the most stereotypical black guy personas that just rubs people the wrong way. He may or may not be a good actor but this movie doesn\'t allow for that kind of character exploration. You want a movie that leaves the audience on the side of the bad guys? Oceans 11. This movie just makes you hate the bad guys instead of capturing the audience.  Even the cars can\'t make up for this fluke of a movie. That Enzo that Griffin wrecked sums up this movie perfectly. It just sucks.'' assigned to cluster 5.
```

In [69]:
# write your code here
from nltk.cluster import cosine_distance
kmeans = KMeansClusterer(num_means = 10, distance = cosine_distance, avoid_empty_clusters = True)

vectorizer = CountVectorizer(max_df = 0.15, max_features=10000, binary = True)


train_vectorized = vectorizer.fit(X_train_text)
train_vect1 = train_vectorized.transform(X_train_text)


cluster = kmeans.cluster(train_vect1.toarray(), assign_clusters = True)

In [70]:
for i in range(0,10):
    print("Review:'{0}' is assigned to cluster {1}.\n".format(X_train_text.iloc[i],cluster[i]))

Review:'I registered make comment pretty much echos ones already The acting worse subpar expounds commonly held stereotypes worst displays tasteless female objectification bod brain has some cheesiest lines known man br br including limited allright lets see these guys do I also mention when show crashes involving innocent civilians end feeling bad innocent people start to hate characters Eddie Griffin character also one stereotypical black guy personas rubs people wrong way He may may not be good actor movie t allow kind character exploration You want movie leaves audience the side the bad guys Oceans 11 This movie makes hate the bad guys instead of capturing the audience br br Even the cars t make for this fluke of a movie That Enzo that Griffin wrecked sums this movie perfectly It just sucks' is assigned to cluster 1.

Review:'You using IMDb br br You given hefty votes some your favourite films br br It something enjoy br br And s because Fifty seconds One world ends another begins 