Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
import numpy as np
import pandas as pd
from os import listdir
%matplotlib inline

import spacy
nlp = spacy.load('en')

import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
from os.path import isfile, join

from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split

This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. 

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning. 
### In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels.  In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.


In [339]:
#saves list of files names to loop though
neg_train_file_names = [f for f in listdir('aclImdb\\train\\neg') if isfile(join('aclImdb\\train\\neg', f))]
pos_train_file_names = [f for f in listdir('aclImdb\\train\\pos') if isfile(join('aclImdb\\train\\pos', f))]
neg_test_file_names = [f for f in listdir('aclImdb\\test\\neg') if isfile(join('aclImdb\\test\\neg', f))]
pos_test_file_names = [f for f in listdir('aclImdb\\test\\pos') if isfile(join('aclImdb\\test\\pos', f))]

In [341]:
review_df1 = []

for file in neg_train_file_names:
    file1_open = open("aclImdb\\train\\neg\\{}".format(file), encoding="utf8")
    file1_content = file1_open.read()
    review_df1.append([file1_content, 0])

In [340]:
review_df2 = []

for file in pos_train_file_names:
    file1_open = open("aclImdb\\train\\pos\\{}".format(file), encoding="utf8")
    file1_content = file1_open.read()
    review_df2.append([file1_content, 1])

In [342]:
review_df3 = []

for file in neg_test_file_names:
    file1_open = open("aclImdb\\test\\neg\\{}".format(file), encoding="utf8")
    file1_content = file1_open.read()
    review_df3.append([file1_content, 0])

In [343]:
review_df4 = []

for file in pos_test_file_names:
    file1_open = open("aclImdb\\test\\pos\\{}".format(file), encoding="utf8")
    file1_content = file1_open.read()
    review_df4.append([file1_content, 1])

In [344]:
review_df1 = pd.DataFrame(review_df1)
review_df2 = pd.DataFrame(review_df2)
review_df3 = pd.DataFrame(review_df3)
review_df4 = pd.DataFrame(review_df4)

In [345]:
review_df = pd.concat([review_df1, review_df2, review_df3, review_df4]).sample(frac=1).reset_index(drop=True)
review_df.columns = ['Review', 'Rating']

In [346]:
print(review_df.shape)
review_df.head(10)
#review_df.to_csv("movie_sentment_review")

(50000, 2)


Unnamed: 0,Review,Rating
0,WWII veterans return home and find it hard to ...,1
1,This is an incredible movie that begins slowly...,1
2,It worked! Director Christian Duguay created a...,1
3,"Spin-offs, for somebody who don't know, are no...",1
4,...about this film was the title song. After 3...,0
5,The film-school intellects can drool all they ...,0
6,In 1976 a mother named Norma Lewis (Cameron Di...,0
7,"While the soundtrack is a bit dated, this stor...",1
8,There is not much to say about this one except...,0
9,Wow...I don't know what to say. I just watched...,1


### Data Frame created above

# text file for making BOW

### Pick 2000 words and make BOW

In [291]:
allwords = " "
Review = []
pattern = "[-*:!&$%',.\\?\"/<>()\d]"
pattern2 = r"\bbr\b"
pattern3 = r'\bA\b'
pattern4 = r'--'
#pattern5 = '.+;$'

for review in review_df['Review']:
    #cleaning up each review text by removing above patters
    mov_review = re.sub(pattern, "", review)
    mov_review = re.sub(pattern2, "", mov_review)
    mov_review = re.sub(pattern3, "", mov_review)
    mov_review = re.sub(pattern4, "", mov_review)
    #mov_review = re.sub(pattern5, "", mov_review)
    mov_review = mov_review.split()
    mov_review = [x for x in mov_review
                if not x == ' '
                and not x == 's'
                and not x == 'I'
                and not x == 'movie'
                and not x == '-PRON-'
                and not x == 'film'
                and not x == '\x96'
                and not x == '!'
                and x not in stopWords]

    #Collect all review as string to make BOW later
    #allwords = allwords + mov_review
    Review.append(mov_review)
    allwords = allwords + ' ' +' '.join(mov_review)

### Above the string text has been worked to reduce the nlp processing work below

In [None]:
#nlp processing
allwords_doc1 = nlp(allwords[0:999999])
allwords_doc2 = nlp(allwords[999999:1999998])


In [295]:
from nltk.corpus import stopwords

stopWords = set(stopwords.words('english'))
def bag_of_words(text):
    allwords = [token.lemma_
                    for token in text
                    if not token.is_punct
                    and not token.is_stop]
    allwords = [x for x in allwords
                if not x == ' '
                and not x == 's'
                and not x == 'movie'
                and not x == '-PRON-'
                and not x == 'film'
                and x not in stopWords]
   
    return allwords

word_count1 = [item[0] for item  in Counter(bag_of_words(allwords_doc1)).most_common(4000)]
word_count2 = [item[0] for item  in Counter(bag_of_words(allwords_doc2)).most_common(4000)]

word_count = set(word_count1 + word_count2)

# Create a df with word in the bag of words, make feature of them, and count how many there are in each review. Those count will be the data point.

In [347]:
def bow_features(review_df, word_count):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=word_count)
    df['text_sentence'] = review_df['Review'][0:20000]
    df['text_source'] = review_df['Rating'][0:20000]
    df.loc[:, word_count] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        sentence = nlp(sentence)
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in word_count
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 500 == 0:
            print("Processing row {}".format(i))
            
    return df

In [348]:
#this takes too long so I only included 20,000 review. 
word_counts = bow_features(review_df, word_count)
word_counts

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000
Processing row 5500
Processing row 6000
Processing row 6500
Processing row 7000
Processing row 7500
Processing row 8000
Processing row 8500
Processing row 9000
Processing row 9500
Processing row 10000
Processing row 10500
Processing row 11000
Processing row 11500
Processing row 12000
Processing row 12500
Processing row 13000
Processing row 13500
Processing row 14000
Processing row 14500
Processing row 15000
Processing row 15500
Processing row 16000
Processing row 16500
Processing row 17000
Processing row 17500
Processing row 18000
Processing row 18500
Processing row 19000
Processing row 19500


Unnamed: 0,relevant,compare,uplift,campus,shrink,metal,farscape,earlier,ultimate,vulnerable,...,loud,flavor,arc,assassination,commentator,contract,charming,underground,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,WWII veterans return home and find it hard to ...,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,This is an incredible movie that begins slowly...,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,It worked! Director Christian Duguay created a...,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"Spin-offs, for somebody who don't know, are no...",1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,...about this film was the title song. After 3...,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,The film-school intellects can drool all they ...,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,In 1976 a mother named Norma Lewis (Cameron Di...,0
7,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"While the soundtrack is a bit dated, this stor...",1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,There is not much to say about this one except...,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Wow...I don't know what to say. I just watched...,1


In [2]:
review_feature_df = pd.read_csv('Review_with_feature.csv', encoding="ISO-8859-1")
review_feature_df.head()

Unnamed: 0.1,Unnamed: 0,relevant,compare,uplift,campus,shrink,metal,farscape,earlier,ultimate,...,loud,flavor,arc,assassination,commentator,contract,charming,underground,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,WWII veterans return home and find it hard to ...,1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,This is an incredible movie that begins slowly...,1
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,It worked! Director Christian Duguay created a...,1
3,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"Spin-offs, for somebody who don't know, are no...",1
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,...about this film was the title song. After 3...,0


In [15]:
review_feature_df2 = review_feature_df[0:1000]

In [23]:
X = review_feature_df.drop(['text_sentence', 'text_source'], 1)
y = review_feature_df['text_source']

X_pca = PCA(n_components=5).fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_pca, 
                                                    y,
                                                    test_size=0.4,
                                                    random_state=0)


# BoW with Logistic Regression

In [350]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train = lr.fit(X_train, y_train)

print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

y_pred = lr.predict(X_train)
pd.crosstab(y_train, y_pred)

(12000, 5049) (12000,)
Training set score: 0.98375

Test set score: 0.84275


col_0,0,1
text_source,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5887,112
1,83,5918


In [351]:
y_pred = lr.predict(X_test)
pd.crosstab(y_test, y_pred)

col_0,0,1
text_source,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3383,618
1,640,3359


# Random forest

In [352]:
from sklearn import ensemble

rfc = ensemble.RandomForestClassifier()

train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

y_pred = rfc.predict(X_train)
pd.crosstab(y_train, y_pred)

Training set score: 0.9933333333333333

Test set score: 0.75725


col_0,0,1
text_source,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5991,8
1,72,5929


In [353]:
y_pred = rfc.predict(X_test)
pd.crosstab(y_test, y_pred)

col_0,0,1
text_source,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3300,701
1,1241,2758


# GradientBoosting

In [354]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.8171666666666667

Test set score: 0.78925


# SVM

In [6]:
from sklearn.svm import SVC

svm = SVC(kernel = 'linear')
svm.fit(X_train, y_train)

print('Training set score:', svm.score(X_train, y_train))
y_pred1 = svm.predict(X_train)
print(pd.crosstab(y_train, y_pred1))

#print('\nTest set score:', svm.score(X_test, y_test))
#y_pred2 = svm.predict(X_test)
#pd.crosstab(y_test, y_pred2)

Training set score: 0.66725
col_0           0     1
text_source            
0            3861  2138
1            1855  4146


In [10]:
from sklearn.metrics iport classification_report, confusion_matrix
print(classification_report(y_train, y_pred1))

             precision    recall  f1-score   support

          0       0.68      0.64      0.66      5999
          1       0.66      0.69      0.67      6001

avg / total       0.67      0.67      0.67     12000



In [27]:
from sklearn.svm import SVC

svm = SVC(C=10, gamma=1, kernel='linear')
svm.fit(X_train, y_train)

print('Training set score:', svm.score(X_train, y_train))
y_pred1 = svm.predict(X_train)
print(pd.crosstab(y_train, y_pred1))

#print('\nTest set score:', svm.score(X_test, y_test))
#y_pred2 = svm.predict(X_test)
#pd.crosstab(y_test, y_pred2)

Training set score: 0.6660833333333334
col_0           0     1
text_source            
0            3798  2201
1            1806  4195


In [32]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_train, y_pred1))
print('\n Testing set score:')
y_pred2 = svm.predict(X_test)
print(classification_report(y_test, y_pred2))

             precision    recall  f1-score   support

          0       0.68      0.63      0.65      5999
          1       0.66      0.70      0.68      6001

avg / total       0.67      0.67      0.67     12000


 Testing set score:
             precision    recall  f1-score   support

          0       0.67      0.63      0.65      4001
          1       0.65      0.69      0.67      3999

avg / total       0.66      0.66      0.66      8000



Using GridSearchCV from sklearn to find the optimal parameters for C, gamma and kernel from a given set of values to improve our accuracy.

In [17]:
from sklearn.grid_search import GridSearchCV
param_grid = {'C':[1,10,100,1000],'gamma':[1,0.1,0.001,0.0001], 'kernel':['linear']}

In [18]:
from sklearn.svm import SVC
grid = GridSearchCV(SVC(),param_grid,refit = True, verbose=2)

In [19]:
grid.fit(X_train,y_train)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] C=1, gamma=1, kernel=linear .....................................
[CV] ............................ C=1, gamma=1, kernel=linear -   7.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.2s remaining:    0.0s


[CV] C=1, gamma=1, kernel=linear .....................................
[CV] ............................ C=1, gamma=1, kernel=linear -   3.6s
[CV] C=1, gamma=1, kernel=linear .....................................
[CV] ............................ C=1, gamma=1, kernel=linear -   4.0s
[CV] C=1, gamma=0.1, kernel=linear ...................................
[CV] .......................... C=1, gamma=0.1, kernel=linear -   7.0s
[CV] C=1, gamma=0.1, kernel=linear ...................................
[CV] .......................... C=1, gamma=0.1, kernel=linear -   3.7s
[CV] C=1, gamma=0.1, kernel=linear ...................................
[CV] .......................... C=1, gamma=0.1, kernel=linear -   4.0s
[CV] C=1, gamma=0.001, kernel=linear .................................
[CV] ........................ C=1, gamma=0.001, kernel=linear -   6.9s
[CV] C=1, gamma=0.001, kernel=linear .................................
[CV] ........................ C=1, gamma=0.001, kernel=linear -   3.7s
[CV] C

[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed: 81.0min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [1, 10, 100, 1000], 'gamma': [1, 0.1, 0.001, 0.0001], 'kernel': ['linear']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)

In [20]:
grid.best_params_

{'C': 10, 'gamma': 1, 'kernel': 'linear'}

In [24]:
from sklearn.metrics import classification_report, confusion_matrix
y_pred1 = grid.predict(X_train)
print(classification_report(y_train, y_pred1))

             precision    recall  f1-score   support

          0       0.51      0.52      0.51      5999
          1       0.51      0.50      0.51      6001

avg / total       0.51      0.51      0.51     12000



In [22]:
y_pred1 = grid.predict(X_test)
print(classification_report(y_test, y_pred1))
print(pd.crosstab(y_test, y_pred1))

             precision    recall  f1-score   support

          0       0.65      0.65      0.65       204
          1       0.64      0.64      0.64       196

avg / total       0.65      0.65      0.65       400

col_0          0    1
text_source          
0            132   72
1             70  126


# K-Mean

In [392]:
from sklearn.cluster import KMeans

# Normalize the data.
X_train_norm = normalize(X_train)
X_test_norm = normalize(X_test)

# Reduce it to two components.
X_train_pca = PCA(n_components=0.95).fit_transform(X_train_norm)
X_test_pca = PCA(n_components=0.95).fit_transform(X_test_norm)

KMean1 = KMeans(n_clusters=2, random_state=42)
KMean2 = KMeans(n_clusters=2, random_state=42)

KMean1.fit(X_train_pca)
KMean2.fit(X_train)

y_pred1 = KMean1.predict(X_train_pca)
y_pred2 = KMean2.predict(X_train)

# Check the solution against the data.
print('Comparing k-means clusters against the data:')
print(pd.crosstab(y_pred1, y_train))
print('Comparing k-means clusters against the data:')
print(pd.crosstab(y_pred2, y_train))

Comparing k-means clusters against the data:
text_source     0     1
row_0                  
0            3846  4301
1            2153  1700
Comparing k-means clusters against the data:
text_source     0     1
row_0                  
0            1052  1005
1            4947  4996


In [396]:
#y_pred1 = KMean1.predict(X_test_pca)
y_pred2 = KMean2.predict(X_test)

print('Comparing k-means clusters against the data:')
#print(pd.crosstab(y_pred1, y_test))
print('Comparing k-means clusters against the data:')
print(pd.crosstab(y_pred2, y_test))

Comparing k-means clusters against the data:
Comparing k-means clusters against the data:
text_source     0     1
row_0                  
0             656   699
1            3345  3300


# minibatchkmeans

In [None]:
minibatchkmeans

In [6]:
from sklearn import ensemble

clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.8171666666666667

Test set score: 0.78925
