Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
import numpy as np
import pandas as pd
from os import listdir
%matplotlib inline

import spacy
nlp = spacy.load('en')

import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
from os.path import isfile, join

from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split

### This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. 
### The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning. 
### In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels.  In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.


In [2]:
#saves list of files names to loop though
neg_train_file_names = [f for f in listdir('aclImdb\\train\\neg') if isfile(join('aclImdb\\train\\neg', f))]
pos_train_file_names = [f for f in listdir('aclImdb\\train\\pos') if isfile(join('aclImdb\\train\\pos', f))]
neg_test_file_names = [f for f in listdir('aclImdb\\test\\neg') if isfile(join('aclImdb\\test\\neg', f))]
pos_test_file_names = [f for f in listdir('aclImdb\\test\\pos') if isfile(join('aclImdb\\test\\pos', f))]

In [3]:
review_df = []

for file in neg_train_file_names:
    file1_open = open("aclImdb\\train\\neg\\{}".format(file), encoding="utf8")
    file1_content = file1_open.read()
    review_df.append([file1_content, 0])

In [4]:
review_df2 = []

for file in pos_train_file_names:
    file1_open = open("aclImdb\\train\\pos\\{}".format(file), encoding="utf8")
    file1_content = file1_open.read()
    review_df2.append([file1_content, 1])

In [5]:
review_df3 = []

for file in neg_test_file_names:
    file1_open = open("aclImdb\\test\\neg\\{}".format(file), encoding="utf8")
    file1_content = file1_open.read()
    review_df3.append([file1_content, 0])

In [6]:
review_df4 = []

for file in pos_test_file_names:
    file1_open = open("aclImdb\\test\\pos\\{}".format(file), encoding="utf8")
    file1_content = file1_open.read()
    review_df4.append([file1_content, 1])

In [7]:
review_df = pd.DataFrame(review_df)
review_df2 = pd.DataFrame(review_df2)
review_df3 = pd.DataFrame(review_df3)
review_df4 = pd.DataFrame(review_df4)

In [8]:
review_df = pd.concat([review_df, review_df2, review_df3, review_df4]).sample(frac=1).reset_index(drop=True)
review_df.columns = ['Review', 'Rating']

In [9]:
print(review_df.shape)
review_df.head(10)
#review_df.to_csv("movie_sentment_review")

(50000, 2)


Unnamed: 0,Review,Rating
0,>>> Great News there is a BBC DVD release sche...,1
1,"I thought this movie would be dumb, but I real...",1
2,I used to love Sabrina The Teenage Witch and h...,1
3,Saw this again recently on Comedy Central. I'd...,1
4,"Its not Braveheart( thankfully),but it is fine...",1
5,"I have seen this film on countless occasions, ...",1
6,Good films cannot solely be based on a beautif...,0
7,I don't know why critics cal it bizarre and ma...,1
8,There is nothing not to like about Moonstruck....,1
9,This is a pretty good thriller at a nuclear po...,1


In [19]:
len(review_df)

50000

In [20]:
X_Review = review_df['Review'][0:10000]
y_Review = review_df['Rating'][0:10000]

In [21]:
pattern = "[',.\\?$\"/()\d]"
pattern4 = '[<>]'
pattern2 = r"\bbr\b"
pattern3 = r'\bA\b'

review_paras=[]
for paragraph in X_Review:
    para=paragraph
    
    para = re.sub(pattern4, " ", para)
    para = re.sub(pattern, "", para)
    para = re.sub(pattern2, "", para)
    para = re.sub(pattern3, "", para)
    #review_paras = ' '.join(mov_review.split())

    #Forming each paragraph into a string and adding it to the list of strings.
    review_paras.append(para)

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?


# K-Mean

In [23]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer to input training set
X_review_paras_tfidf=vectorizer.fit_transform(review_paras)

print("Number of features: %d" % X_review_paras_tfidf.get_shape()[1])

KMean = KMeans(n_clusters=2, random_state=42)
KMean.fit(X_review_paras_tfidf)

y_pred = KMean.predict(X_review_paras_tfidf)

# Check the solution against the data.
print('Comparing k-means clusters against the data:')
print(pd.crosstab(y_pred, y_Review))

Number of features: 30866
Comparing k-means clusters against the data:
Rating     0     1
row_0             
0       2548  4010
1       2420  1022


# MiniBatchKMeans


In [328]:
# Each batch will be made up of 200 data points.
minibatchkmeans = MiniBatchKMeans(
    init='random',
    n_clusters=2,
    batch_size=200)
minibatchkmeans.fit(X_review_norm)

# Add the new predicted cluster memberships to the data frame.
predict_mini = minibatchkmeans.predict(X_review_norm)

# Check the MiniBatch model against our earlier one.
print('Comparing k-means and mini batch k-means solutions:')
print(pd.crosstab(predict_mini, y_Review))

Comparing k-means and mini batch k-means solutions:
Rating      0      1
row_0               
0       14828   6773
1       10172  18227


# Mean-shift
its not possible with the size of the input data

# SpectralClustering

In [25]:
from sklearn.cluster import SpectralClustering

#Divide into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(
    X_review_paras_tfidf,
    y_Review,
    test_size=0.25,
    random_state=42)


# We know we're looking for three clusters.
n_clusters=2

# Declare and fit the model.
sc = SpectralClustering(n_clusters=n_clusters)
sc.fit(X_review_paras_tfidf)

#Predicted clusters.
predict=sc.fit_predict(X_review_paras_tfidf)

#Graph results.
#plt.scatter(X_train[:, 0], X_train[:, 1], c=predict)
#plt.show()

print('Comparing the assigned categories to the ones in the data:')
print(pd.crosstab(y_Review,predict))

  if normed and (np.issubdtype(csgraph.dtype, int)


Comparing the assigned categories to the ones in the data:


ValueError: array length 10000 does not match index length 7500

In [26]:
print('Comparing the assigned categories to the ones in the data:')
print(pd.crosstab(y_Review,predict))

Comparing the assigned categories to the ones in the data:
col_0      0     1
Rating            
0       2740  2228
1       4230   802


# Affinity Propagation

In [30]:
from sklearn.cluster import AffinityPropagation
from sklearn import metrics

# Declare the model and fit it in one statement.
# Note that you can provide arguments to the model, but we didn't.
af = AffinityPropagation().fit(X_train)
print('Done')

# Pull the number of clusters and cluster assignments for each data point.
cluster_centers_indices = af.cluster_centers_indices_
n_clusters_ = len(cluster_centers_indices)
labels = af.labels_

print('Estimated number of clusters: {}'.format(n_clusters_))

Done
Estimated number of clusters: 1027
