# Modeling with Clustering

### KMeans Clustering

In [2]:
from sklearn.cluster import KMeans
from gensim.models import word2vec, fasttext
import time
import numpy as np
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

Loading in the saved model and cleaned review data from the previous notebook

In [3]:
model = word2vec.Word2Vec.load('../data/300features_40minwords_10context')

In [4]:
train = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)

In [5]:
test = pd.read_csv('../data/testData.tsv', header=0, delimiter='\t', quoting=3)

In [6]:
with open('../data/cleantrainreviews.txt', 'rb') as fp:
    clean_train_reviews = pickle.load(fp)

In [7]:
with open('../data/cleantestreviews.txt', 'rb') as fp:
    clean_test_reviews = pickle.load(fp)

Setting a start time, to see how long this takes, because clustering with KMeans can take a while. Setting $k$ equal to a fifth of the vocabulary size, which would give an average of 5 words per cluster. Wrapping the whole thing in an end time. 

In [8]:
start = time.time()

word_vectors = model.wv.vectors

num_clusters = word_vectors.shape[0] / 5

kmeans = KMeans(n_clusters=int(num_clusters))

idx = kmeans.fit_predict(word_vectors)

end = time.time()

elapsed = end - start

print("Time taken for K Means Clustering: ", elapsed, "seconds.")

Time taken for K Means Clustering:  345.2002639770508 seconds.


In [9]:
# it took about 5 minutes on my machine
339 / 60

5.65

Creating a Word/Index dictionary, mapping each vocabulary word to a cluster number

In [10]:
word_centroid_mapping = dict(zip(model.wv.index2word, idx))

Creating a loop to print out the words in the first 10 clusters

In [11]:
for cluster in range(0,10):
    print('Cluster', cluster)
    
    words = []
    for i in range(0, len(word_centroid_mapping.values())):
        if(list(word_centroid_mapping.values())[i] == cluster):
            words.append(list(word_centroid_mapping.keys())[i])
    
    print(words)

Cluster 0
['scrap', 'piles']
Cluster 1
['andrea', 'carla', 'lena', 'ariel', 'teresa', 'pauline', 'iris', 'bliss', 'aileen', 'jen', 'su', 'rosa', 'mai', 'delilah', 'selena', 'flirt', 'anastasia', 'stacey', 'hagen', 'ilona', 'lilly', 'celine', 'lina', 'leila', 'celeste', 'miki', 'olga', 'alma', 'belial', 'kisna', 'auteuil', 'amira']
Cluster 2
['brigade']
Cluster 3
['altered', 'formed', 'expanded', 'discarded', 'omitted', 'unexplored']
Cluster 4
['retaining']
Cluster 5
['disgusting', 'outrageous', 'repulsive', 'vile', 'sickening', 'nauseating', 'revolting', 'distasteful', 'delirious', 'disrespectful', 'intolerable', 'repugnant']
Cluster 6
['owner', 'worker', 'district', 'employee', 'operator', 'escort', 'inmate', 'developer', 'irs']
Cluster 7
['kay', 'stevens', 'sue', 'carrie', 'collins', 'shields', 'banks', 'tyler', 'irving', 'katherine', 'carroll', 'carlos', 'palmer', 'thelma', 'fletcher', 'ava', 'eleanor', 'manson', 'roberta', 'tate', 'windsor', 'betsy', 'channing', 'lillard', 'jaime',

Some of the clusters seem to make sense, like cluster 2 being military terms, and cluster 3 being all names, but some are just one-off clusters with only 1 word in them, like clusters 5 and 6 having 'finishes' and 'privileged', respectively.

---

Now that I have clusters, also known as centroids, I can try out a "bag-of-centroids" model, which is essentially like Bag of Words, but uses clusters instead of just words. I start off by creating a custom function that will give me a numpy array for each review that has the number of features equal to the number of clusters.

In [12]:
def bag_of_centroids(wordlist, word_centroid_mapping):

    # The number of clusters is equal to the highest cluster index in the word/centroid map
    num_centroids = max(word_centroid_mapping.values()) + 1

    # Pre-allocate the bag of centroids vector (for speed)
    bag_centroids = np.zeros(num_centroids, dtype="float32")

    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count by one
    for word in wordlist:
        if word in word_centroid_mapping:
            index = word_centroid_mapping[word]
            bag_centroids[index] += 1

    # Return the "bag of centroids"
    return bag_centroids

Below, I use the custom function from above to creat the bags of centroids for the review training and tests sets. 

In [13]:
# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros((25000, int(num_clusters)), dtype="float32")

# Transform the training set reviews into bags of centroids
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = bag_of_centroids(review, word_centroid_mapping)
    counter += 1

# Repeat for test reviews 
test_centroids = np.zeros((25000, int(num_clusters)), dtype="float32")

counter = 0
for review in clean_test_reviews:
    test_centroids[counter] = bag_of_centroids(review, word_centroid_mapping)
    counter += 1

Using the newly created bags of centroids to train a random forest model again to compare with the previous attempt.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(train_centroids, train.sentiment, random_state=42)

In [15]:
rf = RandomForestClassifier(n_estimators=100)

In [16]:
rfmodel = rf.fit(X_train, y_train)

In [17]:
rfmodel.score(X_train, y_train)

1.0

In [18]:
rfmodel.score(X_test, y_test)

0.84624

Looks like the bag of centroids model did about as good as the previous models.

In [19]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=.3)
lr.fit(X_train, y_train)
print("Logistic Regression score on the training set:", lr.score(X_train, y_train))
print("Logistic Regression score on the test set:", lr.score(X_test, y_test))

Logistic Regression score on the training set: 0.9286933333333334
Logistic Regression score on the test set: 0.85888
