# Clustering Exploration  

### Summary
This notebook contains some attempted clustering to try to group my clues in the dataset into more organic categories. I referred to week 8 lessons about K-Means Clustering and DBSCAN, but I couldn't manage to capture much more than noise. I fit both models onto my full datasset with a tfidf vectorizer to no avail. Additionally, the truncated dataset of the top 25 Jeopardy categories was fit with DBSCAN, also with little success. I would like to work more with this in the future to see if I can find a workable solution to try to capture more signal in the clues, and subsequently attempt classification models through transfer learning in this instance.  

Although this was a late stage attempt in my porject process, I have included it as the second notebook in the series since the intention was to move forward with any successes to the modeling phase.

## Imports

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cluster import KMeans, DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import silhouette_score

np.random.seed(42)

In [3]:
df = pd.read_csv('/Users/Josh/Desktop/jeopardy_processed.csv')

In [4]:
df.head()

Unnamed: 0,round,value,daily_double,category,answer,question,air_date,answer_length,answer_word_count,syllable_count,sentence_count,dale_chall_score,processed_answer
0,1,100,no,LAKES & RIVERS,River mentioned most often in the Bible,the Jordan,1984-09-10,39,7,10,1,6.24,"['river', 'mention', 'often', 'bibl']"
1,1,200,no,LAKES & RIVERS,Scottish word for lake,loch,1984-09-10,22,4,5,1,7.78,"['scottish', 'word', 'lake']"
2,1,400,no,LAKES & RIVERS,American river only 33 miles shorter than the ...,the Missouri,1984-09-10,57,9,17,1,7.59,"['american', 'river', 'onli', '33', 'mile', 's..."
3,1,500,no,LAKES & RIVERS,"World's largest lake, nearly 5 times as big as...",the Caspian Sea,1984-09-10,55,10,14,1,5.71,"['world', 'largest', 'lake', 'nearli', '5', 't..."
4,1,100,no,INVENTIONS,Marconi's wonderful wireless,the radio,1984-09-10,28,3,8,1,14.31,"['marconi', 'wonder', 'wireless']"


In [5]:
df = df[['category', 'answer', 'question']]

### Vectorization with TFIDF

In [8]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['answer'])

### K-Nearest Neighbors

In [13]:
# This code was taken directly from Lesson 8.01: Intro to Clustering and K-Means
# Iterate through potential cluster amounts and return inertia and silhouette scores for evaluation

%%time

scores = []
for k in range(2, 31):
    cl = KMeans(n_clusters=k, n_jobs=4)
    cl.fit(X)
    inertia = cl.inertia_
    sil = silhouette_score(X, cl.labels_)
    scores.append([k, inertia, sil])
    
score_df = pd.DataFrame(scores)
score_df.columns = ['k', 'inertia', 'silhouette']



CPU times: user 13h 33min 20s, sys: 2h 13min 2s, total: 15h 46min 23s
Wall time: 14h 20min 15s


In [14]:
score_df

Unnamed: 0,k,inertia,silhouette
0,2,348458.415152,0.00079
1,3,348092.089058,0.001145
2,4,347787.030448,0.001487
3,5,347416.217691,0.001826
4,6,347198.121874,0.001954
5,7,346766.717089,0.002437
6,8,346596.926129,0.002694
7,9,346229.470257,0.003169
8,10,345877.54293,0.003389
9,11,345854.314559,0.003455


Examining the silhouette score through 29 iterations of the loop with k-values ranging from 2 to 30, I was unable to capture a score much above zero leading me to pivot to another method of clustering. I was hoping to see a silhouette score closer to 1 at some iteration but it appears I will need a k much higher than the 30 I allowed as a maximum. With more time, I think I could have found a more workable number of clusters, however, time did not permit due to extensive time cost during training.  
As an alternative, I would consider trying to cluster my categories since several of them are very similar and tended to confuse the classification algorithm during the modeling process which will be evident in notebook 3: Modeling.

### DBSCAN

In [18]:
%%time
dbscan = DBSCAN(n_jobs=4)
dbscan.fit(X)

CPU times: user 1h 3min 42s, sys: 14min 43s, total: 1h 18min 25s
Wall time: 42min 49s


DBSCAN(n_jobs=4)

In [19]:
silhouette_score(X, dbscan.labels_)

-0.29133736653762277

In [35]:
pd.Series(dbscan.labels_).value_counts()

-1     349269
 1        150
 38        14
 29         8
 8          7
 18         7
 14         7
 11         6
 37         6
 33         6
 19         6
 0          6
 10         6
 2          6
 3          6
 6          6
 12         5
 28         5
 39         5
 36         5
 35         5
 34         5
 4          5
 32         5
 31         5
 30         5
 5          5
 26         5
 27         5
 13         5
 25         5
 24         5
 23         5
 22         5
 21         5
 7          5
 9          5
 17         5
 16         5
 15         5
 20         5
dtype: int64

I attempted DBSCAN despite the fact that it tends to perform better with more clear delineations between clusters, and again got results of mostly noise. The vast majority of the data were categorized into the -1 cluster, with only a small handful making their way into more meaningful areas.  

Below, I tried to fit DBSCAN on the smaller, truncated dataset with absolutely zero success. Once again, I would love to come back later to attempt the clustering portion of this project looking for more meaningful results to move forward with.

In [4]:
df2 = pd.read_csv('./datasets/top_25_processed.csv')

In [30]:
X2 = df2['answer']

vec = TfidfVectorizer(stop_words='english')
X2_vec = vectorizer.fit_transform(X2)

In [31]:
dbs = DBSCAN(n_jobs=4)
dbs.fit(X2_vec)

DBSCAN(n_jobs=4)

In [37]:
pd.Series(dbs.labels_).value_counts()

-1    15009
dtype: int64