Detailed discussion on TF-IDF and K-Means Clustering can be found in Appendix

#### TF-IDF (Refer Appendix as well)

To understand TF-IDF, we need to understand two terminologies first:

- Term-Frequency: The intuition to use this is that documents of particular type will have similar words. Therefore, if certain words that occur frequently will indicate certain topic. Terms that occur highly or terms which are quite rare will be pruned off just like stop-words are removed while doing text preprocessing.

- Inverse Document Frequency: The limitation of TF is that it is not effective in terms of term-weighting, where selected terms will have the same weights all the times. Say some words have discriminative power to distinguish a particular group of documents, at that time TF would be ineffective and hence IDF comes for the salvation. The value of IDF will be high for rare terms and low for highly frequent ones. 

TF-IDF combines the advantages of both above methods. It assigns greater values to terms that occur frequently in a small set of documents, thus having more discriminative power. This value will be penalized when the term occurs in more documents. Lowest value is given to the terms that occur in all documents. Hence, TF-IDF yields better clustering. 

Lets follow the article to understand how k-means clustering is done: 

https://medium.com/@rohithramesh1991/unsupervised-text-clustering-using-natural-language-processing-nlp-1a8bc18b048d

Lets note how to find optimal k referring the above article

Lets try implementing our own clustering on the Social Media updates by Mr. Jack (Don't worry about n-grams yet, we can do it in our project as we can have larger corpus)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

In [2]:
corpus = ["Today its raining so I can't go out to play",
"I like ice hockey, football and basketball and tennis is my favorite",
"I want to go to party tonight. Let me know who wants to join me",
"Who knows how to play cricket? Let's try some new sports as we have holidays now.",
"All work, no plays makes me a dull boy... time to chill",
"Who is watching English Premium League tonight  ?",
"I want to go to Lapland for my winter holidays.",
"Christmas holidays was fun. I enjoyed playing indoor sports with my grandfather.",
"Ice cream vs Ice hockey. Now its summer :D",
"Summer finished and I didn't play anything all the holidays. #CORONA"]

In [3]:
#Creating TF-IDF vectorizer and remove the stop words
vectorizer = TfidfVectorizer(stop_words='english')

#Because of the parameter str_pattern of TfidfVectorizer, the punctuations are completely removed. (Check out the documentation)

X = vectorizer.fit_transform(corpus)  # This is our new vectorizer
#X

In [4]:
#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
assigned_k = 2  
kmeans_model = KMeans(n_clusters=assigned_k, init='k-means++', max_iter=100, n_init=1)
#We are fitting unsupervised model (i.e. K-means clustering to our data(vectorized corpus))
kmeans_model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
       n_clusters=2, n_init=1, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [5]:
#Now lets try printing various values of k using elbow method

#Code goes here

Lets check these documentation to understand what the following snippet does:

- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html (Specially the 'cluster_center_' attribute)


In [6]:
order_centroids = kmeans_model.cluster_centers_
terms = vectorizer.get_feature_names()
#order_centroids

Now we may want to sort according to the highest values. Lets also check the following documentation before 

- https://numpy.org/doc/stable/reference/generated/numpy.argsort.html

In [8]:
#What will the following snippet do ?

# x = np.array([[3, 1, 2],[1,2,3]])
# x.argsort()[:, ::-1]  #Originally it was ascending

In [7]:
order_centroids = order_centroids.argsort()[:, ::-1] # The expression in square brace is to slice a numpy 2d-array. The first argument is row. So here we select all rows and then columns in reversed order

#Now lets check the centroids

In [9]:
#len(terms)
terms

['basketball',
 'boy',
 'chill',
 'christmas',
 'corona',
 'cream',
 'cricket',
 'didn',
 'dull',
 'english',
 'enjoyed',
 'favorite',
 'finished',
 'football',
 'fun',
 'grandfather',
 'hockey',
 'holidays',
 'ice',
 'indoor',
 'join',
 'know',
 'knows',
 'lapland',
 'league',
 'let',
 'like',
 'makes',
 'new',
 'party',
 'play',
 'playing',
 'plays',
 'premium',
 'raining',
 'sports',
 'summer',
 'tennis',
 'time',
 'today',
 'tonight',
 'try',
 'vs',
 'want',
 'wants',
 'watching',
 'winter',
 'work']

In [12]:
#Lets print the words in each cluster
for i in range(assigned_k):
    print("Cluster %d:" % i),
    for index in order_centroids[i, :20]:  #We are checking only the first 20 of each cluster
        print('%s' % terms[index],'\n')

Cluster 0:
tonight 

watching 

league 

premium 

english 

join 

wants 

party 

know 

basketball 

like 

favorite 

football 

tennis 

dull 

chill 

boy 

makes 

work 

time 

Cluster 1:
holidays 

play 

summer 

ice 

sports 

today 

raining 

lapland 

winter 

want 

corona 

didn 

finished 

vs 

cream 

try 

cricket 

knows 

new 

enjoyed 



In [11]:
print('Prediction')
X = vectorizer.transform(['Nothing is easy in football.']) #Testing our model for a new test case

predicted = kmeans_model.predict(X)  
print(predicted) #See which cluster the test sentence X belongs to. 

Prediction
[0]


Now its time to analyze does the clustering makes sense ? As a further task, try increasing the size of your corpus. So instead of taking the corpus from Mr Jack's social Media, you can now try writing 15 sentences on your favorite sport. Use cohorent sentences because our corpus will still be quite less. Feel free to assign number of clusters and play around with various parameters. 

- Use some other methods like CountVectorizer or Bag-of-Words for vectorizing our corpus. Follow all the text preprocessing that we have done in our earlier tasks. If you go with CountVectorizer, you wont have to bother about punctuation and lowering letters. 