### References
https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275
https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import turicreate
import pandas as pd

In [4]:
people = turicreate.SFrame('./people_wiki.sframe')

In [5]:
df = people.to_dataframe()

In [6]:
df.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


### Create a Corpus

A "corpus" represents the entire body of articles / text documents. 

In [7]:
corpus = list(df['text'])

Let's just check the first item in the corpus, represents our first document which should be that of "Digby Morrell".

In [9]:
corpus[0]

'digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullants carltons vfla

### Instantiate a "Vectorizer"

Vectorisation is a process of turning a collection of text documents into numerical feature documents. The specific way in which you 'vectorize' your document, depends on the type of "Vectorizer" you use.

2 common ones are `CountVectorizer`, which simply counts words in the documents, and `tfidfVectorizer` which applies the tf-idf normalisation to upweight words which are unique/rare in the corpus, and downweight the opposite.

In [19]:
tfidfvectoriser = TfidfVectorizer(stop_words='english')

When we apply the tfidfvectoriser to our corpus of text documents, it returns a `scipy.spare._csr.csr_matrix`, which I don't fully understand but I believe it represents basically `(nsamples, nfeatures)` matrix, with the values being the td-idf scores.

The `nfeatures` are 0 based indexed, but in the code below, we can derive what 'token' they represent.

In [20]:
X = tfidfvectoriser.fit_transform(corpus)

In [21]:
tfidf_tokens = tfidfvectoriser.get_feature_names_out()
stop_words = tfidfvectoriser.get_stop_words()

In [22]:
tmp_df = pd.DataFrame(X[0].T.todense(), index=tfidf_tokens, columns=["TF-IDF"])

Below appears to be the top tokens in our corpus.

In [26]:
tmp_df = tmp_df.sort_values('TF-IDF', ascending=False)
tmp_df[:25]

Unnamed: 0,TF-IDF
morrell,0.55318
football,0.386163
kangaroos,0.256943
club,0.174097
carlton,0.137553
league,0.133941
australian,0.126203
brisbaneafter,0.122558
aflfrom,0.122558
edflhe,0.122558


In [13]:
tfidf_tokens

array(['00', '000', '0000', ..., 'zzran', 'zzt', 'zzts'], dtype=object)

In [33]:
X.shape

(59071, 547629)

In [34]:
type(X)

scipy.sparse._csr.csr_matrix

In [2]:
my_array = X.toarray()

NameError: name 'X' is not defined

In [None]:
type(my_array)

### Apply K-Means Clustering to tdidf matrix

In [33]:
kmeans = KMeans(n_clusters=5).fit(X)

In [46]:
set(kmeans.labels_)

{0, 1, 2, 3, 4}

In [47]:
df['tfidf_cluster'] = kmeans.labels_

In [48]:
df.head()

Unnamed: 0,URI,name,text,tfidf_cluster
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,4
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,0
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,2
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,0
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,2


In [49]:
df[df.tfidf_cluster == 3].head(20)

Unnamed: 0,URI,name,text,tfidf_cluster
7,<http://dbpedia.org/resource/Trevor_Ferguson>,Trevor Ferguson,trevor ferguson aka john farrow born 11 novemb...,3
15,<http://dbpedia.org/resource/Joerg_Steineck>,Joerg Steineck,joerg steineck is a german filmmaker editor an...,3
40,<http://dbpedia.org/resource/Timothy_Grucza>,Timothy Grucza,timothy grucza born 1 july 1976 melbourne aust...,3
69,<http://dbpedia.org/resource/Will_Tiao>,Will Tiao,will tiao is a taiwanese american actor and pr...,3
71,<http://dbpedia.org/resource/Geoffrey_Bayldon>,Geoffrey Bayldon,geoffrey bayldon born 7 january 1924 in leeds ...,3
80,<http://dbpedia.org/resource/Robin_MacPherson>,Robin MacPherson,robin macpherson born 1959 glasgow scotland is...,3
85,<http://dbpedia.org/resource/Zvonimir_Juri%C4%87>,Zvonimir Juri%C4%87,zvonimir juri born 4 june 1971 is a croatian f...,3
91,<http://dbpedia.org/resource/Katja_Herbers>,Katja Herbers,katja mira herbers dutch pronunciation ktja mi...,3
94,<http://dbpedia.org/resource/Eva_Habermann>,Eva Habermann,eva felicitas habermann born january 16 1976 i...,3
97,<http://dbpedia.org/resource/David_Shaughnessy>,David Shaughnessy,david james shaughnessy also spelled shaughnes...,3


In [50]:
df.head()

Unnamed: 0,URI,name,text,tfidf_cluster
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,4
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,0
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,2
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,0
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,2


In [115]:
df_digby = df.iloc[[0,]].copy()

In [116]:
df_digby

Unnamed: 0,URI,name,text,tfidf_cluster
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,4


In [110]:
tfidf_matrix_for_digby = pd.DataFrame(X[0].T.todense(), index=tfidf_tokens, columns=["TF-IDF"]).sort_values('TF-IDF', ascending=False)

In [111]:
tfidf_matrix_for_digby['TF-IDF']

morrell      0.553180
football     0.386163
kangaroos    0.256943
club         0.174097
carlton      0.137553
               ...   
epitomise    0.000000
epitomic     0.000000
epitome      0.000000
epithets     0.000000
zzts         0.000000
Name: TF-IDF, Length: 547629, dtype: float64

In [112]:
tfidf_matrix_for_digby = tfidf_matrix_for_digby['TF-IDF'].to_json()

In [117]:
df_digby.loc[:, 'tfidf_dict'] = tfidf_matrix_for_digby

In [118]:
df_digby

Unnamed: 0,URI,name,text,tfidf_cluster,tfidf_dict
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,4,"{""morrell"":0.5531803748,""football"":0.386163332..."


#TODO: Apply the `tfidf_dict` to all of the rows in the corpus, not just for Digby Morrell. This will help evaluation.