# Building Term Document Matrices with standard libraries.

We introduce the standard library Scikit Learn, for creating Term document matrices, and computing the cosine simialrity between documents and queries.

In [51]:
#Loading necessary ibrary. Execute once and forget about this
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [47]:
#Loading dataset - Our dataset is collection of 208 documents. Each document is an episode from How I met your mother.

docs = eval(open('eps.txt').read())
print 'Total number of docuemnts - ',len(docs)

# eps is a list, with ep1 is a string which contains all sentences in episode 1. [ep1, ep2, ep3]

Total number of docuemnts -  208


## Create Term-document matrix for word counts.

The CountVectorzier provides different options, while creating the term document matrix. For full [documentation](http://www.scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) check - [scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html](http://www.scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [26]:
Vcount = CountVectorizer(analyzer='word', ngram_range=(1,1), stop_words = 'english')
countMatrix = Vcount.fit_transform(docs)

In [52]:
feature_names = Vcount.get_feature_names() # Feature names is the vocabbulary, i.e. all unique words (also bigrams and trigrams)

print feature_names[10000:10020] #20 of the words

print len(feature_names)

query = ["have you met Ted?"]   #Query for which we have to find the most relevant documents from the dataset

newCounter = Vcount.transform(query)  # Not fit_transform. Why?

#Note the change in representation at the output. We can find, that the library uses a sparse matrix representation.
#Do you think its advantageous to use a sparse representation? Why?

print newCounter

for item in newCounter.indices:
    print feature_names[item],newCounter[0,item]

[u'lockbox', u'locked', u'locker', u'locket', u'locking', u'lockroom', u'locks', u'locksmith', u'loco', u'locusts', u'lodge', u'lodged', u'lodging', u'lofty', u'log', u'logarithms', u'logic', u'logically', u'logistical', u'logistics']
18972
  (0, 10681)	1
  (0, 16809)	1
met 1
ted 1


In [31]:
from sklearn.metrics.pairwise import cosine_similarity

cosMat = cosine_similarity(newCounter,countMatrix)
print cosMat

#Finding the top 5 results
related_docs_indices = cosMat[0].argsort()[:-6:-1]


print '\n\nThe index numbers for top5 results are', related_docs_indices

print '\nEpisode Number, Cosine Similarity'
for item in related_docs_indices:
    print 'Episode', item+1, cosMat[0][item]


[[ 0.46982511  0.49490873  0.39776136  0.4080457   0.34918273  0.35217454
   0.45673164  0.39605804  0.3432839   0.57498765  0.3743505   0.48402476
   0.53727265  0.25685235  0.22027905  0.33133984  0.27294293  0.47338797
   0.44436048  0.20347623  0.44959571  0.42012999  0.42888626  0.21748043
   0.54513427  0.39208843  0.20109782  0.41930449  0.41930449  0.25832885
   0.37893153  0.17787032  0.48214366  0.35809444  0.39993708  0.33424685
   0.38547167  0.43573326  0.38956891  0.2409969   0.03082949  0.23641635
   0.16695486  0.41883081  0.45193151  0.27590232  0.49527825  0.36807145
   0.49022195  0.3265684   0.24629266  0.08973091  0.38857077  0.03464015
   0.26533928  0.16245673  0.52939212  0.16327821  0.1991428   0.1954712
   0.30918883  0.4381909   0.2822562   0.16645274  0.31591373  0.22198725
   0.38992356  0.39739954  0.4727522   0.43893594  0.30678827  0.28466895
   0.40460185  0.44730945  0.33989048  0.37179893  0.37430613  0.19912918
   0.1710311   0.33770107  0.41647925  

## Create Term-document matrix for tf-idf values.



In [48]:
Vtfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), stop_words = 'english')
tfidfMatrix = Vtfidf.fit_transform(docs)

tf_feature_names = Vcount.get_feature_names() 
print feature_names[1000:1010]

query = ["have you met Ted?"]

newtfidf = Vtfidf.transform(query)

print newtfidf,'\n'

for item in newtfidf.indices:
    print feature_names[item],newtfidf[0,item]



[u'apped', u'appelle', u'appendicitis', u'appetite', u'appetites', u'appetizer', u'appetizers', u'applaud', u'applauded', u'applauding']
  (0, 16809)	0.600384336013
  (0, 10681)	0.799711603686 

ted 0.600384336013
met 0.799711603686


In [50]:
cosMattf = cosine_similarity(newtfidf,tfidfMatrix)
print cosMattf
related_docs_indices = cosMattf[0].argsort()[:-6:-1]
print related_docs_indices
print cosMattf[0][related_docs_indices]

for item in related_docs_indices:
    print 'Episode', item+1, cosMat[0][item]


[[ 0.29832037  0.31381361  0.22427591  0.16382004  0.16869673  0.18560001
   0.19852501  0.25572387  0.14795766  0.36719511  0.1483884   0.2141065
   0.26921911  0.17062134  0.10013598  0.1479135   0.15012545  0.30781399
   0.19952338  0.12073193  0.28791861  0.22473619  0.2939716   0.15485519
   0.38643182  0.24922804  0.09811546  0.17076153  0.17076153  0.16169259
   0.24786607  0.08206474  0.28162381  0.18044557  0.1865094   0.16960901
   0.22120904  0.28781123  0.22800576  0.12546763  0.01554137  0.12567691
   0.09328413  0.29383486  0.21320968  0.13038106  0.23258037  0.18986805
   0.24560645  0.1597784   0.15753429  0.04081194  0.21887841  0.0149117
   0.13497876  0.08644725  0.23433097  0.10485147  0.10050163  0.0854796
   0.1924596   0.19810858  0.11859001  0.077826    0.19036977  0.083194
   0.1933294   0.23956284  0.23063843  0.26001025  0.1997866   0.12382375
   0.19262909  0.2182133   0.15278339  0.23708279  0.22434157  0.11373675
   0.07323738  0.15990011  0.23315832  0.19

## Exercise

- Compare the results for different queries, when we use word-counts an when we use tf-idf 
    - "Actually, I think it's cute"
    - "So do you think you'll ever get married?"
- What happens, when we consider bigrams and trigrams in the term docuemnt matrix. (Check the argument in the vectorizer objects)
- Try keeping a threshold on minimum document frequency each word should have and how that affects the queries (Check the docuemntation link)

- What happens when we use euclidean distance, instead of cosine similarity? Try to apply different pairwise comaprison [metrics](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) and see the results
- Try to apply various preprocessing steps we have done in previous sheet and compare the results with each modification. Is there any preprocessing step that we are missing?