## Latent Semantic Indexes

### Exercise  2:
- 2.1 Build the term-doc matrix and apply tf-idf to the counts (after preprocessing).
- 2.2 Decompose your matrix into its singular factors. (SVD)
- 2.3 Provide the MRR over the query “Nigel Farage leading new pro brexit party”.

In [4]:
! pip install nltk



In [6]:
import nltk

In [7]:
nltk.download("twitter_samples")

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [8]:
import text_transformer as tt

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
from nltk.corpus import twitter_samples
tweets = twitter_samples.docs()
docs = [t['text'] for t in tweets]

In [10]:
unique_tweets = list(set(docs)) # use just the unique tweets

In [11]:
unique_tweets_processed = [tt.tokenizer(tweet) for tweet in unique_tweets] # tokenize the tweets

In [12]:
unique_tweets_processed = [tt.normalizer(tweet) for tweet in unique_tweets_processed] # normalize the tweets

In [13]:
unique_tweets_processed = [tt.stemmer(tweet) for tweet in unique_tweets_processed] # apply stemming to the tweets

In [137]:
# take randomly one third of the tweets in order to finish it in a reasonable processing time
import random
percentage = 5
reduced_tweets = random.sample(unique_tweets_processed, int(len(unique_tweets_processed)*(percentage/100)))

In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.

https://datawarrior.wordpress.com/2018/01/22/document-term-matrix-text-mining-in-r-and-python/

In [64]:
# the IDF (inverted document frequency) is measuring the frequency a word exists in a document and invert this number
# formula: log(total number of documents/number of documents the word exist in)
# in our case it will measure the word in a tweet
def computeIDF (tweets):
    import math
    idfDict = {}
    N = len(tweets) # get number of tweets in collection
    
    for tweet in tweets:
        for index in range(0,len(tweets[tweet])):
            if tweets[tweet][index][0] in idfDict:     # tweets[tweet][index][0] represent one word of a tweet
                                                        # the index is necessary to address the word and not the count
                idfDict[tweets[tweet][index][0]] += 1   # if the word hasn't been added to the dict yet, do
            else:
                idfDict[tweets[tweet][index][0]] = 1

    for word, val in idfDict.items():
        idfDict[word] = math.log10(N/float(val)) #  <- magic formula
    
    return idfDict

In [15]:
# formula: (1+ log(TF)) * IDF
def computeTFIDF(tfDict, idfDict):
    import math
    tfidf = {}
    for word, val in tfDict.items():
        tfidf[word] = (1 + math.log(val))*idfDict[word]
        
    return tfidf

In [124]:
# create a dictionary using the index of each tweet as the key
tweets = { reduced_tweets.index(tweet) : tweet for tweet in reduced_tweets }

In [125]:
# count frequencies for each tweet
tweet_count = {tweet : tt.count_frequencies(tweets[tweet]) for tweet in tweets}

In [126]:
test[1][0][0]

'home'

In [127]:
idf = computeIDF(tweet_count)

In [138]:
keys = (0,1,2,3,4)
test = { k: tweet_count[k] for k in keys}

In [143]:
test[0][0][0]

'nongarden'

### Term Doc Matrix

following the wikipedia-approach: https://en.wikipedia.org/wiki/Document-term_matrix

In [130]:
import pandas as pd
import numpy as np

In [133]:
unique_words = set(tt.flatten(reduced_tweets))

In [134]:
td_matrix = pd.DataFrame(0, index=np.arange(0), columns=unique_words)

In [135]:
td_matrix

Unnamed: 0,papertownsmovi,nite,afang,diari,labour4free,yot,souththanet,liamthehobbit,dreamt,marco,...,includ,tournament,rt,hope,espio1,scottishfirst,famili,front,no…,alrd


In [136]:
# construting the term_doc matrix with the counts as values
# This might take some time
for i in tweet_count:
    for j in range(0, len(tweet_count[i])):
        tdatrix.loc[i,tweet_count[i][j][0]] = tweet_count[i][j][1]
#        print(i)
#        print (test[i][j][0])

KeyboardInterrupt: 

In [114]:
td_matrix

Unnamed: 0,afang,jakeybhoy58,adf,mysightnott,plusmil,willwilson0208,muselshoux,junior,simon,1000,...,immov,christabellatr2,lagtc,charlesheff,millibrand,minecon,bunso,rt,lucyso,scottishfirst
0,,,,,,,,,,,...,,,,,,,,1.0,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [None]:
# construting the term_doc matrix with the tf-idf as values
td_idf_matrix = pd.DataFrame(0, index=np.arange(0), columns=unique_words)
for i in tweet_count:
    for j in range(0, len(tweet_count[i])):
        td_idf_matrix.loc[i,tweet_count[i][j][0]] = tweet_count[i][j][1] * idf[tweet_count[i][j][0]]

### Single Value Decomposition