## Latent Semantic Indexes

### Exercise  2:
- 2.1 Build the term-doc matrix and apply tf-idf to the counts (after preprocessing).
- 2.2 Decompose your matrix into its singular factors. (SVD)
- 2.3 Provide the MRR over the query “Nigel Farage leading new pro brexit party”.

In [1]:
! pip install nltk



In [2]:
import nltk

In [3]:
nltk.download("twitter_samples")

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [4]:
import text_transformer as tt

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
from nltk.corpus import twitter_samples
tweets = twitter_samples.docs()
docs = [t['text'] for t in tweets]

In [6]:
unique_tweets = list(set(docs)) # use just the unique tweets

In [7]:
unique_tweets_processed = [tt.tokenizer(tweet) for tweet in unique_tweets] # tokenize the tweets

In [8]:
unique_tweets_processed = [tt.normalizer(tweet) for tweet in unique_tweets_processed] # normalize the tweets

In [9]:
unique_tweets_processed = [tt.stemmer(tweet) for tweet in unique_tweets_processed] # apply stemming to the tweets

In [10]:
# take randomly one third of the tweets in order to finish it in a reasonable processing time
import random
percentage = 10
reduced_tweets = random.sample(unique_tweets_processed, int(len(unique_tweets_processed)*(percentage/100)))

In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.

https://datawarrior.wordpress.com/2018/01/22/document-term-matrix-text-mining-in-r-and-python/

In [11]:
# the IDF (inverted document frequency) is measuring the frequency a word exists in a document and invert this number
# formula: log(total number of documents/number of documents the word exist in)
# in our case it will measure the word in a tweet
def computeIDF (tweets):
    import math
    idfDict = {}
    N = len(tweets) # get number of tweets in collection
    
    for tweet in tweets:
        for index in range(0,len(tweets[tweet])):
            if tweets[tweet][index][0] in idfDict:     # tweets[tweet][index][0] represent one word of a tweet
                                                        # the index is necessary to address the word and not the count
                idfDict[tweets[tweet][index][0]] += 1   # if the word hasn't been added to the dict yet, do
            else:
                idfDict[tweets[tweet][index][0]] = 1

    for word, val in idfDict.items():
        idfDict[word] = math.log10(N/float(val)) #  <- magic formula
    
    return idfDict

In [12]:
# formula: (1+ log(TF)) * IDF
def computeTFIDF(tfDict, idfDict):
    import math
    tfidf = {}
    for word, val in tfDict.items():
        tfidf[word] = (1 + math.log(val))*idfDict[word]
        
    return tfidf

In [13]:
# create a dictionary using the index of each tweet as the key
tweets = {}
index = 0
for tweet in reduced_tweets:
    tweets[index] = tweet
    index += 1

In [14]:
# count frequencies for each tweet
tweet_count = {tweet : tt.count_frequencies(tweets[tweet]) for tweet in tweets}

In [15]:
idf = computeIDF(tweet_count)

### Term Doc Matrix

following the wikipedia-approach: https://en.wikipedia.org/wiki/Document-term_matrix

In [16]:
import pandas as pd
import numpy as np

In [17]:
unique_words = set(tt.flatten(reduced_tweets))

In [18]:
td_matrix = pd.DataFrame(0, index=np.arange(0), columns=unique_words)

In [19]:
td_matrix

Unnamed: 0,hardli,o…,street,grill,slay,privatis,exist,lordbaconfri,manila,trip,...,👏👏,davidhf54,engag,team,aranitsi,jackfostr,leagu,reynold,huh,uniteright


In [20]:
# construting the term_doc matrix with the counts as values
# This might take some time
for i in tweet_count:
    for j in range(0, len(tweet_count[i])):
        td_matrix.loc[i,tweet_count[i][j][0]] = tweet_count[i][j][1]

In [21]:
td_matrix = td_matrix.fillna(0)

In [22]:
td_matrix.tail()

Unnamed: 0,hardli,o…,street,grill,slay,privatis,exist,lordbaconfri,manila,trip,...,👏👏,davidhf54,engag,team,aranitsi,jackfostr,leagu,reynold,huh,uniteright
2036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2040,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
# construting the term_doc matrix with the tf-idf as values
td_idf_matrix = pd.DataFrame(0, index=np.arange(0), columns=unique_words)
for i in tweet_count:
    for j in range(0, len(tweet_count[i])):
        td_idf_matrix.loc[i,tweet_count[i][j][0]] = tweet_count[i][j][1] * idf[tweet_count[i][j][0]]
td_idf_matrix = td_idf_matrix.fillna(0)

In [25]:
td_idf_matrix.head()

Unnamed: 0,hardli,o…,street,grill,slay,privatis,exist,lordbaconfri,manila,trip,...,👏👏,davidhf54,engag,team,aranitsi,jackfostr,leagu,reynold,huh,uniteright
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.008813,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Single Value Decomposition

In [26]:
import numpy as np

In [27]:
u, s, vh = np.linalg.svd(td_matrix, full_matrices=True)

In [28]:
u

array([[ -3.20572170e-02,   3.16178092e-02,  -2.08045993e-02, ...,
          6.35542981e-17,  -4.31919040e-17,  -3.19893851e-16],
       [ -2.02931049e-02,   1.87509221e-02,  -1.44775436e-02, ...,
          8.30961539e-16,   1.85063963e-16,  -9.77894301e-16],
       [ -4.96179963e-03,  -3.53005544e-05,  -1.66062250e-03, ...,
         -3.14354348e-16,  -1.11797634e-16,   2.88156759e-16],
       ..., 
       [ -5.54745734e-03,  -2.98621159e-03,   4.73112718e-06, ...,
          3.96817995e-17,  -7.32920669e-17,  -2.25514052e-17],
       [ -3.88954891e-04,  -2.97824761e-04,  -8.78093935e-05, ...,
          3.98986399e-17,   6.82369742e-18,  -5.98479599e-17],
       [ -7.32328599e-03,   4.91526545e-03,  -4.01170003e-03, ...,
          5.72458747e-17,   0.00000000e+00,   0.00000000e+00]])

In [29]:
s

array([  3.28739744e+01,   2.72807114e+01,   2.59945052e+01, ...,
         1.40191769e-16,   1.25920513e-16,   4.89979653e-17])

In [30]:
vh

array([[ -1.67249912e-03,  -2.43661380e-03,  -1.22169493e-03, ...,
         -7.15150164e-05,  -1.18694214e-05,  -9.65257336e-04],
       [  1.72194024e-04,  -2.68004598e-03,  -6.13581519e-04, ...,
          4.59609600e-05,  -9.77374387e-06,   1.30170797e-03],
       [  9.47866451e-04,   2.44640776e-04,   1.55913820e-03, ...,
         -3.39622809e-05,  -2.06354227e-06,   8.46916522e-04],
       ..., 
       [ -2.83757450e-03,   4.24741952e-03,  -1.10995344e-03, ...,
          7.53224159e-01,  -9.45268368e-04,  -3.14792304e-03],
       [  1.33306622e-04,  -1.07940419e-03,   2.91742815e-04, ...,
         -8.98126431e-04,   6.00364456e-01,  -1.33642757e-03],
       [ -4.47830473e-03,   1.75126562e-03,   1.92186187e-03, ...,
         -1.77539234e-03,  -4.61289883e-04,   8.87913202e-01]])

### MRR for Query "Nigel Farage leading new pro brexit party"

1. Create a new column based on the query (e.g. column "Nigel" is 1, column "Sun" is 0, etc.)
2. Reduce SVD output Size to 100 dimensions
3. Transform the query row into this new space
4. Calculate the cosine similarity between all documents, in our case tweets, and the query
5. Create a Rank for each tweet and calculaten the MRR (Mean Reciprocal Rank)

In [31]:
# Step 1
query = "Nigel Farage leading new pro brexit party"

In [32]:
query = tt.tokenizer(query)
query = tt.normalizer(query)
query = tt.stemmer(query)

In [33]:
query

['nigel', 'farag', 'lead', 'new', 'pro', 'brexit', 'parti']

In [34]:
query_matrix = pd.DataFrame(0, index=np.arange(1), columns=unique_words)
for elem in query:
    if elem in unique_words:
        query_matrix.loc[0,elem] = 1

In [35]:
# Step 2
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, n_iter=7, random_state=0) # create a model
svd_model = svd.fit(td_matrix)

In [36]:
svd_matrix = svd_model.transform(td_matrix)

In [37]:
# Step 3
svd_query = svd_model.transform(query_matrix)

In [38]:
# Step 4
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(svd_matrix, svd_query, dense_output=True)

In [39]:
# Step 5
similarity_matrix = pd.DataFrame(similarity_matrix)
similarity_matrix["doc"] = similarity_matrix.index
similarity_matrix = similarity_matrix.sort_values(0, ascending=False)

In [40]:
similarity_matrix.head()

Unnamed: 0,0,doc
1858,0.818583,1858
1411,0.814598,1411
428,0.813816,428
620,0.746506,620
1803,0.690455,1803


In [50]:
for index, row in similarity_matrix.head().iterrows():
    print(reduced_tweets[index])

['nigel', 'farag', 'speak', 'askfarag']
['sooo', 'nigel', 'farag', 'interview', 'birmingham']
['nigel', 'farag', 'entertain', '😂']
['isitok', 'paxman', 'give', 'thought', 'nigel', 'farag']
['whi', 'nigel', 'farag', 'stand', 'like']
