![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# Latent Semantic Analysis with Gensim

In [1]:
import os    
import nltk
import gensim
import pandas as pd
from TextPreprocessor import *
from gensim import models, corpora, similarities
from gensim.models import LsiModel, LdaModel, LdaMulticore
import json

In [2]:
with open("../rss/content_daily.json") as f:
    content = json.load(f)

text = list(content.values())

In [3]:
data_df = pd.DataFrame({'text': text})

## Data preprocessing

You will need first to preprocess the data through the following stages:
1. tokenization
2. stopword removal
2. POS-based filtering (optional)
3. lemmatization or stemming (optional)
4. addition of bigrams to each document (optional)
5. filtering of infrequent words
6. inspection and filtering of frequent words

You can use NLTK or our in-house `TextPreprocessor.py` file, as explained in Lab 1.

<font color='green'>Please state here which solution you use and list stages you implement.</font>

**ANSWER**

We used:
- tokenization
- removing of stopwords
- low frequency words removal

We finally inspected and filtered the frequent words

In [4]:
# Please write here the preprocessing instructions if you use TextPreprocessor.py

language = 'english'
stop_words = set(stopwords.words(language))
# Extend the list here:
for custom_sw in ['\"', '\'', '\'\'', '`', '``', '\'s']:
    stop_words.add(custom_sw)

processor = TextPreprocessor(
    language = language,
    pos_tags = {wordnet.ADJ, wordnet.NOUN},
    stopwords = stop_words,
)

In [5]:
data_df['processed'] = processor.transform(data_df['text'])

In [6]:
data_df['tokenized'] = data_df['processed'].apply(nltk.word_tokenize)

In [47]:
# Alternatively, please write here the preprocessing instructions if you use NLTK


In [7]:
print(data_df['tokenized'].iloc[120])

['whether', 'rickety', 'ramp', 'air', 'wheelies', 'ultimate', 'buzz', 'thrill-seekers', '1980s', 'young', 'bmx', 'rider', 'jumped', 'trendy', 'new', 'bike', 'friend', 'relation', 'brave', 'lie', 'ground', 'beneath', 'bonfire', 'zero', 'dry', 'land', 'pier', 'sea', 'photo', 'heady', 'day', 'show', 'new', 'book', 'story', 'bmx', 'stand', 'bicycle', 'motocross', 'craze', '1980s', 'bike', 'must-have', 'garden', 'park', 'without', 'rider', 'delighted', 'manner', 'stunt', 'off-road', 'shenanigan', 'whole', 'hog', 'helmet', 'fancy', 'suit', 'pre-health', 'safety', 'day', 'many', 'much', 'fun', 'bother', 'course', 'olympic', 'sport', 'one', 'beth', 'shriever', 'gold', 'kye', 'whyte', 'silver', 'tokyo', 'olympics', 'last', 'year', 'book', 'three', 'co-authors', 'old', 'day', 'thrill', 'spill', 'special', 'place', 'heart', 'antony', 'frascina', 'infant', 'school', 'teacher', 'wigan', "'in", '80', 'bmx', 'like', 'bolt', 'life', 'shiny', 'chrome', 'new', 'aspirational', 'dangerous', 'scary', 'half

Please make a list of all words from all articles.  Then, using `nltk.FreqDist`, consider the most frequent and the least frequent words.  If you find uninformative words among the most frequent ones, please remove them from the articles.  Similarly, please remove from articles the words appearing fewer than 2 or 3 times in the corpus.  <font color='green'> Please justify these choices. What is now the size of your vocabulary?</font> 

**ANSWER**

We removed the words appearing only 2-3 times because they won't be relevant for the further analysis: they do not carry much informations, and are not frequent enough to be associated with other words.
Some small words are removed to, independantly from their frequency (like "u", "two", "mr").

In [8]:
words_to_filter = ["u","two","could","mr","one", "per", "cent","hrt"]

In [9]:
# Please write here all the necessary instructions.  You may use several cells.
vocabulary = data_df['tokenized'].sum()

most_common = [word for word, freq in nltk.FreqDist(vocabulary).items() if freq > 4 and word not in words_to_filter]

data_df['filtered']=data_df['tokenized']

voc_size_filtered=0
voc_size_tokenized=0

for del_words in words_to_filter:
    for data in data_df['filtered']:
        if del_words in data:
            data.remove(del_words)

print(f"Vocabulary length: {len(nltk.FreqDist(vocabulary))}")
print(f"Number of filtered words: {len(most_common)}")

data_df.head()

Vocabulary length: 10204
Number of filtered words: 2716


Unnamed: 0,text,processed,tokenized,filtered
0,A fearless badger is harassing passers-by at a...,fearless badger passer-by beauty spot -- rspca...,"[fearless, badger, passer-by, beauty, spot, --...","[fearless, badger, passer-by, beauty, spot, --..."
1,"Only disabled actors should play Richard III, ...",disabled actor play richard iii head royal sha...,"[disabled, actor, play, richard, iii, head, ro...","[disabled, actor, play, richard, iii, head, ro..."
2,Sir Keir Starmer's Labour have seized control ...,sir keir starmer labour control tory stronghol...,"[sir, keir, starmer, labour, control, tory, st...","[sir, keir, starmer, labour, control, tory, st..."
3,Labour will be 'somewhat disappointed' by thei...,labour 'somewhat disappointed local election r...,"[labour, 'somewhat, disappointed, local, elect...","[labour, 'somewhat, disappointed, local, elect..."
4,Follow MailOnline's live coverage for all the ...,mailonline live coverage update local election...,"[mailonline, live, coverage, update, local, el...","[mailonline, live, coverage, update, local, el..."


In [68]:
print(data_df['filtered'].iloc[10])

['brother', 'itv', 'reality', 'show', "'life", 'marbs', 'pr', 'bos', 'exclusive', 'mayfair', 'club', 'woman', 'private', 'booth', 'court', 'heard', 'old', 'harrovian', 'adam', 'graham', 'brother', 'jeffrey', 'member', 'public', 'private', 'area', 'celebrity', 'haunt', 'pair', 'area', 'maddox', 'club', 'count', 'prince', 'harry', 'among', 'past', 'famous', 'guest', 'friend', 'family', 'uninvited', 'guest', 'party', 'tom', 'heslop', 'told', 'westminster', 'magistrate', 'court', "'the", 'defendant', 'annoyed', 'member', 'public', 'space', 'pair', 'set', 'new', 'luxury', 'concierge', 'business', 'twelve-episode', 'reality', 'series', 'life', 'marbs', 'life', 'glamorous', 'resident', 'living', 'popular', 'spanish', 'resort', '2015.', 'heslop', 'victim', 'pr', 'manager', 'jacob', 'bohbot', 'sat', 'table', 'group', 'woman', 'graham', 'family', 'brother', 'thought', 'woman', 'space', 'security', 'area', 'court', 'heard', 'adam', 'claimed', 'ushered', 'woman', 'kept', 'area', 'bohbot', 'jeffrey

## LSA with Gensim

In this section, you will write the Gensim commands to compute a term-document matrix from the above documents, then transform it using SVD, and truncate the result.  To learn what the commands are, please follow the [Topics and Tranformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) from Gensim. 

<font color="green">Please gather these commands into a function called `train_lsa`.  They should cover: dictionary creation, corpus mapping, computation of TF-IDF values, and creation of the LSA model.</font> 

In [69]:
def train_lsa(filtered_texts, num_topics = 10):
    dictionary = corpora.Dictionary(filtered_texts)
    corpus = [dictionary.doc2bow(text) for text in filtered_texts]

    # transform the vectors to tf-idf representation
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]

    lsa = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=num_topics)
    corpus_tfidf = lsa[corpus_tfidf]

    return lsa,dictionary,corpus,corpus_tfidf

<font color="green">Please fix a `number_of_topics`, on the lower side of the range mentioned in the course.  Then, execute the cell that performs `train_lsa`.</font>

In [72]:
number_of_topics = 10

In [73]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)

<font color="green">Please display several topics found by LSA using the Gensim `print_topics` function.  Please explain in your own words the meaning of what is displayed.  How do you relate it with what was explained in the course on LSA?</font>

**ANSWER**

`print_topics` returns the `n` (50 in our case) most relevant topics. Each topic consist of a list of words associated with a relevancy score. The higher the score the more relevant is the word according to the topic.

In [74]:
lsa_model.print_topics(number_of_topics)

[(0,
  '-0.156*"abortion" + -0.153*"ukraine" + -0.144*"russian" + -0.135*"cent" + -0.126*"per" + -0.124*"russia" + -0.117*"labour" + -0.115*"mr" + -0.106*"keir" + -0.103*"mp"'),
 (1,
  '-0.566*"abortion" + -0.321*"roe" + -0.206*"court" + -0.195*"supreme" + -0.182*"wade" + -0.162*"v." + -0.161*"draft" + -0.125*"opinion" + -0.124*"justice" + -0.112*"state"'),
 (2,
  '-0.285*"russian" + 0.282*"cent" + -0.277*"ukraine" + 0.251*"per" + 0.237*"rate" + -0.232*"russia" + -0.188*"putin" + -0.173*"ukrainian" + 0.154*"repayment" + -0.124*"war"'),
 (3,
  '-0.266*"keir" + 0.246*"cent" + 0.242*"rate" + -0.230*"sir" + -0.225*"labour" + 0.215*"per" + -0.204*"rayner" + -0.194*"mp" + -0.167*"durham" + 0.163*"repayment"'),
 (4,
  '-0.212*"keir" + -0.175*"sir" + -0.173*"labour" + 0.147*"police" + -0.143*"abortion" + -0.138*"ukraine" + -0.137*"rayner" + -0.135*"russian" + -0.122*"cent" + -0.121*"durham"'),
 (5,
  '0.402*"labor" + 0.349*"albanese" + 0.199*"clare" + 0.197*"mr" + -0.172*"rate" + -0.153*"repay

<font color="green">Please define a function that returns the cosine similarity between two words (testing first if they are in the vocabulary). Please exemplify its value on two different word pairs, one of which should be obviously more similar than the other, and comment the values.</font>  You can get inspiration from this [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [75]:
from sklearn.metrics.pairwise import cosine_similarity

In [76]:
def wordsim(word1, word2, model, dictionary):
    vec_w1=dictionary.doc2bow(word1.lower().split())
    vec_w2=dictionary.doc2bow(word2.lower().split())

    #get words in lsa space
    lsa_w1=model[vec_w1]
    lsa_w2=model[vec_w2]

    return cosine_similarity(lsa_w1,lsa_w2)

In [27]:
# print here the cosine similiarities of several pairs and comment the results.
print("First example")
print(wordsim("indian","detention",lsa_model,dictionary))

#second example
print("Second example")
print(wordsim("fire","australia",lsa_model,dictionary))

First example
[[ 1.00000000e+00 -1.95719097e-02  2.45908862e-03 ... -7.22326848e-05
   2.00836082e-04 -9.84526698e-06]
 [-3.37170374e-02  9.99899887e-01  9.99345484e-01 ...  9.99433852e-01
   9.99424627e-01  9.99431751e-01]
 [-1.11353343e-02  9.99964404e-01  9.99907594e-01 ...  9.99938802e-01
   9.99935744e-01  9.99938110e-01]
 ...
 [ 3.11510785e-05  9.99807842e-01  9.99997053e-01 ...  9.99999995e-01
   9.99999986e-01  9.99999999e-01]
 [-8.27244688e-05  9.99810067e-01  9.99996770e-01 ...  1.00000000e+00
   9.99999960e-01  9.99999997e-01]
 [-1.08781777e-04  9.99810575e-01  9.99996703e-01 ...  9.99999999e-01
   9.99999952e-01  9.99999995e-01]]
Second example
[[ 1.00000000e+00 -9.29780948e-02  8.56240775e-03 ...  1.12656676e-05
  -3.26948146e-05 -9.27223800e-05]
 [-3.78559337e-02  9.98474239e-01  9.98922437e-01 ...  9.99282781e-01
   9.99284444e-01  9.99286713e-01]
 [ 7.73324985e-03  9.94919359e-01  9.99999656e-01 ...  9.99970185e-01
   9.99969845e-01  9.99969377e-01]
 ...
 [ 2.41747284e-

<font color="green">Please use the [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html) to write a function that prints a list of words sorted by decreasing LSA similarity with a given word and showing the score too.  You don't have to use the cosine_similarity function here.  Please choose a "query" word and ten other words, apply your function, and comment the results.</font>

In [28]:
from gensim import similarities

In [29]:
def word_ranking(word0, word_list, model, dictionary):
    # transform corpus to LSI space and index it
    vec_w0=dictionary.doc2bow(word0.lower().split())
    vec_w_list=[dictionary.doc2bow(text.lower().split()) for text in word_list]
    index = similarities.MatrixSimilarity(model[vec_w_list])

    #get word in lsa space
    lsa_w0=model[vec_w0]

    sims_w0 = index[lsa_w0]
    sims = sorted(enumerate(sims_w0), key=lambda item: -item[1])
    for doc_position, doc_score in sims:
        print(doc_score, dictionary[doc_position])

In [30]:
# call here the function on your choice of words
word_list=["worker","warn","escalate","industrial","action","company","reject","srinagar","death","come"]
word_ranking("industry",word_list,lsa_model,dictionary)

0.017094733 around
0.01702085 blaze
0.0 burn
-0.0019819743 aedt
-0.029712332 authority
-0.037531205 bureau
-0.06593551 blue
-0.08245504 across
-0.09509794 available
-0.17063233 area


In [31]:
# Please write here your comments on the rankings

**ANSWER**

The ranking seams ok, but it is difficult to really to evaluate its quality as it really depends on the dictionnary used. Also the topic has to match more or less the words of the vocabulary, otherwise the scores would be all 0.0.

Also if the topics are not well balanced, the ranking won't be good.



<font color="green">Please select now a significantly larger number of topics, and train a new LSA model.  Perform the same `word_ranking` task as above and compare the new ranking with the previous one.  Which one seems better?</font>

## End of Lab 5
Please make sure all cells have been executed, save this completed notebook, compress it to a *zip* file, and upload it to [Moodle](https://moodle.msengineering.ch/course/view.php?id=1869).

In [11]:
data_df.head()

Unnamed: 0,text,processed,tokenized,filtered
0,A fearless badger is harassing passers-by at a...,fearless badger passer-by beauty spot -- rspca...,"[fearless, badger, passer-by, beauty, spot, --...","[fearless, badger, passer-by, beauty, spot, --..."
1,"Only disabled actors should play Richard III, ...",disabled actor play richard iii head royal sha...,"[disabled, actor, play, richard, iii, head, ro...","[disabled, actor, play, richard, iii, head, ro..."
2,Sir Keir Starmer's Labour have seized control ...,sir keir starmer labour control tory stronghol...,"[sir, keir, starmer, labour, control, tory, st...","[sir, keir, starmer, labour, control, tory, st..."
3,Labour will be 'somewhat disappointed' by thei...,labour 'somewhat disappointed local election r...,"[labour, 'somewhat, disappointed, local, elect...","[labour, 'somewhat, disappointed, local, elect..."
4,Follow MailOnline's live coverage for all the ...,mailonline live coverage update local election...,"[mailonline, live, coverage, update, local, el...","[mailonline, live, coverage, update, local, el..."


In [14]:
print(data_df['text'][0])
print(data_df['processed'][0])
print(data_df['tokenized'][0])
print(data_df['filtered'][0])


A fearless badger is harassing passers-by at a renowned beauty spot -- leading the RSPCA to warn the public about its behaviour. Dog walkers, joggers and families out enjoying the countryside have all fallen foul of the black and white menace. The badger has been recorded prowling during the day in Cannock Chase, Staffordshire, and following two barking dogs who were being pulled away by their owner. Despite the mammals' typical nocturnal habits, this particular individual has been seen taking leisurely strolls in broad daylight. It has run up towards barking dogs without flinching and chased a French bulldog, which sought refuge behind its owner. The RSPCA has warned the public to keep their distance from the badger and said its actions are not normal for the normally shy species. It is unusually approachable towards humans and followed one female jogger over stepping stones of a nearby stream, before getting bored and turning back. Ben Clay, 39, who filmed the badger while out walkin