# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *Your group letter.*

**Names:**

* *Guillem Pruñonosa Soler*
* *Name 2*
* *Name 3*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [115]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix, save_npz, load_npz, linalg

from utils import load_json, load_pkl
# imported libraries
import nltk
from collections import defaultdict
import matplotlib.pyplot as plt

nltk.download('wordnet') # lemmatization

%matplotlib inline
plt.style.use("ggplot")

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

[nltk_data] Downloading package wordnet to /home/prunonos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Exercise 4.1: Pre-processing

In [3]:
len(courses)

854

In [4]:
example = courses[2].get('description')
example

"This course provides students with the framework and decision tools needed for taking financial decisions and evaluating investment opportunities in a global economy. We use an integrated model of exchange rate and output determination to analyze the effects of monetary and fiscal policies. Content National Income Accounting and the Balance of Payments Exchange Rates and the Foreign Exchange Market: An Asset Approach Money, Interest Rates and Exchange Rates; Price Level and the Exchange Rate in the Long Run Output and the Exchange Rate in the Short Run with Fixed and Flexible Exchange Rates Fixed Exchange Rates and the Dynamics of Currency Crises Financial Crises and the Choice of Exchange Rate Regime Currency Unions and the European Experience The Global Capital Market Keywords International finance, open-economy macroeconomics Learning Outcomes By the end of the course, the student must be able to: Define the determinants of exchange rates in the short runIllustrate the role of mone

In [5]:
list(stopwords)[:5]

['', 'actually', 'was', 'better', 'how']

In [6]:
nltk.download('punkt')
example_tokenized = nltk.tokenize.word_tokenize(example)

[nltk_data] Downloading package punkt to /home/prunonos/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
# example_lowercase = [word.lower() for word in example_tokenized if (word.isalpha() and word not in stopwords)]

Let's try first with **stemming**.

In [8]:
ps = nltk.stem.PorterStemmer()
example_stemmed = [ps.stem(word.lower()) for word in example_tokenized if (word.isalpha() and word.lower() not in stopwords)]
len(example_stemmed)

337

Now we try with **lemmatization**.

In [9]:
lm = nltk.stem.WordNetLemmatizer()
example_lemma = [lm.lemmatize(word.lower()) for word in example_tokenized if (word.isalpha() and word.lower() not in stopwords)]
len(example_lemma)

337

Here we print some examples, in order to show the difference between the stemming and lemmatization.  

In [10]:
print('stemming \t lemmatization')
print('------------------------------')
for i,(s,l) in enumerate(zip(example_stemmed,example_lemma)):
    if i%10==0: print(s,'    \t',l)

stemming 	 lemmatization
------------------------------
student     	 student
opportun     	 opportunity
effect     	 effect
exchang     	 exchange
exchang     	 exchange
rate     	 rate
dynam     	 dynamic
union     	 union
learn     	 learning
role     	 role
integr     	 integrated
output     	 output
impact     	 impact
account     	 account
runcompar     	 runcompare
regim     	 regime
critiqu     	 critique
class     	 class
understand     	 understand
set     	 set
appli     	 apply
present     	 present
case     	 case
set     	 set
solut     	 solution
question     	 question
complet     	 complete
case     	 case
typic     	 typically
group     	 group
case     	 case
contribut     	 contribution
class     	 class
close     	 closed


One we chose lemmatization, we create the ngrams in order to represent better the meaning of the text.
In this case, we chose ngrams with n=3. But the *n* is a hyperparameter we want to fit.

In [11]:
ngrams = nltk.ngrams(example_lemma, 3)
example_ngram3 = []
for n3 in ngrams:
    n3 = " ".join(n3) # we join each 3-gram into a single string. 
    example_ngram3.append(n3)
    
print(len(example_ngram3))
example_ngram3[:20]

335


['student framework decision',
 'framework decision tool',
 'decision tool needed',
 'tool needed taking',
 'needed taking financial',
 'taking financial decision',
 'financial decision evaluating',
 'decision evaluating investment',
 'evaluating investment opportunity',
 'investment opportunity global',
 'opportunity global economy',
 'global economy integrated',
 'economy integrated model',
 'integrated model exchange',
 'model exchange rate',
 'exchange rate output',
 'rate output determination',
 'output determination analyze',
 'determination analyze effect',
 'analyze effect monetary']

We check which are the most and less common words in our corpus. We will train a model keeping them and another removing them, in order to see in which case our model performs better.

In [12]:
def getCountWords(dict_,corpus):
    for word in corpus:
        dict_[word] += 1
    return dict_

In [13]:
word_count = defaultdict(int)
for course in courses: 
    course_token = nltk.tokenize.word_tokenize(course.get('description'))
#     course_lemma = [lm.lemmatize(word.lower()) for word in course_token if (word.isalpha() and word.lower() not in stopwords)]
    for word in course_token:
        word = word.lower()
        if (word.isalpha() and word not in stopwords):
#             word_lemma = lm.lemmatize(word)
            word_count[word] += 1

In [14]:
# word_count
word_count_values             =   np.array(list(word_count.values()))
word_count_keys               =   np.array(list(word_count.keys()))

idxs_sorted                   =   np.argsort(word_count_values)[::-1]
sorted_word_count_values      =   word_count_values[idxs_sorted]
sorted_word_count_keys        =   word_count_keys[idxs_sorted]

These are the most and less words in our corpus:

In [15]:
for i,(w,c) in enumerate(zip(sorted_word_count_keys, sorted_word_count_values)):
    if i==0: print('Most common words in our corpus:\n')
    if i < 10:
        print(w,c)

    if i==11: print('\n-------------------------------\n\nLess common words in our corpus:\n')
    if i > (len(idxs_sorted)-10):
        print(w,c)

Most common words in our corpus:

methods 1617
learning 1470
student 1210
content 904
students 833
courses 758
design 751
systems 722
analysis 704
end 666

-------------------------------

Less common words in our corpus:

pin 1
novissima 1
productslist 1
involvedpresent 1
nanosystemsapply 1
nanosystemsdescribe 1
tsvs 1
hermetic 1
cite 1


In [16]:
print(f'We have a total of {len(word_count_values)} words in our corpus')

We have a total of 13982 words in our corpus


We filter to remove the words that appear only once in the whole corpus, and also the 20 most common words.

In [17]:
mask = (sorted_word_count_values <= 1) # we filter the sorted word count values array
mask[:20] = True # we also remove the top 20 most common words of our corpus
print(f'Number of words we are going to remove: {sum(mask)}')
print(f'We remove ~{int(sum(mask)/len(word_count_values)*100)}% of all the words.')

Number of words we are going to remove: 7016
We remove ~50% of all the words.


Now we proceed to remove them.

In [18]:
word_count_values             =   sorted_word_count_values[mask]
commonwords                   =   sorted_word_count_keys[mask]
# idxs_sorted                   =   np.argsort(word_count_values)[::-1]
# sorted_words_frequency        =   word_count.keys()[idxs_sorted]

In [19]:
commonwords

array(['methods', 'learning', 'student', ..., 'tsvs', 'hermetic', 'cite'],
      dtype='<U66')

Now, we merge everything we have done until now in order to compute our bag-of-words of 3-gram after lemmatization and removing the stopwords, and the most and less common words of our corpus. 

In [20]:
# filters if it's a stopword, or a very or nothing common word. Also checks if its a word or not.
def filtering_words(w):
    w = w.lower()
    return (w.isalpha() and (w not in stopwords) and (w not in commonwords))

In [28]:
n = 2       # ngrams
lm = nltk.stem.WordNetLemmatizer()
nltk.download('wordnet')

words_doc = []
words = []
# word_count = defaultdict(int)

for course in courses:
    desc_proc = []
    
    desc_tokenized = nltk.tokenize.word_tokenize(course.get('description'))
    desc_lemma     = [lm.lemmatize(word.lower()) for word in desc_tokenized if filtering_words(word)]
        
    ngrams = map(lambda x: ' '.join(x) , nltk.ngrams(desc_lemma, n))
    for n3 in ngrams:
        desc_proc.append(n3)
        words.append(n3)
        
    desc_proc = sorted(desc_proc)
                 
    words_doc.append(desc_proc)

words = np.unique(words)

[nltk_data] Downloading package wordnet to /home/prunonos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Finally, we have decided not to remove the most and less common words of our corpus, because we think they can be important.

The following output is the 2-gram representation of the course *Internet Analytics*.


In [29]:
for i,course in enumerate(courses):
    if course['courseId'] == 'COM-308': words_ix = words_doc[i]
        
words_ix

['acquired lecture',
 'activity lecture',
 'ad auction',
 'ad auction',
 'advertisement class',
 'algebra algorithm',
 'algebra markov',
 'algorithm data',
 'algorithm statistic',
 'analytics application',
 'analytics collection',
 'application inspired',
 'application social',
 'auction provide',
 'auction required',
 'balance foundational',
 'based hadoop',
 'based number',
 'cathedra homework',
 'chain java',
 'class balance',
 'class explores',
 'class lab',
 'cloud service',
 'clustering community',
 'clustering community',
 'collection modeling',
 'combination theoretical',
 'communication recommended',
 'community detection',
 'community detection',
 'computing ad',
 'computing online',
 'concrete problem',
 'coverage main',
 'current practice',
 'data mining',
 'data mining',
 'data mining',
 'data online',
 'data online',
 'data structure',
 'datasets class',
 'datasets real',
 'decade class',
 'dedicated infrastructure',
 'designed explore',
 'detection model',
 'detection to

Now, we have created the *2-grams bag-of-words* of our corpus of the descriptions of the EPFL courses.

In [37]:
dict_word_index = {word: i for i,word in enumerate(words)}

In [67]:
dict_index_word = {i:word for i,word in enumerate(words)}

In [35]:
words_doc_idx = [[dict_word_index[w] for w in doc] for doc in words_doc]

In [42]:
aux = np.zeros(69002)
aux[:len(aux2)] = aux2

In [45]:
len(aux)

69002

In [50]:
bagofwords = []

for doc in words_doc_idx:
    countwords = np.zeros(len(words))
    countwords_bincount = np.bincount(doc)
    countwords[:len(countwords_bincount)] = countwords_bincount
    bagofwords_doc = {word:count for word,count in zip(words,countwords)}
    bagofwords.append(bagofwords_doc)

In [66]:
from utils import save_json
save_json(bagofwords, 'bagofwords.txt')

In [149]:
save_json([dict_index_word], 'dict_index_word.txt')

## Exercise 4.2: Term-document matrix

In [21]:
# calcular IDF para todas las words -> mismo valor para todos los documentos (IDF siempre > 0)
# calcular TF -> si IF == 0 -> dejamos de calcular
# si TF > 0:
    # calcular el TF-IDF de cada term-document
    # ( solo añadir a la Sparse-matrix los valores > 0 )

In [22]:
words.shape

(69002,)

First we compute **IDF** for each word.

In [33]:
# we need the max frequency of a word in the doc

max_fij = []
for doc in words_doc:
    max_count = np.max(np.unique(doc, return_counts=True)[1])
    max_fij.append(max_count)

In [57]:
max_fij[0]

2

In [110]:
ratio_idf = np.zeros(len(words))

for j,doc in enumerate(words_doc_idx):
    if j%100==0: print(j)
    wordcount = np.bincount(doc)
    for i,count in enumerate(wordcount):
        if count > 0:
            ratio_idf[i] += 1

0
100
200
300
400
500
600
700
800


In [None]:
values = []
pos_tf = []

for j,doc in enumerate(words_doc_idx):
    if j%100==0: print(j);
    wordcount = np.bincount(doc)
    for i,count in enumerate(wordcount):
        if count > 0:
            idf = -np.log2((ratio_idf[i])/len(words_doc_idx))
            values.append(idf * (count / max_fij[j]))
            pos_tf.append((i,j))

In [123]:
pos_tf = np.array(pos_tf)
pos_tf

array([[  528,     0],
       [  763,     0],
       [ 2479,     0],
       ...,
       [65496,   853],
       [68051,   853],
       [68778,   853]])

Now, we create the sparse matrix **TF** for each word in each document.

In [124]:
m, n = len(words), len(words_doc)

tf_idf = csr_matrix((values, (pos_tf[:,0],pos_tf[:,1])), shape=(m, n))
print(tf_idf)

  (0, 295)	1.630625951779867
  (0, 753)	2.717709919633111
  (0, 818)	4.076564879449667
  (1, 430)	3.24603075320683
  (2, 97)	9.73809225962049
  (3, 562)	2.4345230649051226
  (4, 566)	1.623015376603415
  (5, 335)	9.73809225962049
  (6, 143)	2.9126974198734965
  (6, 146)	2.9126974198734965
  (7, 387)	1.9476184519240982
  (8, 326)	3.24603075320683
  (9, 61)	2.1845230649051226
  (9, 404)	2.9126974198734965
  (10, 404)	3.24603075320683
  (11, 61)	2.4345230649051226
  (12, 379)	1.9476184519240982
  (13, 683)	2.9126974198734965
  (13, 851)	2.9126974198734965
  (14, 113)	2.9126974198734965
  (14, 507)	2.9126974198734965
  (15, 384)	1.9476184519240982
  (16, 696)	4.869046129810245
  (17, 825)	1.9476184519240982
  (18, 118)	4.369046129810245
  :	:
  (68987, 807)	4.869046129810245
  (68988, 197)	2.9126974198734965
  (68988, 348)	2.9126974198734965
  (68989, 42)	4.076564879449667
  (68989, 535)	4.076564879449667
  (68989, 738)	4.076564879449667
  (68990, 362)	2.9126974198734965
  (68990, 393)	2.91

In [125]:
tf_idf.shape

# should not have 0 values

(69002, 854)

In [127]:
save_npz('tf_idf.npz',tf_idf)

## Exercise 4.3: Document similarity search

In [70]:
tf_idf = load_npz('tf_idf.npz')

In [134]:
def sim(doc1,doc2):
    dot_doc = doc1.T @ doc2
    norm1 = linalg.norm(doc1)
    norm2 = linalg.norm(doc2)
    return (dot_doc / (norm1 * norm2)).toarray()[0,0]

In [77]:
dict_word_index['markov chain']

35957

In [98]:
print(tf_idf[35957])

  (0, 43)	2.3102457791876283
  (0, 80)	6.930737337562886
  (0, 245)	6.930737337562886
  (0, 398)	6.930737337562886
  (0, 412)	3.465368668781443
  (0, 417)	1.7326843343907214
  (0, 555)	1.7326843343907214


In [114]:
top5_mc = np.argsort(tf_idf[35957].toarray()[0])[-5:][::-1]
top5_mc

array([398, 245,  80, 412,  43])

In [139]:
print('The similitudes between the top 5 Markov-Chain courses are:')
print()

for i in range(len(top5_mc)):
    doc1 = tf_idf[:,top5_mc[i]] 
    course1 = courses[top5_mc[i]]['name']
    
    for j in range(i+1,len(top5_mc)):
        doc2 = tf_idf[:,top5_mc[j]] 
        course2 = courses[top5_mc[j]]['name']
        
        sim_docs = sim(doc1,doc2)
#         print(type(sim_docs))
        print(f'"{course1}" and "{course2}" is {sim_docs}.')
        print()

The similitudes between the top 5 Markov-Chain courses are:

"Applied probability & stochastic processes" and "Markov chains and algorithmic applications" is 0.0897082588764901.

"Applied probability & stochastic processes" and "Applied stochastic processes" is 0.10388701735460086.

"Applied probability & stochastic processes" and "Optimization and simulation" is 0.029896303540190697.

"Applied probability & stochastic processes" and "Internet analytics" is 0.017162629730451507.

"Markov chains and algorithmic applications" and "Applied stochastic processes" is 0.11561174226433334.

"Markov chains and algorithmic applications" and "Optimization and simulation" is 0.04614616705760811.

"Markov chains and algorithmic applications" and "Internet analytics" is 0.02431631585415846.

"Applied stochastic processes" and "Optimization and simulation" is 0.050484747799964265.

"Applied stochastic processes" and "Internet analytics" is 0.021609791757521223.

"Optimization and simulation" and "Int

In [151]:
words

array(['ab initio', 'ab power', 'ab variable', ..., 'énergétique master',
       'énergétique notion', 'être pris'], dtype='<U75')

In [165]:
facebook_idxs = [dict_word_index[w] for w in words 
                   if 'facebook' in w]

In [166]:
facebook_idxs

[22303, 22304, 22305, 54060, 57412, 66176]

In [167]:
print(tf_idf[facebook_idxs])

  (0, 798)	0.749084019970807
  (1, 798)	0.749084019970807
  (2, 798)	0.749084019970807
  (3, 798)	0.749084019970807
  (4, 798)	0.749084019970807
  (5, 798)	0.749084019970807


In [175]:
print('The similitudes between the top 5 Markov-Chain and the only course that contains Facebook are:')
print()

doc_face = tf_idf[:,798] 
course_face = courses[798]['name']

for i in range(len(top5_mc)):
    doc1 = tf_idf[:,top5_mc[i]] 
    course1 = courses[top5_mc[i]]['name']
    sim_docs = sim(doc1,doc_face)

    print(f'"{course1}" and "{course_face}" is {sim_docs}.')
    print()

The similitudes between the top 5 Markov-Chain and the only course that contains Facebook are:

"Applied probability & stochastic processes" and "Computational Social Media" is 0.0.

"Markov chains and algorithmic applications" and "Computational Social Media" is 0.008236951339467167.

"Applied stochastic processes" and "Computational Social Media" is 0.0.

"Optimization and simulation" and "Computational Social Media" is 0.0.

"Internet analytics" and "Computational Social Media" is 0.040662506491494395.

