# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *B*

**Names:**

* *Keijiro Tajima*
* *Mahammad Shirinov*
* *Stephen Zhao*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [54]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl, save_pkl
import string
import re
from nltk.stem import PorterStemmer as PS
import math
from numpy import linalg as LA

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
courses_original = load_json('data/courses.txt')

## Exercise 4.1: Pre-processing

In [55]:
# Check the number of courses
print("The number of courses is {}.".format(len(courses)))

The number of courses is 854.


In [56]:
# Check the first course's description without pre-processing
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materialsConstituentsProcessing of compositesDesign of composite structures Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites ApplicationsDriving forces and marketsCost analysisAerospaceAutomotiveSport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite partApply the basic equations for process and mechanical properties modelling for composite mater

### First, remove the stopwords and punctuation.

In [57]:
# Function that remove all the words in a words_list from course descriptions
def remove(words_lst):
    for course in courses:
        desc_lst = course['description'].split()
        for word in words_lst:
            if word in desc_lst:
                desc_lst = list(filter(lambda x: x != word, desc_lst))
        course['description'] = ' '.join(desc_lst)

In [58]:
# Function that capitalizes all the words in a words_list
def capitalize_words(words_lst):
    a = set([])
    for i in words_lst:
        a.add(i.capitalize())
    return a

In [59]:
capitalized_stopwords = capitalize_words(list(stopwords))

In [60]:
# Remove the stopwords and capitalized_stopwords
remove(list(stopwords))
remove(list(capitalized_stopwords))

In [61]:
# Remove the punctuation
punctuation_table = str.maketrans({key: None for key in string.punctuation})

for course in courses:
    course['description'] = course['description'].translate(punctuation_table)

In [62]:
# Check the first course's description
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
latest developments processing generations organic composites discussed Nanocomposites adaptive composites biocomposites presented Product development cost analysis study markets practiced team work Content Basics composite materialsConstituentsProcessing compositesDesign composite structures Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites ApplicationsDriving forces marketsCost analysisAerospaceAutomotiveSport Keywords Composites  Applications  Nanocomposites  Biocomposites  Adaptive composites  Design  Cost Learning Prerequisites Required courses Notion polymers Recommended courses Polymer Composites Learning Outcomes end course student to Propose suitable design production performance criteria production composite partApply basic equations process mechanical properties modelling composite materialsDiscuss main types composite applications Transversal skills work methodology taskUse general domain specific IT resources toolsCommunicate effe

### Check the frequent and infrequent words

In [63]:
# Make frequent words list and infrequent words list
frequency_dict = {}
for course in courses:
    for word in course['description'].split():
            if word in frequency_dict:
                frequency_dict[word] += 1
            else:
                frequency_dict[word] = 1

In [64]:
frequency_sorted = sorted(frequency_dict.items(), key=lambda x:x[1], reverse = True)
infrequency_sorted = sorted(frequency_dict.items(), key=lambda x:x[1])

In [65]:
# Top 10 frequent words
frequency_sorted[:10]

[('methods', 1556),
 ('Learning', 1237),
 ('student', 1177),
 ('Content', 835),
 ('course', 762),
 ('courses', 755),
 ('end', 661),
 ('students', 652),
 ('systems', 621),
 ('to', 602)]

In [66]:
# Top 10 infrequent words
infrequency_sorted[:10]

[('biocomposites', 1),
 ('materialsConstituentsProcessing', 1),
 ('developmentNanocomposites', 1),
 ('Textile', 1),
 ('compositesBiocompositesAdaptive', 1),
 ('ApplicationsDriving', 1),
 ('marketsCost', 1),
 ('analysisAerospaceAutomotiveSport', 1),
 ('Biocomposites', 1),
 ('partApply', 1)]

### Frequent and infrequent words
- Frequent words, like 'methods' and 'Learning', are very general and don't effect on courses' characteristics. 
- Also infrequent words, like 'biocomposites' and 'materialsConstituentsProcessing', are too specialized and not useful for characterising courses.

Let's remove these words. Frequent words are not so many but dominant, so remove 0.1% of frequent words and 1% of infrequent words.

In [67]:
# Function that extracts a certain percentage of words from a words_lst.
def extract(words_lst, percent):
    removed = 0
    taken_lst = []
    while removed <= len(words_lst)*percent*0.01:
        taken_lst.append(words_lst[removed][0])
        removed += 1
    return taken_lst

In [68]:
# Extract 0.1% of frequent words and 1% of infrequent words.
frequent_words = extract(frequency_sorted, 0.1)
infrequent_words = extract(infrequency_sorted, 1)

In [69]:
remove(frequent_words)

In [70]:
remove(infrequent_words)

In [71]:
# Check the first course's description
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
latest developments processing generations organic composites discussed Nanocomposites adaptive composites presented Product development cost study markets practiced team Basics composite compositesDesign composite structures Current composites forces Composites Applications Nanocomposites Adaptive composites Design Cost Required Notion polymers Recommended Polymer Composites Propose suitable production performance criteria production composite basic equations process mechanical properties modelling composite main types composite applications Transversal methodology taskUse general domain specific IT resources toolsCommunicate effectively professionals disciplinesEvaluate ones performance team receive respond appropriately feedback cathedra invited speakers Group sessions exercises Expected Attendance lectures Design composite part bibliography search Written exam report oral presentation class


### Stem the words with nltk library.

In [72]:
ps =PS()

for course in courses:
    desc_lst = course['description'].split()
    for i in range(len(desc_lst)):
        desc_lst[i] = ps.stem(desc_lst[i])
    course['description'] = course['description'] = ' '.join(desc_lst)

In [73]:
# Check the first course's description
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
latest develop process gener organ composit discuss nanocomposit adapt composit present product develop cost studi market practic team basic composit compositesdesign composit structur current composit forc composit applic nanocomposit adapt composit design cost requir notion polym recommend polym composit propos suitabl product perform criteria product composit basic equat process mechan properti model composit main type composit applic transvers methodolog taskus gener domain specif IT resourc toolscommun effect profession disciplinesevalu one perform team receiv respond appropri feedback cathedra invit speaker group session exercis expect attend lectur design composit part bibliographi search written exam report oral present class


### 1. Explain which ones you implemented and why.

- We implemented removing stopwords, punctuation, the very frequent words, the infrequent words, the stem words .
- Clearly, stopwords and punctuation are needless for characterizing courses.
- Also, as we mentioned in the "frequent and infrequent words part", frequent words are too general and infrequent words like 'biocomposites' and 'materialsConstituentsProcessing', are too specialized and not useful for characterising courses. Therefore we removed them.
- For the stemming and lemmatise, they are similar ones, so I chose stemming for this project.

### 2. Print the terms in the pre-processed description of the IX class in alphabetical order.

In [74]:
a = 0
for i in courses:
    if i['courseId'] == 'COM-308':
        ix_desc = i['description'].split()
        ix_desc.sort()
        print(ix_desc)

['20', '30', '50', 'acquir', 'ad', 'ad', 'advertis', 'algebra', 'algebra', 'algorithm', 'algorithm', 'analyt', 'analyt', 'apach', 'applic', 'applic', 'auction', 'auction', 'balanc', 'base', 'base', 'basic', 'basic', 'basic', 'cathedra', 'chain', 'class', 'class', 'class', 'cloud', 'cluster', 'cluster', 'collect', 'com300', 'combin', 'commun', 'commun', 'commun', 'comput', 'comput', 'concret', 'coverag', 'curat', 'current', 'data', 'data', 'data', 'data', 'data', 'data', 'dataset', 'dataset', 'decad', 'dedic', 'design', 'detect', 'detect', 'dimension', 'draw', 'ecommerc', 'ecommerc', 'effect', 'effici', 'etc', 'exam', 'expect', 'explor', 'explor', 'explor', 'explor', 'explor', 'field', 'final', 'foundat', 'framework', 'function', 'fundament', 'good', 'graph', 'graph', 'hadoop', 'hadoop', 'handson', 'homework', 'homework', 'import', 'inform', 'inform', 'infrastructur', 'inspir', 'internet', 'internet', 'java', 'key', 'knowledg', 'lab', 'lab', 'lab', 'laboratori', 'largescal', 'largescal'

## Exercise 4.2: Term-document matrix
- computation of TF-IDF

In [75]:
# Make a set that contains all the words in the courses.
all_lst = []
for course in courses:
    for word in course['description'].split():
        all_lst.append(word)
all_set = set(all_lst)
print("The number of all words used in the courses' descriptions is {}.".format(len(all_set)))

The number of all words used in the courses' descriptions is 14513.


In [76]:
# Construct the TF matrix
TF_matrix = np.zeros((len(all_set),len(courses)))
TF_matrix.shape

(14513, 854)

In [77]:
# Make indices for column (courseIds) and row (words).
column_indices = [x['courseId'] for x in courses]
row_indices = list(all_set)

In [78]:
for i, word in enumerate(row_indices):
    for j, course in enumerate(courses):
        desc_lst = course['description'].split()
        occur_time = desc_lst.count(word)
        TF_matrix[i][j] = occur_time/len(desc_lst)

In [79]:
IDF = np.zeros(len(all_set))

In [80]:
# Construct the IDF
for row in range(len(row_indices)):
    course_sum = np.sum(TF_matrix[row] != 0)
    IDF[row] = math.log(len(courses)/course_sum)

In [81]:
IDF = IDF.reshape(len(IDF),1)

In [82]:
IDF

array([[ 6.74993119],
       [ 2.9212898 ],
       [ 6.74993119],
       ..., 
       [ 6.74993119],
       [ 6.05678401],
       [ 6.74993119]])

In [83]:
# Construct the TFIDF_matrix
TFIDF_matrix = TF_matrix*IDF

In [84]:
TFIDF_matrix.shape

(14513, 854)

In [85]:
TFIDF_matrix

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [86]:
save_pkl(TFIDF_matrix, 'tfidx_matrix.pkl')
save_pkl(column_indices, 'courses.pkl')
save_pkl(row_indices, 'terms.pkl')

In [87]:
# Take the TFIDF values of  'COM-308'
ix_index = column_indices.index('COM-308')
ix_word = TFIDF_matrix[:, ix_index]

In [88]:
ix_TFIDF_vals = {}
for i, val in enumerate(ix_word):
    if not val > 0.0:
        continue
    ix_TFIDF_vals[row_indices[i]] = val

### 1. Print the 15 terms in the description of the IX class with the highest TF-IDF scores

In [89]:
# Show the 15 terms in the description of the IX class with the highest TF-IDF scores.
ix_TFIDF_vals_sorted = sorted(ix_TFIDF_vals.items(), key=lambda x:x[1], reverse = True)
ix_TFIDF_vals_sorted[0:15]

[('onlin', 0.088542142062522422),
 ('realworld', 0.088394291067369318),
 ('social', 0.082031405578561131),
 ('explor', 0.078662153551405739),
 ('mine', 0.072598284747804151),
 ('hadoop', 0.062764601173353626),
 ('largescal', 0.06182397624169101),
 ('ecommerc', 0.05856289020850218),
 ('servic', 0.056182958205068703),
 ('auction', 0.055581728835944866),
 ('internet', 0.05138001787109342),
 ('network', 0.048563978166307087),
 ('data', 0.047383974717368979),
 ('dataset', 0.045098817834095334),
 ('ad', 0.043367687423078061)]

### 2. Explain where the difference between the large scores and the small ones comes from.

- Compared to the top words, like 'onlin' or 'realworld', some words like 'model' and 'data' occured in the description of IX class. However their TFIDF values are relatively small. The reason is that these words like 'model' and 'data' also occur in other courses' descriptions. Frequency for all descriptions is the defference between the large scores and the small ones comes from.

## Exercise 4.3: Document similarity search

In [90]:
# Construct query vectors for markov chain and facebook
markov_chain = np.zeros(len(row_indices)).reshape(len(row_indices),1)
markov_chain[row_indices.index('markov')] = 1/2
markov_chain[row_indices.index('chain')] = 1/2

facebook = np.zeros(len(row_indices))
facebook[row_indices.index('facebook')] = 1

In [91]:
# Function that calculates the cosine similarity of  2 documents
def similarity(di, dj):
    return np.dot(di, dj) / (LA.norm(di) * LA.norm(dj))

In [92]:
markov_similarity = []
for i in range(len(column_indices)):
    markov_similarity.append(similarity(TFIDF_matrix.T[i], markov_chain)[0])

In [93]:
# Make a dictionary of (course name: similarity value)
markov_sim_vals = {}
for i, sim in enumerate(markov_similarity):
    markov_sim_vals[courses[i]['name']] = sim
    
# Sort by similarity value
markov_sim_vals_sorted = sorted(markov_sim_vals.items(), key=lambda x:x[1], reverse = True)

In [95]:
facebook_similarity = []
for i in range(len(column_indices)):
    facebook_similarity.append(similarity(TFIDF_matrix.T[i], facebook))

In [96]:
# Make a dictionary of (course name: similarity value)
facebook_sim_vals = {}
for i, sim in enumerate(facebook_similarity):
    facebook_sim_vals[courses[i]['name']] = sim

# Sort by similarity value
facebook_sim_vals_sorted = sorted(facebook_sim_vals.items(), key=lambda x:x[1], reverse = True)

### 1. Display the top five courses together with their similarity score for each query.

In [98]:
# Print the top 5 courses
print("The top five courses together with their similarity score for query markov chain")
for i in range(5):
    print(markov_sim_vals_sorted[i])

The top five courses together with their similarity score for query markov chain
('Applied stochastic processes', 0.55381084208926346)
('Applied probability & stochastic processes', 0.54293465222562776)
('Markov chains and algorithmic applications', 0.38715760245681952)
('Supply chain management', 0.36275654756075915)
('Mathematical models in supply chain management', 0.2989216694626754)


In [97]:
# Print the top 5 courses
print("The top five courses together with their similarity score for query facebook")
for i in range(5):
    print(facebook_sim_vals_sorted[i])

The top five courses together with their similarity score for query facebook
('Computational Social Media', 0.1762569139361605)
('Composites technology', 0.0)
('Image Processing for Life Science', 0.0)
('Global business environment', 0.0)
('Electrochemical nano-bio-sensing and bio/CMOS interfaces', 0.0)


### 2. What do you think of the results? Give your intuition on what is happening.

In [100]:
count_markov = 0
count_chain = 0
count_facebook = 0
for course in courses:
    for word in course['description'].split():
        if word == 'markov':
            count_markov += 1
        if word == 'chain':
            count_chain += 1
        if word == 'facebook':
            count_facebook += 1

In [105]:
print("The number of markov in descriptions is {}".format(count_markov))
print("The number of chain in descriptions is {}".format(count_chain))
print("The number of facebook in descriptions is {}".format(count_facebook))

The number of markov in descriptions is 42
The number of chain in descriptions is 68
The number of facebook in descriptions is 3


- For markov chain, the suggested top five courses look very related to markov chain.For facebook, most of the simirality values of suggested top five courses are 0. This means that this model cannot detect courses that relate to facebook. 
- According to the numbers mentioned above, facebook doesn't occur often in the descriptions. Therefore, the result shows that this TF-IDF model can detect courses related to the words which exactly occur in the descriptions many times. However, this model can't detect courses with words that are too concrete and don't occur in the descriptions.