# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *B*

**Names:**

* *Keijiro Tajima*
* *Mahammad Shirinov*
* *Stephen Zhao*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl, save_pkl
import string
import re
from nltk.stem import PorterStemmer as PS
import math
from numpy import linalg as LA

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')
courses_original = load_json('data/courses.txt')

## Exercise 4.1: Pre-processing

In [2]:
# Check the number of courses
print("The number of courses is {}.".format(len(courses)))

The number of courses is 854.


In [3]:
# Check the first course's description without pre-processing
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materialsConstituentsProcessing of compositesDesign of composite structures Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites ApplicationsDriving forces and marketsCost analysisAerospaceAutomotiveSport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite partApply the basic equations for process and mechanical properties modelling for composite mater

### Split camel case words

In [4]:
CAPITAL_LETTERS = ''.join(chr(ord('A') + i) for i in range(26))

for course in courses:
    desc_lst = []
    for word in course['description'].split():
        # Split up camelcase words like "materialsConstituentsProcessing" 
        # that were not parsed correctly and ended up as camel case.
        # But do not split up acronyms like "IT"
        words = []
        if len(word) == 1:
            words.append(word)
        else:
            for i in range(len(word)):
                if i == 0: # Start the first word
                    start = 0
                elif i == len(word)-1: # Add the final word
                    words.append(word[start:])
                else:
                    is_prev_capital = word[i-1] in CAPITAL_LETTERS
                    is_curr_capital = word[i] in CAPITAL_LETTERS
                    is_next_capital = word[i+1] in CAPITAL_LETTERS
                    if is_curr_capital:
                        if is_prev_capital:
                            if is_next_capital:
                                # part of acronym
                                continue
                            else:
                                # end of acronym and beginning of new word
                                words.append(word[start:i])
                                start = i
                        else:
                            # start of new word or acronym
                            start = i
                    else:
                        if is_next_capital:
                            # end of a word or acronym
                            words.append(word[start:i+1])
                        else:
                            # part of a word
                            continue       
                        #endif
                    #endif
                #endif
            #endfor
        #endif
        desc_lst += words
    course['description'] = ' '.join(desc_lst)

In [5]:
# Check the first course's description after camel case splitting
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materials Constituents Processing of composites Design of composite structures Current development Nanocomposites Textile composites Biocomposites Adaptive composites Applications Driving forces and markets Cost analysis Aerospace Automotive Sport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite part Apply the basic equations for process and mechanical properties modelling for com

### Remove punctuation

In [6]:
# Remove the punctuation (except apostrophe which is used in stop words)
punctuation_table = "[{}]+".format(re.escape(string.punctuation.replace("\'", '')))

for course in courses:
    course['description'] = re.sub(punctuation_table, ' ', course['description'])

In [7]:
# Check the first course's description 
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
The latest developments in processing and the novel generations of organic composites are discussed  Nanocomposites  adaptive composites and biocomposites are presented  Product development  cost analysis and study of new markets are practiced in team work  Content Basics of composite materials Constituents Processing of composites Design of composite structures Current development Nanocomposites Textile composites Biocomposites Adaptive composites Applications Driving forces and markets Cost analysis Aerospace Automotive Sport Keywords Composites   Applications   Nanocomposites   Biocomposites   Adaptive composites   Design   Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course  the student must be able to  Propose suitable design  production and performance criteria for the production of a composite part Apply the basic equations for process and mechanical properties modelling for com

### Remove stop words, then remove leftover apostrophes

In [8]:
# Function that remove all the words in a words_list from course descriptions
def remove(words_lst):
    for course in courses:
        desc_lst = course['description'].split()
        for word in words_lst:
            if word in desc_lst:
                desc_lst = list(filter(lambda x: x != word, desc_lst))
        course['description'] = ' '.join(desc_lst)

In [9]:
# Function that capitalizes all the words in a words_list
def capitalize_words(words_lst):
    a = set([])
    for i in words_lst:
        a.add(i.capitalize())
    return a

In [10]:
capitalized_stopwords = capitalize_words(list(stopwords))

In [11]:
# Remove the stopwords and capitalized_stopwords
remove(list(stopwords))
remove(list(capitalized_stopwords))

In [12]:
# Remove hanging apostrophes to join possesive "s" with the word
for course in courses:
    course['description'] = course['description'].replace("'", "")

In [13]:
# Check the first course's description without stop words
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
latest developments processing generations organic composites discussed Nanocomposites adaptive composites biocomposites presented Product development cost analysis study markets practiced team work Content Basics composite materials Constituents Processing composites Design composite structures Current development Nanocomposites Textile composites Biocomposites Adaptive composites Applications Driving forces markets Cost analysis Aerospace Automotive Sport Keywords Composites Applications Nanocomposites Biocomposites Adaptive composites Design Cost Learning Prerequisites Required courses Notion polymers Recommended courses Polymer Composites Learning Outcomes end student Propose suitable design production performance criteria production composite part Apply basic equations process mechanical properties modelling composite materials Discuss main types composite applications Transversal skills work methodology task general domain specific IT resources tools Communicate effective

### Stem the words with nltk library.

In [14]:
ps = PS()

for course in courses:
    desc_lst = course['description'].split()
    for i in range(len(desc_lst)):
        desc_lst[i] = ps.stem(desc_lst[i])
    course['description'] = course['description'] = ' '.join(desc_lst)

In [15]:
# Check the first course's description
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
latest develop process gener organ composit discuss nanocomposit adapt composit biocomposit present product develop cost analysi studi market practic team work content basic composit materi constitu process composit design composit structur current develop nanocomposit textil composit biocomposit adapt composit applic drive forc market cost analysi aerospac automot sport keyword composit applic nanocomposit biocomposit adapt composit design cost learn prerequisit requir cours notion polym recommend cours polym composit learn outcom end student propos suitabl design product perform criteria product composit part appli basic equat process mechan properti model composit materi discuss main type composit applic transvers skill work methodolog task gener domain specif IT resourc tool commun effect profession disciplin evalu one perform team receiv respond appropri feedback teach method cathedra invit speaker group session exercis work project expect student activ attend lectur desig

### Check the frequent and infrequent words

In [16]:
# Make frequent words list and infrequent words list
frequency_dict = {}
for course in courses:
    for word in course['description'].split():
            if word in frequency_dict:
                frequency_dict[word] += 1
            else:
                frequency_dict[word] = 1

In [17]:
# See the number of unique words we have
print('num of words:', len(frequency_dict))

num of words: 9758


In [18]:
frequency_sorted = sorted(frequency_dict.items(), key=lambda x:x[1], reverse = True)
infrequency_sorted = sorted(frequency_dict.items(), key=lambda x:x[1])

In [19]:
# Top 10 frequent words
frequency_sorted[:10]

[('student', 2053),
 ('method', 1814),
 ('learn', 1633),
 ('model', 1240),
 ('system', 1201),
 ('assess', 1017),
 ('design', 991),
 ('content', 923),
 ('present', 793),
 ('process', 792)]

In [20]:
# Top 10 infrequent words
infrequency_sorted[:10]

[('imagej', 1),
 ('fiji', 1),
 ('1h30', 1),
 ('metadata', 1),
 ('stitch', 1),
 ('fij', 1),
 ('nontrad', 1),
 ('begiven', 1),
 ('statiqu', 1),
 ('102', 1)]

### Frequent and infrequent words
- Frequent words, like 'student' and 'method', are very general and don't affect the courses' characteristics. 
- Also infrequent words, like 'imagej' and 'fiji', are too specialized and not useful for characterising courses.

Let's remove these words. Frequent words are not so many but dominant, so remove 0.1% (around the top 10) of frequent words. Remove all infrequent words that appear only once.

In [21]:
# Function that extracts a certain percentage of words from a words_lst.
def extract(words_lst, percent):
    removed = 0
    taken_lst = []
    while removed <= len(words_lst)*percent*0.01:
        taken_lst.append(words_lst[removed][0])
        removed += 1
    return taken_lst

In [22]:
# Extract 0.1% of frequent words and all infrequent (freq=1) words.
frequent_words = extract(frequency_sorted, 0.1)
infrequent_words = [word for word, freq in infrequency_sorted if freq <= 1]
print("num of frequent words:   ", len(frequent_words))
print("num of infrequent words: ", len(infrequent_words))

num of frequent words:    10
num of infrequent words:  4282


In [23]:
remove(frequent_words)

In [24]:
remove(infrequent_words)

In [25]:
# Check the first course's description
print(courses[0]['courseId'])
print(courses[0]['description'])

MSE-440
latest develop gener organ composit discuss nanocomposit adapt composit biocomposit product develop cost analysi studi market practic team work basic composit materi constitu composit composit structur current develop nanocomposit textil composit biocomposit adapt composit applic drive forc market cost analysi aerospac automot sport keyword composit applic nanocomposit biocomposit adapt composit cost prerequisit requir cours notion polym recommend cours polym composit outcom end propos suitabl product perform criteria product composit part appli basic equat mechan properti composit materi discuss main type composit applic transvers skill work methodolog task gener domain specif IT resourc tool commun effect profession disciplin evalu one perform team receiv respond appropri feedback teach cathedra invit speaker group session exercis work project expect activ attend lectur composit part bibliographi search written exam report oral class


In [26]:
save_pkl(courses, 'courses_prepped.pkl')

### 1. Explain which ones you implemented and why.

- We implemented removing stopwords, punctuation, the very frequent words, the infrequent words, the stem words .
- Clearly, stopwords and punctuation are needless for characterizing courses.
- Also, as we mentioned in the "frequent and infrequent words part", frequent words are too general and infrequent words like 'imagej' and 'fiji', are too specialized and not useful for characterising courses. Therefore we removed them.
- For the stemming and lemmatise, they are similar ones, so I chose stemming for this project.

### 2. Print the terms in the pre-processed description of the IX class in alphabetical order.

In [163]:
a = 0
for i in courses:
    if i['courseId'] == 'COM-308':
        ix_desc = i['description'].split()
        ix_desc.sort()
        print(ix_desc)

['20', '30', '300', '50', 'CO', 'acquir', 'activ', 'ad', 'ad', 'advertis', 'algebra', 'algebra', 'algorithm', 'algorithm', 'analysi', 'analyt', 'analyt', 'analyz', 'applic', 'applic', 'auction', 'auction', 'balanc', 'base', 'base', 'basic', 'basic', 'basic', 'cathedra', 'chain', 'class', 'class', 'class', 'cloud', 'cluster', 'cluster', 'collect', 'combin', 'commerc', 'commerc', 'commun', 'commun', 'commun', 'comput', 'comput', 'concept', 'concept', 'concret', 'contain', 'cours', 'cours', 'coverag', 'current', 'data', 'data', 'data', 'data', 'data', 'data', 'dataset', 'dataset', 'decad', 'dedic', 'detect', 'detect', 'develop', 'dimension', 'draw', 'effect', 'effici', 'end', 'exam', 'expect', 'explor', 'explor', 'explor', 'explor', 'explor', 'field', 'final', 'foundat', 'framework', 'function', 'fundament', 'good', 'graph', 'graph', 'hadoop', 'hadoop', 'hand', 'homework', 'homework', 'import', 'inform', 'inform', 'infrastructur', 'inspir', 'internet', 'internet', 'java', 'key', 'keyword'

## Exercise 4.2: Term-document matrix
- computation of TF-IDF

In [164]:
# Make a set that contains all the words in the courses.
all_lst = []
for course in courses:
    for word in course['description'].split():
        all_lst.append(word)
all_set = set(all_lst)
print("The number of all words used in the courses' descriptions is {}.".format(len(all_set)))

The number of all words used in the courses' descriptions is 5466.


In [165]:
# Construct the TF matrix
TF_matrix = np.zeros((len(all_set),len(courses)))
TF_matrix.shape

(5466, 854)

In [166]:
# Make indices for column (courseIds) and row (words).
column_indices = [x['courseId'] for x in courses]
row_indices = list(all_set)

In [167]:
for i, word in enumerate(row_indices):
    for j, course in enumerate(courses):
        desc_lst = course['description'].split()
        occur_time = desc_lst.count(word)
        TF_matrix[i][j] = occur_time/len(desc_lst)

In [168]:
IDF = np.zeros(len(all_set))

In [169]:
# Construct the IDF
for row in range(len(row_indices)):
    course_sum = np.sum(TF_matrix[row] != 0)
    IDF[row] = math.log(len(courses)/course_sum)

In [170]:
IDF = IDF.reshape(len(IDF),1)

In [171]:
IDF

array([[ 2.89978359],
       [ 0.98787981],
       [ 3.38263536],
       ..., 
       [ 5.65131891],
       [ 6.74993119],
       [ 6.05678401]])

In [172]:
# Construct the TFIDF_matrix
TFIDF_matrix = TF_matrix*IDF

In [173]:
TFIDF_matrix.shape

(5466, 854)

In [174]:
TFIDF_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.01555716,  0.01829407,  0.        , ...,  0.        ,
         0.03186709,  0.03951519],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [175]:
save_pkl(TFIDF_matrix, 'tfidx_matrix.pkl')
save_pkl(column_indices, 'courses.pkl')
save_pkl(row_indices, 'terms.pkl')

In [176]:
# Take the TFIDF values of  'COM-308'
ix_index = column_indices.index('COM-308')
ix_word = TFIDF_matrix[:, ix_index]

In [177]:
ix_TFIDF_vals = {}
for i, val in enumerate(ix_word):
    if not val > 0.0:
        continue
    ix_TFIDF_vals[row_indices[i]] = val

### 1. Print the 15 terms in the description of the IX class with the highest TF-IDF scores

In [178]:
# Show the 15 terms in the description of the IX class with the highest TF-IDF scores.
ix_TFIDF_vals_sorted = sorted(ix_TFIDF_vals.items(), key=lambda x:x[1], reverse = True)
ix_TFIDF_vals_sorted[0:15]

[('mine', 0.091578228472720277),
 ('servic', 0.087546013809819231),
 ('onlin', 0.084659174700594128),
 ('social', 0.076936600028047683),
 ('world', 0.070557112325506835),
 ('explor', 0.069556573244128034),
 ('hadoop', 0.059380235423810046),
 ('real', 0.055980957723872203),
 ('auction', 0.052584674830085089),
 ('commerc', 0.052584674830085089),
 ('retriev', 0.047098245536600553),
 ('internet', 0.045789114236360139),
 ('network', 0.045240783084279479),
 ('dataset', 0.041029233689480714),
 ('stream', 0.039626284242023135)]

### 2. Explain where the difference between the large scores and the small ones comes from.

- Compared to the top words, like 'onlin' or 'realworld', some words like 'model' and 'data' occured in the description of IX class. However their TFIDF values are relatively small. The reason is that these words like 'model' and 'data' also occur in other courses' descriptions. Frequency for all descriptions is the defference between the large scores and the small ones comes from.

## Exercise 4.3: Document similarity search

In [179]:
# Construct query vectors for markov chain and facebook
markov_chain = np.zeros(len(row_indices)).reshape(len(row_indices),1)
markov_chain[row_indices.index('markov')] = 1/2
markov_chain[row_indices.index('chain')] = 1/2

facebook = np.zeros(len(row_indices))
facebook[row_indices.index('facebook')] = 1

In [180]:
# Function that calculates the cosine similarity of  2 documents
def similarity(di, dj):
    return np.dot(di, dj) / (LA.norm(di) * LA.norm(dj))

In [181]:
markov_similarity = []
for i in range(len(column_indices)):
    markov_similarity.append(similarity(TFIDF_matrix.T[i], markov_chain)[0])

In [182]:
# Make a dictionary of (course name: similarity value)
markov_sim_vals = {}
for i, sim in enumerate(markov_similarity):
    markov_sim_vals[courses[i]['name']] = sim
    
# Sort by similarity value
markov_sim_vals_sorted = sorted(markov_sim_vals.items(), key=lambda x:x[1], reverse = True)

In [183]:
facebook_similarity = []
for i in range(len(column_indices)):
    facebook_similarity.append(similarity(TFIDF_matrix.T[i], facebook))

In [184]:
# Make a dictionary of (course name: similarity value)
facebook_sim_vals = {}
for i, sim in enumerate(facebook_similarity):
    facebook_sim_vals[courses[i]['name']] = sim

# Sort by similarity value
facebook_sim_vals_sorted = sorted(facebook_sim_vals.items(), key=lambda x:x[1], reverse = True)

### 1. Display the top five courses together with their similarity score for each query.

In [185]:
# Print the top 5 courses
print("The top five courses together with their similarity score for query markov chain")
for i in range(5):
    print(markov_sim_vals_sorted[i])

The top five courses together with their similarity score for query markov chain
('Applied stochastic processes', 0.58755084846323447)
('Applied probability & stochastic processes', 0.53961015057965878)
('Markov chains and algorithmic applications', 0.45636978045262666)
('Supply chain management', 0.3903593823655549)
('Mathematical models in supply chain management', 0.31751041450037115)


In [186]:
# Print the top 5 courses
print("The top five courses together with their similarity score for query facebook")
for i in range(5):
    print(facebook_sim_vals_sorted[i])

The top five courses together with their similarity score for query facebook
('Computational Social Media', 0.18591143220755649)
('Composites technology', 0.0)
('Image Processing for Life Science', 0.0)
('Global business environment', 0.0)
('Electrochemical nano-bio-sensing and bio/CMOS interfaces', 0.0)


### 2. What do you think of the results? Give your intuition on what is happening.

In [187]:
count_markov = 0
count_chain = 0
count_facebook = 0
for course in courses:
    for word in course['description'].split():
        if word == 'markov':
            count_markov += 1
        if word == 'chain':
            count_chain += 1
        if word == 'facebook':
            count_facebook += 1

In [188]:
print("The number of markov in descriptions is {}".format(count_markov))
print("The number of chain in descriptions is {}".format(count_chain))
print("The number of facebook in descriptions is {}".format(count_facebook))

The number of markov in descriptions is 45
The number of chain in descriptions is 79
The number of facebook in descriptions is 3


- For markov chain, the suggested top five courses look very related to markov chain.For facebook, most of the simirality values of suggested top five courses are 0. This means that this model cannot detect courses that relate to facebook. 
- According to the numbers mentioned above, facebook doesn't occur often in the descriptions. Therefore, the result shows that this TF-IDF model can detect courses related to the words which exactly occur in the descriptions many times. However, this model can't detect courses with words that are too concrete and don't occur in the descriptions.