# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *B*

**Names:**

* *Amaury Combes*
* *Vincenzo Bazzucchi*
* *Alexis Montavon*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
import scipy.sparse

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

In [2]:
courses[0]

{'courseId': 'MSE-440',
 'description': "The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materialsConstituentsProcessing of compositesDesign of composite structures\xa0Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites\xa0ApplicationsDriving forces and marketsCost analysisAerospaceAutomotiveSport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite partApply the basic equations for process and mechanical p

## Exercise 4.1: Pre-processing

In [3]:
import re
import nltk
from nltk.stem import WordNetLemmatizer

If it is not running on your machine, please use the nltk.download() method to download the WorndNet package

In [4]:
#nltk.download('wordnet')

For the preprocessing part, we decided to use the stop-words you gave us to remove them because they are too frequent. We made a list of the punctuation we could find in the data and removed it as it's not briging any information and we used the nltk lemmatizer to group together different forms of a word.

In [5]:
import re
class Cleaner:

    def __init__(self, punctuation, stop_words, lemmatizer=None, stemmer=None):
        self._punctuation = list(punctuation)
        self._lemmatizer = lemmatizer
        self._stemmer = stemmer
        self._stop_words = list(stop_words)

    def cleanWord(self, word):
        rWord = ""
        for l in word.lower():
            if not(l in self._punctuation) and l.isalpha():
                rWord += l
            else:
                rWord += ' '

        if self._lemmatizer is not None:
            rWord = self._lemmatizer.lemmatize(rWord)
        if self._stemmer is not None:
            rWord = self._stemmer.stem(rWord)
        if rWord in self._stop_words:
            return ''
        else:
            return rWord + " "

    def cleanMessage(self, message):
        separateAtUpper = re.sub('([A-Z]{1})', r' \1', message)
        words = separateAtUpper.split()
        rMessage = ""
        for w in words:
            tmp = self.cleanWord(w)
            if len(w) > 1:
                rMessage += tmp
        return rMessage

In [6]:
ponctuation = [',', '.', '-', ':', '\xa0', '%', '_', '!', '?', ';', '(', ')', '\n']
lemmatizer = WordNetLemmatizer()
cleaner = Cleaner(ponctuation, stopwords, lemmatizer)

In [7]:
cleaned = []
for course in courses:
    cleaned.append({
            'courseId': course['courseId'],
            'description': cleaner.cleanMessage(course['description']),
            'name': course['name']
        })

In [8]:
# Remove very frequent words
from collections import Counter
c = Counter()
for course in cleaned:
    for w in course['description'].split(' '):
        if w != '':
            c[w] += 1
counted = sorted([(w, c[w]) for w in c], key=lambda t: t[1], reverse=True)

In [9]:
counted[:50]

[('student', 1929),
 ('method', 1644),
 ('learning', 1482),
 ('design', 935),
 ('content', 917),
 ('system', 868),
 ('course', 811),
 ('analysis', 754),
 ('project', 738),
 ('model', 731),
 ('basic', 699),
 ('end', 675),
 ('assessment', 652),
 ('s', 647),
 ('to', 644),
 ('outcome', 626),
 ('concept', 622),
 ('data', 607),
 ('teaching', 597),
 ('prerequisite', 597),
 ('keywords', 573),
 ('work', 554),
 ('skill', 550),
 ('lecture', 526),
 ('introduction', 508),
 ('activity', 506),
 ('theory', 504),
 ('exercise', 471),
 ('problem', 458),
 ('application', 453),
 ('presentation', 446),
 ('exam', 442),
 ('report', 438),
 ('energy', 433),
 ('material', 428),
 ('expected', 426),
 ('process', 419),
 ('evaluate', 406),
 ('plan', 404),
 ('time', 402),
 ('transversal', 397),
 ('required', 394),
 ('recommended', 392),
 ('based', 386),
 ('research', 379),
 ('engineering', 375),
 ('structure', 375),
 ('information', 363),
 ('class', 355),
 ('written', 352)]

We delete the most frequent words.

In [10]:
too_frequent = set([w for w, c in counted[:50]])

In [11]:
clean_courses = []
for course in cleaned:
    new_words = ''
    for word in course['description'].split(' '):
        if word != '' and word not in too_frequent:
            new_words = new_words + ' ' + word
    clean_courses.append({
        'courseId': course['courseId'],
        'description': new_words,
        'name': course['name']
    })

In [12]:
#Fetch the index of Internet Analytics in the documents
ix_id = 0
for course in clean_courses:
    if course['courseId'] == 'COM-308':
        ix_id = clean_courses.index(course)
        break

In [13]:
ix_words = []
for word in sorted(clean_courses[ix_id]['description'].split(' ')):
    if word not in ix_words and word != '':
        ix_words.append(word)

Here are the the terms in the pre-processed description of the IX class in alphabetical order.

In [14]:
print(ix_words)

['acquired', 'ad', 'advertisement', 'algebra', 'algorithm', 'algorithms', 'analytics', 'analyze', 'apache', 'auction', 'auctions', 'balance', 'cathedra', 'chains', 'cloud', 'clustering', 'collection', 'combination', 'commerce', 'communication', 'community', 'computing', 'concepts', 'concrete', 'contained', 'coverage', 'curated', 'current', 'datasets', 'decade', 'dedicated', 'designed', 'detection', 'develop', 'dimensionality', 'draw', 'e', 'effectiveness', 'efficiency', 'etc', 'explore', 'explores', 'fields', 'final', 'foundational', 'framework', 'function', 'fundamental', 'good', 'graph', 'graphs', 'hadoop', 'hands', 'homework', 'important', 'infrastructure', 'inspired', 'internet', 'java', 'key', 'knowledge', 'lab', 'laboratory', 'large', 'lectures', 'linear', 'm', 'machine', 'main', 'map', 'markov', 'media', 'midterm', 'mining', 'modeling', 'models', 'modelsdata', 'networking', 'networks', 'number', 'on', 'online', 'past', 'practical', 'practice', 'provide', 'question', 'real', 'rec

## Exercise 4.2: Term-document matrix

In [15]:
import math
import operator

In [16]:
def get_all_words():
    all_words = []
    for course in clean_courses:
        for word in course['description'].split(' '):
            if word not in all_words and word != '':
                all_words.append(word)
    return all_words

In [17]:
def create_TD_matrix(terms, documents):
    rows = []
    columns = []
    data = []
    terms_inv = {t: i for i, t in enumerate(terms)}
    for doc_index, doc in enumerate(documents):
        for word in doc['description'].split(' '):
            if word != '':
                columns.append(doc_index)
                rows.append(terms_inv[word])
                data.append(1)
    return csr_matrix((data, (rows, columns)), shape=(len(terms), len(documents)), dtype=np.int)

In [18]:
def compute_TF(word_index, doc_index, M):
    """
    Computes the term frequency of term with index word_index in document with index doc_index
    using the data in the term document matrix M
    """
    doc_col = M.getcol(doc_index)
    return doc_col[word_index].toarray()[0, 0] / doc_col.max()

In [19]:
def compute_IDF(word_index, M):
    """
    Computes the inverse document frequency of the term with index word_index using
    the data in the therm document matrix M
    """
    ni = M.getrow(word_index).count_nonzero()
    return -math.log(ni / M.shape[1] , 2)

In [20]:
terms = get_all_words()
TD = create_TD_matrix(terms, clean_courses)

In [21]:
ix_all_words = set(clean_courses[ix_id]['description'].split(' '))
ix_scores = {}

In [22]:
for word in ix_all_words:
    if word != '':
        w_id = terms.index(word)
        x = compute_TF(w_id, ix_id, TD)
        y = compute_IDF(w_id, TD)
        ix_scores.update({word : x*y})

In [23]:
sorted_scores = sorted(ix_scores, key=lambda x: ix_scores[x], reverse=True)
for k in sorted_scores[:15]:
    print("{} : {}".format(k, ix_scores[k]))

mining : 5.3904738076963925
online : 4.983204757457022
social : 4.568167258178178
world : 4.153129758899334
explore : 3.7904738076963924
networking : 3.767196384589916
hadoop : 3.4952369038481965
real : 3.295148763771762
service : 3.294098847706143
recommender : 3.261251903559734
commerce : 3.0952369038481966
ad : 2.9664656658932516
retrieval : 2.861251903559734
datasets : 2.7722949350251547
internet : 2.6952369038481963


There are the 15 terms with the highest TF-IDF score from the IX course description. A large TF-IDF score means that the term is rare overall documents but very prominent in this specific one.

## Exercise 4.3: Document similarity search

In [24]:
def compute_sim(doc1, doc2_id, M):
    doc2 = M.getcol(doc2_id)
    num = doc1.transpose().dot(doc2)
    denom = np.sqrt(doc1.power(2).sum()) * np.sqrt(doc2.power(2).sum())
    return num / denom

In [25]:
facebook = csr_matrix(([1], ([terms.index('facebook')], [0])), shape=(len(terms), 1), dtype=np.int)

In [26]:
markov_chains = csr_matrix(([1, 1], ([terms.index('markov'), terms.index('chain')], [0, 0])), shape=(len(terms), 1), dtype=np.int)

In [27]:
similarities_fb = {}
for i in range(len(clean_courses)):
    similarities_fb.update({i : compute_sim(facebook, i, TD)})
    
similarities_mc = {}
for j in range(len(clean_courses)):
    similarities_mc.update({j : compute_sim(markov_chains, j, TD)})

len(similarities_fb), len(similarities_mc)

(854, 854)

In [28]:
sorted_sim_fb = sorted(similarities_fb, key=lambda x: similarities_fb[x], reverse=True)
for k in sorted_sim_fb[:5]:
    print("{} : {}".format(clean_courses[k]['name'], similarities_fb[k]))

Computational Social Media :   (0, 0)	0.103142124626
Composites technology : 
Image Processing for Life Science : 
Global business environment : 
Electrochemical nano-bio-sensing and bio/CMOS interfaces : 


Here we can see that only one course have some similarities with the query 'Facebook' and this is because it is the only course containing this term.

In [29]:
sorted_sim_mc = sorted(similarities_mc, key=lambda x: similarities_mc[x], reverse=True)
for k in sorted_sim_mc[:5]:
    print("{} : {}".format(clean_courses[k]['name'], similarities_mc[k]))

Applied stochastic processes :   (0, 0)	0.542609516234
Applied probability & stochastic processes :   (0, 0)	0.437927872963
Markov chains and algorithmic applications :   (0, 0)	0.352864873113
Supply chain management :   (0, 0)	0.3478327965
Statistical Sequence Processing :   (0, 0)	0.243733339111


Here we see that at least 5 courses have similarities with the query 'Markov chains' as multiple courses contains at least one of those term.

## Appendix

The code below creates the TFIDF matrix and stores it together with the clean_courses and terms in order to be able to use them in the other notebooks

In [30]:
rows, cols, vals = scipy.sparse.find(TD)

In [31]:
IDF = np.array([compute_IDF(i, TD) for i in range(len(terms))])

In [32]:
data = np.zeros(len(vals))

for i in range(len(vals)):
    data[i] = compute_TF(rows[i], cols[i], TD) * IDF[rows[i]]

TFIDF = csr_matrix((data, (rows, cols)), shape=TD.shape, dtype=np.float)

In [33]:
!rm tfidf.npz
np.savez('tfidf', data = TFIDF.data ,indices=TFIDF.indices,
             indptr=TFIDF.indptr, shape=TFIDF.shape)

!rm term-document-matrix.npz
np.savez('term-document-matrix', data = TD.data ,indices=TD.indices,
             indptr=TD.indptr, shape=TD.shape)

!rm terms.txt
with open('terms.txt', 'w') as f:
    for term in terms:
        f.write(term + '\n')

import json
!rm courses.txt
with open('courses.txt', 'w') as f:
    for course in clean_courses:
        f.write(course['name'] + '\n')

rm: cannot remove ‘term-document-matrix.npz’: No such file or directory
