# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *O*

**Names:**

* *Argelaguet Franquelo, Pau*
* *du Bois de Dunilac, Vivien*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
import string
import collections
import operator
import math

from functools import reduce
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl, save_pkl

from nltk.stem import SnowballStemmer

In [2]:
courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [3]:
# Initializing stemmer for posterior use
stemmer = SnowballStemmer("english")

In [4]:
# Checks if given word is a number
def is_number(word):
    try:
        float(word)
        return True
    except ValueError:
        return False

    
# If the word passes the filters and should be in the dataset, returns True, False otherwise
def filter_word(word):
    # Words of len 1
    if len(word) < 2:
        return False
    # Removing words in stopwords
    if word in stopwords:
        return False
    # Removing words consisting of a punctuation sign
    if word in string.punctuation:
        return False
    # Removing numbers
    if is_number(word):
        return False
    return True


# Changes the word in a common format easier to group
def clean_word(word):
    # Removing punctuation signs from word
    word = "".join(c for c in word if c not in string.punctuation)
    # Transforming word to lowercase
    word = word.lower()
    # Stemming
    word = stemmer.stem(word)
    return word
    

def get_bag_of_words(text):
    # Splitting the text in whitespaces, cleaning each word and the filtering
    words = list(filter(filter_word, map(clean_word, text.split())))
    
    # Adding bigrams
    bigrams = list(map(lambda x: x[0] + " " + x[1], zip(words, words[1:])))
    
    # Creating bag-of-words counters
    bow = collections.defaultdict(int)
    for w in words + bigrams:
        bow[w] += 1
    
    # Removing less frequent words
    bow = {k: v for k, v in bow.items() if v > 1}
            
    return dict(collections.OrderedDict(sorted(bow.items())))

In [5]:
# Creates a dictionary of dictionaries of the form 
# {'courseId' -> {name, description}}
dat = {
    x.get('courseId'): {
        'name': x.get('name'),
        'description': get_bag_of_words(x.get('description'))
    } for x in courses
}

In [6]:
# List of all bags of words
list_terms = [x.get('description') for x in dat.values()]

# Dict of all terms and their frequency
term_freqs = collections.defaultdict(int)
for l in list_terms:
    for k, v in l.items():
        term_freqs[k] += v
term_freqs = dict(collections.OrderedDict(
    sorted(term_freqs.items(), key=operator.itemgetter(1), reverse=True)))

# List of terms which appear more than 400 times
top_terms = list({k: v for k, v in term_freqs.items() if v > 400}.keys())

In [7]:
# Removing documents that are in top_terms
for k, v in dat.items():
    dat[k]['description'] = {x: y for x, y in v.get('description').items() 
                             if x not in top_terms}
    
# Ensuring all documents have description
dat = {k: v for k, v in dat.items() if len(v.get('description')) > 0}

For the preprocessing task, the following tasks have been implemented: 

* Removal of the stopwords by using the provided list, because they do not provide information on the document.
* Removal of the punctuation signs, both standalone signs and signs inside words because they carry no meaning.
* Removal of very frequent words because they appear often in all documents and therefore we are not able to differenciate one document from another using those.
* Removal of numbers because they again not carry special meaning by themselves.
* Removal of short words (length == 1).
* Lowercase and stem the words so similar words can be grouped and counted as one.
* Bigrams, because there are words that together have a relevant meaning.

In [8]:
print("Internet Analytics terms after preprocess:")
dat['COM-308']['description']

Internet Analytics terms after preprocess:


{'ad': 2,
 'ad auction': 2,
 'algebra': 2,
 'algorithm': 2,
 'analyt': 2,
 'applic': 2,
 'auction': 2,
 'base': 2,
 'class': 3,
 'cluster': 2,
 'cluster communiti': 2,
 'communiti': 2,
 'communiti detect': 2,
 'comput': 2,
 'data': 6,
 'data mine': 3,
 'dataset': 2,
 'detect': 2,
 'ecommerc': 2,
 'explor': 5,
 'graph': 2,
 'hadoop': 2,
 'homework': 2,
 'inform': 2,
 'internet': 2,
 'lab': 3,
 'largescal': 3,
 'linear': 2,
 'linear algebra': 2,
 'machin': 2,
 'machin learn': 2,
 'mine': 3,
 'network': 4,
 'network ecommerc': 2,
 'number': 2,
 'onlin': 5,
 'onlin servic': 2,
 'practic': 2,
 'problem': 2,
 'realworld': 4,
 'recommend': 3,
 'recommend system': 2,
 'relat': 2,
 'servic': 3,
 'session': 2,
 'social': 5,
 'social network': 3,
 'stream': 2,
 'stream comput': 2,
 'system cluster': 2}

## Exercise 4.2: Term-document matrix

In [9]:
# Creating a sorted list of all terms in the dataset.
terms = set()
for l in [list(x.get('description').keys()) for x in dat.values()]:
    terms = terms.union(set(l))
terms = sorted(terms)

In [10]:
# Creating a sorted list of all documents in the dataset
documents = sorted(list(dat.keys()))

In [11]:
M = len(terms)
N = len(documents)

In [12]:
# Mapping from term to index in terms list
term_idx = {x: terms.index(x) for x in terms}

# Mapping from document to index in documents list
doc_idx = {x: documents.index(x) for x in documents}

In [13]:
# Calculating IDF for all terms in dataset
occs = collections.defaultdict(float)
for k, v in dat.items():
    for x in v.get('description').keys():
        occs[x] += 1
idf = {k: math.log(N/v) for k, v in occs.items()}

In [14]:
# Constructing matrix with TF-IDF values
values = []
rows = []
columns = []

for k, v in dat.items():
    d = v.get('description')
    max_occur = float(max(d.values()))
    for x, y in d.items():
        tf = y / max_occur
        tfidf = tf * idf[x]
        values.append(tfidf)
        rows.append(term_idx[x])
        columns.append(doc_idx[k])
        
mat = csr_matrix((values, (rows, columns)), shape=(M, N))

In [15]:
c = mat.getcol(doc_idx['COM-308'])
ts = {terms[x]: v for (x, _), v in c.todok().items()}
ts = sorted(ts.items(), key=operator.itemgetter(1), reverse=True)

print("15 terms in the description of IX with highest TF-IDF:")
ts[:15]

15 terms in the description of IX with highest TF-IDF:


[('onlin', 3.9895764523183717),
 ('explor', 3.473710445313186),
 ('social', 3.473710445313186),
 ('realworld', 3.415975986268839),
 ('social network', 3.020127355638707),
 ('data mine', 2.8173948015846246),
 ('mine', 2.6735537653587342),
 ('largescal', 2.3269801750787615),
 ('ad auction', 2.2444672972791198),
 ('auction', 2.2444672972791198),
 ('cluster communiti', 2.2444672972791198),
 ('communiti detect', 2.2444672972791198),
 ('ecommerc', 2.2444672972791198),
 ('hadoop', 2.2444672972791198),
 ('network ecommerc', 2.2444672972791198)]

Large scores in TF-IDF mean that the specific term is very representative of the document, that is, it's common in the document itself but not in the other documents of the corpus. In this case, *onlin* is really related with IX and we could characterize the course using that term, while *communiti* is as well, but it might not be sufficient 

## Exercise 4.3: Document similarity search

In [16]:
# Comupte similarity between two vectors using the provided formula
def compute_sim(di, dj):
    res = (di.T).dot(dj) / (np.linalg.norm(di) * np.linalg.norm(dj))
    return float(res)


# Get document similar to a text 
def search(t):
    # Splitting and stemming input text into words
    search_terms = list(map(stemmer.stem, t.split()))
    bigrams = list(map(lambda x: x[0] + " " + x[1], zip(search_terms, search_terms[1:])))
    search_terms = search_terms + bigrams
    
    # Creating vector representing query
    v = [idf[x] for x in search_terms]
    r = [term_idx[x] for x in search_terms]
    c = [0 for i in search_terms]
    q = csr_matrix((v, (r, c)), shape=(M, 1)).todense()
    
    # Computing similarities of all columns (documents) with given query
    sims = [compute_sim(q, mat.getcol(i).todense()) for i in range(N)]
    sims = {dat[documents[i]]['name']: x for i, x in enumerate(sims) if x > 0}  
    sims = sorted(sims.items(), key=operator.itemgetter(1), reverse=True)
    return sims

In [17]:
print("Top 5 courses for markov chains:")
search('markov chains')[:5]

Top 5 courses for markov chains:


[('Applied probability & stochastic processes', 0.5511378297229665),
 ('Applied stochastic processes', 0.5452801476337338),
 ('Markov chains and algorithmic applications', 0.4693110839901864),
 ('Supply chain management', 0.2111261906979578),
 ('Mathematical models in supply chain management', 0.20031709341970322)]

In [18]:
print("Top 5 courses for facebook:")
search('facebook')[:5]

Top 5 courses for facebook:


[('Computational Social Media', 0.11732824630657128)]

For markov chains, we get five courses that judging by the title, they are related to the concept of a markov chain. That is because in the description they have explicitly the words *markov* and *chain*. The problem with *facebook* is that we only get one course, which is the only one in the dataset that contains explicitly the workd *facebook*. We do not get anitying else even though they might talk about the concept of social media, because in this current model we are not taking into account concepts, just explicit terms.

## Exporting data for other exercises

In [19]:
save_pkl(dat, "data/preprocess.pckl")
save_pkl(mat, "data/mat.pckl")
save_pkl(terms, "data/terms.pckl")
save_pkl(documents, "data/documents.pckl")