# GloVe vectors (Tilo)

I used [this](https://jsomers.net/glove-codenames/) blogpost as a guide. It's a fun and easy read about using GloVe vectors to play the popular boadgame Codenames so if that sounds intresting check it out. I use some code snippets from this blogpost.
 
I downloaded my GloVe vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip. This file has vectors for more than a million words, so I just used the vectors from the top 100,000 words.

If you want to skip to see the recommender in action [click this link](#Testing-out-the-recommender). It seems to work well.

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
from scipy import spatial
import string
from nltk.corpus import stopwords


#df =pd.read_csv("courses_raw.csv", error_bad_lines=False)
df = pd.read_json("data_5scheduler.json")

In [2]:
embeddings = {}
with open("./top_100000.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings[word] = vector
        
words_with_embeddings = set([w for w in embeddings])

In [3]:
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

def calculate_description_embeding(description):
    
    # clean description
    description = remove_punctuation(description).lower().strip()
    words = description.split(" ")
    stops = set(stopwords.words('english'))
    
    # filter out stop words and words we don't have embeddings for
    words = [w for w in words if not w in stops]
    words = [w for w in words if (w in words_with_embeddings)]
    
    if len(words) == 0:
        return None
    
    # calculate embedding and return
    return sum([embeddings[w] for w in words])/len(words)

In [4]:
df["description embeddings"] = [calculate_description_embeding(desc) for desc in df["description"]]

In [5]:
description_embeddings = df[["title","description embeddings"]].set_index("title").dropna().to_dict()['description embeddings']

In [6]:
len(description_embeddings)

3687

In [7]:
def recommend(title):
    '''Finds 10 closest courses for a given course by taking the cosine similarity of their description embeddings.'''
    
    def distance(title, reference):
        return spatial.distance.cosine(description_embeddings[title], description_embeddings[reference])

    def closest_courses(reference):
        return sorted(description_embeddings.keys(), key=lambda w: distance(w, reference))
    
    return closest_courses(title)[:10]

# Testing out the recommender
Here I test the recommender on some classes I've taken or am taking. 

In [16]:
recommend('Linear Algebra')

['Linear Algebra',
 'Linear Algebra with Computing',
 'Mathematical Methods of Physics',
 'Precalculus',
 'Calculus with Precalculus',
 'Mathematical Analysis II',
 'Engineering Mathematics',
 'Single and Multivariable Calculus',
 'Fourier Series and Boundary Value Problems',
 'Scientific Computing']

In [14]:
recommend('Language and Gender')

['Language and Gender',
 'Language and Globalization',
 'Morphosyntax',
 'The Socialization of Gender: A Developmental Perspective',
 'Language and Society',
 'Language in Society',
 'Chinese Language in Society',
 'Language, Identity and Violence',
 'Language and Power',
 'Introduction to the Study of Language']

Even though Econometrics is in the econ deparment it's really a stats class and it seems like the recommender is aware of this.

In [15]:
recommend('Econometrics')

['Econometrics',
 'Statistical Inference',
 'Applied Statistics',
 'Bayesian Statistics',
 'Calculus and Discrete Models for Applications',
 'Methods in Modern Modeling',
 'Representations of High-Dimensional Data',
 'Time Series',
 'Computational Statistics',
 'Differential Equations and Modeling']

# Some stuff I'd like to fix about the recommender
I think it seems to be working pretty well. However, here's a list of stuff I'd like to fix about the recommender:
- Handle classes that have the same name but are different; currently if two classes have the same name only one of their descriptions is being used. (Or at least I think that's what's going on with duplicate course names).
- Handle hyphonated words better; when I calculate description embeddings I get rid of all punctuation (e.g. "single-gender" becomes "singlegender"). 
- Investigate why some words don't have embeddings. For example, "singlegender" doesn't have an embedding but that's something that could potentially be fixed if it was treated as two words or the hyphen was not removed. 

# Go test it out youself!


To get the code running you'll need a file called "top_100000.txt" with the top 100,000 GloVe vectors. You can get the GloVe vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip. Be warned though, the unzipped file is 5 MB. After you've got the unzipped file, run this in your terminal to get the "top_100000.txt" file.
```
head -n 100000 glove.42B.300d.txt > top_100000.txt
```
Feel free to play around with with the recomender. If you run the cell bellow it will give you a list of all the courses that have embeddings. 

In [None]:
[desc for desc in description_embeddings]