# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

# WEEK 5 - DATA CLEANING

## Step 1 - Data Retrieval

In [3]:
# using glob, find all the text files in the "Papers" folder
import glob
papers = glob.glob("./Papers/paper*.txt")
print(papers)

['./Papers/paper12.txt', './Papers/paper5.txt', './Papers/paper4.txt', './Papers/paper13.txt', './Papers/paper11.txt', './Papers/paper6.txt', './Papers/paper7.txt', './Papers/paper10.txt', './Papers/paper14.txt', './Papers/paper3.txt', './Papers/paper2.txt', './Papers/paper15.txt', './Papers/paper0.txt', './Papers/paper1.txt', './Papers/paper16.txt', './Papers/paper9.txt', './Papers/paper8.txt']


In [4]:
# get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
documents = []

for filename in files: 
    with open (filename, "r") as f:
        documents.append(f.read())

In [11]:
# print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)

## Step 2 - Data Cleaning

In [12]:
# only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

for i,doc in enumerate(documents):
    print(len(documents[i]), end=' ')
    # only keep the text after the abstract
    doc = doc[doc.index('abstract'):doc.rfind('reference')]
    # save the result
    documents[i] = doc
    # print the length of the resulting string
    print(len(documents[i]))
    
# one liner:
# documents = [doc[doc.index('abstract'):doc.rfind('reference')] for doc in documents]

In [13]:
# replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:

documents = [doc.replace('\n', ' ') for doc in documents]
print(documents[0][:1000])

IndexError: list index out of range

In [14]:
# replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']


# remove ponctuation
for i,doc in enumerate(documents): 
    for punc in punctuation: 
        doc = doc.replace(punc, ' ')
    documents[i] = doc
    
print(documents[0][:1000])

IndexError: list index out of range

In [15]:
# remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing
for i,doc in enumerate(documents): 
    for num in range(10):
        doc = doc.replace(str(num), '')
    documents[i] = doc

print(documents[1][:1000])

IndexError: list index out of range

In [8]:
# Remove the stop words below from our documents
# print the first 1000 characters of the first document
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']


# remove stop words
for i,doc in enumerate(documents):
    for stop_word in stop_words:
        doc = doc.replace(' ' + stop_word + ' ', ' ')
    documents[i] = doc

print(documents[0][:1000])

abstract mind wandering  defined shifts attention task related processing task unrelated thoughts  ubiquitous phenomenon negative influence performance productivity many contexts  including learning  propose next generation learning technologies mechanism detect respond mind wandering real time  towards end  developed technology automatically detects mind wandering eye gaze learning instructional texts  mind wandering detected  technology intervenes posing time questions encouraging re reading needed  multiple rounds iterative refinement  summatively compared technology yoked control experiment  participants  key dependent variable performance post reading comprehension assessment  results suggest technology successful correcting comprehension deficits attributed mind wandering  d     sigma  specific conditions  thereby highlighting potential improve learning  attending attention    keywords mind wandering  gaze tracking  student modeling  attentionaware     introduction despite best e

In [16]:
# remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document

for i,doc in enumerate(documents):  
    doc = [x for x in doc.split() if len(x) > 2]
    doc = " ".join(doc)
    documents[i] = doc

print(documents[0][:1000])

IndexError: list index out of range


### Putting it all together

In [17]:
# package all of your work above into a function that cleans a given document

def clean_list_of_documents(documents):
    
    cleaned_docs = []

    for i,doc in enumerate(documents):
        # only keep the text after the abstract
        doc = doc[doc.index('abstract'):]
        # only keep the text before the references
        doc = doc[:doc.rfind('reference')]
        # replace return carriage with white space
        doc = doc.replace('\n', ' ')
        # remove ponctuation
        for punc in punctuation: 
            doc = doc.replace(punc, ' ')
        # remove numbers
        for i in range(10):
            doc = doc.replace(str(i), ' ')
        # remove stop words
        for stop_word in stop_words:
            doc = doc.replace(' ' + stop_word + ' ', ' ')
        # remove single characters and stem the words 
        doc = [x for x in doc.split() if len(x) > 2]
        doc = " ".join(doc)
        # save the result to our list of documents
        cleaned_docs.append(doc)
        
    return cleaned_docs

In [11]:
# reimport your raw data
documents = []

for filename in files: 
    with open (filename, "r", encoding='utf-8') as f:
        documents.append(f.read())
        
# clean your files using the function above
docs = clean_list_of_documents(documents)

# print the first 1000 characters of the first document
print(docs[0][:1000])

abstract mind wandering defined shifts attention task related processing task unrelated thoughts ubiquitous phenomenon negative influence performance productivity many contexts including learning propose next generation learning technologies mechanism detect respond mind wandering real time towards end developed technology automatically detects mind wandering eye gaze learning instructional texts mind wandering detected technology intervenes posing time questions encouraging reading needed multiple rounds iterative refinement summatively compared technology yoked control experiment participants key dependent variable performance post reading comprehension assessment results suggest technology successful correcting comprehension deficits attributed mind wandering sigma specific conditions thereby highlighting potential improve learning attending attention keywords mind wandering gaze tracking student modeling attentionaware introduction despite best efforts write clear engaging paper ch

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [18]:
import math
import numpy as np

In [19]:
# create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(docs):
    voc = []
    for doc in docs:
        for word in doc.split():
            if word not in voc: 
                voc.append(word)
    voc = list(set(voc))
    voc.sort()
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)
vocabulary = get_vocabulary(docs)
print(len(vocabulary))

NameError: name 'docs' is not defined

## Step 4 - transform your documents in to 100-words chunks

In [24]:
# create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap

def flatten_and_overlap(docs, window_size=100, overlap=25):
    
    # create the list of overlapping documents
    new_list_of_documents = []
    
    # flatten everything into one string
    flat = ""
    for doc in docs:
        flat += doc
    
    # split into words
    flat = flat.split()

    # create chunks of 100 words
    high = window_size
    while high < len(flat):
        low = high - window_size
        new_list_of_documents.append(flat[low:high])
        high += overlap
    return new_list_of_documents

chunks = flatten_and_overlap(docs)

NameError: name 'docs' is not defined

In [25]:
# create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check
for chunk in chunks: 
    assert(len(chunk) == 100)

NameError: name 'chunks' is not defined

# WEEK 6 - VECTOR MANIPULATION

## Step 5 - Create a word by document matrix

In [26]:
# 1) create an empty dataframe using pandas
# the number of rows should be the number of documents we have
# the number of columns should be size of the vocabulary
import numpy as np 
import pandas as pd
df = pd.DataFrame(0, index=np.arange(len(chunks)), columns=vocabulary)

NameError: name 'chunks' is not defined

In [23]:
# 2) fill out the dataframe with the count of words for each document
# (use two for loops to iterate through the documents and the vocabulary)
for i,chunk in enumerate(chunks): 
    for word in chunk: 
        if word in df.columns: 
            df.loc[i, word] += 1
    if i % 100 ==0: 
        print(i, end=' ')

NameError: name 'chunks' is not defined

In [4]:
# 3) Sanity check: make sure that your counts are correct
# (e.g., if you know that a words appears often in a document, check that
# the number is also high in your dataframe; and vice-versa for low counts)
df.loc[0, 'defined']

In [5]:
# 4) Putting it together: create a function that takes a list of documents
# and a vocabulary as arguments, and returns a dataframe with the counts
# of words: 
def altnow(documents, voc): 
    for i,doc in enumerate(documents):
        if voc in vocabulary: 
            
    
# call the function and check that the resulting dataframe is correct


## Step 6 - Weight word frequency

In [22]:
# 5) create a function that adds one to the current cell and takes its log
# IF the value in the cell is not zero
def logadd(cell):
    if cell != 0: 
        cell = 1 + log(cell)
        df.append(cell)
    else: 
        df.append(cell)


SyntaxError: invalid syntax (<ipython-input-22-eb483060e3f7>, line 7)

In [7]:
# 6) use the "applymap" function of the dataframe to apply the function 
# above to each cell of the table
df_weight = df.applymap(logadd)

In [1]:
# 7) check that the numbers in the resulting matrix look accurate;
# print the value before and after applying the function above
print(df.head())


NameError: name 'logadd' is not defined

## Step 7 - Matrix normalization

In [70]:
# 8) look at the image below; why do you think that we need to normalize our 
# data before clustering in this particular case? 


#We need to normalize our data because cluster1 and clusters 2 are fairly distinct 
#and we want to be able to compare them together

<img src="https://i.stack.imgur.com/N2unM.png" />

In general, it's common practice to normalize your data before clustering - so that variables are comparable.

In [54]:
# 9) describe how the min-max normalization works:

#it linearly transforms x to y so that when x = min, y = 0 and when x = max, y = 1. Makes the range
# of x 0 to 1. 

<img src="https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/media/aml-normalization-minmax.png" />

In [56]:
# 10) describe how normalizing using a z-score works:

#converts everything to a mean of 0 and an SD of 1 so it makes everything much cleaner

<img src="https://cdn-images-1.medium.com/max/1600/1*13XKCXQc7eabfZbRzkvGvA.gif"/>

In [None]:
# 11) describe how normalizing to unit norm works

#when you go to unit form you divide the vector by its length

Resources: 
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer
* http://mathworld.wolfram.com/NormalVector.html

We are going to work with some pre-made normalization functions from sklearn (feel free to skim this page):
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

In [9]:
# 12) since we are working with vectors, apply the Normalizer from 
# sklearn.preprocessing to our dataframe. Print a few values 
# before and after to make sure you've applied the normalization
from sklearn.preprocessing import Normalizer
scaler = Normalizer()
df_norm = pd.DataFrame(scaler.fit_transform([[df_weight]], columns = vocabulary))



In [10]:
# 13) create a function that takes a dataframe as argument and where a second
# argument is the type of normalization (MinMaxScaler, Normalizer, StandardScaler)
# and returns the normalized dataframe
from sklearn.preprocessing import MinMaxScaler, Normalizer, StandardScaler

def normalize_df(df, type_of_norm): 
    if type_of_norm = "MinMaxScaler":
        transformer = MinMaxScaler().fit(df)
        df_norm = pd.DataFrame(transformer.transform(df), columns = vocabulary)
    if type_of_norm = "Normalizer": 
        transformer = Normalizer().fit(df)
        df_norm = pd.DataFrame(transformer.transform(df), columns = vocabulary)
    if type_of_norm = "StandardScaler": 
        transformer = StandardScaler().fit(df)
        df_norm = pd.DataFrame(transformer.transform(df), columns = vocabulary)
    return df_norm


## Step 8 - Deviation Vectors

<img src="https://www.dropbox.com/s/9f73r7pk7bi7vh9/deviation_vectors.png?dl=1" />

In [11]:
# 14) compute the sum of the vectors
import math 
import numpy as np
for row in df_norm: 
    np.sum(df_norm, axis=0)
    return(df_norm)

In [12]:
# 15) normalize the vector (find its average)
v_avg = normalizer(df_weight)


In [13]:
# 16) take each vector and subtract its components along v_avg
dev_matrix = []
dev_vec = [] 
for row in df_norm_rows[]:
    dot_product  = np.dot(row, avg_vec)
    dev_vec = Normalizer(row - (np.dot(row, avg_vec)*avg_vec))
    

In [14]:
# 17) put the code above in a function that takes in a dataframe as an argument
# and computes deviation vectors of each row (=document)


# WEEK 7 - CLUSTERING

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [17]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization