# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

# WEEK 5 - DATA CLEANING

## Step 1 - Data Retrieval

In [3]:
##Setup
# plot the graphs inline
%matplotlib inline

import os
import re
import glob

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from collections import defaultdict

In [2]:
# using glob, find all the text files in the "Papers" folder
import glob

files = glob.glob('./Papers/*.txt')
print(files)

['./Papers/paper0.txt', './Papers/paper1.txt', './Papers/paper10.txt', './Papers/paper11.txt', './Papers/paper12.txt', './Papers/paper13.txt', './Papers/paper14.txt', './Papers/paper15.txt', './Papers/paper16.txt', './Papers/paper2.txt', './Papers/paper3.txt', './Papers/paper4.txt', './Papers/paper5.txt', './Papers/paper6.txt', './Papers/paper7.txt', './Papers/paper8.txt', './Papers/paper9.txt']


In [3]:
# get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
documents = []

for filename in files: 
    with open (filename, "r", encoding='utf-8') as f:
        documents.append(f.read())

In [4]:
# print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)
documents[0][:1500]

'\x0czone out no more: mitigating mind wandering during\ncomputerized reading\nsidney k. d’mello, caitlin mills, robert bixler, & nigel bosch\nuniversity of notre dame\n118 haggar hall\nnotre dame, in 46556, usa\nsdmello@nd.edu\n\nabstract\nmind wandering, defined as shifts in attention from task-related\nprocessing to task-unrelated thoughts, is a ubiquitous\nphenomenon that has a negative influence on performance and\nproductivity in many contexts, including learning. we propose\nthat next-generation learning technologies should have some\nmechanism to detect and respond to mind wandering in real-time.\ntowards this end, we developed a technology that automatically\ndetects mind wandering from eye-gaze during learning from\ninstructional texts. when mind wandering is detected, the\ntechnology intervenes by posing just-in-time questions and\nencouraging re-reading as needed. after multiple rounds of\niterative refinement, we summatively compared the technology to\na yoked-control in a

## Step 2 - Data Cleaning

In [5]:
# only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

for i,doc in enumerate(documents):
    print(len(documents[i]), end=' ')
    # only keep the text after the abstract
    doc = doc[doc.index('abstract'):doc.rfind('reference')]
    # save the result
    documents[i] = doc
    # print the length of the resulting string
    print(len(documents[i]))
    
# one liner:
# documents = [doc[doc.index('abstract'):doc.rfind('reference')] for doc in documents]

50043 39318
41110 35514
49177 42621
32277 28206
40387 34778
45258 42251
40655 32734
31574 28134
42046 37649
46761 42253
47377 42978
44037 40032
37214 32762
47851 41302
42617 35102
45724 39947
47845 44059


In [6]:
# replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:

documents = [doc.replace('\n', ' ') for doc in documents]
print(documents[0][:1000])

abstract mind wandering, defined as shifts in attention from task-related processing to task-unrelated thoughts, is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts, including learning. we propose that next-generation learning technologies should have some mechanism to detect and respond to mind wandering in real-time. towards this end, we developed a technology that automatically detects mind wandering from eye-gaze during learning from instructional texts. when mind wandering is detected, the technology intervenes by posing just-in-time questions and encouraging re-reading as needed. after multiple rounds of iterative refinement, we summatively compared the technology to a yoked-control in an experiment with 104 participants. the key dependent variable was performance on a post-reading comprehension assessment. our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering 

In [7]:
# replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']


# remove ponctuation
for i,doc in enumerate(documents): 
    for punc in punctuation: 
        doc = doc.replace(punc, ' ')
    documents[i] = doc
    
print(documents[0][:1000])

abstract mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with 104 participants  the key dependent variable was performance on a post reading comprehension assessment  our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering 

In [8]:
# remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing
for i,doc in enumerate(documents): 
    for num in range(10):
        doc = doc.replace(str(num), '')
    documents[i] = doc

print(documents[1][:1000])

abstract educational systems typically contain a large pool of items  questions  problems   using data mining techniques we can group these items into knowledge components  detect duplicated items and outliers  and identify missing items  to these ends  it is useful to analyze item similarities  which can be used as input to clustering or visualization techniques  we describe and evaluate different measures of item similarity that are based only on learners  performance data  which makes them widely applicable  we provide evaluation using both simulated data and real data from several educational systems  the results show that pearson correlation is a suitable similarity measure and that response times are useful for improving stability of similarity measures when the scope of available data is small     introduction interactive educational systems offer learners items  problems  questions  for solving  realistic educational systems typically contain a large number of such items  this 

In [9]:
# Remove the stop words below from our documents
# print the first 1000 characters of the first document
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']


# remove stop words
for i,doc in enumerate(documents):
    for stop_word in stop_words:
        doc = doc.replace(' ' + stop_word + ' ', ' ')
    documents[i] = doc

print(documents[0][:1000])


abstract mind wandering  defined shifts attention task related processing task unrelated thoughts  ubiquitous phenomenon negative influence performance productivity many contexts  including learning  propose next generation learning technologies mechanism detect respond mind wandering real time  towards end  developed technology automatically detects mind wandering eye gaze learning instructional texts  mind wandering detected  technology intervenes posing time questions encouraging re reading needed  multiple rounds iterative refinement  summatively compared technology yoked control experiment  participants  key dependent variable performance post reading comprehension assessment  results suggest technology successful correcting comprehension deficits attributed mind wandering  d     sigma  specific conditions  thereby highlighting potential improve learning  attending attention    keywords mind wandering  gaze tracking  student modeling  attentionaware     introduction despite best e

In [10]:
# remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document

for i,doc in enumerate(documents):  
    doc = [x for x in doc.split() if len(x) > 2]
    doc = " ".join(doc)
    documents[i] = doc


print(documents[0][:1000])

abstract mind wandering defined shifts attention task related processing task unrelated thoughts ubiquitous phenomenon negative influence performance productivity many contexts including learning propose next generation learning technologies mechanism detect respond mind wandering real time towards end developed technology automatically detects mind wandering eye gaze learning instructional texts mind wandering detected technology intervenes posing time questions encouraging reading needed multiple rounds iterative refinement summatively compared technology yoked control experiment participants key dependent variable performance post reading comprehension assessment results suggest technology successful correcting comprehension deficits attributed mind wandering sigma specific conditions thereby highlighting potential improve learning attending attention keywords mind wandering gaze tracking student modeling attentionaware introduction despite best efforts write clear engaging paper ch


### Putting it all together

In [11]:
# package all of your work above into a function that cleans a given document

def clean_list_of_documents(documents):
    
    cleaned_docs = []

    for i,doc in enumerate(documents):
        # only keep the text after the abstract
        doc = doc[doc.index('abstract'):]
        # only keep the text before the references
        doc = doc[:doc.rfind('reference')]
        # replace return carriage with white space
        doc = doc.replace('\n', ' ')
        # remove ponctuation
        for punc in punctuation: 
            doc = doc.replace(punc, ' ')
        # remove numbers
        for i in range(10):
            doc = doc.replace(str(i), ' ')
        # remove stop words
        for stop_word in stop_words:
            doc = doc.replace(' ' + stop_word + ' ', ' ')
        # remove single characters and stem the words 
        doc = [x for x in doc.split() if len(x) > 2]
        doc = " ".join(doc)
        # save the result to our list of documents
        cleaned_docs.append(doc)
        
    return cleaned_docs

In [12]:
# reimport your raw data
documents = []

for filename in files: 
    with open (filename, "r", encoding='utf-8') as f:
        documents.append(f.read())
        
# clean your files using the function above
docs = clean_list_of_documents(documents)

# print the first 1000 characters of the first document
print(docs[0][:1000])

abstract mind wandering defined shifts attention task related processing task unrelated thoughts ubiquitous phenomenon negative influence performance productivity many contexts including learning propose next generation learning technologies mechanism detect respond mind wandering real time towards end developed technology automatically detects mind wandering eye gaze learning instructional texts mind wandering detected technology intervenes posing time questions encouraging reading needed multiple rounds iterative refinement summatively compared technology yoked control experiment participants key dependent variable performance post reading comprehension assessment results suggest technology successful correcting comprehension deficits attributed mind wandering sigma specific conditions thereby highlighting potential improve learning attending attention keywords mind wandering gaze tracking student modeling attentionaware introduction despite best efforts write clear engaging paper ch

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [4]:
import math
import numpy as np

In [14]:
# create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(docs):
    voc = []
    for doc in docs:
        for word in doc.split():
            if word not in voc: 
                voc.append(word)
    voc = list(set(voc))
    voc.sort()
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)
vocabulary = get_vocabulary(docs)
print(len(vocabulary))

5676


## Step 4 - transform your documents in to 100-words chunks

In [15]:
# create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap

def flatten_and_overlap(docs, window_size=100, overlap=25):
    
    # create the list of overlapping documents
    new_list_of_documents = []
    
    # flatten everything into one string
    flat = ""
    for doc in docs:
        flat += doc
    
    # split into words
    flat = flat.split()

    # create chunks of 100 words
    high = window_size
    while high < len(flat):
        low = high - window_size
        new_list_of_documents.append(flat[low:high])
        high += overlap
    return new_list_of_documents

chunks = flatten_and_overlap(docs)

In [16]:
# create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check
for chunk in chunks: 
    assert(len(chunk) == 100)

# WEEK 6 - VECTOR MANIPULATION

## Step 5 - Create a word by document matrix

In [17]:
# 1) create an empty dataframe using pandas
# the number of rows should be the number of segments we have
# the number of columns should be size of the vocabulary
print(len(chunks))
print(len(vocabulary))

df_segments = pd.DataFrame(0, index=np.arange(len(chunks)), columns=vocabulary)
df_segments.head()

2219
5676


Unnamed: 0,��,���,����,����,����,����,����,����,�����,����,...,𝑅𝑒𝑐𝑎𝑙𝑙,𝑇𝐹𝑖,𝑔𝑎𝑖𝑛,𝑚𝑒𝑎𝑠𝑢𝑟𝑒,𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡,𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛,𝑝𝑟𝑒𝑡𝑒𝑠𝑡,𝑟𝑒𝑐𝑎𝑙𝑙,𝑠𝑐𝑜𝑟𝑒,𝟎𝟒𝟕
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [57]:
# 2) fill out the dataframe with the count of words for each chunk
# (use two for loops to iterate through the chunk and the vocabulary)

for i,chunk in enumerate(chunks):
    for word in chunk:
        if word in df_segments.columns:
            df_segments.loc[i, word] += 1

In [58]:
# 3) Sanity check: make sure that your counts are correct
# (e.g., if you know that a words appears often in a document, check that
# the number is also high in your dataframe; and vice-versa for low counts)
df_segments.data.describe()

count    2219.000000
mean        1.025687
std         1.670507
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        14.000000
Name: data, dtype: float64

In [39]:
# 4) Putting it together: create a function that takes a list of documents
# and a vocabulary as arguments, and returns a dataframe with the counts
# of words: 
def word_count(texts, vocab):
    df = pd.DataFrame(0, index=np.arange(len(texts)), columns=vocab)
    for i,text in enumerate(texts):
        for word in text:
            if word in df.columns:
                df.loc[i, word] += 1
    return df

dfseg = word_count(chunks, vocabulary)
dfseg.head()
dfseg.data.describe()

#the data description is a little bit off- max is 12 instead of 14, mean is a little lower. but I'm not sure why?

Unnamed: 0,��,���,����,����,����,����,����,����,�����,����,...,𝑅𝑒𝑐𝑎𝑙𝑙,𝑇𝐹𝑖,𝑔𝑎𝑖𝑛,𝑚𝑒𝑎𝑠𝑢𝑟𝑒,𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡,𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛,𝑝𝑟𝑒𝑡𝑒𝑠𝑡,𝑟𝑒𝑐𝑎𝑙𝑙,𝑠𝑐𝑜𝑟𝑒,𝟎𝟒𝟕
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 6 - Weight word frequency

In [5]:
#load data as a csv file
df_seg = pd.read_csv("./word-by-chunk.csv")
df_seg.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,𝑅𝑒𝑐𝑎𝑙𝑙,𝑇𝐹𝑖,𝑔𝑎𝑖𝑛,𝑚𝑒𝑎𝑠𝑢𝑟𝑒,𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡,𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛,𝑝𝑟𝑒𝑡𝑒𝑠𝑡,𝑟𝑒𝑐𝑎𝑙𝑙,𝑠𝑐𝑜𝑟𝑒,𝟎𝟒𝟕
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# 5) create a function that adds one to the current cell and takes its log
# IF the value in the cell is not zero
def log_transform(cell):
    if cell != 0:
        cell = np.log1p(cell)
    return cell

In [7]:
# 6) use the "applymap" function of the dataframe to apply the function 
# above to each cell of the table
df_log = df_seg.applymap(log_transform)

In [8]:
# 7) check that the numbers in the resulting matrix look accurate;
# print the value before and after applying the function above
print(df_seg['data'].describe())
print(df_seg['learning'].describe())
print(df_seg['education'].describe())
print(np.log1p(12))
print(np.log1p(3))

print(df_log['data'].describe())
print(df_log['learning'].describe())
print(df_log['education'].describe())




count    2219.000000
mean        0.998648
std         1.599002
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        12.000000
Name: data, dtype: float64
count    2219.000000
mean        1.104552
std         1.774194
min         0.000000
25%         0.000000
50%         0.000000
75%         2.000000
max        12.000000
Name: learning, dtype: float64
count    2219.000000
mean        0.070753
std         0.323342
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         3.000000
Name: education, dtype: float64
2.5649493574615367
1.3862943611198906
count    2219.000000
mean        0.471216
std         0.614158
min         0.000000
25%         0.000000
50%         0.000000
75%         0.693147
max         2.564949
Name: data, dtype: float64
count    2219.000000
mean        0.505046
std         0.638218
min         0.000000
25%         0.000000
50%         0.000000
75%         1.098612
max         2.564949
Name:

## Step 7 - Matrix normalization

### 8) look at the image below; why do you think that we need to normalize our data before clustering in this particular case? 
Because the axes aren't labeled it is hard to interpret this graph. But my best guess is that the x-axis represents one word's values and the y-axis represents another word. The values of these two variables are not comparable because they are different scales. Is moving from 0 to 1 on the x axis akin to moving from 0 to 1000 on the y axis? Do we want to represent the variation in terms of standard deviations instead o the units of measure? If so, we should standardize them. If they are substantively meaningfully different scales though I don't think you want to normalize/standardize them. But since we want to more easily compare them and as stated bewlow it's common practice, we are going to do it.

<img src="https://i.stack.imgur.com/N2unM.png" />

In general, it's common practice to normalize your data before clustering - so that variables are comparable.

### 9) describe how the min-max normalization works: 
Here we are transforming the values in order to represent them as their relative position along the range of possible values. We end up with a fraction that represents its position relative to other values. It's almost like thinking of the range as a number line, and the resulting transformed variable is the position on that line as a fraction (e.g. 6/10 if the numbers range from 1 to 11 and the value is a 7).

<img src="https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/media/aml-normalization-minmax.png" />

### 10) describe how normalizing using a z-score works:
Z-scores represent the values in standard deviation units. A 0 z-score indicates the value is the mean of the variable, and a 1 means it is 1-standard deviation above the mean. So you're comparing values based on their relative value in terms of standard deviation.

<img src="https://cdn-images-1.medium.com/max/1600/1*13XKCXQc7eabfZbRzkvGvA.gif"/>

### 11) describe how normalizing to unit norm works
From the mathworld page below: "The unit vector obtained by normalizing the normal 
vector (i.e., dividing a nonzero normal vector by its vector norm) is the unit normal vector, often known 
simply as the "unit normal." Care should be taken to not confuse the terms "vector norm" (length of vector), 
"normal vector" (perpendicular vector) and "normalized vector" (unit-length vector).

The normal vector is commonly denoted N or n, with a hat sometimes (but not always) added (i.e., N^^ and n^^) 
to explicitly indicate a unit normal vector."

Conceptually, what we are doing is standardizing each vector to have the same length, and so only comparing vectors to each other based on their angle, which can be compared to each other more easily using cosines when they are the same lengths. We're also reducing the amount of information contained in the vector, so we only look at the one parameter (angle) instead of comparing them on two parameters. As the description above describes, this is done on a vector-by-vector basis, not like other transformationt that may standardize a variable based on the entire sample. The vector is divided by its "vector norm" or magnitude.

Resources: 
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer
* http://mathworld.wolfram.com/NormalVector.html

We are going to work with some pre-made normalization functions from sklearn (feel free to skim this page):
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

In [8]:
# 12) since we are working with vectors, apply the Normalizer from 
# sklearn.preprocessing to our dataframe. Print a few values 
# before and after to make sure you've applied the normalization
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
scaled = normalizer.fit_transform(df_log)
df_normalized = pd.DataFrame(scaled,columns=df_log.columns)
print(df_log.learning.describe())
print(df_normalized.learning.describe())


count    2219.000000
mean        0.505046
std         0.638218
min         0.000000
25%         0.000000
50%         0.000000
75%         1.098612
max         2.564949
Name: learning, dtype: float64
count    2219.000000
mean        0.050257
std         0.063434
min         0.000000
25%         0.000000
50%         0.000000
75%         0.104009
max         0.252719
Name: learning, dtype: float64


In [9]:
# 13) create a function that takes a dataframe as argument and where a second
# argument is the type of normalization (MinMaxScaler, Normalizer, StandardScaler)
# and returns the normalized dataframe
from sklearn.preprocessing import MinMaxScaler, Normalizer, StandardScaler
def normed(scaler, dataframe):
    from sklearn.preprocessing import MinMaxScaler, Normalizer, StandardScaler
    normalizer = scaler
    scaled = normalizer.fit_transform(dataframe)
    return pd.DataFrame(scaled,columns=dataframe.columns)


## Step 8 - Deviation Vectors

<img src="https://www.dropbox.com/s/9f73r7pk7bi7vh9/deviation_vectors.png?dl=1" />

In [10]:
# 14) compute the sum of the vectors

df_normalized = df_normalized.append(df_normalized.sum(), ignore_index=True) 
print(df_normalized.tail())


       Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  \
2215     0.721278    0.000000    0.000000    0.000000    0.000000    0.000000   
2216     0.720558    0.000000    0.000000    0.000000    0.000000    0.000000   
2217     0.722922    0.000000    0.000000    0.000000    0.000000    0.000000   
2218     0.722308    0.000000    0.000000    0.000000    0.000000    0.000000   
2219  1481.754891    0.614176    0.575591    0.575591    0.575591    0.575591   

      Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9    ...       𝑅𝑒𝑐𝑎𝑙𝑙  \
2215    0.000000    0.000000    0.000000    0.000000    ...     0.000000   
2216    0.000000    0.000000    0.000000    0.000000    ...     0.000000   
2217    0.000000    0.000000    0.000000    0.000000    ...     0.000000   
2218    0.000000    0.000000    0.000000    0.000000    ...     0.000000   
2219    0.575591    0.575591    0.267249    0.453378    ...     0.271974   

           𝑇𝐹𝑖      𝑔𝑎𝑖𝑛   𝑚𝑒𝑎𝑠𝑢𝑟𝑒  𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡  𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 

In [17]:
# 15) normalize the vector (find its average)
v_sum = pd.DataFrame(df_normalized.loc[df_normalized.index==2219])
#print(v_sum)
v_avg = normed(Normalizer(), v_sum)
#print(v_avg)
print(v_avg.loc[0])

Unnamed: 0     0.957048
Unnamed: 1     0.000397
Unnamed: 2     0.000372
Unnamed: 3     0.000372
Unnamed: 4     0.000372
Unnamed: 5     0.000372
Unnamed: 6     0.000372
Unnamed: 7     0.000372
Unnamed: 8     0.000173
Unnamed: 9     0.000293
Unnamed: 10    0.000173
Unnamed: 11    0.000293
Unnamed: 12    0.000173
Unnamed: 13    0.000293
Unnamed: 14    0.000174
Unnamed: 15    0.000174
Unnamed: 16    0.000397
Unnamed: 17    0.000174
Unnamed: 18    0.000397
Unnamed: 19    0.000174
Unnamed: 20    0.000372
Unnamed: 21    0.000397
Unnamed: 22    0.000174
Unnamed: 23    0.000370
abilities      0.000946
ability        0.006914
able           0.003354
abnormal       0.000178
absence        0.000385
absent         0.000351
                 ...   
ztransform     0.000357
zurich         0.000178
µθt            0.000279
école          0.000193
αjt            0.000190
αxt            0.000302
θlj            0.000354
λkak           0.000180
λkz            0.000342
σθt            0.000279
‘amount        0

In [24]:
print(len(df_normalized))

2220


In [26]:
# 16) take each vector and subtract its components along v_avg


#make a copy of the normalized dataframe without the total row
df_deviation=df_normalized[:2219].copy()
# for each row in the normalized dataframe, subtract it from the average vector which only has one row:
for row in range(2219):
    df_deviation.loc[row]= v_avg.loc[0] - df_deviation.loc[row]
    
print(df_deviation.head())
print(df_deviation.tail())

   Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  \
0    0.957048    0.000397    0.000372    0.000372    0.000372    0.000372   
1    0.861267    0.000397    0.000372    0.000372    0.000372    0.000372   
2    0.805303    0.000397    0.000372    0.000372    0.000372    0.000372   
3    0.767092    0.000397    0.000372    0.000372    0.000372    0.000372   
4    0.738476    0.000397    0.000372    0.000372    0.000372    0.000372   

   Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9    ...       𝑅𝑒𝑐𝑎𝑙𝑙  \
0    0.000372    0.000372    0.000173    0.000293    ...     0.000176   
1    0.000372    0.000372    0.000173    0.000293    ...     0.000176   
2    0.000372    0.000372    0.000173    0.000293    ...     0.000176   
3    0.000372    0.000372    0.000173    0.000293    ...     0.000176   
4    0.000372    0.000372    0.000173    0.000293    ...     0.000176   

        𝑇𝐹𝑖      𝑔𝑎𝑖𝑛   𝑚𝑒𝑎𝑠𝑢𝑟𝑒  𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡  𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛   𝑝𝑟𝑒𝑡𝑒𝑠𝑡    𝑟𝑒𝑐𝑎𝑙𝑙  \
0  0.000178  0.000169 

In [14]:
# 17) put the code above in a function that takes in a dataframe as an argument
# and computes deviation vectors of each row (=document)
def normed(scaler, dataframe):
    from sklearn.preprocessing import MinMaxScaler, Normalizer, StandardScaler
    normalizer = scaler
    scaled = normalizer.fit_transform(dataframe)
    return pd.DataFrame(scaled,columns=dataframe.columns)

def deviation(dataframe):
    dataframe.append(dataframe.sum(), ignore_index=True) 
    v_sum = pd.DataFrame(df_normalized.loc[df_normalized.index==-1])
    v_avg = normed(Normalizer(), v_sum)
    for row in range(len(dataframe)):
        dataframe.loc[row]= v_avg.loc[0] - df_deviation.loc[row]
    return dataframe

# WEEK 7 - CLUSTERING

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [17]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization