# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

# WEEK 5 - DATA CLEANING

## Step 1 - Data Retrieval

In [26]:
# using glob, find all the text files in the "Papers" folder
import glob

files = glob.glob('Papers/*.txt')
print(files)

['Papers/paper12.txt', 'Papers/paper5.txt', 'Papers/paper4.txt', 'Papers/paper13.txt', 'Papers/paper11.txt', 'Papers/paper6.txt', 'Papers/paper7.txt', 'Papers/paper10.txt', 'Papers/paper14.txt', 'Papers/paper3.txt', 'Papers/paper2.txt', 'Papers/paper15.txt', 'Papers/paper0.txt', 'Papers/paper1.txt', 'Papers/paper16.txt', 'Papers/paper9.txt', 'Papers/paper8.txt']


In [27]:
# get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
documents = []

for filename in files: 
    with open (filename, "r", encoding='utf-8') as f:
        documents.append(f.read())

In [28]:
# print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)
documents[0][:1500]

'103\n\n\x0cepistemic network analysis and topic modeling for chat\ndata from collaborative learning environment\nzhiqiang cai\n\nbrendan eagan\n\nnia m. dowell\n\nthe university of memphis\n365 innovation drive, suite 410\nmemphis, tn, usa\n\nuniversity of wisconsin-madison\n1025 west johnson street\nmadison, wi, usa\n\nthe university of memphis\n365 innovation drive, suite 410\nmemphis, tn, usa\n\nzcai@memphis.edu\n\neaganb@gmail.com\n\nniadowell@gmail.com\n\njames w. pennebaker\n\ndavid w. shaffer\n\narthur c. graesser\n\nuniversity of texas-austin\n116 inner campus dr stop g6000\naustin, tx, usa\n\nuniversity of wisconsin-madison\n1025 west johnson street\nmadison, wi, usa\n\nthe university of memphis\n365 innovation drive, suite 403\nmemphis, tn, usa\n\npennebaker@utexas.edu\n\ndws@education.wisc.edu\n\nart.graesser@gmail.com\n\nabstract\nthis study investigates a possible way to analyze chat data from\ncollaborative learning environments using epistemic network\nanalysis and topi

## Step 2 - Data Cleaning

In [29]:
# only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

for i,doc in enumerate(documents):
    print(len(documents[i]), end=' ')
    # only keep the text after the abstract
    doc = doc[doc.index('abstract'):doc.rfind('reference')]
    # save the result
    documents[i] = doc
    # print the length of the resulting string
    print(len(documents[i]))
    
# one liner:
# documents = [doc[doc.index('abstract'):doc.rfind('reference')] for doc in documents]

40387 34778
37214 32762
44037 40032
45258 42251
32277 28206
47851 41302
42617 35102
49177 42621
40655 32734
47377 42978
46761 42253
31574 28134
50043 39318
41110 35514
42046 37649
47845 44059
45724 39947


In [30]:
# replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:

documents = [doc.replace('\n', ' ') for doc in documents]
print(documents[0][:1000])

abstract this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling. a 300-topic general topic model built from tasa (touchstone applied science associates) corpus was used in this study. 300 topic scores for each of the 15,670 utterances in our chat data were computed. seven relevant topics were selected based on the total document scores. while the aggregated topic scores had some power in predicting students’ learning, using epistemic network analysis enables assessing the data from a different angle. the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different (𝑡 = 2.00). overall, the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions.  keywords chat; collaborative learning; topic modelin

In [31]:
# replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']


# remove ponctuation
for i,doc in enumerate(documents): 
    for punc in punctuation: 
        doc = doc.replace(punc, ' ')
    documents[i] = doc
    
print(documents[0][:1000])

abstract this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling  a 300 topic general topic model built from tasa  touchstone applied science associates  corpus was used in this study  300 topic scores for each of the 15 670 utterances in our chat data were computed  seven relevant topics were selected based on the total document scores  while the aggregated topic scores had some power in predicting students  learning  using epistemic network analysis enables assessing the data from a different angle  the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different  𝑡   2 00   overall  the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions   keywords chat  collaborative learning  topic modelin

In [32]:
# remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing
for i,doc in enumerate(documents): 
    for num in range(10):
        doc = doc.replace(str(num), '')
    documents[i] = doc

print(documents[1][:1000])

abstract there is a critical need to develop new educational technology applications that analyze the data collected by universities to ensure that students graduate in a timely fashion   to  years   and they are well prepared for jobs in their respective fields of study  in this paper  we present a novel approach for analyzing historical educational records from a large  public university to perform next term grade prediction  i e   to estimate the grades that a student will get in a course that he she will enroll in the next term  accurate next term grade prediction holds the promise for better student degree planning  personalized advising and automated interventions to ensure that students stay on track in their chosen degree program and graduate on time  we present a factorization based approach called matrix factorization with temporal course wise influence that incorporates course wise influence effects and temporal effects for grade prediction  in this model  students and cours

In [33]:
# Remove the stop words below from our documents
# print the first 1000 characters of the first document
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']


# remove stop words
for i,doc in enumerate(documents):
    for stop_word in stop_words:
        doc = doc.replace(' ' + stop_word + ' ', ' ')
    documents[i] = doc

print(documents[0][:1000])

abstract study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling   topic general topic model built tasa  touchstone applied science associates  corpus used study   topic scores   utterances chat data computed  seven relevant topics selected based total document scores  aggregated topic scores power predicting students  learning  using epistemic network analysis enables assessing data different angle  results showed topic score based epistemic networks low gain students high gain students significantly different  𝑡       overall  results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epistemic network analysis    introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchronousl

In [34]:
# remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document

for i,doc in enumerate(documents):  
    doc = [x for x in doc.split() if len(x) > 2]
    doc = " ".join(doc)
    documents[i] = doc

print(documents[0][:1000])

abstract study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchronously asynchronously participate ta


### Putting it all together

In [35]:
# package all of your work above into a function that cleans a given document

def clean_list_of_documents(documents):
    
    cleaned_docs = []

    for i,doc in enumerate(documents):
        # only keep the text after the abstract
        doc = doc[doc.index('abstract'):]
        # only keep the text before the references
        doc = doc[:doc.rfind('reference')]
        # replace return carriage with white space
        doc = doc.replace('\n', ' ')
        # remove ponctuation
        for punc in punctuation: 
            doc = doc.replace(punc, ' ')
        # remove numbers
        for i in range(10):
            doc = doc.replace(str(i), ' ')
        # remove stop words
        for stop_word in stop_words:
            doc = doc.replace(' ' + stop_word + ' ', ' ')
        # remove single characters and stem the words 
        doc = [x for x in doc.split() if len(x) > 2]
        doc = " ".join(doc)
        # save the result to our list of documents
        cleaned_docs.append(doc)
        
    return cleaned_docs

In [36]:
# reimport your raw data
documents = []

for filename in files: 
    with open (filename, "r", encoding='utf-8') as f:
        documents.append(f.read())
        
# clean your files using the function above
docs = clean_list_of_documents(documents)

# print the first 1000 characters of the first document
print(docs[0][:1000])

abstract study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchronously asynchronously participate ta

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [37]:
import math
import numpy as np

In [38]:
# create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(docs):
    voc = []
    for doc in docs:
        for word in doc.split():
            if word not in voc: 
                voc.append(word)
    voc = list(set(voc))
    voc.sort()
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)
vocabulary = get_vocabulary(docs)
print(len(vocabulary))

5676


## Step 4 - transform your documents in to 100-words chunks

In [39]:
# create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap

def flatten_and_overlap(docs, window_size=100, overlap=25):
    
    # create the list of overlapping documents
    new_list_of_documents = []
    
    # flatten everything into one string
    flat = ""
    for doc in docs:
        flat += doc
    
    # split into words
    flat = flat.split()

    # create chunks of 100 words
    high = window_size
    while high < len(flat):
        low = high - window_size
        new_list_of_documents.append(flat[low:high])
        high += overlap
    return new_list_of_documents

chunks = flatten_and_overlap(docs)

In [19]:
# create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check
for chunk in chunks: 
    assert(len(chunk) == 100)

# WEEK 6 - VECTOR MANIPULATION

## Step 5 - Create a word by document matrix

In [79]:
# 1) create an empty dataframe using pandas
# the number of rows should be the number of documents we have
# the number of columns should be size of the vocabulary
import pandas as pd
import numpy as np
df = pd.DataFrame(0,index=np.arange(len(chunks)), columns=(vocabulary))
df.head()

Unnamed: 0,��,���,����,����,����,����,����,����,�����,����,...,𝑅𝑒𝑐𝑎𝑙𝑙,𝑇𝐹𝑖,𝑔𝑎𝑖𝑛,𝑚𝑒𝑎𝑠𝑢𝑟𝑒,𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡,𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛,𝑝𝑟𝑒𝑡𝑒𝑠𝑡,𝑟𝑒𝑐𝑎𝑙𝑙,𝑠𝑐𝑜𝑟𝑒,𝟎𝟒𝟕
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [80]:
# 2) fill out the dataframe with the count of words for each document
# (use two for loops to iterate through the documents and the vocabulary)
for i,chunk in enumerate(chunks):
    for word in chunk:
        if word in vocabulary:
            df.loc[i,word]+=1
df.head()

Unnamed: 0,��,���,����,����,����,����,����,����,�����,����,...,𝑅𝑒𝑐𝑎𝑙𝑙,𝑇𝐹𝑖,𝑔𝑎𝑖𝑛,𝑚𝑒𝑎𝑠𝑢𝑟𝑒,𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡,𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛,𝑝𝑟𝑒𝑡𝑒𝑠𝑡,𝑟𝑒𝑐𝑎𝑙𝑙,𝑠𝑐𝑜𝑟𝑒,𝟎𝟒𝟕
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [81]:
df.describe()

Unnamed: 0,��,���,����,����,����,����,����,����,�����,����,...,𝑅𝑒𝑐𝑎𝑙𝑙,𝑇𝐹𝑖,𝑔𝑎𝑖𝑛,𝑚𝑒𝑎𝑠𝑢𝑟𝑒,𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡,𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛,𝑝𝑟𝑒𝑡𝑒𝑠𝑡,𝑟𝑒𝑐𝑎𝑙𝑙,𝑠𝑐𝑜𝑟𝑒,𝟎𝟒𝟕
count,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,...,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0,2219.0
mean,0.005408,0.005408,0.005408,0.005408,0.005408,0.005408,0.005408,0.001803,0.003605,0.001803,...,0.001803,0.001803,0.001803,0.001803,0.001803,0.003605,0.003605,0.003605,0.005408,0.001803
std,0.119992,0.119992,0.119992,0.119992,0.119992,0.119992,0.119992,0.042428,0.079366,0.042428,...,0.042428,0.042428,0.042428,0.042428,0.042428,0.084857,0.079366,0.084857,0.119992,0.042428
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,2.0,1.0,...,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,3.0,1.0


In [82]:
# 3) Sanity check: make sure that your counts are correct
# (e.g., if you know that a words appears often in a document, check that
# the number is also high in your dataframe; and vice-versa for low counts)
print(df['data'].describe())

count    2219.000000
mean        0.991438
std         1.604047
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        12.000000
Name: data, dtype: float64


In [83]:
# 4) Putting it together: create a function that takes a list of documents
# and a vocabulary as arguments, and returns a dataframe with the counts
# of words: 
def counts_of_words(documents,vocabulary):
    chunks = flatten_and_overlap(documents)
    df = pd.DataFrame(0,index=np.arange(len(chunks)), columns=(vocabulary))    
    for i,chunk in enumerate(chunks):
        for word in chunk:
            if word in vocabulary:
                df.loc[i,word]+=1
    return df
# call the function and check that the resulting dataframe is correct
df2=counts_of_words(docs,vocabulary)

In [84]:
print(df2['data'].describe())

count    2219.000000
mean        0.991438
std         1.604047
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        12.000000
Name: data, dtype: float64


## Step 6 - Weight word frequency

In [2]:
# 5) create a function that adds one to the current cell and takes its log
# IF the value in the cell is not zero
import pandas as pd
word_by_chunk= pd.read_csv('word-by-chunk.csv',index_col=[0])



In [3]:
import math
def plus_one_log(cell):
    if cell!=0:
        cell = math.log(1+cell)
    return cell



In [4]:
# 6) use the "applymap" function of the dataframe to apply the function 
# above to each cell of the table
word_by_chunk_log = word_by_chunk.applymap(plus_one_log)



In [5]:
# 7) check that the numbers in the resulting matrix look accurate;
# print the value before and after applying the function above
print(word_by_chunk['student'].head())
print(word_by_chunk_log['student'].head())

0    0
1    1
2    1
3    1
4    1
Name: student, dtype: int64
0    0.000000
1    0.693147
2    0.693147
3    0.693147
4    0.693147
Name: student, dtype: float64


## Step 7 - Matrix normalization

In [6]:
# 8) look at the image below; why do you think that we need to normalize our 
# data before clustering in this particular case? 

Yes because the y axis will be way overweighted because its values are much more extreme

<img src="https://i.stack.imgur.com/N2unM.png" />

In general, it's common practice to normalize your data before clustering - so that variables are comparable.

In [7]:
# 9) describe how the min-max normalization works:

for every feature, it will transform the minimum feature to 0, and the maximum feature to 1, and will transform every other value into a decimal between 0 and 1. The downside is that it doesn't handle outliers very well. 

<img src="https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/media/aml-normalization-minmax.png" />

In [8]:
# 10) describe how normalizing using a z-score works:

Z score normalizes data using a normal curve. It calculates the distance of each point from the mean and divides it by the standard deviation.

<img src="https://cdn-images-1.medium.com/max/1600/1*13XKCXQc7eabfZbRzkvGvA.gif"/>

In [9]:
# 11) describe how normalizing to unit norm works

Normalize such that if you squared each element in each vector and summed them they would equal 1. You do this by dividing a non zero normal vector by the vector norm. 

Resources: 
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer
* http://mathworld.wolfram.com/NormalVector.html

We are going to work with some pre-made normalization functions from sklearn (feel free to skim this page):
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

In [10]:
print(word_by_chunk_log['student'].head())

0    0.000000
1    0.693147
2    0.693147
3    0.693147
4    0.693147
Name: student, dtype: float64


In [11]:
# 12) since we are working with vectors, apply the Normalizer from 
# sklearn.preprocessing to our dataframe. Print a few values 
# before and after to make sure you've applied the normalization
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
np_scaled = normalizer.fit_transform(word_by_chunk_log)
df_normalized = pd.DataFrame(np_scaled,columns=word_by_chunk_log.columns)
df_normalized['student'].head()



0    0.000000
1    0.096224
2    0.096862
3    0.096739
4    0.096466
Name: student, dtype: float64

In [12]:
# 13) create a function that takes a dataframe as argument and where a second
# argument is the type of normalization (MinMaxScaler, Normalizer, StandardScaler)
# and returns the normalized dataframe
from sklearn.preprocessing import MinMaxScaler, Normalizer, StandardScaler

def apply_normalizer(df, Normalizer):
    normalizer = Normalizer()
    np_scaled = normalizer.fit_transform(df)
    df_normalized = pd.DataFrame(np_scaled,columns=df.columns)
    return df_normalized

## Step 8 - Deviation Vectors

<img src="https://www.dropbox.com/s/9f73r7pk7bi7vh9/deviation_vectors.png?dl=1" />

In [16]:
# 14) compute the sum of the vectors
import numpy as np
df_summed= np.sum(df_normalized,axis=1)
df_summed

0       8.382958
1       8.665337
2       9.114184
3       9.011595
4       8.860376
5       8.672300
6       8.536212
7       8.505834
8       8.873432
9       8.419118
10      8.565492
11      8.536871
12      8.007668
13      8.412935
14      8.184418
15      8.009800
16      8.043954
17      8.558377
18      9.340302
19      9.254563
20      8.609961
21      8.330454
22      8.447416
23      8.841692
24      8.970366
25      8.615061
26      8.548368
27      8.805986
28      9.025566
29      8.750449
          ...   
2189    8.211961
2190    7.729321
2191    7.716339
2192    7.968933
2193    8.473824
2194    8.424897
2195    8.469514
2196    8.592788
2197    8.343269
2198    8.350283
2199    7.917504
2200    7.993132
2201    8.410788
2202    8.772392
2203    8.985744
2204    8.970366
2205    8.283157
2206    8.028105
2207    8.026028
2208    8.008247
2209    8.101467
2210    7.947889
2211    8.394462
2212    8.482821
2213    8.531877
2214    8.146446
2215    8.104278
2216    8.0322

In [18]:
# 15) normalize the vector (find its average)
df_mean = np.mean(df_normalized,axis=1)
df_mean

0       0.001477
1       0.001527
2       0.001606
3       0.001588
4       0.001561
5       0.001528
6       0.001504
7       0.001499
8       0.001563
9       0.001483
10      0.001509
11      0.001504
12      0.001411
13      0.001482
14      0.001442
15      0.001411
16      0.001417
17      0.001508
18      0.001646
19      0.001630
20      0.001517
21      0.001468
22      0.001488
23      0.001558
24      0.001580
25      0.001518
26      0.001506
27      0.001551
28      0.001590
29      0.001542
          ...   
2189    0.001447
2190    0.001362
2191    0.001359
2192    0.001404
2193    0.001493
2194    0.001484
2195    0.001492
2196    0.001514
2197    0.001470
2198    0.001471
2199    0.001395
2200    0.001408
2201    0.001482
2202    0.001546
2203    0.001583
2204    0.001580
2205    0.001459
2206    0.001414
2207    0.001414
2208    0.001411
2209    0.001427
2210    0.001400
2211    0.001479
2212    0.001495
2213    0.001503
2214    0.001435
2215    0.001428
2216    0.0014

In [26]:
df_normalized.loc[1]-df_mean[1]

Unnamed: 1    -0.001527
Unnamed: 2    -0.001527
Unnamed: 3    -0.001527
Unnamed: 4    -0.001527
Unnamed: 5    -0.001527
Unnamed: 6    -0.001527
Unnamed: 7    -0.001527
Unnamed: 8    -0.001527
Unnamed: 9    -0.001527
Unnamed: 10   -0.001527
Unnamed: 11   -0.001527
Unnamed: 12   -0.001527
Unnamed: 13   -0.001527
Unnamed: 14   -0.001527
Unnamed: 15   -0.001527
Unnamed: 16   -0.001527
Unnamed: 17   -0.001527
Unnamed: 18   -0.001527
Unnamed: 19   -0.001527
Unnamed: 20   -0.001527
Unnamed: 21   -0.001527
Unnamed: 22   -0.001527
Unnamed: 23   -0.001527
abilities     -0.001527
ability       -0.001527
able          -0.001527
abnormal      -0.001527
absence       -0.001527
absent        -0.001527
absolute      -0.001527
                 ...   
ztransform    -0.001527
zurich        -0.001527
µθt           -0.001527
école         -0.001527
αjt           -0.001527
αxt           -0.001527
θlj           -0.001527
λkak          -0.001527
λkz           -0.001527
σθt           -0.001527
‘amount       -0

In [29]:
# 16) take each vector and subtract its components along v_avg
df_deviation=df_normalized
for row in range(2218):
    df_deviation.loc[row]= (df_normalized.loc[row]-df_mean[row])
    


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,...,𝑅𝑒𝑐𝑎𝑙𝑙,𝑇𝐹𝑖,𝑔𝑎𝑖𝑛,𝑚𝑒𝑎𝑠𝑢𝑟𝑒,𝑝𝑜𝑠𝑡𝑡𝑒𝑠𝑡,𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛,𝑝𝑟𝑒𝑡𝑒𝑠𝑡,𝑟𝑒𝑐𝑎𝑙𝑙,𝑠𝑐𝑜𝑟𝑒,𝟎𝟒𝟕
0,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,...,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954,-0.002954
1,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,...,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053,-0.003053
2,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,...,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211,-0.003211
3,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,...,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175,-0.003175
4,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,...,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122,-0.003122


In [30]:
# 17) put the code above in a function that takes in a dataframe as an argument
# and computes deviation vectors of each row (=document)
def compute_deviation(document):
    document= apply_normalizer(document)
    deviation = document
    doc_mean =  np.mean(document,axis=1)
    deviation.loc[row]= (document.loc[row]-df_mean[row])
    return deviation
    



# WEEK 7 - CLUSTERING

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [17]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization