# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 5 - data cleaning**
1. import the data
2. clean the data (e.g., remopve stop words, punctuation, etc.)
3. build a vocabulary for the dataset
4. create chunks of 100 words, with a 25-words overlap
5. create a word count matrix, where each chunk of a row and each column represents a word

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

In [18]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization

## Step 1 - Data Retrieval

In [19]:
# 1) using glob, find all the text files in the "Papers" folder
# Hint: refer to last week's notebook
import glob
import os
#os.chdir('Papers')
paths = glob.glob('paper*.txt')
paths.sort()
print(paths, len(paths))

['paper0.txt', 'paper1.txt', 'paper10.txt', 'paper11.txt', 'paper12.txt', 'paper13.txt', 'paper14.txt', 'paper15.txt', 'paper16.txt', 'paper2.txt', 'paper3.txt', 'paper4.txt', 'paper5.txt', 'paper6.txt', 'paper7.txt', 'paper8.txt', 'paper9.txt'] 17


In [5]:
# 2) get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
import os
documents = []
for file in paths:
    with open (file,'r',encoding='utf-8') as f:
        documents.append(f.read())

documents[0]

"\x0czone out no more: mitigating mind wandering during\ncomputerized reading\nsidney k. d’mello, caitlin mills, robert bixler, & nigel bosch\nuniversity of notre dame\n118 haggar hall\nnotre dame, in 46556, usa\nsdmello@nd.edu\n\nabstract\nmind wandering, defined as shifts in attention from task-related\nprocessing to task-unrelated thoughts, is a ubiquitous\nphenomenon that has a negative influence on performance and\nproductivity in many contexts, including learning. we propose\nthat next-generation learning technologies should have some\nmechanism to detect and respond to mind wandering in real-time.\ntowards this end, we developed a technology that automatically\ndetects mind wandering from eye-gaze during learning from\ninstructional texts. when mind wandering is detected, the\ntechnology intervenes by posing just-in-time questions and\nencouraging re-reading as needed. after multiple rounds of\niterative refinement, we summatively compared the technology to\na yoked-control in a

In [8]:
# 3) print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)
print(documents[0][:1000])

zone out no more: mitigating mind wandering during
computerized reading
sidney k. d’mello, caitlin mills, robert bixler, & nigel bosch
university of notre dame
118 haggar hall
notre dame, in 46556, usa
sdmello@nd.edu

abstract
mind wandering, defined as shifts in attention from task-related
processing to task-unrelated thoughts, is a ubiquitous
phenomenon that has a negative influence on performance and
productivity in many contexts, including learning. we propose
that next-generation learning technologies should have some
mechanism to detect and respond to mind wandering in real-time.
towards this end, we developed a technology that automatically
detects mind wandering from eye-gaze during learning from
instructional texts. when mind wandering is detected, the
technology intervenes by posing just-in-time questions and
encouraging re-reading as needed. after multiple rounds of
iterative refinement, we summatively compared the technology to
a yoked-control in an experiment with 104 par

## Step 2 - Data Cleaning

In [37]:
# 4) only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

select_doc = []

for document in documents:
    print("before:",len(document), end= ' ')
    document = document[document.find("abstract\n"):document.rfind("references\n")]
    print("after:",len(document))
    select_doc.append(document)

print(len(select_doc))
print(select_doc[0])

before: 50043 after: 39318
before: 41110 after: 35514
before: 49177 after: 42621
before: 32277 after: 28206
before: 40387 after: 34778
before: 45258 after: 42251
before: 40655 after: 32734
before: 31574 after: 28134
before: 42046 after: 37649
before: 46761 after: 42253
before: 47377 after: 42978
before: 44037 after: 40032
before: 37214 after: 30738
before: 47851 after: 41302
before: 42617 after: 35102
before: 45724 after: 45357
before: 47845 after: 44059
17
abstract
mind wandering, defined as shifts in attention from task-related
processing to task-unrelated thoughts, is a ubiquitous
phenomenon that has a negative influence on performance and
productivity in many contexts, including learning. we propose
that next-generation learning technologies should have some
mechanism to detect and respond to mind wandering in real-time.
towards this end, we developed a technology that automatically
detects mind wandering from eye-gaze during learning from
instructional texts. when mind wandering i

In [36]:
# Using a regex:
import re
d1 = documents[0]
body = re.findall(r'abstract(.*?)reference', d1, re.DOTALL)
print(body)
body = body[0]
print(body)

# Note this will not work for a paper that has a lot of "abstract" and "reference" in the middle

['\nmind wandering, defined as shifts in attention from task-related\nprocessing to task-unrelated thoughts, is a ubiquitous\nphenomenon that has a negative influence on performance and\nproductivity in many contexts, including learning. we propose\nthat next-generation learning technologies should have some\nmechanism to detect and respond to mind wandering in real-time.\ntowards this end, we developed a technology that automatically\ndetects mind wandering from eye-gaze during learning from\ninstructional texts. when mind wandering is detected, the\ntechnology intervenes by posing just-in-time questions and\nencouraging re-reading as needed. after multiple rounds of\niterative refinement, we summatively compared the technology to\na yoked-control in an experiment with 104 participants. the key\ndependent variable was performance on a post-reading\ncomprehension assessment. our results suggest that the\ntechnology was successful in correcting comprehension deficits\nattributed to mind

In [143]:
# 5) replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:

select_doc_clean = [doc.replace("\n"," ") for doc in select_doc]

# Or, 
#for i,doc in enumerate(documents): 
#    documents[i] = doc.replace('\n', ' ')

select_doc_clean[0][:1000]

'abstract mind wandering, defined as shifts in attention from task-related processing to task-unrelated thoughts, is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts, including learning. we propose that next-generation learning technologies should have some mechanism to detect and respond to mind wandering in real-time. towards this end, we developed a technology that automatically detects mind wandering from eye-gaze during learning from instructional texts. when mind wandering is detected, the technology intervenes by posing just-in-time questions and encouraging re-reading as needed. after multiple rounds of iterative refinement, we summatively compared the technology to a yoked-control in an experiment with 104 participants. the key dependent variable was performance on a post-reading comprehension assessment. our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering

In [144]:
# 6) replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)


# In order to avoid running two loops (which turned out to more time-efficient, see the time benchmark in the next 
# cell), I used the translate function within the string module.
punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']
del(punctuation[1])
str_punc = ''.join(punctuation)
trans_table = str_punc.maketrans(str_punc, ' '*len(str_punc))
select_doc_clean = [doc.translate(trans_table) for doc in select_doc_clean]
select_doc_clean = [doc.replace('...',' ') for doc in select_doc_clean]
select_doc_clean[0][:2000]

'abstract mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with 104 participants  the key dependent variable was performance on a post reading comprehension assessment  our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering

In [94]:
# 6) replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)


# In this list, the only 
punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']

for punc in punctuation:
    select_doc_clean = [doc.replace(punc," ") for doc in select_doc_clean]

select_doc_clean[0][:2000]

'abstract mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with 104 participants  the key dependent variable was performance on a post reading comprehension assessment  our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering

In [145]:
# 7) remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing

# Method 1: translate, 40 ms
from string import digits
remove_digits = str.maketrans('', '', digits)
select_doc_clean = [paper.translate(remove_digits) for paper in select_doc_clean]
print (select_doc_clean[0][:2000])

# Method 2: string join with filter, 100 ms
print([''.join(filter(lambda x: not x.isdigit(), paper)) for paper in select_doc_clean][0][:2000])

# Method 3: string join iterate, 66 ms
print([''.join([i for i in paper if not i.isdigit()]) for paper in select_doc_clean][0][:2000])

abstract mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with  participants  the key dependent variable was performance on a post reading comprehension assessment  our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering  d 

In [146]:
# 8) Remove the stop words below from our documents
# print the first 1000 characters of the first document
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']

print (select_doc_clean[0][:2000])

for i, doc in enumerate (select_doc_clean):
    select_doc_clean[i] = [word for word in select_doc_clean[i].split() if word not in stop_words]
    select_doc_clean[i] = ' '.join(select_doc_clean[i])
    

print (select_doc_clean[0][:2000])



abstract mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with  participants  the key dependent variable was performance on a post reading comprehension assessment  our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering  d 

In [160]:
# 9) remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document

# Testing
string = "one recent g study tracked mi s "
pattern = re.compile(r"\b[a-zA-Z]{1,2}\b")
new_string = re.sub(pattern, '', string)
print(new_string)

select_doc_clean = [re.sub(pattern, '', doc) for doc in select_doc_clean]
print (select_doc_clean[0][:1000])


one recent  study tracked   
abstract mind wandering defined shifts attention task related processing task unrelated thoughts ubiquitous phenomenon negative influence performance productivity many contexts including learning propose next generation learning technologies mechanism detect respond mind wandering real time towards end developed technology automatically detects mind wandering eye gaze learning instructional texts mind wandering detected technology intervenes posing time questions encouraging  reading needed multiple rounds iterative refinement summatively compared technology yoked control experiment participants key dependent variable performance post reading comprehension assessment results suggest technology successful correcting comprehension deficits attributed mind wandering  sigma specific conditions thereby highlighting potential improve learning attending attention keywords mind wandering gaze tracking student modeling attentionaware introduction despite best effort


### Putting it all together

In [155]:
# 10) package all of your work above into a function that cleans a given document

# Required packages/functions: digits in string, re
def clean_list_of_documents(documents):
    
    # ensure the needed packages are imported
    import re
    from string import digits
    
    cleaned_docs = []

    ### your code ###
    # define constants
    punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']
    
    stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']
    
    # Step 1: select between abstract and references
    for document in documents:
        document = document[document.find("abstract\n"):document.rfind("references\n")]
        cleaned_docs.append(document)
    
    # Step 2: remove carriage returns
    cleaned_docs = [doc.replace("\n"," ") for doc in cleaned_docs]
    
    # Step 3: remove punctuations
    for punc in punctuation:
        cleaned_docs = [doc.replace(punc," ") for doc in cleaned_docs]
    
    # Step 4: remove numbers (using the most time efficient method)
    remove_digits = str.maketrans('', '', digits)
    cleaned_docs = [doc.translate(remove_digits) for doc in cleaned_docs]
    
    # Step 5: remove stop_words
    for i, doc in enumerate (cleaned_docs):
        cleaned_docs[i] = [word for word in cleaned_docs[i].split() if word not in stop_words]
        cleaned_docs[i] = ' '.join(cleaned_docs[i])
    
    # Step 6: remove words with one or two characters
    pattern = re.compile(r"\b[a-zA-Z]{1,2}\b")
    cleaned_docs = [re.sub(pattern, '', doc) for doc in cleaned_docs]
        
        
    return cleaned_docs

In [159]:
# 11a) reimport your raw data using the code in 2)
documents = []

for file in paths:
    with open (file,'r',encoding='utf-8') as f:
        documents.append(f.read())
        
# 11b) clean your files using the function above
cleaned_data = clean_list_of_documents(documents)

# 11c) print the first 1000 characters of the first document
print(cleaned_data[0][:1000])

abstract mind wandering defined shifts attention task related processing task unrelated thoughts ubiquitous phenomenon negative influence performance productivity many contexts including learning propose next generation learning technologies mechanism detect respond mind wandering real time towards end developed technology automatically detects mind wandering eye gaze learning instructional texts mind wandering detected technology intervenes posing time questions encouraging  reading needed multiple rounds iterative refinement summatively compared technology yoked control experiment participants key dependent variable performance post reading comprehension assessment results suggest technology successful correcting comprehension deficits attributed mind wandering  sigma specific conditions thereby highlighting potential improve learning attending attention keywords mind wandering gaze tracking student modeling attentionaware introduction despite best efforts write clear engaging paper 

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [13]:
import math
import numpy as np

In [229]:
filter(lambda x: x in string.printable, '\x01string')

'<filter object at 0x10e40e400>'

12) Describe why we need to figure out the vocabulary used in our corpus (refer back to Sherin's paper, and explain in your own words): 

In [233]:
# 13) create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

# test

import string
test_t = (cleaned_data[0]).split()
print(len(test_t))
set_t = set(test_t)
print(len(set_t))


# note: function does not store any non printable characters (e.g. "sigma", "\x00", etc) in the vocabulary
# they should probably be stripped already in pre-processing (i.e. adding a step 7 to the previous function), 
# as there is no need to analyze them.
def get_vocabulary(documents): 
    
    # import necessary libraries/functions:
    from string import printable
    
    ### your code ###
    long_doc_list = ' '.join(documents).split()
    cleaned_word = [''.join(list(filter(lambda x: x in string.printable, word))) for word in long_doc_list]
    voc = set ([word.strip() for word in cleaned_word])
    voc = sorted(voc)
    
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)

voc = get_vocabulary (cleaned_data)
print(len(voc))
print(voc[:100], voc[-200:])

3333
1248
5700


In [None]:
# 14) what was the size of Sherin's vocabulary? 

# 1,429 words

## Step 4 - transform your documents into 100-words chunks

In [260]:
# 15) create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap size
# Advice: combining all the documents into one giant string
# and splitting it into separate words will make your life easier!

# test
test_lst = list(range(0,(156 - 100),25))
test_word = cleaned_data[0].split()
test_o = [test_word[i:i+100] for i in test_lst]
print(test_o[0], "\n", len(test_o[0]), "\n", test_o[-1], "\n", len(test_o[-1]))

# note: this way of chunking will miss a few words at the end of the document if its word length is not divisable
# by 100. I personally would probably prefer to choose a number can divide the word length without remainders.
def flatten_and_overlap(documents, num_word, overlap):
    long_doc = ' '.join(documents).split()
    start_index = list (range (0, (len(long_doc)) - num_word, overlap))
    chunks = [long_doc[i:i+num_word] for i in start_index]
    print(len(start_index), len(chunks))
    chunks = [' '.join(chunk) for chunk in chunks]
    print(chunks[-1])
    
    
    return chunks
chunks = flatten_and_overlap(cleaned_data, 100, 25)
print(cleaned_data[-1][-1500:])

['abstract', 'mind', 'wandering', 'defined', 'shifts', 'attention', 'task', 'related', 'processing', 'task', 'unrelated', 'thoughts', 'ubiquitous', 'phenomenon', 'negative', 'influence', 'performance', 'productivity', 'many', 'contexts', 'including', 'learning', 'propose', 'next', 'generation', 'learning', 'technologies', 'mechanism', 'detect', 'respond', 'mind', 'wandering', 'real', 'time', 'towards', 'end', 'developed', 'technology', 'automatically', 'detects', 'mind', 'wandering', 'eye', 'gaze', 'learning', 'instructional', 'texts', 'mind', 'wandering', 'detected', 'technology', 'intervenes', 'posing', 'time', 'questions', 'encouraging', 'reading', 'needed', 'multiple', 'rounds', 'iterative', 'refinement', 'summatively', 'compared', 'technology', 'yoked', 'control', 'experiment', 'participants', 'key', 'dependent', 'variable', 'performance', 'post', 'reading', 'comprehension', 'assessment', 'results', 'suggest', 'technology', 'successful', 'correcting', 'comprehension', 'deficits', 

In [261]:
# 16) create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check

assert (len(chunk) == 100 for chunk in chunks)

In [268]:
# 17) print the first chunk, and compare it to the original text.
# does that match what Sherin describes in his paper?

print (chunks[0])
print (documents[0][:1500])

# the chunks contain only the semantically meaningful words, similar to Sherin's

abstract mind wandering defined shifts attention task related processing task unrelated thoughts ubiquitous phenomenon negative influence performance productivity many contexts including learning propose next generation learning technologies mechanism detect respond mind wandering real time towards end developed technology automatically detects mind wandering eye gaze learning instructional texts mind wandering detected technology intervenes posing time questions encouraging reading needed multiple rounds iterative refinement summatively compared technology yoked control experiment participants key dependent variable performance post reading comprehension assessment results suggest technology successful correcting comprehension deficits attributed mind wandering sigma specific conditions thereby highlighting potential improve learning attending attention keywords mind wandering
zone out no more: mitigating mind wandering during
computerized reading
sidney k. d’mello, caitlin mills, ro

In [None]:
# 18) how many chunks did Sherin have? What does a chunk become 
# in the next step of our topic modeling algorithm? 

# 794, vectors

In [23]:
# 19) what are some other preprocessing steps we could do 
# to improve the quality of the text data? Mention at least 2.

# 1. as previously mentioned, add a step 7 that removes or converts all non printable characters
# 2. use chunk length that is divisable to the cleaned data word length

In [25]:
# 20) in your own words, describe the next steps of the 
# data modeling algorithms (listed below):

# Convert the chunks into vectors by counting the occurrence of vocab in each chunk and use vector and matrix
# operations to compare their similarities

## Step 5 - Vector and Matrix operations

## Step 6 - Weight word frequency

## Step 7 - Matrix normalization

## Step 8 - Deviation Vectors

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [17]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization