# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 5 - data cleaning**
1. import the data
2. clean the data (e.g., remopve stop words, punctuation, etc.)
3. build a vocabulary for the dataset
4. create chunks of 100 words, with a 25-words overlap
5. create a word count matrix, where each chunk of a row and each column represents a word

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

In [None]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization

## Step 1 - Data Retrieval

In [20]:
# 1) using glob, find all the text files in the "Papers" folder
# Hint: refer to last week's notebook
import os
import glob
os.chdir('/Users/peizhiwen/Documents/GitHub/week5-vsm-1-lukewpz/')
text_file = glob.glob('./Papers/paper*.txt')
print(len(text_file))

17


In [21]:
# 2) get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
documents = []
for text in text_file:
    file= open(text,'r',encoding='utf-8')
    documents.append(file.read())
print(len(documents))

17


In [22]:
# 3) print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)
print(documents[0][:1000])

zone out no more: mitigating mind wandering during
computerized reading
sidney k. d’mello, caitlin mills, robert bixler, & nigel bosch
university of notre dame
118 haggar hall
notre dame, in 46556, usa
sdmello@nd.edu

abstract
mind wandering, defined as shifts in attention from task-related
processing to task-unrelated thoughts, is a ubiquitous
phenomenon that has a negative influence on performance and
productivity in many contexts, including learning. we propose
that next-generation learning technologies should have some
mechanism to detect and respond to mind wandering in real-time.
towards this end, we developed a technology that automatically
detects mind wandering from eye-gaze during learning from
instructional texts. when mind wandering is detected, the
technology intervenes by posing just-in-time questions and
encouraging re-reading as needed. after multiple rounds of
iterative refinement, we summatively compared the technology to
a yoked-control in an experiment with 104 par

## Step 2 - Data Cleaning

In [70]:
# 4) only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

docs = []
for document in documents:
    print(len(document), end=' ')
    abstract = document.index('abstract', 0, 8000)
    reference = document.rfind('reference')
    newtext = document[abstract:reference]
    docs.append(newtext)
    print(len(newtext[abstract:reference]))



50043 39099
41110 35334
49177 42214
32277 27511
40387 33988
45258 41978
40655 32070
31574 27787
42046 37371
46761 42033
47377 42660
44037 39651
37214 32315
47851 40777
42617 34774
45724 39581
47845 43713


In [71]:
# using a regex:
#import re
#d1 = document[0]
#body = re.findall(r'abstract(.*?)reference', d1, re.DOTALL)
#print(body)

In [72]:
# 5) replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:

#option 1:
#nocarriage = []
#for selecttext in selecttexts:
#    newselect = selecttext.replace('\n', ' ')
#    nocarriage.append(newselect)
#print(nocarriage[0][:1000])
#for i in nocarriage:
#    print(len(i))

#option 2:
#for i in range(len(documents)):
#    documents[i] = documents[i].replace('\n',' ')

#option 3:
for i, doc in enumerate(docs):
    docs[i] = doc.replace('\n', ' ')

In [73]:
# 6) replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']

for i, doc in enumerate(docs):
    print(len(doc), end=' ')
    for punc in punctuation:
        docs[i]=doc.replace(punc, ' ')
        doc = docs[i]
    print(len(docs[i]))
print(docs[0][:1000])

39318 39318
35514 35514
42621 42621
28206 28206
34778 34778
42251 42251
32734 32734
28134 28134
37649 37649
42253 42253
42978 42978
40032 40032
32762 32762
41302 41302
35102 35102
39947 39947
44059 44059
abstract mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with 104 participants  the 

In [74]:
# 7) remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing
numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
for i, doc in enumerate(docs):
    for num in numbers:
        docs[i]=doc.replace(num, ' ')
        doc = docs[i]
print(docs[0][:1000])
    


abstract mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with     participants  the key dependent variable was performance on a post reading comprehension assessment  our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering 

In [85]:
# 8) Remove the stop words below from our documents
# print the first 1000 characters of the first document
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']

nostop = []
for i, doc in enumerate(docs):
    doc = doc.split()
    print(len(doc), end=' ')
    nos=()
    for word in list(doc):
        if word in stop_words:
            doc.remove(word)
    nostop.append(doc)
    print(len(doc))
print(len(nostop))
print(nostop[0][:1000])

5759 3428
5427 3348
6339 3913
4056 2503
5246 3230
6688 3991
5052 3018
4222 2656
5489 3443
6394 3931
6569 3972
6080 3848
4881 3228
6305 3842
5516 3327
6006 3967
6442 4182
17
['abstract', 'mind', 'wandering', 'defined', 'shifts', 'attention', 'task', 'related', 'processing', 'task', 'unrelated', 'thoughts', 'ubiquitous', 'phenomenon', 'negative', 'influence', 'performance', 'productivity', 'many', 'contexts', 'including', 'learning', 'propose', 'next', 'generation', 'learning', 'technologies', 'mechanism', 'detect', 'respond', 'mind', 'wandering', 'real', 'time', 'towards', 'end', 'developed', 'technology', 'automatically', 'detects', 'mind', 'wandering', 'eye', 'gaze', 'learning', 'instructional', 'texts', 'mind', 'wandering', 'detected', 'technology', 'intervenes', 'posing', 'time', 'questions', 'encouraging', 're', 'reading', 'needed', 'multiple', 'rounds', 'iterative', 'refinement', 'summatively', 'compared', 'technology', 'yoked', 'control', 'experiment', 'participants', 'key', 'dep

In [86]:
# 9) remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document
cleaned = []
for i, doc in enumerate(nostop):
    for word in list(doc):
        if len(word) <= 2:
            doc.remove(word)
    cleaned.append(doc)
    print(len(doc))
print(len(cleaned))
print(cleaned[0][:1000])

3331
3131
3653
2440
3132
3479
2894
2541
3308
3601
3884
3408
2648
3659
2957
3510
3988
17
['abstract', 'mind', 'wandering', 'defined', 'shifts', 'attention', 'task', 'related', 'processing', 'task', 'unrelated', 'thoughts', 'ubiquitous', 'phenomenon', 'negative', 'influence', 'performance', 'productivity', 'many', 'contexts', 'including', 'learning', 'propose', 'next', 'generation', 'learning', 'technologies', 'mechanism', 'detect', 'respond', 'mind', 'wandering', 'real', 'time', 'towards', 'end', 'developed', 'technology', 'automatically', 'detects', 'mind', 'wandering', 'eye', 'gaze', 'learning', 'instructional', 'texts', 'mind', 'wandering', 'detected', 'technology', 'intervenes', 'posing', 'time', 'questions', 'encouraging', 'reading', 'needed', 'multiple', 'rounds', 'iterative', 'refinement', 'summatively', 'compared', 'technology', 'yoked', 'control', 'experiment', 'participants', 'key', 'dependent', 'variable', 'performance', 'post', 'reading', 'comprehension', 'assessment', 'resu


### Putting it all together

In [89]:
# 10) package all of your work above into a function that cleans a given document

def clean_list_of_documents(documents):
    
    cleaned_docs = []

    docs = []
    for document in documents:
        abstract = document.index('abstract', 0, 8000)
        reference = document.rfind('reference')
        newtext = document[abstract:reference]
        docs.append(newtext)
        
    for i, doc in enumerate(docs):
        docs[i] = doc.replace('\n', ' ')

    punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']
    numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
    
    for i, doc in enumerate(docs):
        for punc in punctuation:
            docs[i]=doc.replace(punc, ' ')
            doc = docs[i]
            
    for i, doc in enumerate(docs):
        for num in numbers:
            docs[i]=doc.replace(num, ' ')
            doc = docs[i]

    stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']
    
    nostop = []
    for i, doc in enumerate(docs):
        doc = doc.split()
        nos=()
        for word in list(doc):
            if word in stop_words:
                doc.remove(word)
        nostop.append(doc)
    
    for i, doc in enumerate(nostop):
        for word in list(doc):
            if len(word) <= 2:
                doc.remove(word)
        cleaned_docs.append(doc)

    return cleaned_docs

In [117]:
# 11a) reimport your raw data using the code in 2)
documents = []
for text in text_file:
    file= open(text,'r',encoding='utf-8')
    documents.append(file.read())
print(len(documents))
        
# 11b) clean your files using the function above
cleaneddoc = clean_list_of_documents(documents)

# 11c) print the first 1000 characters of the first document
print(cleaneddoc[0][:1000])

17
['abstract', 'mind', 'wandering', 'defined', 'shifts', 'attention', 'task', 'related', 'processing', 'task', 'unrelated', 'thoughts', 'ubiquitous', 'phenomenon', 'negative', 'influence', 'performance', 'productivity', 'many', 'contexts', 'including', 'learning', 'propose', 'next', 'generation', 'learning', 'technologies', 'mechanism', 'detect', 'respond', 'mind', 'wandering', 'real', 'time', 'towards', 'end', 'developed', 'technology', 'automatically', 'detects', 'mind', 'wandering', 'eye', 'gaze', 'learning', 'instructional', 'texts', 'mind', 'wandering', 'detected', 'technology', 'intervenes', 'posing', 'time', 'questions', 'encouraging', 'reading', 'needed', 'multiple', 'rounds', 'iterative', 'refinement', 'summatively', 'compared', 'technology', 'yoked', 'control', 'experiment', 'participants', 'key', 'dependent', 'variable', 'performance', 'post', 'reading', 'comprehension', 'assessment', 'results', 'suggest', 'technology', 'successful', 'correcting', 'comprehension', 'deficits

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [93]:
import math
import numpy as np

12) Describe why we need to figure out the vocabulary used in our corpus (refer back to Sherin's paper, and explain in your own words): 

First, the vocabulary is pruned using a list of stopwords, so it does not count all stopwords that cannot convey the meaning of the text

Second, figuring out the vocabulary allows us to map the text into vectors which present the similarity of the meaning of different passages

In [135]:
# 13) create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(documents):
    voc = []
    for i, doc in enumerate(documents):
        uniquewords = set(doc)
        uniquewords = sorted(uniquewords)
        voc.append(uniquewords)
        
    allvoc = sum(voc, [])
    uniquewords = set(allvoc)
    voc = sorted(uniquewords)
        
    
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)

lenclean = get_vocabulary(cleaneddoc)
print(len(lenclean))

5666


In [None]:
# 14) what was the size of Sherin's vocabulary? 
# the size of Sherin's vocabulary is 1429 and pruned vocabulary is 647

## Step 4 - transform your documents into 100-words chunks

In [178]:
# 15) create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap size
# Advice: combining all the documents into one giant string
# and splitting it into separate words will make your life easier!

def flatten_and_overlap(documents):
    
    wordschunk = []
    
    allvoc = sum(documents, [])
    len(allvoc)
    alist = []
    count = 0
    numchunk = 1
    while (count < len(allvoc)):
        if len(alist) <= 99:
            alist.append(allvoc[count])
            count += 1
        else:
            wordschunk.append(alist)
            print('The number of words in chunk ' + str(numchunk) + " is " + str(len(alist)))
            numchunk +=1
            alist = []
            count = count - 25
    print('There are ' + str(len(wordschunck)) + ' chunks in total')
    return wordschunk

In [179]:
# 16) create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check
wordschunk = flatten_and_overlap(cleaneddoc)
for i in wordschunk:
    assert len(i) == 100

The number of words in chunk 1 is 100
The number of words in chunk 2 is 100
The number of words in chunk 3 is 100
The number of words in chunk 4 is 100
The number of words in chunk 5 is 100
The number of words in chunk 6 is 100
The number of words in chunk 7 is 100
The number of words in chunk 8 is 100
The number of words in chunk 9 is 100
The number of words in chunk 10 is 100
The number of words in chunk 11 is 100
The number of words in chunk 12 is 100
The number of words in chunk 13 is 100
The number of words in chunk 14 is 100
The number of words in chunk 15 is 100
The number of words in chunk 16 is 100
The number of words in chunk 17 is 100
The number of words in chunk 18 is 100
The number of words in chunk 19 is 100
The number of words in chunk 20 is 100
The number of words in chunk 21 is 100
The number of words in chunk 22 is 100
The number of words in chunk 23 is 100
The number of words in chunk 24 is 100
The number of words in chunk 25 is 100
The number of words in chunk 26 is

In [180]:
# 17) print the first chunk, and compare it to the original text.
# does that match what Sherin describes in his paper?
print(wordschunck[0])
#Yes

['abstract', 'mind', 'wandering', 'defined', 'shifts', 'attention', 'task', 'related', 'processing', 'task', 'unrelated', 'thoughts', 'ubiquitous', 'phenomenon', 'negative', 'influence', 'performance', 'productivity', 'many', 'contexts', 'including', 'learning', 'propose', 'next', 'generation', 'learning', 'technologies', 'mechanism', 'detect', 'respond', 'mind', 'wandering', 'real', 'time', 'towards', 'end', 'developed', 'technology', 'automatically', 'detects', 'mind', 'wandering', 'eye', 'gaze', 'learning', 'instructional', 'texts', 'mind', 'wandering', 'detected', 'technology', 'intervenes', 'posing', 'time', 'questions', 'encouraging', 'reading', 'needed', 'multiple', 'rounds', 'iterative', 'refinement', 'summatively', 'compared', 'technology', 'yoked', 'control', 'experiment', 'participants', 'key', 'dependent', 'variable', 'performance', 'post', 'reading', 'comprehension', 'assessment', 'results', 'suggest', 'technology', 'successful', 'correcting', 'comprehension', 'deficits', 

In [None]:
# 18) how many chunks did Sherin have? What does a chunk become 
# in the next step of our topic modeling algorithm? 

#Sherin has 794 segements of text
#Each segement becomes a vector

In [181]:
# 19) what are some other preprocessing steps we could do 
# to improve the quality of the text data? Mention at least 2.

# 1. In Sherin's paper, the interviewers' questions are removed from the text because these questions are not the interests of this research.
#    Thus, I envision to remove certain part of the paper which is not considered as an interest of this research
# 2. I think the title of each section, for example "abstract", "method", or "conclusion", can be misled the data, because these words are  
#    consistent across all papers and can not be seen as the vocabulary that the authors want to express. Leaving those title in the text data
#    may bias the direction of our vectors.

In [None]:
# 20) in your own words, describe the next steps of the 
# data modeling algorithms (listed below):

# Vector and Matrix operations: transform each passage to a vector
# Weight word frequency: to improve the result of vector, using a weighting function such as log-trasfromation to make the extreme comparatable.
# Matrix normalization: making all elements, all vectors, in a matrix with length of 1
# Deviation Vectors: substract the average normalized vector from each original vector, then normalize the new vector to obtain the deivationalized vector
# Clustering: using hierarchical agglomerative clustering, we can iterate the process of reducing the total number of clusters by one until we get a single cluster


## Step 5 - Vector and Matrix operations

## Step 6 - Weight word frequency

## Step 7 - Matrix normalization

## Step 8 - Deviation Vectors

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [None]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization