# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 5 - data cleaning**
1. import the data
2. clean the data (e.g., remopve stop words, punctuation, etc.)
3. build a vocabulary for the dataset
4. create chunks of 100 words, with a 25-words overlap
5. create a word count matrix, where each chunk of a row and each column represents a word

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

In [81]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization

## Step 1 - Data Retrieval

In [85]:
# 1) using glob, find all the text files in the "Papers" folder
# Hint: refer to last week's notebook

import glob 

for filename in glob.glob(r'C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\*.txt'):
    print(filename)


C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper0.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper1.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper10.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper11.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper12.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper13.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper14.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper15.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper16.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper2.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\paper3.txt
C:\Users\Jazib Zahir\Documents\GitHub\S435-Week

In [86]:
# 2) get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
documents = []

for paper in glob.glob(r'C:\Users\Jazib Zahir\Documents\GitHub\S435-Week5\week5-vsm-1-jzahir\Papers\*.txt'):
    f = open(filename, "r", encoding= "utf-8")
    x = f.readlines()
    documents.append(x)
    
len(documents)


17

In [87]:
# 3) print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)

print(documents[0][:1000])

['79\n', '\n', '\x0cpredicting short- and long-term vocabulary learning via\n', 'semantic features of partial word knowledge\n', 'sungjin nam\n', '\n', 'school of information\n', 'university of michigan\n', 'ann arbor, mi 48109\n', '\n', 'sjnam@umich.edu\n', '\n', 'gwen frishkoff\n', '\n', 'kevyn collins-thompson\n', '\n', 'gfrishkoff@gmail.com\n', '\n', 'kevynct@umich.edu\n', '\n', 'department of psychology\n', 'university of oregon\n', 'eugene, or 97403\n', '\n', 'abstract\n', '\n', 'we show how the novel use of a semantic representation\n', 'based on osgood’s semantic differential scales can lead to\n', 'effective features in predicting short- and long-term learning\n', 'in students using a vocabulary learning system. previous\n', 'studies in students’ intermediate knowledge states during\n', 'vocabulary acquisition did not provide much information\n', 'on which semantic knowledge students gained during word\n', 'learning practice. moreover, these studies relied on human\n', 'rating

## Step 2 - Data Cleaning

In [88]:
# 4) only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

clean_documents = []

for document in documents:
 a = (str(document).index("abstract"))
 z = (str(document).rfind("reference"))
 clean_documents.append(document[a:z])
 
print(clean_documents[0])


['semantic score-based features\n', '\n', 'we now describe the semantic features tested in our\n', 'prediction models.\n', '\n', '3.2.1\n', '\n', 'semantic scales\n', '\n', 'for this study, we used semantic scales from osgood’s study\n', '[16]. ten scales were selected by a cognitive psychologist as\n', 'being considered semantic attributes that can be detected\n', 'during word learning (figure 2). each semantic scale\n', 'consists of pairs of semantic attributes. for example, the\n', 'bad–good scale can show how the meaning of a word can\n', 'be projected on a scale with bad and good located at either\n', '\n', 'basic semantic distance scores\n', '\n', 'to extract meaningful semantic information, we have\n', 'applied the following measures that can be used to explain\n', 'various characteristics of student responses for different\n', 'target words. in this study, we used a pre-trained model\n', 'for word2vec,1 built based on the google news corpus\n', '(100 billion tokens with 3 milli

In [89]:
# 5) replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:


import re

spaced_documents = []

for document in clean_documents:
    spaced_document = (str(document)).replace(r'\n', ' ')
    spaced_documents.append(spaced_document)

spaced_documents[0]



"['semantic score-based features ', ' ', 'we now describe the semantic features tested in our ', 'prediction models. ', ' ', '3.2.1 ', ' ', 'semantic scales ', ' ', 'for this study, we used semantic scales from osgood’s study ', '[16]. ten scales were selected by a cognitive psychologist as ', 'being considered semantic attributes that can be detected ', 'during word learning (figure 2). each semantic scale ', 'consists of pairs of semantic attributes. for example, the ', 'bad–good scale can show how the meaning of a word can ', 'be projected on a scale with bad and good located at either ', ' ', 'basic semantic distance scores ', ' ', 'to extract meaningful semantic information, we have ', 'applied the following measures that can be used to explain ', 'various characteristics of student responses for different ', 'target words. in this study, we used a pre-trained model ', 'for word2vec,1 built based on the google news corpus ', '(100 billion tokens with 3 million unique vocabularies,

In [90]:
# 6) replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']

document_no_punctuation = []

for document in spaced_documents:
    for character in punctuation:
        document.replace(character,' ')
    document_no_punctuation.append(document)
        
print(document_no_punctuation[0][:1000])

['semantic score-based features ', ' ', 'we now describe the semantic features tested in our ', 'prediction models. ', ' ', '3.2.1 ', ' ', 'semantic scales ', ' ', 'for this study, we used semantic scales from osgood’s study ', '[16]. ten scales were selected by a cognitive psychologist as ', 'being considered semantic attributes that can be detected ', 'during word learning (figure 2). each semantic scale ', 'consists of pairs of semantic attributes. for example, the ', 'bad–good scale can show how the meaning of a word can ', 'be projected on a scale with bad and good located at either ', ' ', 'basic semantic distance scores ', ' ', 'to extract meaningful semantic information, we have ', 'applied the following measures that can be used to explain ', 'various characteristics of student responses for different ', 'target words. in this study, we used a pre-trained model ', 'for word2vec,1 built based on the google news corpus ', '(100 billion tokens with 3 million unique vocabularies, 

In [91]:
# 7) remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing

numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9']


document_no_numbers = []

for document in document_no_punctuation:
    for character in numbers:
        document.replace(character,' ')
    document_no_numbers.append(document)
        
print(document_no_numbers[0][:1000])
    
       
  

['semantic score-based features ', ' ', 'we now describe the semantic features tested in our ', 'prediction models. ', ' ', '3.2.1 ', ' ', 'semantic scales ', ' ', 'for this study, we used semantic scales from osgood’s study ', '[16]. ten scales were selected by a cognitive psychologist as ', 'being considered semantic attributes that can be detected ', 'during word learning (figure 2). each semantic scale ', 'consists of pairs of semantic attributes. for example, the ', 'bad–good scale can show how the meaning of a word can ', 'be projected on a scale with bad and good located at either ', ' ', 'basic semantic distance scores ', ' ', 'to extract meaningful semantic information, we have ', 'applied the following measures that can be used to explain ', 'various characteristics of student responses for different ', 'target words. in this study, we used a pre-trained model ', 'for word2vec,1 built based on the google news corpus ', '(100 billion tokens with 3 million unique vocabularies, 

In [92]:
# 8) Remove the stop words below from our documents
# print the first 1000 characters of the first document
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']


document_no_stop_words = []

for document in document_no_numbers:
    for word in stop_words:
        document.replace(word,'')
    document_no_stop_words.append(document)
        
print(document_no_stop_words[0][:1000])



['semantic score-based features ', ' ', 'we now describe the semantic features tested in our ', 'prediction models. ', ' ', '3.2.1 ', ' ', 'semantic scales ', ' ', 'for this study, we used semantic scales from osgood’s study ', '[16]. ten scales were selected by a cognitive psychologist as ', 'being considered semantic attributes that can be detected ', 'during word learning (figure 2). each semantic scale ', 'consists of pairs of semantic attributes. for example, the ', 'bad–good scale can show how the meaning of a word can ', 'be projected on a scale with bad and good located at either ', ' ', 'basic semantic distance scores ', ' ', 'to extract meaningful semantic information, we have ', 'applied the following measures that can be used to explain ', 'various characteristics of student responses for different ', 'target words. in this study, we used a pre-trained model ', 'for word2vec,1 built based on the google news corpus ', '(100 billion tokens with 3 million unique vocabularies, 

In [102]:
# 9) remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document

import re

for document in document_no_stop_words:   
     document.join(word for word in document.split() if len(word)>2)

print(document_no_stop_words[0][:1000])

   

['semantic score-based features ', ' ', 'we now describe the semantic features tested in our ', 'prediction models. ', ' ', '3.2.1 ', ' ', 'semantic scales ', ' ', 'for this study, we used semantic scales from osgood’s study ', '[16]. ten scales were selected by a cognitive psychologist as ', 'being considered semantic attributes that can be detected ', 'during word learning (figure 2). each semantic scale ', 'consists of pairs of semantic attributes. for example, the ', 'bad–good scale can show how the meaning of a word can ', 'be projected on a scale with bad and good located at either ', ' ', 'basic semantic distance scores ', ' ', 'to extract meaningful semantic information, we have ', 'applied the following measures that can be used to explain ', 'various characteristics of student responses for different ', 'target words. in this study, we used a pre-trained model ', 'for word2vec,1 built based on the google news corpus ', '(100 billion tokens with 3 million unique vocabularies, 


### Putting it all together

In [13]:
# 10) package all of your work above into a function that cleans a given document

def clean_list_of_documents(documents):
    
    cleaned_docs = []

    ### your code ###
        
    return cleaned_docs

In [14]:
# 11a) reimport your raw data using the code in 2)
documents = []

        
# 11b) clean your files using the function above


# 11c) print the first 1000 characters of the first document


## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [13]:
import math
import numpy as np

12) Describe why we need to figure out the vocabulary used in our corpus (refer back to Sherin's paper, and explain in your own words): To see what patterns of words people use to explain scientific phenomenon (which tells us about how these concepts are taught, how they are interpreted and how people ultimately reationalize them)

In [15]:
# 13) create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(documents):
    voc = []
    
    ### your code ###
    
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)


In [None]:
# 14) what was the size of Sherin's vocabulary? 
647 words

## Step 4 - transform your documents into 100-words chunks

In [21]:
# 15) create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap size
# Advice: combining all the documents into one giant string
# and splitting it into separate words will make your life easier!



In [17]:
# 16) create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check


In [19]:
# 17) print the first chunk, and compare it to the original text.
# does that match what Sherin describes in his paper?


In [None]:
# 18) how many chunks did Sherin have? What does a chunk become 
# in the next step of our topic modeling algorithm? 

794 chunks of text
Each gets mapped on to a vector of 647 numbers (representing the unique words)


In [23]:
# 19) what are some other preprocessing steps we could do 
# to improve the quality of the text data? Mention at least 2.

1. Find some technique to give significance to the order of words
2. Make a second list of common words (beyond the most obvious ones) to identify 'unique words' even better


In [25]:
# 20) in your own words, describe the next steps of the 
# data modeling algorithms (listed below):

The 794 vectors are clustered based on similarity/deviance. Centroid clustering is used since it is simplest and brings
together vectors with the most in common (geometrically/content-wise). We play around with different clusters numbers and
sizes to get an optimal configuration. 

## Step 5 - Vector and Matrix operations

## Step 6 - Weight word frequency

## Step 7 - Matrix normalization

## Step 8 - Deviation Vectors

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [17]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization