# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 5 - data cleaning**
1. import the data
2. clean the data (e.g., remopve stop words, punctuation, etc.)
3. build a vocabulary for the dataset
4. create chunks of 100 words, with a 25-words overlap
5. create a word count matrix, where each chunk of a row and each column represents a word

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

In [1]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization

## Step 1 - Data Retrieval

In [2]:
# 1) using glob, find all the text files in the "Papers" folder
# Hint: refer to last week's notebook
import glob

txt_files = glob.glob('Papers/*.txt')
print(txt_files)

['Papers/paper12.txt', 'Papers/paper5.txt', 'Papers/paper4.txt', 'Papers/paper13.txt', 'Papers/paper11.txt', 'Papers/paper6.txt', 'Papers/paper7.txt', 'Papers/paper10.txt', 'Papers/paper14.txt', 'Papers/paper3.txt', 'Papers/paper2.txt', 'Papers/paper15.txt', 'Papers/paper0.txt', 'Papers/paper1.txt', 'Papers/paper16.txt', 'Papers/paper9.txt', 'Papers/paper8.txt']


In [3]:
# 2) get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
documents = []

for txt_file in txt_files:
    with open(txt_file) as filename: 
        contents = filename.read()
        documents.append(contents)

In [4]:
# 3) print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)

print(documents[0][:1000])

103

epistemic network analysis and topic modeling for chat
data from collaborative learning environment
zhiqiang cai

brendan eagan

nia m. dowell

the university of memphis
365 innovation drive, suite 410
memphis, tn, usa

university of wisconsin-madison
1025 west johnson street
madison, wi, usa

the university of memphis
365 innovation drive, suite 410
memphis, tn, usa

zcai@memphis.edu

eaganb@gmail.com

niadowell@gmail.com

james w. pennebaker

david w. shaffer

arthur c. graesser

university of texas-austin
116 inner campus dr stop g6000
austin, tx, usa

university of wisconsin-madison
1025 west johnson street
madison, wi, usa

the university of memphis
365 innovation drive, suite 403
memphis, tn, usa

pennebaker@utexas.edu

dws@education.wisc.edu

art.graesser@gmail.com

abstract
this study investigates a possible way to analyze chat data from
collaborative learning environments using epistemic network
analysis and topic modeling. a 300-topic general topic model
built from tasa

## Step 2 - Data Cleaning

In [5]:
# 4) only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

def abbreviate(documents):
    abbreviated_texts = []
    for document in documents:
        term1 = document.find('abstract')
        term2 = document.rfind('reference')
        substring = document[term1 + len('abstract'):term2]
        abbreviated_texts.append(substring)
    return abbreviated_texts

documents = abbreviate(documents)
print(documents[0][:1000])


this study investigates a possible way to analyze chat data from
collaborative learning environments using epistemic network
analysis and topic modeling. a 300-topic general topic model
built from tasa (touchstone applied science associates) corpus was used in this study. 300 topic scores for each of the 15,670
utterances in our chat data were computed. seven relevant topics
were selected based on the total document scores. while the aggregated topic scores had some power in predicting students’
learning, using epistemic network analysis enables assessing the
data from a different angle. the results showed that the topic
score based epistemic networks between low gain students and
high gain students were significantly different (𝑡 = 2.00). overall,
the results suggest these two analytical approaches provide complementary information and afford new insights into the processes
related to successful collaborative interactions.

keywords
chat; collaborative learning; topic modeling; epist

In [6]:
# 5) replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:

def strip_new_lines(documents):
    remove_returns = []
    for document in documents:
        substring = document.replace('\n', ' ')
        remove_returns.append(substring)
    return remove_returns 

documents = strip_new_lines(documents)
print(documents[0][:1000])


 this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling. a 300-topic general topic model built from tasa (touchstone applied science associates) corpus was used in this study. 300 topic scores for each of the 15,670 utterances in our chat data were computed. seven relevant topics were selected based on the total document scores. while the aggregated topic scores had some power in predicting students’ learning, using epistemic network analysis enables assessing the data from a different angle. the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different (𝑡 = 2.00). overall, the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions.  keywords chat; collaborative learning; topic modeling; epist

In [7]:
# 6) replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation_marks = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']

def strip_punctuation(documents):
    clean_punctuations = []
    for document in documents:
        for punctuation_mark in punctuation_marks:
            document = document.replace(punctuation_mark, " ")
        clean_punctuations.append(document)
    return clean_punctuations
    
documents = strip_punctuation(documents)
print(documents[0][:1000])

 this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling  a 300 topic general topic model built from tasa  touchstone applied science associates  corpus was used in this study  300 topic scores for each of the 15 670 utterances in our chat data were computed  seven relevant topics were selected based on the total document scores  while the aggregated topic scores had some power in predicting students  learning  using epistemic network analysis enables assessing the data from a different angle  the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different  𝑡   2 00   overall  the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epist

In [8]:
# 7) remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing

def strip_numbers(documents):
    clean_numbers = []
    for document in documents:
        for char in document:
            if char.isdigit():
                document = document.replace(char, " ")
        clean_numbers.append(document)
    return clean_numbers

documents = strip_numbers(documents)
print(documents[0][:1000])

 this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling  a     topic general topic model built from tasa  touchstone applied science associates  corpus was used in this study      topic scores for each of the        utterances in our chat data were computed  seven relevant topics were selected based on the total document scores  while the aggregated topic scores had some power in predicting students  learning  using epistemic network analysis enables assessing the data from a different angle  the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different  𝑡          overall  the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epist

In [10]:
# 8) Remove the stop words below from our documents
# print the first 1000 characters of the first document

def strip_stopwords(documents):
    stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']
    clean_stopwords = []
    for document in documents:
        words = document.split()
        resultwords  = [word for word in words if word.lower() not in stop_words]
        result = ' '.join(resultwords)
        clean_stopwords.append(result)
    return clean_stopwords

documents = strip_stopwords(documents)
print(documents[0][:1000])

study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different 𝑡 overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchronously asynchronously participate tasks acc

In [11]:
# 9) remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document
import re

def strip_short_words(documents):
    clean_shortwords = []
    shortword = re.compile(r'\W*\b\w{1,2}\b')
    for document in documents:
        document = shortword.sub('', document)
        clean_shortwords.append(document)
    return clean_shortwords

documents = strip_short_words(documents)
print(documents[0][:1000])

study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchronously asynchronously participate tasks accom


### Putting it all together

In [12]:
# 10) package all of your work above into a function that cleans a given document

def clean_list_of_documents(documents):
    cleaned_docs = []
    
    # only looks at text between abstract and reference
    documents = abbreviate(documents)
    
    # replaces new lines with a space
    documents = strip_new_lines(documents)
    
    # replaces punctuation with a space
    documents = strip_punctuation(documents)

    # replaces numbers with a space
    documents = strip_numbers(documents)
   
    # removes stop words
    documents = strip_stopwords(documents)
  
    # removes words less than length 3
    documents = strip_short_words(documents)
    
    cleaned_docs = documents

    return cleaned_docs

In [13]:
# 11a) reimport your raw data using the code in 2)
documents = []

for txt_file in txt_files:
    with open(txt_file) as filename: 
        contents = filename.read()
        documents.append(contents)

        
# 11b) clean your files using the function above
clean = clean_list_of_documents(documents)

# 11c) print the first 1000 characters of the first document
print(clean[0][:1000])

study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchronously asynchronously participate tasks accom

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [14]:
import math
import numpy as np

12) Describe why we need to figure out the vocabulary used in our corpus (refer back to Sherin's paper, and explain in your own words): 

In [15]:
# 13) create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(documents):
    voc = []
    for document in documents:
        words = document.split()
        for word in words:
            if word not in voc:
                voc.append(word)
    voc.sort()
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)

#cleaned
print(len(get_vocabulary(documents)))
#uncleaned
print(len(get_vocabulary(clean)))


15947
5649


In [None]:
# 14) what was the size of Sherin's vocabulary? 

# full vocabulary was 1429 words, pruned list was 647 after removing stop words

## Step 4 - transform your documents into 100-words chunks

In [44]:
# 15) create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap size
# Advice: combining all the documents into one giant string
# and splitting it into separate words will make your life easier!

all_documents = ''.join(clean)

def chunk(list_docs, chunk_size, overlap):
    words = all_documents.split()
    chunks = []
    for word in range(0, len(words), overlap):
        chunks.append(words[word:word + chunk_size])
    return chunks
    
chunks = chunk(all_documents, 100, 25)

In [45]:
# 16) create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check

for chunk in chunks:
     assert (len(chunk) == 100),"length is not equal to 100, length is " + str(len(chunk))
#     if len(chunk) != 100:
#         print("length is not equal to 100, length is " + str(len(chunk)))

AssertionError: length is not equal to 100, length is 86

In [61]:
# 17) print the first chunk, and compare it to the original text.
# does that match what Sherin describes in his paper?
print(len(chunks))
print(chunks[0])
print(all_documents[:811])

2229
['study', 'investigates', 'possible', 'way', 'analyze', 'chat', 'data', 'collaborative', 'learning', 'environments', 'using', 'epistemic', 'network', 'analysis', 'topic', 'modeling', 'topic', 'general', 'topic', 'model', 'built', 'tasa', 'touchstone', 'applied', 'science', 'associates', 'corpus', 'used', 'study', 'topic', 'scores', 'utterances', 'chat', 'data', 'computed', 'seven', 'relevant', 'topics', 'selected', 'based', 'total', 'document', 'scores', 'aggregated', 'topic', 'scores', 'power', 'predicting', 'students', 'learning', 'using', 'epistemic', 'network', 'analysis', 'enables', 'assessing', 'data', 'different', 'angle', 'results', 'showed', 'topic', 'score', 'based', 'epistemic', 'networks', 'low', 'gain', 'students', 'high', 'gain', 'students', 'significantly', 'different', 'overall', 'results', 'suggest', 'two', 'analytical', 'approaches', 'provide', 'complementary', 'information', 'afford', 'new', 'insights', 'processes', 'related', 'successful', 'collaborative', 'int

In [None]:
# 18) how many chunks did Sherin have? What does a chunk become 
# in the next step of our topic modeling algorithm? 

# 794 chunks, which later become vectors

In [None]:
# 19) what are some other preprocessing steps we could do 
# to improve the quality of the text data? Mention at least 2.

# Using https://pypi.org/project/stemming/1.0/ to group words of the same root
# Remove anything that is not strictly alphabetic 
# (there are still a couple of math symbols that were not included in punctuation stripping)

In [None]:
# 20) in your own words, describe the next steps of the 
# data modeling algorithms (listed below):


## Step 5 - Vector and Matrix operations

- Sherin says "At this stage of the analysis, each of the 794 segments
has been mapped to a vector that is understood to represent the meaning
of that segment. The next step is to identify common meanings among these
segments. To do that, I look for natural clusterings of the 794 vectors."

- In other words, map segments to vectors, where the vector represents the 
meaning of the segment

## Step 6 - Weight word frequency

- Sherin says "In most vector space analyses, the raw counts
are modified by a weighting function. In the analyses reported in this article,
each count is replaced with (1 + log[count]). This has the effect of dampening
the impact of very frequent words. (Raw counts of zero are just left as zero.)"

- In other words, scaling raw counts by way of a weighting function for word 
frequencies to account for the most frequent words 

## Step 7 - Matrix normalization

- Sherin says "Finally, all of the vectors used in the present work are normalized. This means
that every entry in the vector is divided by a constant (the length of the vector)
so that the resulting vector has a length of 1."

- In other words, making the length of the vectors the same so comparisons are easier across 
all vectors

## Step 8 - Deviation Vectors

- Sherin says "For that purpose, I compute what I call deviation vectors. To compute the deviation
vectors for two vectors V1 and V2, I first find their average and then break
each vector into two components, one that lies along the average and another that
is perpendicular to the average (refer to Figure 4). The perpendicular components,
V1’ and V2’, are the deviation vectors. If we use these deviation vectors in place
of the original vectors, the result is that V1 and V2 have each been replaced by the
component that defines its unique piece—a piece that characterizes how it differs
from the average."

- In other words, this step is sort of like calculating standard deviation. We are only concerned
with how far each vector is from the average as opposed to the actual vectors being compared

## Step 9 - Clustering

- Sherin says "To cluster the transcript vectors, I employed the very general technique called
hierarchical agglomerative clustering. In hierarchical agglomerative clustering,
the analysis begins with each of the items in its own cluster. Thus, we begin with
a number of clusters equal to the total number of items. Then two of those clusters
are combined into a single cluster containing two items, thus reducing the total
number of clusters by one. The process then iterates: Two clusters are combined
and the total number of clusters is decreased by one. This repeats until all of the
items are combined into a single cluster. The result is a list of candidate clusterings
of the data, with each candidate corresponding to one of the intermediate steps in
this process...The results I describe were obtained using a technique
called centroid clustering. At each step in the iteration, I first find the centroid
of each cluster (the average of all of the vectors currently in the cluster). Then
I find the pair of centroids that are closest to each other and merge the associated
clusters. An explanation of centroid clustering, including its application to
LSA-produced vectors, can be found in Manning, Raghavan, and Schütze (2008)."

- In other words, there are two methods for clustering "hierarchical agglomerative clustering"
and "centroid clustering". Centroid clustering was used in Sherin's example because it was deterministic,
"hierarchical agglomerative clustering" involves some heuristic for combining clusters so that it is done
systematically.

## Step 10 - Visualizing the results

- Create visuals to help others understand the data and findings


## Final Step - Putting it all together: 

In [None]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization