# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 5 - data cleaning**
1. import the data
2. clean the data (e.g., remopve stop words, punctuation, etc.)
3. build a vocabulary for the dataset
4. create chunks of 100 words, with a 25-words overlap
5. create a word count matrix, where each chunk of a row and each column represents a word

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

In [None]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization

## Step 1 - Data Retrieval

In [1]:
# 1) using glob, find all the text files in the "Papers" folder
# Hint: refer to last week's notebook
import glob
files = glob.glob('./Papers/*.txt')

In [2]:
# 2) get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
documents = []

for file in files:
    with open(file,'r', encoding='utf-8') as f:
        text = f.read()
        documents.append(text)

In [3]:
# 3) print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)
print(documents[0][:1000])

103

epistemic network analysis and topic modeling for chat
data from collaborative learning environment
zhiqiang cai

brendan eagan

nia m. dowell

the university of memphis
365 innovation drive, suite 410
memphis, tn, usa

university of wisconsin-madison
1025 west johnson street
madison, wi, usa

the university of memphis
365 innovation drive, suite 410
memphis, tn, usa

zcai@memphis.edu

eaganb@gmail.com

niadowell@gmail.com

james w. pennebaker

david w. shaffer

arthur c. graesser

university of texas-austin
116 inner campus dr stop g6000
austin, tx, usa

university of wisconsin-madison
1025 west johnson street
madison, wi, usa

the university of memphis
365 innovation drive, suite 403
memphis, tn, usa

pennebaker@utexas.edu

dws@education.wisc.edu

art.graesser@gmail.com

abstract
this study investigates a possible way to analyze chat data from
collaborative learning environments using epistemic network
analysis and topic modeling. a 300-topic general topic model
built from tasa

## Step 2 - Data Cleaning

In [4]:
# 4) only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

contents = []

for doc in documents:
    start = doc.index('abstract') + len('abstract')
    end = doc.rfind('reference')
    contents.append(doc[start:end])

In [5]:
# 5) replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:

for i in range(0,len(contents)):
    contents[i] = contents[i].replace('\n',' ')

print(contents[0][:1000])


 this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling. a 300-topic general topic model built from tasa (touchstone applied science associates) corpus was used in this study. 300 topic scores for each of the 15,670 utterances in our chat data were computed. seven relevant topics were selected based on the total document scores. while the aggregated topic scores had some power in predicting students’ learning, using epistemic network analysis enables assessing the data from a different angle. the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different (𝑡 = 2.00). overall, the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions.  keywords chat; collaborative learning; topic modeling; epist

In [6]:
# 6) replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']

for i in range(0,len(contents)):
    for p in punctuation:
        contents[i] = contents[i].replace(p,' ')

print(contents[0][:1000])


 this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling  a 300 topic general topic model built from tasa  touchstone applied science associates  corpus was used in this study  300 topic scores for each of the 15 670 utterances in our chat data were computed  seven relevant topics were selected based on the total document scores  while the aggregated topic scores had some power in predicting students  learning  using epistemic network analysis enables assessing the data from a different angle  the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different  𝑡   2 00   overall  the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epist

In [7]:
# 7) remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing

numbers = ['1', '2', '3', '4', '5', '6','7', '8', '9', '0' ]

for i in range(0,len(contents)):
    for n in numbers:
        contents[i] = contents[i].replace(n,' ')

print(contents[0][:1000])


 this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling  a     topic general topic model built from tasa  touchstone applied science associates  corpus was used in this study      topic scores for each of the        utterances in our chat data were computed  seven relevant topics were selected based on the total document scores  while the aggregated topic scores had some power in predicting students  learning  using epistemic network analysis enables assessing the data from a different angle  the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different  𝑡          overall  the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epist

In [8]:
# 8) Remove the stop words below from our documents
# print the first 1000 characters of the first document
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']

for i in range(0,len(stop_words)):
    stop_words[i] = ' ' + stop_words[i] + ' '

for i in range(0,len(contents)):
    for s in stop_words:
        contents[i] = contents[i].replace(s,' ')

print(contents[0][:1000])

 study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling      topic general topic model built tasa  touchstone applied science associates  corpus used study      topic scores        utterances chat data computed  seven relevant topics selected based total document scores  aggregated topic scores power predicting students  learning  using epistemic network analysis enables assessing data different angle  results showed topic score based epistemic networks low gain students high gain students significantly different  𝑡          overall  results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epistemic network analysis     introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synch

In [9]:
# 9) remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document

import re

shortword = re.compile(r'\W*\b\w{1,2}\b')

for i in range(0,len(contents)):
    contents[i] = shortword.sub('', contents[i])

print(contents[0][:1000])


 study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling      topic general topic model built tasa  touchstone applied science associates  corpus used study      topic scores        utterances chat data computed  seven relevant topics selected based total document scores  aggregated topic scores power predicting students  learning  using epistemic network analysis enables assessing data different angle  results showed topic score based epistemic networks low gain students high gain students significantly different          overall  results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epistemic network analysis     introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchron


### Putting it all together

In [10]:
# 10) package all of your work above into a function that cleans a given document

def clean_list_of_documents(documents):
    
    ### your code ###
    
    import re
    
    cleaned_docs = []
    remove = ['\n',
              '.', '...', '!', '#', '"', '%', '$', "'", '&', ')','(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
              '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',', '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’',
              '1', '2', '3', '4', '5', '6','7', '8', '9', '0',
              ' i ', ' me ', ' my ', ' myself ', ' we ', ' our ', ' ours ', 
              ' ourselves ', ' you ', ' your ', ' yours ', ' yourself ', 
              ' yourselves ', ' he ', ' him ', ' his ', ' himself ', ' she ', 
              ' her ', ' hers ', ' herself ', ' it ', ' its ', ' itself ', 
              ' they ', ' them ', ' their ', ' theirs ', ' themselves ', 
              ' what ', ' which ', ' who ', ' whom ', ' this ', ' that ', 
              ' these ', ' those ', ' am ', ' is ', ' are ', ' was ', ' were ', 
              ' be ', ' been ', ' being ', ' have ', ' has ', ' had ', ' having ', 
              ' do ', ' does ', ' did ', ' doing ', ' a ', ' an ', ' the ', ' and ', 
              ' but ', ' if ', ' or ', ' because ', ' as ', ' until ', ' while ', 
              ' of ', ' at ', ' by ', ' for ', ' with ', ' about ', ' against ', 
              ' between ', ' into ', ' through ', ' during ', ' before ', 
              ' after ', ' above ', ' below ', ' to ', ' from ', ' up ', ' down ', 
              ' in ', ' out ', ' on ', ' off ', ' over ', ' under ', ' again ', 
              ' further ', ' then ', ' once ', ' here ', ' there ', ' when ', 
              ' where ', ' why ', ' how ', ' all ', ' any ', ' both ', ' each ', 
              ' few ', ' more ', ' most ', ' other ', ' some ', ' such ', ' no ', 
              ' nor ', ' not ', ' only ', ' own ', ' same ', ' so ', ' than ', 
              ' too ', ' very ', ' s ', ' t ', ' can ', ' will ', 
              ' just ', ' don ', ' should ', ' now ']
    shortword = re.compile(r'\W*\b\w{1,2}\b')
    
    for doc in documents:
        start = doc.index('abstract') + len('abstract')
        end = doc.rfind('reference')
        cleaned_docs.append(doc[start:end])
    
    for i in range(0,len(cleaned_docs)):
        for r in remove:
            cleaned_docs[i] = cleaned_docs[i].replace(r,' ')
    
        cleaned_docs[i] = shortword.sub('', cleaned_docs[i])
    
    return cleaned_docs

In [11]:
# 11a) reimport your raw data using the code in 2)
documents = []

for file in files:
    with open(file,'r', encoding='utf-8') as f:
        text = f.read()
        documents.append(text)
        
# 11b) clean your files using the function above

cleaned_documents = clean_list_of_documents(documents)

# 11c) print the first 1000 characters of the first document

print(cleaned_documents[0][:1000])

 study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling      topic general topic model built tasa  touchstone applied science associates  corpus used study      topic scores        utterances chat data computed  seven relevant topics selected based total document scores  aggregated topic scores power predicting students  learning  using epistemic network analysis enables assessing data different angle  results showed topic score based epistemic networks low gain students high gain students significantly different          overall  results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epistemic network analysis     introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchron

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [12]:
import math
import numpy as np

12) Describe why we need to figure out the vocabulary used in our corpus (refer back to Sherin's paper, and explain in your own words): 

We require the vocabulary as a reference for us to be able to convert any given text into a vector (i.e the vocabulary forms the basis for our vector space.)

In [13]:
# 13) create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(documents):
    voc = []
    
    ### your code ###
    for i in range(0,len(documents)):
        words = documents[i].split()
        for word in words:
            if word in voc:
                pass
            else:
                voc.append(word)
    
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)
vocabulary = get_vocabulary(cleaned_documents)
print(len(vocabulary))

5660


In [14]:
# 14) what was the size of Sherin's vocabulary? 
print('Sherin\'s vocabulary consists of 647 words.')

Sherin's vocabulary consists of 647 words.


## Step 4 - transform your documents into 100-words chunks

In [15]:
# 15) create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap size
# Advice: combining all the documents into one giant string
# and splitting it into separate words will make your life easier!

def chunks(documents, chunk_size, overlap_size):
    list_of_chunk = []
    giant_string = ''
    short_string = ''
    
    for doc in documents:
        giant_string = giant_string + doc
    
    list_of_words = giant_string.split()
    
    for i in range(0,len(list_of_words),overlap_size):
        if i+100 < len(list_of_words):
            for x in range(0,chunk_size):
                y = i + x
                short_string = short_string + list_of_words[y] + ' '
            
            list_of_chunk.append(short_string)
            short_string=''
            
    return list_of_chunk

word_chunks = chunks(cleaned_documents,100,25)

In [16]:
# 16) create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check

for chunk in word_chunks:
    assert len(chunk.split()) == 100
 

In [17]:
# 17) print the first chunk, and compare it to the original text.
# does that match what Sherin describes in his paper?

original = ''

for i in range(0,100):
    original = original + cleaned_documents[0].split()[i] + ' '

print('---first chunk---\n'+ word_chunks[0] + '\n')
print('---original---\n'+ original)

---first chunk---
study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis 

---original---
study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic

In [18]:
# 18) how many chunks did Sherin have? What does a chunk become 
# in the next step of our topic modeling algorithm? 
print('Sherin had 794 segments of text.')
print('A chunk would be converted to a vector in the next step of topic modeling algorithm.')

Sherin had 794 segments of text.
A chunk would be converted to a vector in the next step of topic modeling algorithm.


In [19]:
# 19) what are some other preprocessing steps we could do 
# to improve the quality of the text data? Mention at least 2.
print('1) We could reduce our vocabulary by selecting high frequency words or meaningful words based on purpose of study.')
print('2) We could reduce all the words to its infinitive form so that only words with the same or similar meaning would be captured.')

1) We could reduce our vocabulary by selecting high frequency words or meaningful words based on purpose of study.
2) We could reduce all the words to its infinitive form so that only words with the same or similar meaning would be captured.


In [20]:
# 20) in your own words, describe the next steps of the 
# data modeling algorithms (listed below):
print('Step 5 - This step creates commonly used vector and matrix operations such as dot product in order for us to be able to manipulate and analyze our text after its conversion into vectors.')
print('Step 6 - This step modifies the raw frequency counts by using the function (1+log(count)) so that the count for high frequency words is brought down.')
print('Step 7 - This step changes the length of all the vectors to be of 1 unit length so the vectors could be standardized for analysis.')
print('Step 8 - This step calculates the deviation vector for each vector, which is a measure of how dissimilar a given text is from the average text.')
print('Step 9 - This step aims to group the vectors or texts that are similar in meaning together.')
print('Step 10 - This step creates visualization tools such as bar graphs for the human analyst to attain better overview and comprehension.')

Step 5 - This step creates commonly used vector and matrix operations such as dot product in order for us to be able to manipulate and analyze our text after its conversion into vectors.
Step 6 - This step modifies the raw frequency counts by using the function (1+log(count)) so that the count for high frequency words is brought down.
Step 7 - This step changes the length of all the vectors to be of 1 unit length so the vectors could be standardized for analysis.
Step 8 - This step calculates the deviation vector for each vector, which is a measure of how dissimilar a given text is from the average text.
Step 9 - This step aims to group the vectors or texts that are similar in meaning together.
Step 10 - This step creates visualization tools such as bar graphs for the human analyst to attain better overview and comprehension.


## Step 5 - Vector and Matrix operations

## Step 6 - Weight word frequency

## Step 7 - Matrix normalization

## Step 8 - Deviation Vectors

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [None]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization