# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 5 - data cleaning**
1. import the data
2. clean the data (e.g., remopve stop words, punctuation, etc.)
3. build a vocabulary for the dataset
4. create chunks of 100 words, with a 25-words overlap
5. create a word count matrix, where each chunk of a row and each column represents a word

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

In [1]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization

## Step 1 - Data Retrieval

In [2]:
# 1) using glob, find all the text files in the "Papers" folder
# Hint: refer to last week's notebook

import glob 
txt_files = glob.glob('papers/*.txt')
# print(txt_files)

In [3]:
# 2) get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
# -*- coding: UTF-8 -*-
documents = []

for paper in txt_files:
    with open(paper) as filename: 
        contents = filename.read()
        documents.append(contents)
        
# print(documents[3])


In [4]:
# 3) print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)

print (documents[0][0:1000])

103

epistemic network analysis and topic modeling for chat
data from collaborative learning environment
zhiqiang cai

brendan eagan

nia m. dowell

the university of memphis
365 innovation drive, suite 410
memphis, tn, usa

university of wisconsin-madison
1025 west johnson street
madison, wi, usa

the university of memphis
365 innovation drive, suite 410
memphis, tn, usa

zcai@memphis.edu

eaganb@gmail.com

niadowell@gmail.com

james w. pennebaker

david w. shaffer

arthur c. graesser

university of texas-austin
116 inner campus dr stop g6000
austin, tx, usa

university of wisconsin-madison
1025 west johnson street
madison, wi, usa

the university of memphis
365 innovation drive, suite 403
memphis, tn, usa

pennebaker@utexas.edu

dws@education.wisc.edu

art.graesser@gmail.com

abstract
this study investigates a possible way to analyze chat data from
collaborative learning environments using epistemic network
analysis and topic modeling. a 300-topic general topic model
built from tasa

## Step 2 - Data Cleaning

In [5]:
# 4) only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

import re

documents_clean = []

for doc in documents:
    abstract = doc.find("abstract")
    ref = doc.rfind("reference")
    new_string = doc[abstract+8:ref] 
    #could have done length of abstract...or split the string after the len(abstract)...what's most efficient?
#     print(new_string)
#     print("doc "+ str(len(doc)))
#     print("new_string " + str(len(new_string)))
    documents_clean.append(new_string)




In [6]:
# 5) replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:

documents_clean_white = []

for doc in documents_clean:
    doc = doc.replace("\n"," ")
    documents_clean_white.append(doc)
#     print(doc)

documents_clean_white[0][0:1000]

print(len(documents_clean_white))



17


In [7]:
# 6) replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']

docs_clean_punct= []

for doc in documents_clean_white:
    for i in punctuation:
        doc = doc.replace(i, " ")
    docs_clean_punct.append(doc)

docs_clean_punct[0][0:1000]


' this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling  a 300 topic general topic model built from tasa  touchstone applied science associates  corpus was used in this study  300 topic scores for each of the 15 670 utterances in our chat data were computed  seven relevant topics were selected based on the total document scores  while the aggregated topic scores had some power in predicting students  learning  using epistemic network analysis enables assessing the data from a different angle  the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different  𝑡   2 00   overall  the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epis

In [8]:
# 7) remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing

docs_clean_num = []

for doc in docs_clean_punct:
    doc = re.sub('\w*[0-9]\w*', " ",doc)
    docs_clean_num.append(doc)
    
docs_clean_num[0][0:1000]



' this study investigates a possible way to analyze chat data from collaborative learning environments using epistemic network analysis and topic modeling  a   topic general topic model built from tasa  touchstone applied science associates  corpus was used in this study    topic scores for each of the     utterances in our chat data were computed  seven relevant topics were selected based on the total document scores  while the aggregated topic scores had some power in predicting students  learning  using epistemic network analysis enables assessing the data from a different angle  the results showed that the topic score based epistemic networks between low gain students and high gain students were significantly different  𝑡         overall  the results suggest these two analytical approaches provide complementary information and afford new insights into the processes related to successful collaborative interactions   keywords chat  collaborative learning  topic modeling  epistemic ne

In [9]:
# 8) Remove the stop words below from our documents
# print the first 1000 characters of the first document

# https://stackoverflow.com/questions/25346058/removing-list-of-words-from-a-string

stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']

docs_clean_stop = []

for doc in docs_clean_num:
    docwords = doc.split()
    resultwords  = [word for word in docwords if word.lower() not in stop_words]
    result = ' '.join(resultwords)
    docs_clean_stop.append(result)

docs_clean_stop[0][0:1000]

print(len(docs_clean_stop))

17


In [10]:
# 9) remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document

docs_clean_short = []
shortword = re.compile(r'\W*\b\w{1,2}\b')

for doc in docs_clean_stop:
    doc = shortword.sub('',doc)
    docs_clean_short.append(doc)
    
docs_clean_short[0][0:1000]
    


'study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchronously asynchronously participate tasks acco


### Putting it all together


In [11]:
# *** VERSION ONE***

# 10) package all of your work above into a function that cleans a given document

#store body of paper from abstract to reference
def paperbody(documents):
    docs_temp = []
    for doc in documents:
        abstract = doc.find("abstract")
        ref = doc.rfind("reference")
        new_string = doc[abstract+8:ref] 
        docs_temp.append(new_string)
#     print(docs_temp[0][0:1000])
    return docs_temp

# whitespace
def whitespace(documents):
    documents_clean_white = []
    for doc in documents_clean:
        doc = doc.replace("\n"," ")
        documents_clean_white.append(doc)
    return documents_clean_white

#punctuation
def punctuation(documents):
    punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']
    docs_clean_punct= []

    for doc in documents_clean_white:
        for i in punctuation:
            doc = doc.replace(i, " ")
        docs_clean_punct.append(doc)
    return docs_clean_punct

# numbers
def numbers(documents):
    docs_clean_num = []
    for doc in docs_clean_punct:
        doc = re.sub('\w*[0-9]\w*', " ",doc)
        docs_clean_num.append(doc)
    return docs_clean_num

#stopwords
def stopwords(documents):
    stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']
    docs_clean_stop = []
    for doc in docs_clean_num:
        docwords = doc.split()
        resultwords  = [word for word in docwords if word.lower() not in stop_words]
        result = ' '.join(resultwords)
        docs_clean_stop.append(result)
    return docs_clean_stop

def shortwords(documents):
    docs_clean_short = []
    shortword = re.compile(r'\W*\b\w{1,2}\b')
    for doc in docs_clean_stop:
        doc = shortword.sub('',doc)
        docs_clean_short.append(doc)
    return docs_clean_short

def clean_list_of_documents(documents):
    
    cleaned_docs = []

    ### your code ###

    cleaned_docs = paperbody(documents)
    cleaned_docs = whitespace(documents)
    cleaned_docs = punctuation(documents)
    cleaned_docs = numbers(documents)
    cleaned_docs = stopwords(documents)
    cleaned_docs = shortwords(documents)
    
    return cleaned_docs



In [12]:
# *** Version TWO***
# Yes I know this is shorter but I wrote the first version first

def clean_list_of_documents2(documents):    
    cleaned_docs = []
    stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']
    punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']
    shortword = re.compile(r'\W*\b\w{1,2}\b')
    for doc in documents:
        abstract = doc.find("abstract")
        ref = doc.rfind("reference")
        new_string = doc[abstract+8:ref] 
        doc = new_string.replace("\n"," ")
        for i in punctuation:
            doc = doc.replace(i, " ")
        doc = re.sub('\w*[0-9]\w*', " ",doc)
        docwords = doc.split()
        resultwords  = [word for word in docwords if word.lower() not in stop_words]
        result = ' '.join(resultwords)   
        doc = shortword.sub('',result)
        cleaned_docs.append(doc)
    return cleaned_docs




In [13]:
# 11a) reimport your raw data using the code in 2)
documents = []

for paper in txt_files:
    with open(paper) as filename: 
        contents = filename.read()
        documents.append(contents)

        
# 11b) clean your files using the function above
# clean_docs = clean_list_of_documents(documents)

clean_docs2 = clean_list_of_documents2(documents)


# 11c) print the first 1000 characters of the first document

# print(clean_docs[0][:1000])

print(clean_docs2[0][:1000])

study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis introduction collaborative learning special form learning interaction affords opportunities groups students combine cognitive resources synchronously asynchronously participate tasks accom

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [14]:
import math
import numpy as np

12) Describe why we need to figure out the vocabulary used in our corpus (refer back to Sherin's paper, and explain in your own words): 

In [15]:
# 13) create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(documents):
    voc = []
    for doc in documents:
        docList = doc.split()
        for word in docList:
            if word not in voc:
                voc.append(word)
    return voc


documents = []

for paper in txt_files:
    with open(paper) as filename: 
        contents = filename.read()
        documents.append(contents)
        
vocabulary = get_vocabulary(clean_docs2)

print(len(vocabulary))

# print(vocabulary)

# Then print the length of your vocabulary (it should be 
# around 5500 words)

# ^^ assuming the cleaned dictionary


5639


In [16]:
# 14) what was the size of Sherin's vocabulary? 

# "For the corpus used in this work, the full vocabulary contained
# 1,429 words, the stop list consisted of 782 words, and the resulting pruned vocabulary
# contained 647 words." (p.617)

## Step 4 - transform your documents into 100-words chunks

In [24]:
# 15) create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap size
# Advice: combining all the documents into one giant string
# and splitting it into separate words will make your life easier!


docs_all = ''.join(clean_docs2)

# print(docs_all)

def wordChunk (listName, chunkSize, overlap):
    HundList = []
    docsList = listName.split()
    for i in range(0, len(docsList), overlap):
#         print(i)
        HundList.append(" ".join(docsList[i:i+chunkSize]))
#         print(i+chunkSize)
    return HundList

docs_all = wordChunk(docs_all, 100, 25)

# print(docs_all[0])
# print(docs_all[1])
# print(docs_all[2])


0
100
25
125
50
150
75
175
100
200
125
225
150
250
175
275
200
300
225
325
250
350
275
375
300
400
325
425
350
450
375
475
400
500
425
525
450
550
475
575
500
600
525
625
550
650
575
675
600
700
625
725
650
750
675
775
700
800
725
825
750
850
775
875
800
900
825
925
850
950
875
975
900
1000
925
1025
950
1050
975
1075
1000
1100
1025
1125
1050
1150
1075
1175
1100
1200
1125
1225
1150
1250
1175
1275
1200
1300
1225
1325
1250
1350
1275
1375
1300
1400
1325
1425
1350
1450
1375
1475
1400
1500
1425
1525
1450
1550
1475
1575
1500
1600
1525
1625
1550
1650
1575
1675
1600
1700
1625
1725
1650
1750
1675
1775
1700
1800
1725
1825
1750
1850
1775
1875
1800
1900
1825
1925
1850
1950
1875
1975
1900
2000
1925
2025
1950
2050
1975
2075
2000
2100
2025
2125
2050
2150
2075
2175
2100
2200
2125
2225
2150
2250
2175
2275
2200
2300
2225
2325
2250
2350
2275
2375
2300
2400
2325
2425
2350
2450
2375
2475
2400
2500
2425
2525
2450
2550
2475
2575
2500
2600
2525
2625
2550
2650
2575
2675
2600
2700
2625
2725
2650
2750
2675
2775
2

23250
23350
23275
23375
23300
23400
23325
23425
23350
23450
23375
23475
23400
23500
23425
23525
23450
23550
23475
23575
23500
23600
23525
23625
23550
23650
23575
23675
23600
23700
23625
23725
23650
23750
23675
23775
23700
23800
23725
23825
23750
23850
23775
23875
23800
23900
23825
23925
23850
23950
23875
23975
23900
24000
23925
24025
23950
24050
23975
24075
24000
24100
24025
24125
24050
24150
24075
24175
24100
24200
24125
24225
24150
24250
24175
24275
24200
24300
24225
24325
24250
24350
24275
24375
24300
24400
24325
24425
24350
24450
24375
24475
24400
24500
24425
24525
24450
24550
24475
24575
24500
24600
24525
24625
24550
24650
24575
24675
24600
24700
24625
24725
24650
24750
24675
24775
24700
24800
24725
24825
24750
24850
24775
24875
24800
24900
24825
24925
24850
24950
24875
24975
24900
25000
24925
25025
24950
25050
24975
25075
25000
25100
25025
25125
25050
25150
25075
25175
25100
25200
25125
25225
25150
25250
25175
25275
25200
25300
25225
25325
25250
25350
25275
25375
25300
25400
2532

42075
42000
42100
42025
42125
42050
42150
42075
42175
42100
42200
42125
42225
42150
42250
42175
42275
42200
42300
42225
42325
42250
42350
42275
42375
42300
42400
42325
42425
42350
42450
42375
42475
42400
42500
42425
42525
42450
42550
42475
42575
42500
42600
42525
42625
42550
42650
42575
42675
42600
42700
42625
42725
42650
42750
42675
42775
42700
42800
42725
42825
42750
42850
42775
42875
42800
42900
42825
42925
42850
42950
42875
42975
42900
43000
42925
43025
42950
43050
42975
43075
43000
43100
43025
43125
43050
43150
43075
43175
43100
43200
43125
43225
43150
43250
43175
43275
43200
43300
43225
43325
43250
43350
43275
43375
43300
43400
43325
43425
43350
43450
43375
43475
43400
43500
43425
43525
43450
43550
43475
43575
43500
43600
43525
43625
43550
43650
43575
43675
43600
43700
43625
43725
43650
43750
43675
43775
43700
43800
43725
43825
43750
43850
43775
43875
43800
43900
43825
43925
43850
43950
43875
43975
43900
44000
43925
44025
43950
44050
43975
44075
44000
44100
44025
44125
44050
4415

In [18]:
# 16) create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check

for chunk in docs_all:
    chunk = chunk.split()
#     print (len(chunk))
#     assert len(chunk) == True
#     print (len(chunk))


In [19]:
# 17) print the first chunk, and compare it to the original text.
# does that match what Sherin describes in his paper?

print(docs_all[0])

study investigates possible way analyze chat data collaborative learning environments using epistemic network analysis topic modeling topic general topic model built tasa touchstone applied science associates corpus used study topic scores utterances chat data computed seven relevant topics selected based total document scores aggregated topic scores power predicting students learning using epistemic network analysis enables assessing data different angle results showed topic score based epistemic networks low gain students high gain students significantly different overall results suggest two analytical approaches provide complementary information afford new insights processes related successful collaborative interactions keywords chat collaborative learning topic modeling epistemic network analysis


In [20]:
# # 18) how many chunks did Sherin have? What does a chunk become 
# # in the next step of our topic modeling algorithm? 

# # p.618: "I chose to employ overlapping 100-word segments, with the start of each
# segment beginning 25 words after the start of the preceding segment. So the first
# segment of a transcript would include Words 1–100, the second Words 26–125,
# the third Words 51–150, and so on. When all of the 54 interview transcripts were
# segmented in this manner, I ended up with 794 segments of text."

# 794 segments (or chunks)
# these become vectors, next!


In [21]:
# 19) what are some other preprocessing steps we could do 
# to improve the quality of the text data? Mention at least 2.

# * locating/correcting typos/misspellings
# stem/lemming functions (count plurals as the same as singular)


# 20) in your own words, describe the next steps of the data modeling algorithms (listed below):

### Step 5
* map each segment to a vector, each consisting of the list of unique words (see above)

### Step 6 Weight word frequency
* Sherin adjusts the count, so that count+=1 --> count += 1+log(count)

### Step 7 Matrix normalization
* divide each vector by a constant (length of vector) so each vector has length of 1

### Step 8 Deviation Vectors
* find the average of two vectors
* break each vector into two components (one is the average other is perpendicular to average)
* perpendicular pieces are the deviation vectors
* V1 and V2 are replaced by deviation vectors, to maximize difference between vectors

### step 9 Clustering
* cluster them - use hierarchical agglomerative clustering? (number of clusters is number of items, then you kind of "tree" up)
* centroid clustering: average of all vectors in each cluster, then merge closest clusters

### Step 10 visualizing results
* graphs! Some py.plots!


## Step 5 - Vector and Matrix operations

## Step 6 - Weight word frequency

## Step 7 - Matrix normalization

## Step 8 - Deviation Vectors

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [22]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization