# Week 5 - Vector Space Model (VSM) and Topic Modeling

Over the next weeks, we are going to re-implement Sherin's algorithm and apply it to the text data we've been working on last week! Here's our roadmap:

**Week 5 - data cleaning**
1. import the data
2. clean the data (e.g., remopve stop words, punctuation, etc.)
3. build a vocabulary for the dataset
4. create chunks of 100 words, with a 25-words overlap
5. create a word count matrix, where each chunk of a row and each column represents a word

**Week 6 - vectorization and linear algebra**
6. Dampen: weight the frequency of words (1 + log[count])
7. Scale: Normalize weighted frequency of words
8. Direction: compute deviation vectors

**Week 7 - Clustering**
9. apply different unsupervised machine learning algorithms
    * figure out how many clusters we want to keep
    * inspect the results of the clustering algorithm

**Week 8 - Visualizing the results**
10. create visualizations to compare documents

In [14]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization

In [37]:
##Setup
# plot the graphs inline
%matplotlib inline

import os
import re
import glob

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from collections import defaultdict

## Step 1 - Data Retrieval

In [16]:
# 1) using glob, find all the text files in the "Papers" folder
# Hint: refer to last week's notebook
import glob
files = glob.glob('./Papers/*.txt')
print(files)

['./Papers/paper0.txt', './Papers/paper1.txt', './Papers/paper10.txt', './Papers/paper11.txt', './Papers/paper12.txt', './Papers/paper13.txt', './Papers/paper14.txt', './Papers/paper15.txt', './Papers/paper16.txt', './Papers/paper2.txt', './Papers/paper3.txt', './Papers/paper4.txt', './Papers/paper5.txt', './Papers/paper6.txt', './Papers/paper7.txt', './Papers/paper8.txt', './Papers/paper9.txt']


In [17]:
# 2) get all the data from the text files into the "documents" list
# P.S. make sure you use the 'utf-8' encoding
documents = []
for paper in files:
        f= open(paper,"r", encoding = 'utf8')
        f = f.read()
        documents.append(f)
    
len(documents)


17

In [18]:
# 3) print the first 1000 characters of the first document to see what it 
# looks like (we'll use this as a sanity check below)
print(documents[0][0:1000])

zone out no more: mitigating mind wandering during
computerized reading
sidney k. d’mello, caitlin mills, robert bixler, & nigel bosch
university of notre dame
118 haggar hall
notre dame, in 46556, usa
sdmello@nd.edu

abstract
mind wandering, defined as shifts in attention from task-related
processing to task-unrelated thoughts, is a ubiquitous
phenomenon that has a negative influence on performance and
productivity in many contexts, including learning. we propose
that next-generation learning technologies should have some
mechanism to detect and respond to mind wandering in real-time.
towards this end, we developed a technology that automatically
detects mind wandering from eye-gaze during learning from
instructional texts. when mind wandering is detected, the
technology intervenes by posing just-in-time questions and
encouraging re-reading as needed. after multiple rounds of
iterative refinement, we summatively compared the technology to
a yoked-control in an experiment with 104 par

## Step 2 - Data Cleaning

In [19]:
# 4) only select the text that's between the first occurence of the 
# the word "abstract" and the last occurence of the word "reference"
# Optional: print the length of the string before and after, as a 
# sanity check
# HINT: https://stackoverflow.com/questions/14496006/finding-last-occurrence-of-substring-in-string-replacing-that
# read more about rfind: https://www.tutorialspoint.com/python/string_rfind.htm

d1 = documents[0]
print(len(d1))
body = re.findall(r'abstract(.*?)reference', d1, re.DOTALL)
body = body[0]
print(len(body))
type(body)

50043
39310


str

In [20]:
bodies = []
print(len(documents))
for doc in documents:
    start = doc.index('abstract\n') + len('abstract')
    end = doc.rfind('\nreference')
    bodies.append(doc[start:end])
len(bodies)
print(bodies[0][0:1000])

17

mind wandering, defined as shifts in attention from task-related
processing to task-unrelated thoughts, is a ubiquitous
phenomenon that has a negative influence on performance and
productivity in many contexts, including learning. we propose
that next-generation learning technologies should have some
mechanism to detect and respond to mind wandering in real-time.
towards this end, we developed a technology that automatically
detects mind wandering from eye-gaze during learning from
instructional texts. when mind wandering is detected, the
technology intervenes by posing just-in-time questions and
encouraging re-reading as needed. after multiple rounds of
iterative refinement, we summatively compared the technology to
a yoked-control in an experiment with 104 participants. the key
dependent variable was performance on a post-reading
comprehension assessment. our results suggest that the
technology was successful in correcting comprehension deficits
attributed to mind wandering (d = 

In [21]:
# 5) replace carriage returns (i.e., "\n") with a white space
# check that the result looks okay by printing the 
# first 1000 characters of the 1st doc:
# body = body.replace('\n', ' ')
# print(body[0:1000])
#using enumerate- i is the index, doc is the element
for i, doc in enumerate(bodies):
    bodies[i] = doc.replace('\n', ' ')
bodies[0][:1000]

' mind wandering, defined as shifts in attention from task-related processing to task-unrelated thoughts, is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts, including learning. we propose that next-generation learning technologies should have some mechanism to detect and respond to mind wandering in real-time. towards this end, we developed a technology that automatically detects mind wandering from eye-gaze during learning from instructional texts. when mind wandering is detected, the technology intervenes by posing just-in-time questions and encouraging re-reading as needed. after multiple rounds of iterative refinement, we summatively compared the technology to a yoked-control in an experiment with 104 participants. the key dependent variable was performance on a post-reading comprehension assessment. our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering (d = .4

In [22]:
# 6) replace the punctation below by a white space
# check that the result looks okay 
# (e.g., by print the first 1000 characters of the 1st doc)

punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']

for punc in punctuation:
    bodies[0] = bodies[0].replace(punc, ' ')
print(bodies[0][0:1000])

 mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with 104 participants  the key dependent variable was performance on a post reading comprehension assessment  our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering  d    47

In [23]:
# 7) remove numbers by either a white space or the word "number"
# again, print the first 1000 characters of the first document
# to check that you're doing the right thing


bodies[0] = re.sub(r'\d', ' ', bodies[0])
print(bodies[0][0:1000])


 mind wandering  defined as shifts in attention from task related processing to task unrelated thoughts  is a ubiquitous phenomenon that has a negative influence on performance and productivity in many contexts  including learning  we propose that next generation learning technologies should have some mechanism to detect and respond to mind wandering in real time  towards this end  we developed a technology that automatically detects mind wandering from eye gaze during learning from instructional texts  when mind wandering is detected  the technology intervenes by posing just in time questions and encouraging re reading as needed  after multiple rounds of iterative refinement  we summatively compared the technology to a yoked control in an experiment with     participants  the key dependent variable was performance on a post reading comprehension assessment  our results suggest that the technology was successful in correcting comprehension deficits attributed to mind wandering  d      

In [24]:
# 8) Remove the stop words below from our documents
# print the first 1000 characters of the first document
stop_words = ['i', 'me', 'my', 'myself', 'we ', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']

for word in stop_words:
    word = ' '+word+' '
    bodies[0] = bodies[0].replace(word, ' ')
print(bodies[0][0:1000])


 mind wandering  defined shifts attention task related processing task unrelated thoughts  ubiquitous phenomenon negative influence performance productivity many contexts  including learning  we propose next generation learning technologies mechanism detect respond mind wandering real time  towards end  we developed technology automatically detects mind wandering eye gaze learning instructional texts  mind wandering detected  technology intervenes posing time questions encouraging re reading needed  multiple rounds iterative refinement  we summatively compared technology yoked control experiment     participants  key dependent variable performance post reading comprehension assessment  results suggest technology successful correcting comprehension deficits attributed mind wandering  d       sigma  specific conditions  thereby highlighting potential improve learning  attending attention    keywords mind wandering  gaze tracking  student modeling  attentionaware      introduction despite

In [25]:
# 9) remove words with one and two characters (e.g., 'd', 'er', etc.)
# print the first 1000 characters of the first document
shortword = re.compile(r'\W*\b\w{1,3}\b')
bodies[0]= shortword.sub('', bodies[0])
print(bodies[0][0:1000])

 mind wandering  defined shifts attention task related processing task unrelated thoughts  ubiquitous phenomenon negative influence performance productivity many contexts  including learning propose next generation learning technologies mechanism detect respond mind wandering real time  towards developed technology automatically detects mind wandering gaze learning instructional texts  mind wandering detected  technology intervenes posing time questions encouraging reading needed  multiple rounds iterative refinement summatively compared technology yoked control experiment     participants dependent variable performance post reading comprehension assessment  results suggest technology successful correcting comprehension deficits attributed mind wandering       sigma  specific conditions  thereby highlighting potential improve learning  attending attention    keywords mind wandering  gaze tracking  student modeling  attentionaware      introduction despite best efforts write clear engag


### Putting it all together

In [51]:
# 10) package all of your work above into a function that cleans a given document

##Setup
# plot the graphs inline
%matplotlib inline

import os
import re
import glob

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from collections import defaultdict

##Define the function

def clean_list_of_documents(documents):
#empty list to add documents to:    
    cleaned_docs = []
    
#define lists for removal (punctuation, stop words and 2-letter words):
    punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
                   '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
                   '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
                   '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']
    stop_words = ['i', 'me', 'my', 'myself', 'we ', 'our', 'ours', 
                  'ourselves', 'you', 'your', 'yours', 'yourself', 
                  'yourselves', 'he', 'him', 'his', 'himself', 'she', 
                  'her', 'hers', 'herself', 'it', 'its', 'itself', 
                  'they', 'them', 'their', 'theirs', 'themselves', 
                  'what', 'which', 'who', 'whom', 'this', 'that', 
                  'these', 'those', 'am', 'is', 'are', 'was', 'were', 
                  'be', 'been', 'being', 'have', 'has', 'had', 'having', 
                  'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
                  'but', 'if', 'or', 'because', 'as', 'until', 'while', 
                  'of', 'at', 'by', 'for', 'with', 'about', 'against', 
                  'between', 'into', 'through', 'during', 'before', 
                  'after', 'above', 'below', 'to', 'from', 'up', 'down', 
                  'in', 'out', 'on', 'off', 'over', 'under', 'again', 
                  'further', 'then', 'once', 'here', 'there', 'when', 
                  'where', 'why', 'how', 'all', 'any', 'both', 'each', 
                  'few', 'more', 'most', 'other', 'some', 'such', 'no', 
                  'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
                  'too', 'very', 's', 't', 'can', 'will', 
                  'just', 'don', 'should', 'now']

    shortword = re.compile(r'\W*\b\w{1,3}\b')

    #select only the body of the text:    
    for document in documents:
        start = document.index('abstract\n') + len('abstract')
        end = document.rfind('\nreference')
        document = document[start:end]
    #remove line breaks: 
        document = document.replace('\n', ' ')
    #remove punctuation
        for punc in punctuation:
            document = document.replace(punc, ' ')
    #remove numbers using regex:
        document = re.sub(r'\d', ' ', document)
    #remove stop words
        for word in stop_words:
            word = ' '+word+' '
            document = document.replace(word, ' ')
    #remove short (1- and 2- letter) words:
        document= shortword.sub('', document)
    #append the cleaned document to the list cleaned_docs
        cleaned_docs.append(document)
    #return the results of cleaned_docs
    return cleaned_docs

In [52]:
# 11a) reimport your raw data using the code in 2)

files = glob.glob('./Papers/*.txt')

documents = []
for paper in files:
        f= open(paper,"r", encoding = 'utf8')
        f = f.read()
        documents.append(f)
print(len(documents))
    
        
# 11b) clean your files using the function above
documents = clean_list_of_documents(documents)


# 11c) print the first 1000 characters of the first document
print(documents[0][0:1000])
print(len(documents))

17
 mind wandering  defined shifts attention task related processing task unrelated thoughts  ubiquitous phenomenon negative influence performance productivity many contexts  including learning propose next generation learning technologies mechanism detect respond mind wandering real time  towards developed technology automatically detects mind wandering gaze learning instructional texts  mind wandering detected  technology intervenes posing time questions encouraging reading needed  multiple rounds iterative refinement summatively compared technology yoked control experiment     participants dependent variable performance post reading comprehension assessment  results suggest technology successful correcting comprehension deficits attributed mind wandering       sigma  specific conditions  thereby highlighting potential improve learning  attending attention    keywords mind wandering  gaze tracking  student modeling  attentionaware      introduction despite best efforts write clear en

## Step 3 - Build your list of vocabulary

This list of words (i.e., the vocabulary) is going to become the columns of your matrix.

In [41]:
import math
import numpy as np

12) Describe why we need to figure out the vocabulary used in our corpus (refer back to Sherin's paper, and explain in your own words): 

We need a list of all the unique words that might be important to count. By counting these unique words in each paper/passage, we can identify patterns of what the texts are about. With the vocabularly we will be able to count the frequency of these words in each passage we identify.

In [53]:
# 13) create a function that takes in a list of documents
# and returns a set of unique words. Make sure that you
# sort the list alphabetically before returning it. 

def get_vocabulary(documents):
    voc = []
    for document in documents:
        for word in document.split():
            if word not in voc:
                voc.append(word)
    voc.sort()
    return voc


# Then print the length of your vocabulary (it should be 
# around 5500 words)
vocabulary = get_vocabulary(documents)
print(len(vocabulary))
print(vocabulary)

5876


14) what was the size of Sherin's vocabulary? 
647 words

## Step 4 - transform your documents into 100-words chunks

In [98]:
# 15) create a function that takes in a list of documents
# and returns a list of 100-words chunk 
# (with a 25 words overlap between them)
# Optional: add two arguments, one for the number of words
# in each chunk, and one for the overlap size
# Advice: combining all the documents into one giant string
# and splitting it into separate words will make your life easier!

def text_segments(text, overlap, length):
    text = ' '.join(text)
    text = text.split()
    segments = []
    for i in range(0,len(text)-100,overlap):
        segments.append(text[i:i+length])
        i += 25
    return segments

segments = text_segments(documents,25,100)
print(len(segments))

#We have to limit the end segments so they are all 100 words- otherwise we end up with the final 100 words split
#into 25-word chunks

2231
['study', 'capturing', 'semantic', 'attributes', 'student', 'responses', 'vocabulary', 'learning', 'also', 'limitations', 'current', 'study', 'areas', 'future', 'work', 'first', 'expanding', 'scope', 'analysis', 'full', 'experimental', 'conditions', 'used', 'study', 'reveal', 'complex', 'interactions', 'conditions', 'students', 'short', 'longterm', 'learning', 'second', 'study', 'used', 'fixed', 'threshold', 'investigating', 'false', 'prediction', 'results', 'however', 'optimal', 'threshold', 'participant', 'group', 'prediction', 'model', 'could', 'selected', 'especially', 'different', 'false', 'positive', 'negative', 'patterns', 'observed', 'different', 'groups', 'students', 'lastly', 'study', 'collected', 'data', 'single', 'vocabulary', 'tutoring', 'system', 'used', 'classroom', 'setting', 'applying', 'proposed', 'method', 'data', 'collected', 'classroom', 'setting', 'vocabulary', 'learning', 'system', 'would', 'useful', 'show', 'generalization', 'suggested', 'method', 'acknowle

In [101]:
# 16) create a for loop to double check that each chunk has 
# a length of 100
# Optional: use assert to do this check
def hundreds_true(text):
    for i in range(0,len(text)):
        assert len(seg) == 100
    return 'true'
hundreds_true(segments)

'true'

In [102]:
# 17) print the first chunk, and compare it to the original text.
# does that match what Sherin describes in his paper?
print(segments[0])

Abstract from paper 1:
# Mind wandering, defined as shifts in attention from task-related
# processing to task-unrelated thoughts, is a ubiquitous
# phenomenon that has a negative influence on performance and
# productivity in many contexts, including learning. We propose
# that next-generation learning technologies should have some
# mechanism to detect and respond to mind wandering in real-time.
# Towards this end, we developed a technology that automatically
# detects mind wandering from eye-gaze during learning from
# instructional texts. When mind wandering is detected, the
# technology intervenes by posing just-in-time questions and
# encouraging re-reading as needed. After multiple rounds of
# iterative refinement, we summatively compared the technology to
# a yoked-control in an experiment with 104 participants. The key
# dependent variable was performance on a post-reading
# comprehension assessment. Our results suggest that the
# technology was successful in correcting comprehension deficits
# attributed to mind wandering (d = .47 sigma) under specific
# conditions, thereby highlighting the potential to improve learning
# by “attending to attention.”

##Yep, this looks right to me

['mind', 'wandering', 'defined', 'shifts', 'attention', 'task', 'related', 'processing', 'task', 'unrelated', 'thoughts', 'ubiquitous', 'phenomenon', 'negative', 'influence', 'performance', 'productivity', 'many', 'contexts', 'including', 'learning', 'propose', 'next', 'generation', 'learning', 'technologies', 'mechanism', 'detect', 'respond', 'mind', 'wandering', 'real', 'time', 'towards', 'developed', 'technology', 'automatically', 'detects', 'mind', 'wandering', 'gaze', 'learning', 'instructional', 'texts', 'mind', 'wandering', 'detected', 'technology', 'intervenes', 'posing', 'time', 'questions', 'encouraging', 'reading', 'needed', 'multiple', 'rounds', 'iterative', 'refinement', 'summatively', 'compared', 'technology', 'yoked', 'control', 'experiment', 'participants', 'dependent', 'variable', 'performance', 'post', 'reading', 'comprehension', 'assessment', 'results', 'suggest', 'technology', 'successful', 'correcting', 'comprehension', 'deficits', 'attributed', 'mind', 'wandering'

18) how many chunks did Sherin have? What does a chunk become in the next step of our topic modeling algorithm? 
Sherin had 794 segments of text. These segments were converted from a list of words, to a count of the vocabulary. So in the segment printed in question 1, insread of 100 words in a list, we might have a dictionary that held a key for each word and a value that was the count of how often that word appeared in this segment (e.g. 'mind':4). Then this is converted to a vector.

19) what are some other preprocessing steps we could do to improve the quality of the text data? Mention at least 2.
I don't think we should have made these segments cross papers. We probably should have made 100-word segments bounded by each paper so we can compare the papers to each other (like Sherin does for different students)

I think we might have wanted to keep compound words, so not removing all hyphens from the text. In the first paper "mind-wandering" is meaningfully different from "mind" and "wandering" separately.

We might also want to remove the keywords that appear in between the abstract and introduction.

20) in your own words, describe the next steps of the data modeling algorithms (listed below):
We will take the segments and convert them into vocabulary counts, so all the segments will have a matrix of the same list of words and the unique frequency of each to that segment.

We will "weight" the words, actually log-transforming them so that they have a more linear relationship and giving less weight to those words that have very high frequencies

We will create vectors out of these: which will have two properties, an angle and a length.

We will standardize the lengths to 1 so the vectors only have one property.

We will convert these into deviation vectors that tell us not the unique angle of the vector but its deviation from the average.

We will cluster the vectors by comparing them to each other- vectors with similar averages will iteratively be clustered together. We then have to choose the best cluster scheme.

We will visualize the results- how do the segments compare, and can we extract any meaning about what the EDM conference is about?

## Step 5 - Vector and Matrix operations

## Step 6 - Weight word frequency

## Step 7 - Matrix normalization

## Step 8 - Deviation Vectors

## Step 9 - Clustering

## Step 10 - Visualizing the results

## Final Step - Putting it all together: 

In [17]:
# in python code, our goal is to recreate the steps above as functions
# so that we can just one line to run topic modeling on a list of 
# documents: 
def ExtractTopicsVSM(documents, numTopics):
    ''' this functions takes in a list of documents (strings), 
        runs topic modeling (as implemented by Sherin, 2013)
        and returns the clustering results, the matrix used 
        for clustering a visualization '''
    
    # step 2: clean up the documents
    documents = clean_list_of_documents(documents)
    
    # step 3: let's build the vocabulary of these docs
    vocabulary = get_vocabulary(documents)
    
    # step 4: we build our list of 100-words overlapping fragments
    documents = flatten_and_overlap(documents)
    
    # step 5: we convert the chunks into a matrix
    matrix = docs_by_words_matrix(documents, vocabulary)
    
    # step 6: we weight the frequency of words (count = 1 + log(count))
    matrix = one_plus_log_mat(matrix, documents, vocabulary)
    
    # step 7: we normalize the matrix
    matrix = normalize(matrix)
    
    # step 8: we compute deviation vectors
    matrix = transform_deviation_vectors(matrix, documents)
    
    # step 9: we apply a clustering algorithm to find topics
    results_clustering = cluster_matrix(matrix)
    
    # step 10: we create a visualization of the topics
    visualization = visualize_clusters(results_clustering, vocabulary)
    
    # finally, we return the clustering results, the matrix, and a visualization
    return results_clustering, matrix, visualization