### Text cleaning and topic modeling

This notebook is an example of topic modeling adapted from [this writeup](https://medium.com/@sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06).

It performs the following tasks:

- the first part of the notebook loads texts from a spreadsheet and turns them into one large corpuse
- then we walk through various ways in which we can analyze and clean our corpus using spaCy (this includes taking out `stopwords` — words most often used in the English language and lemmatizing our corpus)
- to better understand how a model works this notebook also explores some funcationalities of spaCy
- the last parts of this notebook then make a simple topics model from the cleaned language data

The libraries we will use are:
- `pandas`: for reading in and exporting spreadsheets
- `spacy`: a natural language processing library that contains various models trained on various languages
- `gensim`: a library for topic modelling, document indexing and similarity retrieval with large corpora. In this case we will use it for topic modeling, the process of clustering words that seem to be used a lot in relation to one another. The algorithms built into genim that this notebook uses are called [Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) and [Latent Semantic Analysis (LSA)](https://blog.marketmuse.com/glossary/latent-semantic-analysis-definition/).
- `pyLDAvis`: a library that is capable of visualizing your topics clusters.

Topic modeling is a form of unsupervised machine learning and can be really helpful in discovering topics in a large amount of text, especially if you're uncertain which topics might be buried in thousands or millions of documents. 

In [1]:
import os 
import pandas as pd

# for comprehension of language
import spacy 
from spacy import displacy

# for topics modeling
import gensim 
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel, LsiModel, HdpModel


### Load spaCy's English language trained pipeline

`A training pipeline typically reads training data from a feature store, performs model-dependent transformations, trains the model, and evaluates the model before the model is saved to a model registry.`

You will need to download one of spaCy's models and can do so by typing this into a cell here:
```
!python3 -m spacy download en_core_web_sm

```

In [2]:
# !python3 -m spacy download en_core_web_sm

In [3]:
#load the English language model 
nlp = spacy.load('en_core_web_sm')

#### Stop words

A lot of languages also contain 'stop words', words that are used very frequently and may not be useful when we're evaluating how often certain words may be used. spaCy has niftyfunctions that allow us to designate stop words for our analysis. 

For this purpose, we got stopwords [here](https://gist.github.com/sebleier/554280).

First we need to open the text file adn then turn it into a list of words:

In [4]:

with open("../data/stopwords.txt", "r") as file:
    stop_words = file.read().split("\n")

print(
    len(stop_words), 
    stop_words)

430 ['a', 'about', 'above', 'across', 'after', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'among', 'an', 'and', 'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'area', 'areas', 'around', 'as', 'ask', 'asked', 'asking', 'asks', 'at', 'away', 'b', 'back', 'backed', 'backing', 'backs', 'be', 'became', 'because', 'become', 'becomes', 'been', 'before', 'began', 'behind', 'being', 'beings', 'best', 'better', 'between', 'big', 'both', 'but', 'by', 'c', 'came', 'can', 'cannot', 'case', 'cases', 'certain', 'certainly', 'clear', 'clearly', 'come', 'could', 'd', 'did', 'differ', 'different', 'differently', 'do', 'does', 'done', 'down', 'down', 'downed', 'downing', 'downs', 'during', 'e', 'each', 'early', 'either', 'end', 'ended', 'ending', 'ends', 'enough', 'even', 'evenly', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'f', 'face', 'faces', 'fact', 'facts', 'far', 'felt', 'few', 'find', 'finds', 'f

Next we use spaCy's model and define stopwords:

In [5]:

# starts a loop that iterates through each word in the stop_words list.
for stopword in stop_words:
    # This line retrieves the lexeme (the base or dictionary form of a word) from spaCy's vocabulary. 
    lexeme = nlp.vocab[stopword]
    # then we set `lexeme.is_stop = True`for each word, making every word a stop word in spaCy's vocabulary.
    lexeme.is_stop = True


## Loading your text and making it a corpus

#### First we need to load the text

In [6]:
#load a spreadsheet with the text you want to analyze
tiktok_influencer =  pd.read_csv("../data/transcripts_demure.csv")

print(len(tiktok_influencer))
tiktok_influencer.head()

7


Unnamed: 0,file_name,transcript
0,../data/@joolieannie_7404929915893681451.mp4,"Ladies, let's be mindful when we use our phon..."
1,../data/@joolieannie_1724362477829.mp4,give me one that's like the size of like a fi...
2,../data/@joolieannie_1723610748575.mp4,You see how I do my makeup for work? Very dem...
3,../data/@joolieannie_1724324953244.mp4,Divas I'm in Los Angeles and Zillow needs my ...
4,../data/@joolieannie_1724362572281.mp4,"Hi, Tvaz. Okay, so I've been going to the sam..."


The next lines take all content from the `transcript` column, turn it into a list and then join it all with a space between each text. This creates one large corpus:

In [7]:
transcript_list = tiktok_influencer["transcript"].tolist()
 
text = ' '.join(str(x) for x in transcript_list)

In [8]:
len(text)

4104

In [9]:
doc = nlp(text)


In [10]:
len(doc)

1015

## Data cleaning
The next few lines 'normalize' the text and turns words into lemmas, get rid of stopwords and punctuation markers, and add lemmatized words.

In [11]:
# here's a demo of us cycling through the 
for word in doc:
    print(f"the lemma for the word {word} is {word.lemma_}")

the lemma for the word   is  
the lemma for the word Ladies is Ladies
the lemma for the word , is ,
the lemma for the word let is let
the lemma for the word 's is us
the lemma for the word be is be
the lemma for the word mindful is mindful
the lemma for the word when is when
the lemma for the word we is we
the lemma for the word use is use
the lemma for the word our is our
the lemma for the word phones is phone
the lemma for the word . is .
the lemma for the word You is you
the lemma for the word know is know
the lemma for the word me is I
the lemma for the word , is ,
the lemma for the word I is I
the lemma for the word keep is keep
the lemma for the word it is it
the lemma for the word very is very
the lemma for the word cutesy is cutesy
the lemma for the word , is ,
the lemma for the word very is very
the lemma for the word demure is demure
the lemma for the word . is .
the lemma for the word I is I
the lemma for the word reply is reply
the lemma for the word to is to
the lemma for 

In [12]:
# We add some words to the stop word list

#let's create some empty arrays. 
# texts will hold all our words that we will use for our topic model
texts = []
# is a temporary array that we will use to store the lemma-version of a word
article = []

for word in doc:
    if word.text != '\n' and not word.is_stop and not word.is_punct and not word.like_num:
        article.append(word.lemma_)
        texts.append(article)
        article = []
        
print(texts[1], len(texts))

['Ladies'] 279


In [13]:
texts

[[' '],
 ['Ladies'],
 ['mindful'],
 ['phone'],
 ['cutesy'],
 ['demure'],
 ['reply'],
 ['text'],
 ['email'],
 ['picky'],
 ['picky'],
 ['flicky'],
 ['flicky'],
 ['flicky'],
 ['Verizon'],
 ['Verizon'],
 ['trade'],
 ['musty'],
 ['diva'],
 ['demure'],
 ['diva'],
 ['Verizon'],
 ['trade'],
 ['crusty'],
 ['phone'],
 ['crunchy'],
 ['phone'],
 ['crack'],
 ['screen'],
 ['type'],
 ['phone'],
 ['get'],
 ['cut'],
 ['finger'],
 ['charge'],
 ['phone'],
 ['crazy'],
 ['thank'],
 ['bag'],
 ['bag'],
 ['cute'],
 ['respectful'],
 ['staff'],
 ['crazy'],
 ['walk'],
 ['nice'],
 ['phone'],
 ['partner'],
 ['Verizon'],
 ['elegant'],
 ['cutesy'],
 ['classy'],
 ['red'],
 ['hot'],
 ['pink'],
 ['crazy'],
 ['demure'],
 [' '],
 ['size'],
 ['calibrate'],
 ['bucket'],
 [' '],
 ['makeup'],
 ['demure'],
 ['mindful'],
 ['green'],
 ['cut'],
 ['crease'],
 ['look'],
 ['clown'],
 ['mindful'],
 ['look'],
 ['presentable'],
 ['interview'],
 ['job'],
 ['lot'],
 ['girl'],
 ['interview'],
 ['look'],
 ['Marge'],
 ['Simpson'],
 ['job']

In the next lines we turn these cleaned texts into a bag-of-words format.


In [14]:
# Dictionary() to map each unique word to a unique integer ID
dictionary = Dictionary(texts)
# this line creates a corpus and converts a single document (a list of words) into a bag-of-words format
corpus = [dictionary.doc2bow(text) for text in texts]

print(corpus[1])

[(1, 1)]


## Different Kinds of Topic Modeling 
Topic Modeling refers to the probabilistic modeling of text documents as topics. Gensim is one of the most popular libraries to perform such modeling.

#### LSI — Latent Semantic Indexing
One of the methods available in gensim is called LSI, which stands for Latent Semantic Indexing. LSI aims to find hidden (latent) relationships between words and concepts in a collection of documents. The assumption here is words that are used in similar contexts tend to have similar meanings.

In [15]:
lsi_model = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)
lsi_model.show_topics(num_topics=15)

[(0,
  '1.000*"demure" + -0.000*"Um" + 0.000*"thank" + -0.000*"chocho" + 0.000*"hide" + 0.000*"episode" + -0.000*"clown" + -0.000*"reply" + -0.000*"mix" + -0.000*"text"'),
 (1,
  '1.000*"mindful" + 0.001*"color" + -0.001*"size" + -0.001*"staff" + 0.001*"omg" + -0.001*"OMG" + 0.001*"reply" + 0.001*"Los" + 0.001*"Friday" + -0.001*"reality"'),
 (2,
  '-0.687*"look" + 0.546*"watch" + -0.386*"house" + 0.285*" " + 0.001*"partner" + 0.001*"snack" + 0.001*"water" + 0.001*"Los" + 0.001*"come" + -0.001*"cheaty"'),
 (3,
  '-0.650*"look" + 0.612*"house" + -0.419*" " + -0.167*"watch" + -0.001*"hygienic" + 0.001*"Betty" + 0.001*"Selma" + 0.001*"obviously" + 0.001*"crusty" + -0.001*"little"'),
 (4,
  '-0.676*"house" + -0.656*" " + -0.306*"watch" + -0.135*"look" + 0.002*"respectful" + -0.002*"reality" + 0.002*"free" + 0.002*"Hi" + 0.002*"bed" + 0.002*"Zillow"'),
 (5,
  '0.762*"watch" + -0.559*" " + 0.296*"look" + 0.139*"house" + 0.002*"musty" + -0.002*"give" + -0.002*"mean" + 0.002*"Marge" + -0.002*"c

#### HDP — Hierarchical Dirichlet Process
HDP, the Hierarchical Dirichlet Process is an unsupervised Topic Model which figures out the number of topics on its own. HPD assumes that documents are mixtures of topics, and topics are mixtures of words, but it doesn't limit the number of topics.

In [16]:
hdp_model = HdpModel(corpus=corpus, id2word=dictionary)
hdp_model.show_topics()[:5]

[(0,
  '0.039*green + 0.035*Patty + 0.025*go + 0.024*bring + 0.024*Verizon + 0.023*quality + 0.022*hygienic + 0.022*phone + 0.021*respectful + 0.021*get + 0.018*Wavuzva + 0.018*queen + 0.018*obviously + 0.017*Hotel + 0.017*sweet + 0.017*classy + 0.016*hide + 0.014*guidance + 0.013*tree + 0.013*post'),
 (1,
  '0.043*okay + 0.038*forever + 0.026*check + 0.023*Tvaz + 0.020*little + 0.019*help + 0.018*Los + 0.017*Simpson + 0.017*Hotel + 0.016*shirt + 0.015*yeah + 0.015*Selma + 0.015*partner + 0.014*salon + 0.014*email + 0.014*tile + 0.013*nail + 0.013*omg + 0.013*world + 0.013*cheesy'),
 (2,
  '0.028*guidance + 0.028*hire + 0.027*size + 0.027*Tvaz + 0.025*Simpson + 0.022*favorite + 0.020*crusty + 0.019*invest + 0.018*Hollywood + 0.018*red + 0.017*check + 0.016*Marge + 0.015*demure + 0.015*lie + 0.015*bed + 0.015*wild + 0.014*flicky + 0.013*bang + 0.013*omg + 0.013*staff'),
 (3,
  '0.035*color + 0.027*simple + 0.024*charge + 0.024*Netflix + 0.023*go + 0.020*tomorrow + 0.019*Betty + 0.019*ti

#### LDA — Latent Dirichlet Allocation
LDA or Latent Dirichlet Allocation is arguably the most famous Topic Modeling algorithm out there. Out here we create a simple Topic Model with 5 topics. The LDA algorithm assumes that each document is a mixture of topics, and each topic is a mixture of words.

In [17]:
lda_model = LdaModel(corpus=corpus, num_topics=5, id2word=dictionary)
lda_model.show_topics()

[(0,
  '0.051*"Verizon" + 0.039*"diva" + 0.039*"tech" + 0.039*"real" + 0.039*"elegant" + 0.039*"beautiful" + 0.039*"seey" + 0.027*"simple" + 0.027*"nice" + 0.015*"nail"'),
 (1,
  '0.091*"house" + 0.041*"demure" + 0.041*"watch" + 0.040*"beddy" + 0.028*"diva" + 0.028*"nail" + 0.028*"sweet" + 0.028*"picky" + 0.028*"glibadi" + 0.015*"to"'),
 (2,
  '0.091*"demure" + 0.071*" " + 0.052*"cute" + 0.032*"ugly" + 0.022*"cut" + 0.022*"let" + 0.022*"cozy" + 0.022*"Willa" + 0.022*"walk" + 0.022*"Diva"'),
 (3,
  '0.052*"watch" + 0.040*"flicky" + 0.040*"cutesy" + 0.028*"nail" + 0.028*"real" + 0.028*"invest" + 0.027*"wild" + 0.016*"look" + 0.015*"girl" + 0.015*"crazy"'),
 (4,
  '0.100*"mindful" + 0.061*"look" + 0.061*"phone" + 0.041*"crazy" + 0.032*"morning" + 0.031*"girl" + 0.022*"bring" + 0.022*"go" + 0.022*"bag" + 0.022*"tag"')]

## Visualizing Topics with pyLDAvis
pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing.

**Note: If you have issues running the `pyLDAvis.gensim_modeles.prepare()` function, you may need to walk yourself through [these fixes](https://docs.google.com/document/d/1XOz5fJdHR754SHkMIrqCxkgD2yGHhBlueK4EcbaNwII/edit?usp=sharing).**

In [18]:
#for visualizations
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()


In [19]:
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)


In [20]:
pyLDAvis.save_html(vis, "../output/topics_modeling_demure.html")

In [21]:
# vis

## Word count
For good measure, we can also use this space to make a word count

In [22]:
words_influencer = pd.DataFrame(texts)
print(len(words_influencer))
words_influencer.columns = ["word"]
words_influencer.head()


279


Unnamed: 0,word
0,
1,Ladies
2,mindful
3,phone
4,cutesy


In [23]:
word_tally = words_influencer["word"].value_counts()
word_tally.head()

word
demure     12
mindful    10
            7
look        7
house       7
Name: count, dtype: int64

In [24]:
word_tally.to_csv("../output/word_tally_demure.csv")