### Text cleaning and topic modeling

This notebook is an example of topic modeling adapted from [this writeup](https://medium.com/@sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06).

It performs the following tasks:

- the first part of the notebook loads texts from a spreadsheet and turns them into one large corpuse
- then we walk through various ways in which we can analyze and clean our corpus using spaCy (this includes taking out `stopwords` — words most often used in the English language and lemmatizing our corpus)
- to better understand how a model works this notebook also explores some funcationalities of spaCy
- the last parts of this notebook then make a simple topics model from the cleaned language data

The libraries we will use are:
- `pandas`: for reading in and exporting spreadsheets
- `spacy`: a natural language processing library that contains various models trained on various languages
- `gensim`: a library for topic modelling, document indexing and similarity retrieval with large corpora. In this case we will use it for topic modeling, the process of clustering words that seem to be used a lot in relation to one another. The algorithms built into genim that this notebook uses are called [Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) and [Latent Semantic Analysis (LSA)](https://blog.marketmuse.com/glossary/latent-semantic-analysis-definition/). Some newer computers in my classroom had issues with installing gensim which were resolved in [this manner](    len(https://stackoverflow.com/questions/72672196/error-pips-dependency-resolver-does-not-currently-take-into-account-all-the-pa),).
- `pyLDAvis`: a library that is capable of visualizing your topics clusters.

Topic modeling is a form of unsupervised machine learning and can be really helpful in discovering topics in a large amount of text, especially if you're uncertain which topics might be buried in thousands or millions of documents. 

In [1]:
import os 
import pandas as pd

# for comprehension of language
import spacy 
from spacy import displacy

# for topics modeling
import gensim 
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel, LsiModel, HdpModel


### Load spaCy's English language trained pipeline

`A training pipeline typically reads training data from a feature store, performs model-dependent transformations, trains the model, and evaluates the model before the model is saved to a model registry.`

You will need to download one of spaCy's models and can do so by typing this into a cell here:
```
!python3 -m spacy download en_core_web_sm

```

In [2]:
#!python3 -m spacy download en_core_web_sm

In [3]:
#load the English language model 
nlp = spacy.load('en_core_web_sm')

#### Stop words

A lot of languages also contain 'stop words', words that are used very frequently and may not be useful when we're evaluating how often certain words may be used. spaCy has niftyfunctions that allow us to designate stop words for our analysis. 

For this purpose, we got stopwords [here](https://gist.github.com/sebleier/554280).

First we need to open the text file adn then turn it into a list of words:

In [4]:

with open("../data/stopwords.txt", "r") as file:
    stop_words = file.read().split("\n")

print(
    len(stop_words), 
    stop_words)

430 ['a', 'about', 'above', 'across', 'after', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'among', 'an', 'and', 'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'area', 'areas', 'around', 'as', 'ask', 'asked', 'asking', 'asks', 'at', 'away', 'b', 'back', 'backed', 'backing', 'backs', 'be', 'became', 'because', 'become', 'becomes', 'been', 'before', 'began', 'behind', 'being', 'beings', 'best', 'better', 'between', 'big', 'both', 'but', 'by', 'c', 'came', 'can', 'cannot', 'case', 'cases', 'certain', 'certainly', 'clear', 'clearly', 'come', 'could', 'd', 'did', 'differ', 'different', 'differently', 'do', 'does', 'done', 'down', 'down', 'downed', 'downing', 'downs', 'during', 'e', 'each', 'early', 'either', 'end', 'ended', 'ending', 'ends', 'enough', 'even', 'evenly', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'f', 'face', 'faces', 'fact', 'facts', 'far', 'felt', 'few', 'find', 'finds', 'f

Next we use spaCy's model and define stopwords:

In [5]:

# starts a loop that iterates through each word in the stop_words list.
for stopword in stop_words:
    # This line retrieves the lexeme (the base or dictionary form of a word) from spaCy's vocabulary. 
    lexeme = nlp.vocab[stopword]
    # then we set `lexeme.is_stop = True`for each word, making every word a stop word in spaCy's vocabulary.
    lexeme.is_stop = True


## Loading your text and making it a corpus

#### First we need to load the text

In [6]:
#load a spreadsheet with the text you want to analyze
tiktok_influencer =  pd.read_csv("../output/transcripts.csv")

print(len(tiktok_influencer))
tiktok_influencer.head()

6


Unnamed: 0,file_name,transcript
0,../output/demure/Replying to @DanaLynn #fyp .mp4,"Ok, vamos a tener un honesto sobre estos híga..."
1,../output/demure/Sweatproof glam #fyp #demure ...,Es sword RW Fan. Buenas noches. El Ice Swap t...
2,../output/demure/#fyp #demure .mp4,¿Qué es lo que hace para trabajar? Muy dimulo...
3,../output/demure/Replying to @Bri G. #fyp @Ken...,"ANS FLES magia, porque yo también estaba pensa..."
4,../output/demure/#fyp.mp4,¿Dye aunque te decimos a primaria? Me pusebet...


The next lines take all content from the `transcript` column, turn it into a list and then join it all with a space between each text. This creates one large corpus:

In [7]:
transcript_list = tiktok_influencer["transcript"].tolist()
 
text = ' '.join(str(x) for x in transcript_list)

In [8]:
len(text)

2900

In [9]:
doc = nlp(text)


In [10]:
len(doc)

721

## Data cleaning
The next few lines 'normalize' the text and turns words into lemmas, get rid of stopwords and punctuation markers, and add lemmatized words.

In [11]:
# here's a demo of us cycling through the 
for word in doc:
    print(f"the lemma for the word {word} is {word.lemma_}")

the lemma for the word   is  
the lemma for the word Ok is Ok
the lemma for the word , is ,
the lemma for the word vamos is vamos
the lemma for the word a is a
the lemma for the word tener is tener
the lemma for the word un is un
the lemma for the word honesto is honesto
the lemma for the word sobre is sobre
the lemma for the word estos is estos
the lemma for the word hígamos is hígamos
the lemma for the word . is .
the lemma for the word Yo is Yo
the lemma for the word siempre is siempre
the lemma for the word he is he
the lemma for the word habido is habido
the lemma for the word la is la
the lemma for the word cintura is cintura
the lemma for the word de is de
the lemma for the word boteo is boteo
the lemma for the word . is .
the lemma for the word Por is Por
the lemma for the word 10 is 10
the lemma for the word años is año
the lemma for the word , is ,
the lemma for the word exclusivamente is exclusivamente
the lemma for the word , is ,
the lemma for the word no is no
the lemma f

In [12]:
# We add some words to the stop word list

#let's create some empty arrays. 
# texts will hold all our words that we will use for our topic model
texts = []
# is a temporary array that we will use to store the lemma-version of a word
article = []

for word in doc:
    if word.text != '\n' and not word.is_stop and not word.is_punct and not word.like_num:
        article.append(word.lemma_)
        texts.append(article)
        article = []
        
print(texts[1], len(texts))

['Ok'] 528


In [13]:
texts

[[' '],
 ['Ok'],
 ['vamos'],
 ['tener'],
 ['un'],
 ['honesto'],
 ['sobre'],
 ['estos'],
 ['hígamos'],
 ['Yo'],
 ['siempre'],
 ['habido'],
 ['la'],
 ['cintura'],
 ['de'],
 ['boteo'],
 ['Por'],
 ['año'],
 ['exclusivamente'],
 ['en'],
 ['realidad'],
 ['La'],
 ['más'],
 ['importante'],
 ['es'],
 ['que'],
 ['el'],
 ['color'],
 ['de'],
 ['la'],
 ['cintura'],
 ['es'],
 ['buitufla'],
 ['Si'],
 ['usted'],
 ['sabe'],
 ['como'],
 ['el'],
 ['color'],
 ['de'],
 ['la'],
 ['cintura'],
 ['el'],
 ['color'],
 ['de'],
 ['la'],
 ['cintura'],
 ['Baby'],
 ['la'],
 ['cintura'],
 ['de'],
 ['boteo'],
 ['es'],
 ['la'],
 ['forma'],
 ['de'],
 ['ir'],
 ['Ahora'],
 ['la'],
 ['cintura'],
 ['de'],
 ['la'],
 ['cintura'],
 ['es'],
 ['para'],
 ['mi'],
 ['cintura'],
 ['para'],
 ['cómo'],
 ['es'],
 ['la'],
 ['cintura'],
 ['cómo'],
 ['es'],
 ['rica'],
 ['Obviamente'],
 ['cómo'],
 ['hay'],
 ['que'],
 ['tener'],
 ['que'],
 ['tener'],
 ['que'],
 ['tener'],
 ['que'],
 ['tener'],
 ['en'],
 ['esa'],
 ['cintura'],
 ['Estamos'],
 

In the next lines we turn these cleaned texts into a bag-of-words format.


In [14]:
# Dictionary() to map each unique word to a unique integer ID
dictionary = Dictionary(texts)
# this line creates a corpus and converts a single document (a list of words) into a bag-of-words format
corpus = [dictionary.doc2bow(text) for text in texts]

print(corpus[1])

[(1, 1)]


## Different Kinds of Topic Modeling 
Topic Modeling refers to the probabilistic modeling of text documents as topics. Gensim is one of the most popular libraries to perform such modeling.

#### LSI — Latent Semantic Indexing
One of the methods available in gensim is called LSI, which stands for Latent Semantic Indexing. LSI aims to find hidden (latent) relationships between words and concepts in a collection of documents. The assumption here is words that are used in similar contexts tend to have similar meanings.

In [15]:
lsi_model = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)
lsi_model.show_topics(num_topics=15)

[(0,
  '1.000*"de" + 0.000*"buitufla" + -0.000*"oh" + 0.000*"te" + -0.000*"noche" + 0.000*"ya" + 0.000*"uno" + -0.000*"Ahora" + -0.000*"internet" + 0.000*"exclusivamente"'),
 (1,
  '-1.000*"cintura" + 0.000*"honesto" + 0.000*"exclusivamente" + 0.000*"tanto" + 0.000*"usted" + -0.000*"familia" + 0.000*"smoky" + 0.000*"estoy" + 0.000*"FLES" + 0.000*"todo"'),
 (2,
  '1.000*"es" + -0.000*"puede" + -0.000*"cuando" + 0.000*"buitufla" + 0.000*"LOS" + 0.000*"loMONY" + 0.000*"vas" + -0.000*"Que" + -0.000*"usted" + 0.000*"Kurdistica"'),
 (3,
  '-1.000*"la" + 0.000*"casi" + 0.000*"va" + 0.000*"pusebetween" + 0.000*"EN" + -0.000*"haber" + 0.000*"gran" + 0.000*"exclusivamente" + -0.000*"sword" + -0.000*"Korea"'),
 (4,
  '-1.000*"boteo" + -0.000*"buena" + -0.000*"vas" + -0.000*"ir" + -0.000*"estaba" + -0.000*"exclusivamente" + 0.000*"rica" + -0.000*"ASI" + -0.000*"Dye" + -0.000*"Una"'),
 (5,
  '-1.000*"que" + 0.000*"y" + 0.000*"buena" + 0.000*"poco" + 0.000*"Simpson" + -0.000*"vamos" + -0.000*"habido

#### HDP — Hierarchical Dirichlet Process
HDP, the Hierarchical Dirichlet Process is an unsupervised Topic Model which figures out the number of topics on its own. HPD assumes that documents are mixtures of topics, and topics are mixtures of words, but it doesn't limit the number of topics.

In [16]:
hdp_model = HdpModel(corpus=corpus, id2word=dictionary)
hdp_model.show_topics()[:5]

[(0,
  '0.028*materiales + 0.027*nueva + 0.022*estaba + 0.020*él + 0.020*buitufla + 0.018*Swap + 0.018*uno + 0.017*sword + 0.016*Fuéno + 0.015*estar + 0.015*si + 0.014*va + 0.012*Pero + 0.012*clon + 0.011*ser + 0.011*mirar + 0.011*V + 0.010*usted + 0.010*luego + 0.010*undeljase'),
 (1,
  '0.033*Baby + 0.023*Ok + 0.021*$ + 0.020*comiendo + 0.019*convince + 0.018*el + 0.018*V + 0.018*casi + 0.017*en + 0.015*buitufla + 0.014*försito + 0.014*muy + 0.014*hacer + 0.014*Ice + 0.013*haber + 0.013*usted + 0.013*tiene + 0.012*cuando + 0.012*voy + 0.012*trabajar'),
 (2,
  '0.026*holaen + 0.022*es + 0.021*forma + 0.020*que + 0.018*debía<|cy| + 0.018*importante + 0.017*Así + 0.017*en + 0.016*la + 0.016*estaba + 0.015*si + 0.014*realidad + 0.013*también + 0.013*decimos + 0.012*mariamos + 0.011*familia + 0.011*ven + 0.011*smoky + 0.011*y + 0.010*resolve'),
 (3,
  '0.032*Es + 0.026*mí + 0.025*siempre + 0.023*materiales + 0.021*trabajar + 0.021*videos + 0.018*tanto + 0.017*le + 0.016*pensando + 0.015*r

#### LDA — Latent Dirichlet Allocation
LDA or Latent Dirichlet Allocation is arguably the most famous Topic Modeling algorithm out there. Out here we create a simple Topic Model with 5 topics. The LDA algorithm assumes that each document is a mixture of topics, and each topic is a mixture of words.

In [17]:
lda_model = LdaModel(corpus=corpus, num_topics=5, id2word=dictionary)
lda_model.show_topics()

[(0,
  '0.048*"tener" + 0.048*"voy" + 0.032*"muy" + 0.032*"Qué" + 0.032*"se" + 0.025*"una" + 0.025*"puse" + 0.025*"forma" + 0.025*"coche" + 0.018*"que"'),
 (1,
  '0.159*"la" + 0.096*"boteo" + 0.039*"para" + 0.039*"el" + 0.027*"es" + 0.027*"Ah" + 0.020*"un" + 0.020*"cómo" + 0.014*"$" + 0.014*"estar"'),
 (2,
  '0.163*"de" + 0.040*"un" + 0.034*"gusta" + 0.034*"como" + 0.034*"lo" + 0.028*"en" + 0.027*"La" + 0.021*"el" + 0.021*"parece" + 0.021*"con"'),
 (3,
  '0.173*"cintura" + 0.143*"es" + 0.026*"color" + 0.020*"boteo" + 0.020*"Y" + 0.020*"mucho" + 0.014*"trabajar" + 0.014*"ser" + 0.014*"le" + 0.014*"look"'),
 (4,
  '0.110*"que" + 0.086*"de" + 0.066*"en" + 0.046*"mi" + 0.033*" " + 0.027*"casa" + 0.021*"boteo" + 0.021*"los" + 0.014*"puedo" + 0.014*"si"')]

## Visualizing Topics with pyLDAvis
pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing.

**Note: If you have issues running the `pyLDAvis.gensim_modeles.prepare()` function, you may need to walk yourself through [these fixes](https://docs.google.com/document/d/1XOz5fJdHR754SHkMIrqCxkgD2yGHhBlueK4EcbaNwII/edit?usp=sharing).**

In [18]:
#for visualizations
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()


In [19]:
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)


In [20]:
pyLDAvis.save_html(vis, "../output/topics_modeling_demure.html")

In [21]:
# vis

## Word count
For good measure, we can also use this space to make a word count

In [22]:
words_influencer = pd.DataFrame(texts)
print(len(words_influencer))
words_influencer.columns = ["word"]
words_influencer.head()


528


Unnamed: 0,word
0,
1,Ok
2,vamos
3,tener
4,un


In [23]:
word_tally = words_influencer["word"].value_counts()
word_tally.head()

word
de         39
cintura    31
es         27
la         25
boteo      21
Name: count, dtype: int64

In [24]:
word_tally.to_csv("../output/word_tally_demure.csv")