### Text cleaning and topic modeling

This notebook is an example of topic modeling adapted from [this writeup](https://medium.com/@sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06).

It performs the following tasks:

- the first part of the notebook loads texts from a spreadsheet and turns them into one large corpuse
- then we walk through various ways in which we can analyze and clean our corpus using spaCy (this includes taking out `stopwords` — words most often used in the English language and lemmatizing our corpus)
- to better understand how a model works this notebook also explores some funcationalities of spaCy
- the last parts of this notebook then make a simple topics model from the cleaned language data

The libraries we will use are:
- `pandas`: for reading in and exporting spreadsheets
- `spacy`: a natural language processing library that contains various models trained on various languages
- `gensim`: a library for topic modelling, document indexing and similarity retrieval with large corpora. In this case we will use it for topic modeling, the process of clustering words that seem to be used a lot in relation to one another. The algorithms built into genim that this notebook uses are called [Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) and [Latent Semantic Analysis (LSA)](https://blog.marketmuse.com/glossary/latent-semantic-analysis-definition/).
- `pyLDAvis`: a library that is capable of visualizing your topics clusters.

Topic modeling is a form of unsupervised machine learning and can be really helpful in discovering topics in a large amount of text, especially if you're uncertain which topics might be buried in thousands or millions of documents. 

In [None]:


# for comprehension of language

# for topics modeling



### Load spaCy's English language trained pipeline

`A training pipeline typically reads training data from a feature store, performs model-dependent transformations, trains the model, and evaluates the model before the model is saved to a model registry.`

You will need to download one of spaCy's models and can do so by typing this into a cell here:
```
!python3 -m spacy download en_core_web_sm

```

In [None]:
# !python3 -m spacy download en_core_web_sm

In [None]:
#load the English language model 


#### Stop words

A lot of languages also contain 'stop words', words that are used very frequently and may not be useful when we're evaluating how often certain words may be used. spaCy has niftyfunctions that allow us to designate stop words for our analysis. 

For this purpose, we got stopwords [here](https://gist.github.com/sebleier/554280).

First we need to open the text file adn then turn it into a list of words:

Next we use spaCy's model and define stopwords:

In [None]:
# starts a loop that iterates through each word in the stop_words list.

    # This line retrieves the lexeme (the base or dictionary form of a word) from spaCy's vocabulary. 

    # then we set `lexeme.is_stop = True`for each word, making every word a stop word in spaCy's vocabulary.



## Loading your text and making it a corpus

#### First we need to load the text

In [None]:
#load a spreadsheet with the text you want to analyze


The next lines take all content from the `transcript` column, turn it into a list and then join it all with a space between each text. This creates one large corpus:

In [None]:
]

## Data cleaning
The next few lines 'normalize' the text and turns words into lemmas, get rid of stopwords and punctuation markers, and add lemmatized words.

In [None]:
# here's a demo of us cycling through the 


In [None]:
# We add some words to the stop word list

#let's create some empty arrays. 
# texts will hold all our words that we will use for our topic model

# is a temporary array that we will use to store the lemma-version of a word




In the next lines we turn these cleaned texts into a bag-of-words format.


In [None]:
# Dictionary() to map each unique word to a unique integer ID


# this line creates a corpus and converts a single document (a list of words) into a bag-of-words format




## Different Kinds of Topic Modeling 
Topic Modeling refers to the probabilistic modeling of text documents as topics. Gensim is one of the most popular libraries to perform such modeling.

#### LSI — Latent Semantic Indexing
One of the methods available in gensim is called LSI, which stands for Latent Semantic Indexing. LSI aims to find hidden (latent) relationships between words and concepts in a collection of documents. The assumption here is words that are used in similar contexts tend to have similar meanings.

#### HDP — Hierarchical Dirichlet Process
HDP, the Hierarchical Dirichlet Process is an unsupervised Topic Model which figures out the number of topics on its own. HPD assumes that documents are mixtures of topics, and topics are mixtures of words, but it doesn't limit the number of topics.

#### LDA — Latent Dirichlet Allocation
LDA or Latent Dirichlet Allocation is arguably the most famous Topic Modeling algorithm out there. Out here we create a simple Topic Model with 5 topics. The LDA algorithm assumes that each document is a mixture of topics, and each topic is a mixture of words.

## Visualizing Topics with pyLDAvis
pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing.

**Note: If you have issues running the `pyLDAvis.gensim_modeles.prepare()` function, you may need to walk yourself through [these fixes](https://docs.google.com/document/d/1XOz5fJdHR754SHkMIrqCxkgD2yGHhBlueK4EcbaNwII/edit?usp=sharing).**

In [None]:
#for visualizations



In [None]:
# vis

## Word count
For good measure, we can also use this space to make a word count