## Loading your text and making it a corpus

#### First we need to load the text

#### Then we can load spaCy's English language trained pipeline

`A training pipeline typically reads training data from a feature store, performs model-dependent transformations, trains the model, and evaluates the model before the model is saved to a model registry.`

#### Stop words

A lot of languages also contain 'stop words', words that are used very frequently and may not be useful when we're evaluating how often certain words may be used. spaCy has niftyfunctions that allow us to designate stop words for our analysis. 

For this purpose, we got stopwords [here](https://github.com/stopwords-iso/stopwords-es).

## Computational Linguistics

#### POS-Tagging — (Part Of Speech)
spaCy has a a nifty way to look into how each word is used in a sentence, often also referred to as Part Of Speech (POS). There are eight main parts of speech — nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions, and interjections. 

#### NER-Tagging — (Named Entity Recognition)
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.



Let's run a slightly different version of this code to see what role these things play:

You can render these roles visually, too:

#### Dependency Parsing
The term Dependency Parsing (DP) refers to the process of examining the dependencies between the phrases of a sentence in order to determine its grammatical structure.

## Data cleaning
The next few lines get rid of stopwords and punctuation markers, and add lemmatized words.

Sometimes Topic Modeling makes more sense when New and York are treated as New York - we can do this by creating a bigram model and modifying our corpus accordingly.

## Different Kinds of Topic Modeling 
Topic Modeling refers to the probabilistic modeling of text documents as topics. Gensim remains the most popular library to perform such modeling, and we will be using it to perform our Topic Modeling.

#### LSI — Latent Semantic Indexing
LSI stands for Latent Semantic Indexing — It is a popular information retrieval (IR) method that works by decomposing the original matrix of words to maintain key topics.

#### HDP — Hierarchical Dirichlet Process
HDP, the Hierarchical Dirichlet Process is an unsupervised Topic Model which figures out the number of topics on its own.

In [None]:
hdp_model = HdpModel(corpus=corpus, id2word=dictionary)
hdp_model.show_topics()[:5]

#### LDA — Latent Dirichlet Allocation
LDA or Latent Dirichlet Allocation is arguably the most famous Topic Modeling algorithm out there. Out here we create a simple Topic Model with 10 topics.

## Visualizing Topics with pyLDAvis
pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing.

## Word count