<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Topic Modeling - What ARE they talkingabout?

### Learning Objectives
- Demonstrate an understanding of clustering/topic approaches with NLP
- Understand approaches to persisting the insights from unsupervised approaches
- Discuss keyword generation and classification using machine learning

In [None]:
# from IPython.display import display, HTML
# display(HTML("<style>.container { width:80% !important; }</style>"))
# display(HTML("<style>.output_result { max-width:80% !important; }</style>"))

In [None]:
import pandas as pd
import numpy as np

import re
import random
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

import spacy
from spacy import displacy
from spacy.util import minibatch, compounding
from spacy.tokens import Doc
from spacy.training import Example

import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('stopwords')

from gensim import matutils, corpora
from gensim.models import Word2Vec
from gensim.corpora import MmCorpus, dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel

#import pyLDAvis.gensim_models

pd.options.display.float_format = '{:,.8f}'.format
%matplotlib inline

In [None]:
 # Due to backward incompatibility, if spacy 2.x is used, the codes here will not work as intended
assert spacy.__version__ >= "3.0.0", "Ensure spacy version is 3.x.x" 

# Topic Modeling

Topic modeling is an interesting problem in NLP applications where we want to get an idea of what subjects (topics) we have in our dataset. A topic is nothing more than a collection of words that describe the overall theme. For example, in case of news articles, we might think of topics as politics, sports etc. 

Topic modeling won’t directly give you names of the topics but rather a set of most probable words that might describe a topic. It is up to us to determine what topic the set of words might refer to. 

In the below we see a recipe for various topics - **What do you think they represent?**

![TopicModeling](assets/TopicModeling.png)

#### All topic models are based on the same basic assumption:
- each document consists of a mixture of topics, and
- each topic consists of a collection of words.


In other words, topic models are built around the idea that the semantics of our document are actually being governed by some hidden, or “latent,” variables that we are not observing. As a result, the goal of topic modeling is to uncover these latent variables — topics — that shape the meaning of our document and corpus. 

For example, the word bank when used together with mortgage, loans, and rates probably means a financial institution. However, the word bank when used together with lures, casting, and fish probably means a stream or river bank.

If every word natively assigned itself to a topic - that would not be an issue

![MachineLearning](assets/word2concept.png)

Unfortunately - with context it tends to be more confusing

![MachineLearning](assets/concept2words.png)

The fundamental difficulty arises when we compare words to find relevant documents, because what we really want to do is compare the meanings or concepts behind the words. To do that we use topic models to build topic 'recipes'. 



### Topic Recipes

When the model builds out topics - it creates a frequency of different unique words that define the topic. However, the model does not convey the "topic" of conversation. It's up to the Data Scientist to piece it together with context clues. There are a few ways to do this:

1. Look at the words and frequencies to see if they tell a story
2. Investigate the records by calling back multiple records per topics or the most dominant topics per record 

Often time you may need SME to assist!


For example - say we have three topics - sports, business and science. We ask our model to find 3 topics (a number we have to tune) and it returns the below recipes where each word and topic pair includes the probability **P(word|topic)**

- Topic 1: [football: 0.3, basketball: 0.2, baseball: 0.2, touchdown: 0.02 ... genetics: 0.0001] 
- Topic 2: [genetics: 0.2, drug: 0.2, ... baseball: 0.0001]
- Topic 3: [stocks: 0.1, ipo: 0.08,  ... baseball: 0.0001]

**How would you label these topics?**

#### With that in mind we'll investigate a few methods of topic modeling 

### Today's Topic Modeling Approaches
- Latent Semantic Analysis
- Latent Direchlet Analysis

### Others
- nonnegative matrix factorization(NMF)
- Probabilistic Latent Semantic Analysis
- lda2vec

[Read up on a few of these here](https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05)

### Latent Semantic Analysis

Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.

Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.



#### Step 1 - Generating a Document Term Matrix

Given m document with n words in our vocabulary we can construct a large matrix (m X n). Each row represents a document (individual record) and each column represents every unique word from the corpus. While the method could count the number of words that would overlook the significance so typically LSA models replace raw counts with tf-idf scores. 

#### Step 2 - Dimensionality reduction

With every unique word captured as a column the resulting matrix is very likely spare and noisy. A traditional approach is using singular value decomposition. This linear algebra technique factorizes any matrix into a product of 3 seperate matrices: 


##### Step 3 - Uncover Topics

With these document vectors and term vectors, we can now easily apply measures such as cosine similarity to evaluate the similarity of different documents, words or passages. 


In [None]:
#For simplicities sake - let's start with a small dataset

documents = ["Rover is a good dog", 
             "Cats are lazy", 
             "dogs are a mans best friend" , 
             "There is the cat in the hat", 
             "Cats are easy to care for", 
             "I only want black cats and dogs", 
             "Have you seen my dog?"] 

dog_cat=pd.DataFrame(documents, columns=['documents'])
print (dog_cat.shape)

In [None]:
dog_cat.head()

### Start with some preprocessing

To learn more about regex, visit https://regexr.com/

In [None]:
#remove special characters
dog_cat['clean_documents'] = dog_cat['documents'].str.replace("[^a-zA-Z#]", " ", regex=True)

In [None]:
dog_cat.head()

In [None]:
#remove words have letters less than 3
dog_cat['clean_documents'] = dog_cat['clean_documents'].fillna('').apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

In [None]:
#lowercase all characters
dog_cat['clean_documents'] = dog_cat['clean_documents'].fillna('').apply(lambda x: x.lower())

In [None]:
dog_cat.head()

### Building the model

In [None]:
#Build the TF-IDF DTM
vectorizer = TfidfVectorizer(stop_words='english', 
                             use_idf=True, 
                             smooth_idf=True)

# SVD to reduce dimensionality: 
svd_model = TruncatedSVD(n_components=2,         
                         algorithm='randomized',
                         n_iter=20)

# Using a pipeline for fun to combine tf-idf + SVD fitting against corpus
svd_transformer = Pipeline([('tfidf', vectorizer), 
                            ('svd', svd_model)])
svd_matrix = svd_transformer.fit_transform(dog_cat['clean_documents'])

In [None]:
# Let's capture the output of our topic model
topic_encoded_df = pd.DataFrame(svd_matrix, columns = ["topic_1", "topic_2"])
topic_encoded_df["documents"] = dog_cat['clean_documents']

In [None]:
# Using the below - which topic is for cats and which one is for dogs?
topic_encoded_df

## Exercise - Repeat the above using the below documents


Items for consideration
- How many topics are you looking to find?
- How can you best label the topics?
- Will this model work well on a new record? When and why?

In [None]:
titles =[ "The Neatest Little Guide to Stock Market Investing", 
    "Investing For Dummies, 4th Edition", 
    "The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns", 
    "The Little Book of Value Investing", 
    "Value Investing: From Graham to Buffett and Beyond", 
    "Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!", 
    "Investing in Real Estate, 5th Edition", 
    "Stock Investing For Dummies", 
    "Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss"]

In [None]:
# Workspace
df = pd.DataFrame(titles, columns=['titles'])
print (df.shape)
#df.head()

df['clean_documents'] = df['titles'].str.replace("[^a-zA-Z#]", " ", regex=True)
df['clean_documents'] = df['clean_documents'].fillna('').apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
df['clean_documents'] = df['clean_documents'].fillna('').apply(lambda x: x.lower())

df.head()

In [None]:
#Build the TF-IDF DTM
vectorizer = TfidfVectorizer(stop_words='english', 
                             use_idf=True, 
                             smooth_idf=True)

# SVD to reduce dimensionality: 
svd_model = TruncatedSVD(n_components=2,         
                         algorithm='randomized',
                         n_iter=20)

# Using a pipeline for fun to combine tf-idf + SVD fitting against corpus
svd_transformer = Pipeline([('tfidf', vectorizer), 
                            ('svd', svd_model)])
svd_matrix = svd_transformer.fit_transform(df['clean_documents'])

In [None]:
# Let's capture the output of our topic model
topic_encoded_df = pd.DataFrame(svd_matrix, columns = ["topic_1", "topic_2"])
topic_encoded_df["documents"] = df['clean_documents']

In [None]:
topic_encoded_df

### LSA Overview:

Pros:
- Quick and Efficient method    
    
Cons:
- lack of interpretable topics means we have to assign context ourselves 
- the components may be arbitrarily positive/negative
- need for really large set of documents and vocabulary to get accurate results
- less efficient representation

# Latent Dirichlet Allocation (LDA)

Latent Dirichlet allocation (LDA) is a topic model that generates topics based on word frequency from a set of documents. LDA is particularly useful for finding reasonably accurate mixtures of topics within a given document set. We can think of dirichlet as a “distribution over distributions.” In essence, it answers the question: “given this type of distribution, what are some actual probability distributions I am likely to see?”

![LDA](assets/Schematic-of-LDA.png)


Essentially Latent Dirichlet Allocation is a model that assumes this is the way text is generated and then attempts to learn two things:

1. The word distribution of each topic

2. The topic distribution of each document.

In the below graph you'll notice how the dirichlet distribution allows multiple words to be associated with various topics, and multiple documents to be assigned as well. It's an expanding infered context that allows both diminsions to play a role in defining the creation of topis.

![ldamatching](assets/ldamatching.jpeg)


LDA is easily the most popular topic modeling technique out there. It's not a silver bullet but works well in a wide range of applications and is easy to use. The basic steps are

1. Loading & clean data
2. Preparing data for LDA analysis
3. LDA model training
4. Analyzing LDA model results


### Data for this journey 

For the LDA model we're going to kick it up a notch. We will be using 1,000 papers from the Neural Information Processing Systems (NeurIPS) conferences from 1987 to 2016. As the file is ~400 mb it has been reduced to papers listing an abstract


https://www.kaggle.com/benhamner/nips-papers?select=database.sqlite


### Step 1: Load and Clean data


In [None]:
papers=pd.read_csv('data/NIPS_Papers.csv')
print (papers.shape)
papers.head(1)

In [None]:
papers.iloc[0].abstract

In [None]:
# Remove punctuation
papers['abstract_processed'] = papers['abstract'].map(lambda x: re.sub(r'[^\w\s]', '', x))

# Convert the titles to lowercase
papers['abstract_processed'] = papers['abstract_processed'].map(lambda x: x.lower())

# Print out the first rows of papers
papers['abstract_processed'][0:4]

In [None]:
papers.iloc[0].abstract # Before processing

In [None]:
papers.iloc[0].abstract_processed # After processing

### Step 2: Preparing data for LDA analysis

We are going to stand this up a bit from how we've lemmatized before. With the power of spacy we can lemmatize much faster based on the POS tagging. The context of nouns, verbs, adjectives and adverbs will be lemmatized against the associated sPacy lexicon

In [None]:
#Lemmatize the data to reduce the feature space

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
papers['abstract_lemma'] = papers['abstract_processed'].map(lambda x: [token.lemma_ for token in nlp(x) if token.lemma_ != '-PRON-' and token.pos_ in {'NOUN', 'VERB', 'ADJ', 'ADV'}])

# Final cleaning
papers['abstract_processed_lemma'] = papers['abstract_lemma'].map(lambda x: [t for t in x if len(t) > 1])

# Example
print(papers['abstract_processed_lemma'].iloc[0][:25], end='\n\n')

In [None]:
# Let's remove stopwords and see the difference
stop_en = stopwords.words('english')
papers['abstract_processed_lemma'] = papers['abstract_processed_lemma'].map(lambda x: [t for t in x if t not in stop_en]) 
print(papers['abstract_processed_lemma'].iloc[0][:25])

In [None]:
# Create a corpus from a list of texts
texts = papers.sample(n=500, random_state=43)['abstract_processed_lemma'].values
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [None]:
# Let's take a sneak peak inside dictionary
print(f'No of words stored in dictionary: {len(dictionary.items())}') 
print ('10 most commonly used words in dictionary:')
print (dictionary.most_common()[:10])

#### Step 3: build our LDA Model

More parameters [can be found here](https://radimrehurek.com/gensim/models/ldamodel.html)

`corpus`– Stream of document vectors or sparse matrix of shape (num_terms, num_documents). If not given, the model is left untrained (presumably because you want to call update() manually).

`num_topics`  – The number of requested latent topics to be extracted from the training corpus.

`id2word` – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.

`minimum_probability` – Topics with a probability lower than this threshold will be filtered out.

In [None]:
n_topics=9

my_lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary, random_state=42, minimum_probability=0)

### Our first topics

With those few lines of code we've explored and built an NLP version of clustering to create our topics. The next step is to explore them.

We can limit what is returned by specifying the number of topics we're interested in and the number of words to return

In [None]:
# We can limit what is returned by specifying the number of topics we're interested in and the number of works to return
num_topics = 10
num_words = 5
for ti, topic in enumerate(my_lda.show_topics(num_topics = num_topics, num_words= num_words)):
    print("Topic: %d" % (ti))
    print (topic)
    print()

In [None]:
# How similar are topics? Remember cosine similarity from above. It's built-in to gensim letting us see how similar topics are to each other

matutils.cossim(my_lda.get_topic_terms(2), my_lda.get_topic_terms(1))

## Analyzing LDA model results

A common method for reviewing a topic model is to investigate its key metrics. For LDA those tend to be

#### Perplexity 
 - Perplexity measures how probable some new unseen data is given the model that was learned earlier. That is to say, how well does the model represent or reproduce the statistics of the held-out data. Optimizing for perplexity may not yield human interpretable topics


#### Coherence
 - Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference
 
A good readup on both can be found [here](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)

For C__v coherence scores a rule of thumb is:
- .3 is bad


- .4 is low


- .55 is okay


- .65 might be as good as it is going to get


- .7 is nice


- .8 is unlikely and


- .9 is probably wrong

My approach to finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.

Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics.


In [None]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=my_lda, texts=papers['abstract_processed_lemma'], dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

#### By our rules of thumb, not a great score so let's see how well we could do given our preprocessing strategy

In [None]:
# Takes few mins to train the models
max_topics = 30
coh_list = []
for n_topics in range(3,max_topics+1):
    # Train the model on the corpus
    my_lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary, random_state=42, alpha=0.1)
    # Estimate coherence
    cm = CoherenceModel(model=my_lda, texts=texts, dictionary=dictionary, coherence='c_v', topn=20)
    coherence = cm.get_coherence_per_topic() # get coherence value
    coh_list.append(coherence)

In [None]:
# Coherence scores:
coh_means = np.array([np.mean(l) for l in coh_list])
coh_stds = np.array([np.std(l) for l in coh_list])

import matplotlib.pyplot as plt
%matplotlib inline
plt.xticks(np.arange(3, max_topics+1, 3.0));
plt.plot(range(3,max_topics+1), coh_means);
plt.fill_between(range(3,max_topics+1), coh_means-coh_stds, coh_means+coh_stds, color='g', alpha=0.05);
plt.vlines([8, 9], 0.24, 0.26, color='red', linestyles='dashed',  linewidth=1);
plt.hlines([0.253], 3, max_topics, color='black', linestyles='dotted',  linewidth=0.5);

#### Looks like the best we can do is ~3. In that case we'll want to go back to our data and work on additional stopwords, vectorizing and other pre-processing strategies to make this stronger

### Visualizations

Now that we have a trained model let’s visualize the topics for interpretability. To do so, we’ll use a popular visualization package, pyLDAvis which is designed to help interactively with:
1. Better understanding and interpreting individual topics, and
2. Better understanding the relationships between the topics.


For (1), you can manually select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter. This can help when you’re trying to assign a human interpretable name or “meaning” to each topic.


For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

In [None]:
# Reloading our model with 9 topics
n_topics=9
n_top_words = 25
my_lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary, random_state=42, minimum_probability=0)

In [None]:
# Let's visualize our topics in our feature space to see how divergent they were
import pyLDAvis
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(my_lda, corpus, dictionary)

### Finding the dominant topic in each sentence

One of the practical application of topic modeling is to determine what topic a given document is about.

To find that, we find the topic number that has the highest percentage contribution in that document.

The `format_topics_sentences()` function below nicely aggregates this information in a presentable table.

In [None]:
def format_topics_sentences(ldamodel=my_lda, corpus=corpus, texts=papers['abstract_processed_lemma']):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=my_lda, corpus=corpus, texts=papers['abstract'])

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
pd.set_option('max_colwidth', 400)
df_dominant_topic.head(10)

# Finding the prevalence of a topic by each document

The `.get_document_topics` method returns the probability of a topic belonging to any particular document. Investigate individually or apply against your original dataframe.



In [None]:
my_lda.get_document_topics(corpus[0])

### Contextualizing topics

Sometimes just the topic keywords may not be enough to make sense of what a topic is about. So, to help with understanding the topic, you can find the documents a given topic has contributed to the most and infer the topic by reading that document. Whew!!

In [None]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf_mallet.head()

#### What do you think these topics are related to? Consider filtering on the df 'df_dominant_topic' to provide more context

## Document Tagging

![tagging](assets/tagging.png)


Over the course of this lesson we've looked at a few ways to generate potential keywords for articles.

1. Leveraging already labeled data
2. Named Entity Extraction
3. Topic Modeling


Once this information is created - it can be used discriminantly to apply keywords to your documents to improve information retrieval. Since it's important for tags to be governed to maximize interprability and effectiveness the challenge switches from a generation exercise to a text classification exercise.

Depending on the complexity of your model and the size of your tags - this could be an easy to challenging program

### Tags

Tags should be representative of the subject matter at hand. They begin at higher levels and can dig into more nuance as you gain the ability. For instance one might start simply with subjects

- Biology
- Chemistry
- Physics
- Mathematics


As you can additional labeled data you can expand your tags to flush out any particular area. For instance:

- Biology
    - Molecular biology
    - Cell Biology
    - Genomics
    - Proteomics 
    - Virology
    - Ecology
    - etc 
    
    
### Let's start by trying out Named Entity Recognition (or extraction)

For this exercise we'll use a Seasonality of Transmission dataset from a larger COVID-19 repository on Kaggle

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

In [None]:
#Create the dataset

COVID=pd.read_csv('./data/Seasonalityoftransmission.csv')
COVID.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
COVID.head()

In [None]:
COVID.shape

In [None]:
# Notice the dataset has repeating values to allow it to categorize based on 'Study Type' and 'Factors'
len(COVID['Study'].unique())

In [None]:
#Let's see how this works for a record in our dataframe
COVID.Study[0]

In [None]:
#Let's see how this works for an individual record

nlp=spacy.load('en_core_web_sm')
doc=nlp(COVID.Study[0])

print (doc)
print([(ent.text, ent.label_) for ent in doc.ents])

In [None]:
#Let's see how this works for all the other records

for i, study in enumerate(COVID.Study):
    doc=nlp(study)
    print(i, doc)
    print([(ent.text, ent.label_) for ent in doc.ents])
    print('-----Break-----')

#### Interesting breakout of items but let's add some of our own words


In [None]:
# Start by creating a label
LABEL = "COVID"

# Add some training data associating the entries with beginning and ending character
TRAIN_DATA = [
    ("Causal empirical estimates suggest COVID-19 transmission rates are highly seasonal",
        {"entities": [(35, 42, LABEL)]}),
    ("Meteorological factors correlate with transmission of 2019-nCoV: Proof of incidence of novel coronavirus pneumonia in Hubei Province, China", 
     {"entities": [(54,62,LABEL)]}),
    ("Eco-epidemiological assessment of the COVID-19 epidemic in China, January-February 2020",
        {"entities": [(38, 45, LABEL)]}),
    
    ("Correlation between weather and Covid-19 pandemic in Jakarta, Indonesia",
        {"entities": [(32, 39, LABEL)]}),
    ("Effects of temperature and humidity on the spread of COVID-19: A systematic review",
        {"entities": [(53, 60, LABEL)]} ),
    ("Effect of Temperature on the Transmission of COVID-19: A Machine Learning Case Study in Spain", 
         {"entities": [(45, 52, LABEL)]}),
    ("COVID-19: Effects of weather conditions on the propagation of respiratory droplets", 
         {"entities": [(0, 7, LABEL)]}),
]

In [None]:
# It'll take too long to do that by hand - let's build a function that does this for us

def build_training(df_series, word, LABEL):
    """Takes the word of interest, assigned to a LABEL against a dataframe series/column to return training data. 
    Note - this is only designed for one tag per entry"""
    train_data=[]
    
    #Captures the ending position of the word within the string
    y=len(word)-1
    /
    #Loops through each entry to find the word
    for entry in set(df_series):
        x= str(entry).lower().find(word.lower())
        
        if x == -1: # -1 means it wasn't found
            continue # so we'll skip that one
        else:
            print (f'Word extracted: {entry[x:x+y+1]}')
            train_data.append((entry,{"entities":[(x,x+y+1,LABEL)]}))
    return train_data    

In [None]:
#Create our training_data
train_data=build_training(COVID.Study, word='Coronavirus', LABEL='DISEASE')

In [None]:
#Not much data is there?
train_data[0:7]

In [None]:
train_data[0][0][40:51]

### Training a model

First, let’s understand the ideas involved before going to the code.

1. To train an ner model, the model has to be looped over the example for sufficient number of iterations. If you train it for like just 5 or 6 iterations, it may not be effective.

2. Before every iteration it’s a good practice to shuffle the examples randomly `throughrandom.shuffle()` function. This will ensure the model does not make generalizations based on the order of the examples.

3. The training data is usually passed in batches.

You can call the `minibatch()` function of spaCy over the training data that will return you data in batches . The minibatch function takes size parameter to denote the batch size. You can make use of the utility function compounding to generate an infinite series of compounding values.

`compunding()` function takes three inputs which are start ( the first integer value) ,stop (the maximum value that can be generated) and finally compound. This value stored in compund is the compounding factor for the series.

For each iteration , the model or ner is updated through the `nlp.update()` command. Parameters of `nlp.update()` are :

- docs: This expects a batch of texts as input. You can pass each batch to the zip method, which will return you batches of text and annotations. `
- vgolds: You can pass the annotations we got through zip method here

- vdrop: This represents the dropout rate.

- vlosses: A dictionary to hold the losses against each pipeline component. Create an empty dictionary and pass it here.

At each word, the `update()` it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.

Finally, all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.

In [None]:
# Load the spacy model (trained pipeline for English). For more details, visit https://spacy.io/models/en#en_core_web_sm
nlp=spacy.load('en_core_web_sm')

def adding_entity(train_data,LABEL):
    """Takes in the training data and associated entity label to attempt to learn a new entity"""
    
    random.seed(1)
    
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
        # otherwise, get the NER component, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    # Add the new label to ner
    ner.add_label(LABEL)

    # Resumee training
    optimizer = nlp.resume_training()
    move_names = list(ner.move_names)

    # List of pipes you want to train
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

    # List of pipes which should remain unaffected in training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
    
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        sizes = compounding(1.0, 4.0, 1.001)
        # Training for 30 iterations     
        for itn in range(30):
            # shuffle examples before training
            random.shuffle(train_data)
            # batch up the examples using spaCy's minibatch
            batches = minibatch(train_data, size=sizes)
            # Dictionary to store losses
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                
                example=[]
                # Update the model with iterating each text
                for i in range(len(texts)):
                    doc = nlp.make_doc(texts[i])
                    example.append(Example.from_dict(doc, annotations[i]))
                
                # Update the model
                nlp.update(example, sgd=optimizer, drop=0.25, losses=losses)
                print("Losses", losses)

In [None]:
# Adding our training data to find the new word
adding_entity(train_data, LABEL='DISEASE')

In [None]:
# Testing the model
# nlp=spacy.load('en_core_web_sm')

test_text = "Eco-epidemiological assessment of the Coronavirus epidemic in China, January-February 2020"
#test_text = "Correlation between weather and Covid-19 pandemic in Jakarta, Indonesia"
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
    print(ent.label_, ent.text)
    
displacy.render(doc, style='ent', jupyter=True)

#### Fantastic! A New Entity for Coronavirus!

spaCy was able to look at how the word was being used in context, classify it and come out with a new way to understand and thus track this entity. However - with just a few examples, we may have needed to run a few times to return the new

If spaCy can't recognize the word - in a case we'll show below with COVID-19  - then you'll need to add it to the dictionary. We try that in the below to show you what might happen. Sometimes this will even end up in a "catastrophic failure" where it lost the ability to find other Named Entities as well! Always something to be aware of when retraining a model and potentially a great use case for building a blank model against your target NER items (all your keywords)

One solution would be adding COVID-19 to the lexicon/vocabulary of the model. This is a great situation to dig into [the spaCy documentation](https://spacy.io/usage/spacy-101).


Some other food for thought. Often for spaCy to recognize a new entity it needs a few hundred observations. We used a randomseed here to make sure it found the entity with only two observations. However - the more the merrier

In [None]:
#Updates our Training data (Adding old data on top of new data)
train_data=build_training(COVID.Study, 'COVID-19', 'DISEASE')
train_data = train_data + build_training(COVID.Study, 'Coronavirus', 'DISEASE')

In [None]:
print (f'No. of records in training data: {len(train_data)}')
train_data[:2]

In [None]:
train_data[0][0][39:47]

In [None]:
#build the model
adding_entity(train_data, 'DISEASE')

In [None]:
# Testing the model

test_text = "Eco-epidemiological assessment of the COVID-19 epidemic in China, January-February 2020"
# test_text = "Eco-epidemiological assessment of the Coronavirus epidemic in China, January-February 2020"
# test_text = "Correlation between weather and Covid-19 pandemic in Jakarta, Indonesia"
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
    print(ent.label_, ent.text)
    
displacy.render(doc, style='ent', jupyter=True)

## Using the Classification method on your discovered topics

### Step 1 - Define your tags
- Avoid Overlapping
- Don't mix classifications
- Organize your tags around similar hierarchies


### Step 2 - Data gathering

Capture all the labeled data available for each tag. Consider what you have on hand, what you may want to build and potentially data you might need to build. Options range from crowdsourcing it with something like [AWS groundtruth](https://aws.amazon.com/sagemaker/groundtruth/) or algorithms like the [RAKE: Rapid Automatic Keyword Extractor](https://medium.com/datadriveninvestor/rake-rapid-automatic-keyword-extraction-algorithm-f4ec17b2886c)

### Step 3 - Text classifier

Determine the right text classification 

## Exercise: Tagging

Now that you have a dominant topic tagged to all 1,000 articles from the NIPS papers - let's see if you can find a way to classify 500 new documents against the top topic



# Recap


In the world of NLP - context is king. In the above we learned to

- Leverage POS and NER to provide a better understanding of words within the corpus
- Transpose the DTM to give additonal context to words based on their usage
- Build models to allow us to identify topics that exist within your corpus
- Apply topics against your dataframe to continue your projects design
