# Topic modeling v2

In this notebook, we will start again our work to get topics from posts.  
The reason of this change is simple : 
* There are a lot of pre-processing actions we could do to improve our initial dataset (lemmatization, group words together, filter words,...)
* Our LDA model we used before got us an error we couldn't fix. This time, based on a new course (https://www.youtube.com/watch?v=6zm9NC9uRkk&ab_channel=PyData), we will improve our code.

Let's begin

## Import the data

In [223]:
import pandas as pd

In [224]:
data = pd.read_pickle("contentCorpus.pkl")
data

Unnamed: 0,Name,#Reactions,#Comments,Location,Followers,Time_spent,Media_type,Content
0,Nicholas Wyman,12,1,Unknown,6484.0,1 day ago,article,robert lerman writes that achieving a healthy...
1,Nicholas Wyman,11,0,Unknown,6484.0,1 week ago,none,"national disability advocate sara hart weir, ..."
3,Nicholas Wyman,44,0,Unknown,6484.0,2 months ago,article,exploring in this months talent management & h...
4,Nicholas Wyman,22,2,Unknown,6484.0,2 months ago,article,i count myself fortunate to have spent time wi...
5,Nicholas Wyman,21,1,Unknown,6484.0,2 months ago,article,online job platforms are a different way of wo...
...,...,...,...,...,...,...,...,...
34007,Simon Sinek,4005,93,Unknown,4206024.0,4 years ago,image,igniter of the year 2016. well i know that i'm...
34008,Simon Sinek,1698,74,Unknown,4206024.0,4 years ago,video,executives who prioritize the shareholder are ...
34009,Simon Sinek,661,59,Unknown,4206024.0,4 years ago,video,"like many, i too have been reflecting as we ne..."
34010,Simon Sinek,766,35,Unknown,4206024.0,4 years ago,video,"if you say ""customer first"" that means your em..."


In [225]:
# Let's devide this dataset into two datasets : Corpus only & Informations about the post

# We create a specific ID for each row
data["ID"]=range(data.shape[0])

# We create the dataset containing content
corpus = data[['ID','Content']]

# And the one containing reactions & comments about each post
# We don't need other columns for this analysis
postReactions = data[['ID','#Reactions','#Comments']]

In [226]:
for c in corpus['Content'][[1,100,1000,10000,20000]]:
    print(c)
    print('--------------')

national disability advocate  sara hart weir, ms   shares how congress passed the able act
--------------
for those intersted in youth career pathways. great to read today about the expansion of citi foundation’s pathways to progress inititiave - new commitment to 500k young adults #jobready #jobs #training  joanne gedge   janet searle   louise martin lindsay   amy-lou cowdroy-ling   https://lnkd.in/g8ftr5w. 
--------------
community building has meant something dramatically different the past couple of months.  when 500 startups hosted an event with 2400+ rsvps, we had to pivot almost 8 times to accommodate changing restrictions along the way.  people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months.  all these things have made me ask the question, what's the best way to build community right now? how is this changing our perception of community? thus, i'm starting my letter to the community on the subject

As we can see, some words can be filtered :
* mentions (joanne gedge, janet searle,...)
* some hashtags that are involved in sentences (natgeotraveller , globalhealth), generally at the end of the post

## Texts Preprocessing : what we can do ?

For this preprocessing, we will use Spacy, which is a fast industrial-strength natural language processing (NLP) library for Python.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

* Tokenization
* Text normalization, such as lowercasing, stemming/lemmatization
* Part-of-speech tagging
* Syntactic dependency parsing
* Sentence boundary detection
* Named entity recognition and annotation

In the "batteries included" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:

* Large English vocabulary, including stopword lists
* Token "probabilities"
* Word vectors

In [227]:
import spacy
import codecs

# We need to import spacy trained pipelines, which support many languages
# Let's use the English pipeline

# In CMD :
# $ python -m spacy download en_core_web_lg
# or
# In python :
# >>> import spacy
# >>> nlp = spacy.load("en_core_web_lg")

# We download the large version "lg" and not the small version "sm" here 
# This module is pretty large : ~631 MB
nlp = spacy.load("en_core_web_lg")

Let's create a sample of several posts to play with.

In [228]:
#Posts are separated by line breaks, and gather into one string
posts_sample = "\n\n".join(corpus.Content[[1,100,1000,10000,20000]])

print(posts_sample)

national disability advocate  sara hart weir, ms   shares how congress passed the able act

for those intersted in youth career pathways. great to read today about the expansion of citi foundation’s pathways to progress inititiave - new commitment to 500k young adults #jobready #jobs #training  joanne gedge   janet searle   louise martin lindsay   amy-lou cowdroy-ling   https://lnkd.in/g8ftr5w. 

community building has meant something dramatically different the past couple of months.  when 500 startups hosted an event with 2400+ rsvps, we had to pivot almost 8 times to accommodate changing restrictions along the way.  people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months.  all these things have made me ask the question, what's the best way to build community right now? how is this changing our perception of community? thus, i'm starting my letter to the community on the subject. 

where can we find  #case

In [229]:
parsed_posts_sample = nlp(posts_sample)

print(parsed_posts_sample)

national disability advocate  sara hart weir, ms   shares how congress passed the able act

for those intersted in youth career pathways. great to read today about the expansion of citi foundation’s pathways to progress inititiave - new commitment to 500k young adults #jobready #jobs #training  joanne gedge   janet searle   louise martin lindsay   amy-lou cowdroy-ling   https://lnkd.in/g8ftr5w. 

community building has meant something dramatically different the past couple of months.  when 500 startups hosted an event with 2400+ rsvps, we had to pivot almost 8 times to accommodate changing restrictions along the way.  people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months.  all these things have made me ask the question, what's the best way to build community right now? how is this changing our perception of community? thus, i'm starting my letter to the community on the subject. 

where can we find  #case

Looks the same ! So what happened ?  
Let's apply some functions !

### Sentences detection

Actually, we have already removed punctuation, but it is working very well !!!  
I keep that here for a later use. 

In [230]:
for num, sentence in enumerate(parsed_posts_sample.sents):
    print("Sentence {}".format(num+1))
    print(sentence)
    print('')

Sentence 1
national disability advocate  sara hart weir, ms   shares how congress passed the able act

for those intersted in youth career pathways.

Sentence 2
great to read today about the expansion of citi foundation’s pathways to progress inititiave - new commitment to 500k young adults #jobready #jobs #training  joanne gedge   janet searle   louise martin lindsay   amy-lou cowdroy-ling   https://lnkd.in/g8ftr5w.

Sentence 3


community building has meant something dramatically different the past couple of months.

Sentence 4
 when 500 startups hosted an event with 2400+ rsvps, we had to pivot almost 8 times to accommodate changing restrictions along the way.

Sentence 5
 people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months.

Sentence 6
 all these things have made me ask the question, what's the best way to build community right now?

Sentence 7
how is this changing our perception of community?

Sent

In [231]:
# Same with the original content, containing ponctuations & uppercases
test = pd.read_pickle("cleaned_data.pkl")
test_content = test.content
sample_test_content = "\n\n".join(test_content[[100,1000]])
parsed_sample_test = nlp(sample_test_content)

for num, sentence in enumerate(parsed_sample_test.sents):
    print("Sentence {}".format(num+1))
    print(sentence)
    print('')

Sentence 1
For those intersted in youth career pathways.

Sentence 2
Great to read today about the expansion of Citi Foundation’s Pathways to Progress inititiave - new commitment to 500k young adults #JobReady #Jobs #Training  Joanne Gedge   Janet Searle   Louise Martin Lindsay   Amy-Lou Cowdroy-Ling   https://lnkd.in/g8FTr5w.

Sentence 3

 
 
 …see more

Community building has meant something dramatically different the past couple of months.

Sentence 4
 When 500 Startups hosted an event with 2400+ RSVPs, we had to pivot almost 8 times to accommodate changing restrictions along the way.

Sentence 5
 People are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months.

Sentence 6
 All these things have made me ask the question, what's the best way to build community right now?

Sentence 7
How is this changing our perception of community?

Sentence 8
Thus, I'm starting my letter to the community on the subject.

Sentenc

### 

### Name entity detection

In [232]:
for num, entity in enumerate(parsed_posts_sample.ents):
    print("Entity {}".format(num+1), entity, '-', entity.label_)

Entity 1 congress - ORG
Entity 2 today - DATE
Entity 3 citi foundation’s - ORG
Entity 4 500k - ORG
Entity 5 joanne gedge   janet searle   louise martin lindsay   amy-lou cowdroy-ling - PERSON
Entity 6 the past couple of months - DATE
Entity 7 500 - CARDINAL
Entity 8 2400 - CARDINAL
Entity 9 8 - CARDINAL
Entity 10 months - DATE
Entity 11 u.s. - GPE
Entity 12 atlanta - GPE
Entity 13 one - CARDINAL
Entity 14 12th - ORDINAL
Entity 15 linkedin fam - PERSON


* ORG : organism 
* GPE : geopolitcal entity
* LOC : location

As we can see, we can know which entity represents each words or group of words ! 

For instance, spacy knows when a group of words is a name, a location, a date ...  
That's clearly amazing !

Also, it appears that some words are badly comprehend by the algorithm like "us", as USA, which is here "us" designating we.


### Speech tagging

We can also define if a word is an adjective, a noun, or other...

In [233]:
token_text = [token.orth_ for token in parsed_posts_sample]
token_pos = [token.pos_ for token in parsed_posts_sample]

pd.DataFrame({"Token_text" : token_text , "Token_pos" : token_pos})

Unnamed: 0,Token_text,Token_pos
0,national,ADJ
1,disability,NOUN
2,advocate,NOUN
3,,SPACE
4,sara,PROPN
...,...,...
355,!,PUNCT
356,#,SYM
357,natgeotraveller,NOUN
358,#,NOUN


### Text normalization (stemming/lemmatization and shape analysis)

Lemmatization consists in transforming a word into its root. 

For instance, "is" becomes "be" ; "me" becomes "I" ...

In [234]:
token_lemma = [token.lemma_ for token in parsed_posts_sample]
token_shape = [token.shape_ for token in parsed_posts_sample]

pd.DataFrame({'Token_text' : token_text, 'Token_lemma' : token_lemma , 'Token_shape' : token_shape})

Unnamed: 0,Token_text,Token_lemma,Token_shape
0,national,national,xxxx
1,disability,disability,xxxx
2,advocate,advocate,xxxx
3,,,
4,sara,sara,xxxx
...,...,...,...
355,!,!,!
356,#,#,#
357,natgeotraveller,natgeotraveller,xxxx
358,#,#,#


What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?

* stopword
* punctuation
* whitespace
* represents a number
* whether or not the token is included in spaCy's default vocabulary?

In [235]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_posts_sample]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))

df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,national,-20.0,,,,,
1,disability,-20.0,,,,,
2,advocate,-20.0,,,,,
3,,-20.0,,,Yes,,Yes
4,sara,-20.0,,,,,
...,...,...,...,...,...,...,...
355,!,-20.0,,Yes,,,
356,#,-20.0,,Yes,,,
357,natgeotraveller,-20.0,,,,,Yes
358,#,-20.0,,Yes,,,


* Log_probability represents the frequency of a word apparation in the text : 
    *  ~ 0 if appears often
    * =! 0 if appears rarely 
    

* Stop ? : Is this word a stop word ?

* Out of vocab ? : Is this word out of the english dictionary proposed by Spacy


## NLP preprocessing : application

## First pre-processing

In this section, we will clean a little bit the content by removing hashtags. Indeed, it appears that some hastags have no sense for future models.

In [236]:
# Let's create a function to apply to the pd.Serie
def rm_hashtags(str):
    return " ".join(word for word in str.split(' ') if "#" not in word)

first_clean_corpus = corpus['Content'].apply(rm_hashtags)

print("Cleaned !:")
print("----------")

for post in first_clean_corpus[:4]:
    print(post)
    print("")

Cleaned !:
----------
robert lerman  writes that achieving a healthy future of work requires employees to build skills that help them attain productive and rewarding careers. he notes - "one of the most cost-effective ways to do this is through apprenticeship, which helps workers master occupations and gain professional identity and pride". coudlnt agree more!        read the article on    urban institute 

national disability advocate  sara hart weir, ms   shares how congress passed the able act

exploring in this months talent management & hr what a company should consider to get the most out of a modern apprenticeship program. thanks to employer & entrepreneuer  ankur gopal  for sharing insights on your it program.  why not start a program in 2021.. wishing you all a safe and happy festive season.  nick          urban institute   zach boren   robert lerman   lana gordon   andrew sezonov   simon w.   ervin dimeny 

i count myself fortunate to have spent time with brooklyn-born arnold

### Phrase modeling

Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens  and  constitute a phrase is:

$$ \frac{count (AB) - count_{min}}{count(A)*count(B)} * N > treshold $$

 
...where:

 * $count(A)$ is the number of times token $A$ appears in the corpus
 * $count(B)$ is the number of times token $B$ appears in the corpus
 * $count(AB)$ is the number of times the tokens $AB$ appear in the corpus in order
 * $N$ is the total size of the corpus vocabulary
 * $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times
 * $treshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase
Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york would become new_york). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensible **gensim** library to help us with phrase modeling — the Phrases class in particular.

In [237]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
import itertools as it

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

* Segment text of complete reviews into sentences & normalize text
* First-order phrase modeling  apply first-order phrase model to transform sentences
* Second-order phrase modeling  apply second-order phrase model to transform sentences
* Apply text normalization and second-order phrase model to text of complete reviews
* We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the **lemmatized_sentence_corpus** generator function will use spaCy to:

In [238]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """

    return token.is_punct or token.is_space

def rules(token):
    """
    conditions to select a specific token for the corpus cleaning
    used with all() function : return True if all True
    """

    return [not punct_space(token),
            token not in nlp.Defaults.stop_words,
            token.pos_ == 'NOUN' or token.pos_ =="ADJ"]
            
def corpus_cleaning(serie):
    """
    generator function to use spaCy to parse posts,
    lemmatize the text, remove punctuations, unconvenient whitespaces, stopwords, and
    keep only nouns and adjectives
    """
    
    for post in nlp.pipe(serie):
      yield ' '.join([ token.lemma_ for token in post if all(rules(token)) ])

*Update*

At first, we have decided to keep verbs. However it appears that verbs like "go" or "be" appeared significantly in the topic modeling later. Therefore we will just keep Nouns & Adjectives

**Note**

"yield" returns the value we ask him, but in opposite to "return", it saves the last value it returned. Therefore, in our case, it returns a "list" of posts, whereas "return" would return one string *''.join()* of words.

Let's see if the cleaning is good !

In [239]:
print("Original :")

for post in it.islice(corpus_cleaning(first_clean_corpus),4):
    print(post)
    print('')

print("Cleaned :")

for post in it.islice(corpus_cleaning(first_clean_corpus),4):
    print(post)
    print("")

Original :
write healthy future work employee skill productive rewarding career cost effective way apprenticeship worker master occupation professional identity pride article

national disability advocate ms share able act

month talent management hr company most modern apprenticeship program thank employer entrepreneuer insight program program safe happy festive season dimeny

fortunate time assistant secretary policy evaluation research labor administration insight innovative thinking economic employment training policy many people young people full potential aged last project colleage resume thank institute report lerman lopr wonderful insight topic love thought conversation safe

Cleaned :
write healthy future work employee skill productive rewarding career cost effective way apprenticeship worker master occupation professional identity pride article

national disability advocate ms share able act

month talent management hr company most modern apprenticeship program thank employer

Here we used *yield* instead of *return*.  
The reason is well explained here : https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

Long story short, *yield* returns a generator, which is an iterable, as a list or a string for instance, but does not store all the values in the memory.

Generators recquire less memory than a list that is very useful if the dataset is very large !

Therefore, **we can only iterate over once**. 

To understand how to use spacy with a pandas, here a clearly explained article : https://towardsdatascience.com/structured-natural-language-processing-with-pandas-and-spacy-7089e66d2b10

We first need to apply the spaCy language model to the entire collection of posts. The easiest and most computationally efficient way to do this is to use the *nlp.pipe* function. This will iterate over each document and will apply the language model.

In [240]:
#Let's apply the function *CorpusCleaning* to create a parsed list
cleaned_posts = corpus_cleaning(first_clean_corpus)
print(type(cleaned_posts))

#We can print cleaned posts with 
#list(parsed_posts)
#But it's time consuming

<class 'generator'>


Let's see some examples.

To iterate on a generator, we can use the *itertools* package, which is a common practice with generator objects.

it.islice is an iterator designed to iterate over an object. Because we can't iterate directly over a generator (can't be subscriptable), this function is pretty useful !

In [241]:
for post in it.islice(cleaned_posts,5):
    print(post)
    print(type(post))
    print('----')

write healthy future work employee skill productive rewarding career cost effective way apprenticeship worker master occupation professional identity pride article
<class 'str'>
----
national disability advocate ms share able act
<class 'str'>
----
month talent management hr company most modern apprenticeship program thank employer entrepreneuer insight program program safe happy festive season dimeny
<class 'str'>
----
fortunate time assistant secretary policy evaluation research labor administration insight innovative thinking economic employment training policy many people young people full potential aged last project colleage resume thank institute report lerman lopr wonderful insight topic love thought conversation safe
<class 'str'>
----
online job platform different way time workplace school international border example future work conversation
<class 'str'>
----


**Important note**

When we iterate over a generator, the value saved in it is then deleted ! 

That confirms the definition of a generator : we can only iterate over it once !

**That also means if we want to use several times a generator, we have to recreate one for the purpose !**

For instance, here, we can't use anymore the 6 first values contained in the generator !

A common practice consists in iterating on a generator in this way :

```
for i in create_a_generator_function(y):   
    print(i)
```

******

Now, let's group words together with the **gensim.models.Phrases** library.

In [242]:
# This code was long to run
# Make the folloing statement true if you want to run it again
if 1 == 1:

    # Let's generate the cleaned content one more time as a generator
    # It can be used only one time, so we have to be careful to save the modified version
    cleaned_posts = corpus_cleaning(first_clean_corpus)

    # gensim.Phrases needs a sequence of sentences. (e.g. an iterable or a generator)
    # Each sentence has to be a list of string tokens
    streamed_posts = (post.split(' ') for post in cleaned_posts)

    # We train the model
    # min_count : the minimum of sentences (aka. coupled words) found in the doc. Under this value, the model doesn't take it into consideration.
    # treshold : (by default : 10). The higher, the fewer phrases.
    # This line return a model we can use later. 
    bigram_model = Phrases(streamed_posts,min_count=5,threshold=10)

In [243]:
# How many sentences were found ? / Which words were associated in the doc ?
# pd.DataFrame(data = bigram_model.export_phrases(), columns=[['Phrases',"Score"]])

# model.export_phrases() return a dict with phrases as keys and scores as values
# score represents the frequency of apparition in doc, following the previous formula :
phrases = bigram_model.export_phrases()

pd.DataFrame(data = {"Phrases" : phrases.keys() , "Score" : phrases.values()})

Unnamed: 0,Phrases,Score
0,cost_effective,31.947798
1,apprenticeship_program,57.727183
2,festive_season,149.195370
3,young_people,11.238823
4,full_potential,29.053028
...,...,...
2193,infinite_mindset,113.322901
2194,bit_optimism,233.523188
2195,optimism_podcast,42.723320
2196,worthy_rival,434.901484


$$ score(A,B) = \frac{count(AB)-min_{AB}}{count(A)*count(B)} $$

**Notes**

We see that some phrases have a huge score. 

Let's check these posts.

*Update : by removing hashtags at the begining, we avoid this issue*

In [244]:
# text ='startwithwhy'
# serie = corpus.Content

# for post in serie.loc[serie.str.contains(text, regex=False)]:
#     print(post)
#     print("....")


Indeed, some words are associated together because they are written together in the hashtags "section".  

It would be maybe useful to remove hashtags from all posts to avoid this issue.

*Update : that is what we have done*

**Now we have defined *bigram phrases*, we can apply the model to the corpus !**

In [245]:
# This code was long to run
# Make the folloing statement true if you want to run it again
if 1 == 1:

    # We re-create from the first_clean_corpus a sequence of posts, each of them tokenized (words are splitted)
    cleaned_posts = corpus_cleaning(first_clean_corpus)
    streamed_posts = (post.split(' ') for post in cleaned_posts)

    # We create a new list of cleaned posts we will use for future models
    bigram_corpus = []

    for streamed_post in streamed_posts:
        bigram_post = ' '.join(bigram_model[streamed_post])
        bigram_corpus.append(bigram_post)

    bigram_corpus   


**J'AI REUSSI !!!!!!!**

## Topic modeling with Latent Dirichlet Allocation (LDA)

Now we have succesfully cleaned the corpus, removed hashtags, selected only relevant words, and associated words together, we can now apply a model to define topics in all these posts. 

Topic modeling is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using Latent Dirichlet Allocation or LDA, a popular approach to topic modeling.

In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a vector of token counts. There are two layers in this model — documents and tokens — and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:

* Document vectors tend to be large (one dimension for each token  lots of dimensions)
* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
* The dimensions are fully indepedent from each other — there's no sense of connection between related tokens, such as knife and fork.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its *LdaModel* class.

In [246]:
from gensim.corpora import Dictionary
from gensim.models.ldamulticore import LdaModel

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's **Dictionary** class for this.

In [247]:
#Let's create a new corpus as a save
corpus = pd.Series(bigram_corpus)

In [248]:
# Dictionary recquired to split the sentences into a list of words. Let's use split() function
streamed_corpus = corpus.apply(lambda str : str.split())

# Then we learn the dictionary by iterating over all of the reviews
# It return a generator
corpus_dictionary = Dictionary(streamed_corpus)

Also, we take care to remove from the dictionary words that
 appears rarely, and too often.

In [249]:
# filter tokens that are very rare or too common from
# the dictionary (filter_extremes) and reassign integer ids (compactify)
# We remove words that appears less than 10 times, and more than 40% of the time.
corpus_dictionary.filter_extremes(no_below=10, no_above=0.4)
corpus_dictionary.compactify()

**Dictionary** encapsulates the mapping between normalized words and their integer ids (https://radimrehurek.com/gensim/corpora/dictionary.html)

In [250]:
# Let's see some words as an example
corpus_dictionary.doc2bow(["entrepreneur","host","startup","guy",'event'])

[(536, 1), (630, 1), (1430, 1), (1618, 1), (2559, 1)]

**Note**

It seems that Dictionary() returns a generator. However, calling this generator with doc2bow doesn't seem to erase it. Maybe data present in a generator are deleted when we iterate over it only. We have to check that.

Anyway, let's resume the project.

Like many NLP techniques, LDA uses a simplifying assumption known as the bag-of-words model. In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded.

In [251]:
bag_of_words = [corpus_dictionary.doc2bow(post) for post in streamed_corpus]

In [None]:
# This code was long to run
# Make the folloing statement true if you want to run it again
if 1 == 1: 

    # Train the model on the corpus.
    lda = LdaModel(corpus=bag_of_words,id2word=corpus_dictionary, num_topics=10,passes=100)

Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [None]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print ('{:20}{}'.format('term', 'frequency'))

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print ("{:20}{:.03f}".format(term, round(frequency, 3)))

**Note**

Running a second time Lda changes topics' ID !
We have done that and 

In [None]:
explore_topic(topic_number=9)

term                frequency
company             0.049
business            0.035
brand               0.025
team                0.017
marketing           0.017
product             0.015
new                 0.015
customer            0.013
more                0.013
strategy            0.011
great               0.010
consumer            0.010
startup             0.009
employee            0.008
industry            0.008
sale                0.008
client              0.008
big                 0.007
opportunity         0.007
ceo                 0.007
global              0.006
growth              0.006
founder             0.006
investment          0.006
digital             0.005


Despite the power of topic modeling, a human view is necessary to assess each topic. We have selected 10 topics after several trials, and we will name each of them. 

In [None]:
topic_names = {0: 'People',
               1: 'Tech',
               2: 'World & Politics',
               3: 'Career',
               4: 'Content',
               5: 'Interview',
               6: 'Personal devlopment',
               7: 'Health',
               8: 'Money', 
               9: 'Business'} #Business, strategy, team, community

Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data — preferably in an interactive format. Fortunately, we have the fantastic pyLDAvis library to help with that!

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [None]:
LDAvis_prepared = gensimvis.prepare(lda, bag_of_words,
                                              corpus_dictionary)

  default_term_info = default_term_info.sort_values(


pyLDAvis.display(...) displays the topic model visualization in-line in the notebook.

In [None]:
pyLDAvis.display(LDAvis_prepared)

**Wait, what am I looking at again?**

There are a lot of moving parts in the visualization. Here's a brief summary:

* On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)
    * The plot is rendered in two dimensions according a multidimensional scaling (MDS) algorithm. Topics that are generally similar should be appear close together on the plot, while dissimilar topics should appear far apart.
    * The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
    * An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

* On the right, there is a bar chart showing top terms.
    * When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
    * When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter , which can be adjusted with a slider above the bar chart.
        * Setting the  parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.
        * Setting  close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.
        * Setting  to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
* Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

A more detailed explanation of the pyLDAvis visualization can be found here. Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.

### What about our us ?

Our first decision to name topics are pretty right, except for topics 5 & 10 (on pyLDAvis). Indeed, looking at it more specifically (on gensim ID's where we named manually each topic) it seems that topics 3 & 4 (World new & Health,Covid) are strongly similar. It seems that these topics are relative to the political & health situation (pandemic, health care, security, ...) in USA. 

But because Topics are well separated in global, we will keep things like that

In [None]:
# We have tested to create 9topics instead of 10, but it appears that topics are more relatalble than with 10.
# Therefore we are not going to use 9 topics later, but we keep that in this notebook
# If you want to run this code, make the following statement true
if 1 == 1:

    # Train the model on the corpus.
    lda = LdaModel(corpus=bag_of_words,id2word=corpus_dictionary, num_topics=9,passes=100)
    LDAvis_prepared = gensimvis.prepare(lda, bag_of_words,
                                                corpus_dictionary)
    pyLDAvis.display(LDAvis_prepared)

**Reactions & Topics**

Now we have defined our 10 topics, we will observe which topics get the most reactions !

To do so, we will create a function based on what we have done previously that takes as entry a raw post and return the topic distribution.  
We define a minimum topic frequency to not consider a certain topic if it is not relevant enough in the post.

In [None]:
def lda_description(post, min_topic_freq = 0.1):

    # First we remove hashtags with rm_hashtags
    rm_hash_post = rm_hashtags(post)

    # Lowercase the text
    lowcase_post = rm_hash_post.lower()

    # Clean it with rules we have defined previously
    # Rules are : not punct - not whitespace - not stopword - Nouns & Adjectives only
    # cleaned_post is here a list of tokens
    parsed_post = nlp(lowcase_post)
    cleaned_post = [token.lemma_ for token in parsed_post if all(rules(token))]

    # Define 2-words phrases
    # Return a list of phrases / tokens
    bigram_post = bigram_model[cleaned_post]

    # Create a bag_of_words representation
    bow_post = corpus_dictionary.doc2bow(bigram_post)

    # Create an LDA representation
    lda_post = lda[bow_post]

    # Sort with the most highly related topics first
    # lda_post = sorted(lda_post, key=lambda freq : -freq)

    for topic_number, freq in lda_post:
        if freq < min_topic_freq:
            continue
        print ('{:20}{:.02f}'.format(topic_names[topic_number],
                                round(freq, 3)))
    

In [None]:
corpus = data[['ID','Content']]
example_post = corpus["Content"][18]
example_post

"the imf predicts we will be entering the worst global recession since the 1930s, and are forecasting a downturn on par with the great depression. all of these negative predictions can lead to anxiety. what's going to be important is to manage that anxiety, and remember - there is hope. stay happy & healthy. "

In [None]:
lda_description(example_post)

Interview           0.22
Money               0.72


In [None]:
# For each cleaned post, we define which topics are present.
# It takes the form of a vector with the distribution of topics as values between 0 & 1

# As previously, we need to stream each post into individual words
# Then we have to transform these sequences/lists of words into bag_of_words (bow) thanks to the dictionary we have created previously
# Let's use bigram_corpus we have created previously with 2-words phrases

# We take posts after cleaning and modification, we ahve saved in bigram_corpus
posts = pd.Series(bigram_corpus)

#A list of posts, where each post is streamed into words
streamed_posts = posts.apply(lambda post : post.split(' '))

#Transform these sequences into bow.
bow_posts = [corpus_dictionary.doc2bow(post) for post in streamed_posts]

#Then we use the Lda model we trained previously
post_topics = lda[bow_posts]

In [None]:
ex = 'Hello, my name is'
nlp(ex)

Hello, my name is

In [None]:
nlp(ex)

Hello, my name is