# Topic modeling v2

In this notebook, we will start again our work to get topics from posts.  
The reason of this change is simple : 
* There are a lot of pre-processing actions we could do to improve our initial dataset (lemmatization, group words together, filter words,...)
* Our LDA model we used before got us an error we couldn't fix. This time, based on a new course (https://www.youtube.com/watch?v=6zm9NC9uRkk&ab_channel=PyData), we will improve our code.

Let's begin

## Import the data

In [1]:
import pandas as pd

In [2]:
data = pd.read_pickle("contentCorpus.pkl")
data

Unnamed: 0,Name,#Reactions,#Comments,Location,Followers,Time_spent,Media_type,Content
0,Nicholas Wyman,12,1,Unknown,6484.0,1 day ago,article,robert lerman writes that achieving a healthy...
1,Nicholas Wyman,11,0,Unknown,6484.0,1 week ago,none,national disability advocate sara hart weir m...
3,Nicholas Wyman,44,0,Unknown,6484.0,2 months ago,article,exploring in this months talent management hr...
4,Nicholas Wyman,22,2,Unknown,6484.0,2 months ago,article,i count myself fortunate to have spent time wi...
5,Nicholas Wyman,21,1,Unknown,6484.0,2 months ago,article,online job platforms are a different way of wo...
...,...,...,...,...,...,...,...,...
34007,Simon Sinek,4005,93,Unknown,4206024.0,4 years ago,image,igniter of the year well i know that im an op...
34008,Simon Sinek,1698,74,Unknown,4206024.0,4 years ago,video,executives who prioritize the shareholder are ...
34009,Simon Sinek,661,59,Unknown,4206024.0,4 years ago,video,like many i too have been reflecting as we nea...
34010,Simon Sinek,766,35,Unknown,4206024.0,4 years ago,video,if you say customer first that means your empl...


In [3]:
# Let's devide this dataset into two datasets : Corpus only & Informations about the post

# We create a specific ID for each row
data["ID"]=range(data.shape[0])

# We create the dataset containing content
corpus = data[['ID','Content']]

# And the one containing reactions & comments about each post
# We don't need other columns for this analysis
postReactions = data[['ID','#Reactions','#Comments']]

In [4]:
for c in corpus['Content'][[1,100,1000,10000,20000]]:
    print(c)
    print('--------------')

national disability advocate  sara hart weir ms   shares how congress passed the able act
--------------
for those intersted in youth career pathways great to read today about the expansion of citi foundation’s pathways to progress inititiave  new commitment to  young adults jobready jobs training  joanne gedge   janet searle   louise martin lindsay   amylou cowdroyling    
--------------
community building has meant something dramatically different the past couple of months  when  startups hosted an event with  rsvps we had to pivot almost  times to accommodate changing restrictions along the way  people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months  all these things have made me ask the question whats the best way to build community right now how is this changing our perception of community thus im starting my letter to the community on the subject 
--------------
where can we find  casestudies  of  st

As we can see, some words can be filtered :
* mentions (joanne gedge, janet searle,...)
* some hashtags that are involved in sentences (natgeotraveller , globalhealth), generally at the end of the post

## Texts Preprocessing : what we can do ?

For this preprocessing, we will use Spacy, which is a fast industrial-strength natural language processing (NLP) library for Python.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

* Tokenization
* Text normalization, such as lowercasing, stemming/lemmatization
* Part-of-speech tagging
* Syntactic dependency parsing
* Sentence boundary detection
* Named entity recognition and annotation

In the "batteries included" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:

* Large English vocabulary, including stopword lists
* Token "probabilities"
* Word vectors

In [5]:
import spacy
import codecs

# We need to import spacy trained pipelines, which support many languages
# Let's use the English pipeline

# In CMD :
# $ python -m spacy download en_core_web_lg
# or
# In python :
# >>> import spacy
# >>> nlp = spacy.load("en_core_web_lg")

# We download the large version "lg" and not the small version "sm" here 
# This module is pretty large : ~631 MB
nlp = spacy.load("en_core_web_lg")

Let's create a sample of several posts to play with.

In [6]:
#Posts are separated by line breaks, and gather into one string
posts_sample = "\n\n".join(corpus.Content[[1,100,1000,10000,20000]])

print(posts_sample)

national disability advocate  sara hart weir ms   shares how congress passed the able act

for those intersted in youth career pathways great to read today about the expansion of citi foundation’s pathways to progress inititiave  new commitment to  young adults jobready jobs training  joanne gedge   janet searle   louise martin lindsay   amylou cowdroyling    

community building has meant something dramatically different the past couple of months  when  startups hosted an event with  rsvps we had to pivot almost  times to accommodate changing restrictions along the way  people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months  all these things have made me ask the question whats the best way to build community right now how is this changing our perception of community thus im starting my letter to the community on the subject 

where can we find  casestudies  of  startups  that were built by a  team  of  en

In [7]:
parsed_posts_sample = nlp(posts_sample)

print(parsed_posts_sample)

national disability advocate  sara hart weir ms   shares how congress passed the able act

for those intersted in youth career pathways great to read today about the expansion of citi foundation’s pathways to progress inititiave  new commitment to  young adults jobready jobs training  joanne gedge   janet searle   louise martin lindsay   amylou cowdroyling    

community building has meant something dramatically different the past couple of months  when  startups hosted an event with  rsvps we had to pivot almost  times to accommodate changing restrictions along the way  people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months  all these things have made me ask the question whats the best way to build community right now how is this changing our perception of community thus im starting my letter to the community on the subject 

where can we find  casestudies  of  startups  that were built by a  team  of  en

Looks the same ! So what happened ?  
Let's apply some functions !

### Sentences detection

Actually, we have already removed punctuation, but it is working very well !!!  
I keep that here for a later use. 

In [8]:
for num, sentence in enumerate(parsed_posts_sample.sents):
    print("Sentence {}".format(num+1))
    print(sentence)
    print('')

Sentence 1
national disability advocate  sara hart weir ms   shares how congress passed the able act

for those intersted in youth career pathways great to read today about the expansion of citi foundation’s pathways to progress inititiave  new commitment to  young adults jobready jobs training  joanne gedge   janet searle   louise martin lindsay   amylou cowdroyling    

community building has meant something dramatically different the past couple of months  when  startups hosted an event with  rsvps we had to pivot almost  times to accommodate changing restrictions along the way

Sentence 2
 people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months  all these things have made me ask the question whats the best way to build community right now how is this changing our perception of community thus im starting my letter to the community on the subject 

where can we find  casestudies  of  startups  that were b

In [9]:
# Same with the original content, containing ponctuations & uppercases
test = pd.read_pickle("cleaned_data.pkl")
test_content = test.content
sample_test_content = "\n\n".join(test_content[[100,1000]])
parsed_sample_test = nlp(sample_test_content)

for num, sentence in enumerate(parsed_sample_test.sents):
    print("Sentence {}".format(num+1))
    print(sentence)
    print('')

Sentence 1
For those intersted in youth career pathways.

Sentence 2
Great to read today about the expansion of Citi Foundation’s Pathways to Progress inititiave - new commitment to 500k young adults #JobReady #Jobs #Training  Joanne Gedge   Janet Searle   Louise Martin Lindsay   Amy-Lou Cowdroy-Ling   https://lnkd.in/g8FTr5w.

Sentence 3

 
 
 …see more

Community building has meant something dramatically different the past couple of months.

Sentence 4
 When 500 Startups hosted an event with 2400+ RSVPs, we had to pivot almost 8 times to accommodate changing restrictions along the way.

Sentence 5
 People are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months.

Sentence 6
 All these things have made me ask the question, what's the best way to build community right now?

Sentence 7
How is this changing our perception of community?

Sentence 8
Thus, I'm starting my letter to the community on the subject.

Sentenc

### 

### Name entity detection

In [10]:
for num, entity in enumerate(parsed_posts_sample.ents):
    print("Entity {}".format(num+1), entity, '-', entity.label_)

Entity 1 congress - ORG
Entity 2 today - DATE
Entity 3 citi foundation’s - ORG
Entity 4 joanne gedge   janet searle   louise martin lindsay    - PERSON
Entity 5 the past couple of months - DATE
Entity 6 months - DATE
Entity 7 atlanta - GPE
Entity 8 one - CARDINAL
Entity 9 africa - LOC


* ORG : organism 
* GPE : geopolitcal entity
* LOC : location

As we can see, we can know which entity represents each words or group of words ! 

For instance, spacy knows when a group of words is a name, a location, a date ...  
That's clearly amazing !

Also, it appears that some words are badly comprehend by the algorithm like "us", as USA, which is here "us" designating we.


### Speech tagging

We can also define if a word is an adjective, a noun, or other...

In [11]:
token_text = [token.orth_ for token in parsed_posts_sample]
token_pos = [token.pos_ for token in parsed_posts_sample]

pd.DataFrame({"Token_text" : token_text , "Token_pos" : token_pos})

Unnamed: 0,Token_text,Token_pos
0,national,ADJ
1,disability,NOUN
2,advocate,NOUN
3,,SPACE
4,sara,PROPN
...,...,...
299,me,PRON
300,inspired,VERB
301,too,ADV
302,natgeotraveller,NOUN


### Text normalization (stemming/lemmatization and shape analysis)

Lemmatization consists in transforming a word into its root. 

For instance, "is" becomes "be" ; "me" becomes "I" ...

In [12]:
token_lemma = [token.lemma_ for token in parsed_posts_sample]
token_shape = [token.shape_ for token in parsed_posts_sample]

pd.DataFrame({'Token_text' : token_text, 'Token_lemma' : token_lemma , 'Token_shape' : token_shape})

Unnamed: 0,Token_text,Token_lemma,Token_shape
0,national,national,xxxx
1,disability,disability,xxxx
2,advocate,advocate,xxxx
3,,,
4,sara,sara,xxxx
...,...,...,...
299,me,I,xx
300,inspired,inspire,xxxx
301,too,too,xxx
302,natgeotraveller,natgeotraveller,xxxx


What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?

* stopword
* punctuation
* whitespace
* represents a number
* whether or not the token is included in spaCy's default vocabulary?

In [13]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_posts_sample]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))

df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,national,-20.0,,,,,
1,disability,-20.0,,,,,
2,advocate,-20.0,,,,,
3,,-20.0,,,Yes,,Yes
4,sara,-20.0,,,,,
...,...,...,...,...,...,...,...
299,me,-20.0,Yes,,,,
300,inspired,-20.0,,,,,
301,too,-20.0,Yes,,,,
302,natgeotraveller,-20.0,,,,,Yes


* Log_probability represents the frequency of a word apparation in the text : 
    *  ~ 0 if appears often
    * =! 0 if appears rarely 
    

* Stop ? : Is this word a stop word ?

* Out of vocab ? : Is this word out of the english dictionary proposed by Spacy


## NLP preprocessing : application

### Phrase modeling

Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens  and  constitute a phrase is:

$$ \frac{count (AB) - count_{min}}{count(A)*count(B)} * N > treshold $$

 
...where:

 * $count(A)$ is the number of times token $A$ appears in the corpus
 * $count(B)$ is the number of times token $B$ appears in the corpus
 * $count(AB)$ is the number of times the tokens $AB$ appear in the corpus in order
 * $N$ is the total size of the corpus vocabulary
 * $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times
 * $treshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase
Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york would become new_york). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensible **gensim** library to help us with phrase modeling — the Phrases class in particular.

In [14]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

* Segment text of complete reviews into sentences & normalize text
* First-order phrase modeling  apply first-order phrase model to transform sentences
* Second-order phrase modeling  apply second-order phrase model to transform sentences
* Apply text normalization and second-order phrase model to text of complete reviews
* We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the **lemmatized_sentence_corpus** generator function will use spaCy to:

In [33]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """

    return token.is_punct or token.is_space
            
def corpus_cleaning(serie):
    """
    generator function to use spaCy to parse posts,
    lemmatize the text, remove punctuations, unconvenient whitespaces, stopwords, and names
    """
    
    for parsed_post in nlp.pipe(serie):
      yield ' '.join([token.lemma_ for token in parsed_post
                             if not punct_space(token)
                             if not token in nlp.Defaults.stop_words #stopwords
                             if token.pos_ != "PROPN"]) #name & surname

Here we used *yield* instead of *return*.  
The reason is well explained here : https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

Long story short, *yield* returns a generator, which is an iterable, as a list or a string for instance, but does not store all the values in the memory.

Generators recquire less memory than a list that is very useful if the dataset is very large !

Therefore, **we can only iterate over once**. 

To understand how to use spacy with a pandas, here a clearly explained article : https://towardsdatascience.com/structured-natural-language-processing-with-pandas-and-spacy-7089e66d2b10

We first need to apply the spaCy language model to the entire collection of posts. The easiest and most computationally efficient way to do this is to use the *nlp.pipe* function. This will iterate over each document and will apply the language model.

In [30]:
#Let's apply the function *CorpusCleaning* to create a parsed list
parsed_posts = corpus_cleaning(corpus.Content)
type(parsed_posts)

#We can print cleaned posts with 
#list(parsed_posts)
#But it's time consuming

generator

Let's see some examples.

To iterate on a generator, we can use the *itertools* package, which is a common practice with generator objects.

it.islice is an iterator designed to iterate over an object. Because we can't iterate directly over a generator (can't be subscriptable), this function is pretty useful !

In [31]:
import itertools as it

In [32]:
for post in it.islice(parsed_posts,5):
    print(post)
    print('----')

write that achieve a healthy future of work require employee to build skill that help they attain productive and rewarding career he note one of the most costeffective way to do this be through apprenticeship which help worker master occupation and gain professional identity and pride coudlnt agree more workbasedlearne apprenticeship read the article on urbanwire institute
----
national disability advocate ms share how pass the able act
----
explore in this month talent management hr what a company should consider to get the most out of a modern apprenticeship program thank to employer entrepreneuer for share insight on your it program why not start a program in wish you all a safe and happy festive season careerplanne apprenticeship workbasedlearne career institute dimeny
----
I count myself fortunate to have spend time with be the assistant secretary for policy evaluation and research in the department of labor during the his insight and innovative thinking around economic employment

**Important note**

When we iterate over a generator, the value saved in it is then deleted ! 

That confirms the definition of a generator : we can only iterate over it once !

**That also means if we want to use several times a generator, we have to recreate one for the purpose !**

For instance, here, we can't use anymore the 6 first values contained in the generator !

A common practice consists in iterating on a generator in this way :

```
for i in create_a_generator_function(y):   
    print(i)
```

Now, let's group word together with the **gensim** library 