# NLP Cleaning

In this notebook, we will adapt and modify the cleaned dataset to apply Natural Language Processing on it.  
Some columns are composed of texts we can transform to obtain an exploitable dataset. 

## Libraries

In [57]:
import pickle
import pandas as pd
import warnings
warnings.filterwarnings('ignore') #Remove warnings

## Import the data

In [58]:
data = pd.read_pickle('cleaned_data.pkl')

## First look

### Overall

In [59]:
data.head(3)

Unnamed: 0,name,headline,about,content,content_links,media_url,num_hashtags,hashtags,reactions,comments,Locations,Followers,Time_spent,Media_type
0,Nicholas Wyman,CEO IWSI Group,Nicholas Wyman for the past 25 years has shone...,Robert Lerman writes that achieving a healthy...,[['https://www.linkedin.com/in/ACoAAACy1HkBviR...,['https://www.urban.org/urban-wire/its-time-mo...,4,"[['#workbasedlearning', 'https://www.linkedin....",12,1,Unknown,6484.0,1 day ago,article
1,Nicholas Wyman,CEO IWSI Group,Nicholas Wyman for the past 25 years has shone...,"National disability advocate Sara Hart Weir, ...",[['https://www.linkedin.com/in/ACoAAAHsfJgBb7_...,[],0,[],11,0,Unknown,6484.0,1 week ago,none
3,Nicholas Wyman,CEO IWSI Group,Nicholas Wyman for the past 25 years has shone...,Exploring in this months Talent Management & H...,[['https://www.linkedin.com/in/ACoAAAADlGIBLfn...,['https://www.tlnt.com/apprenticeships-that-br...,4,"[['#careerplanning', 'https://www.linkedin.com...",44,0,Unknown,6484.0,2 months ago,article


### Content

In [60]:
data.content[1000]

"Community building has meant something dramatically different the past couple of months.  When 500 Startups hosted an event with 2400+ RSVPs, we had to pivot almost 8 times to accommodate changing restrictions along the way.  People are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months.  All these things have made me ask the question, what's the best way to build community right now? How is this changing our perception of community? Thus, I'm starting my letter to the community on the subject. \n \n \n …see more"

**Comments**  
We see "...see more" in the content, which means we don't have all the content !  
Let's see if we can manage to get the entire content with the links provided in the dataset. 

### About

In [61]:
data.about[1000]

"I build communities, content, and products centered on entrepreneurship, social impact, and female empowerment. I am currently focused on leading marketing partnerships and content at 500 Startups (2019's most active early-stage VC), creating community hubs and content for diverse founders, and advising FLIK: a platform dedicated to connecting female founders with female apprentices. Prior to 500, I started my career by creating Linkedin content, working in digital entertainment, and even venturing into my own events startup. In 2015, I was the youngest recipient of Linkedin's Top Voice award and would go on to cultivate an audience over 400K.I'm always looking to connect with female founders, heads of diversity, community managers, VCs, and those looking to make a difference!PGPs: she/her/hersThoughts are my own."

### Headline

In [62]:
print(data.headline[1000])
print(data.headline[100])
print(data.headline[10000])

Marketing @ 500 Startups | 3x Linkedin Top Voice
CEO IWSI Group
Founder and CEO of One Million by One Million (1Mby1M)


**Comments**  
With headlines, maybe we can highlight the postion of the person.  
Also we can analyse which words or expressions stand out ("help businesses" ; "Start-up"  ...)

### Content links

In [63]:
print(data.content_links[100])
print("--------------")
print(data.content_links[10000])

[['https://www.linkedin.com/in/ACoAAAaojZkBD0OpJLI3LCDMMNVrCqhzr1ty4Wk', 'Joanne Gedge'], ['https://www.linkedin.com/in/ACoAAAMT1vIBSDvXwNANtuFumamRcEzb3AyQo8k', 'Janet Searle'], ['https://www.linkedin.com/in/ACoAAADZoYwBIARReJ9lqzB2Kwf9YhVyQUM7qSg', 'Louise Martin Lindsay'], ['https://www.linkedin.com/in/ACoAAAzDh6wBOeZqRnp9XtyoKDDOkM0pxfS6G5s', 'Amy-Lou Cowdroy-Ling'], ['https://lnkd.in/g8FTr5w', 'https://lnkd.in/g8FTr5w.']]
--------------
[['https://www.linkedin.com/feed/hashtag/?keywords=casestudies&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#casestudies'], ['https://www.linkedin.com/feed/hashtag/?keywords=startups&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#startups'], ['https://www.linkedin.com/feed/hashtag/?keywords=team&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#team'], ['https://www.linkedin.com/feed/hashtag/?keywords=entrepreneurs&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#entrep

**Comments**  
There are 3 types of links :
1. Other account/people mentions
2. Hastags
3. Link to websites

The data is organized as the following :
[[ "Link" , "mention/hashtag/website_link" ]]

We can differenciate these 3 types depending easily. For instance :  
1. Capital at the beginning of the string
2. "#"
3. "http"

### Media URL

In [64]:
print(data.media_url[10000])
print('-----------')
print(data.content_links[10000])

['https://www.sramanamitra.com/2020/10/16/where-can-i-find-case-studies-of-how-entrepreneurs-build-tech-companies/']
-----------
[['https://www.linkedin.com/feed/hashtag/?keywords=casestudies&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#casestudies'], ['https://www.linkedin.com/feed/hashtag/?keywords=startups&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#startups'], ['https://www.linkedin.com/feed/hashtag/?keywords=team&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#team'], ['https://www.linkedin.com/feed/hashtag/?keywords=entrepreneurs&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#entrepreneurs']]


In [65]:
#Let's see the composition of media_url
data.media_url.value_counts().head()

#We observe some [] in it, which means that some contents (a lot actually) don't have media


[]                                                                                           6893
['https://www.linkedin.com/newsletters/the-future-of-digital-health-6501324601757442048']     130
['https://www.sramanamitra.com/2020/04/29/bootstrapping-course-welcome/']                     110
['https://www.linkedin.com/newsletters/cloud-stock-analysis-6494194802798788608']             101
['https://1m1m.sramanamitra.com/investor-introduction/']                                       97
Name: media_url, dtype: int64

**Comments**  
When a media is added to the content, an option allowed by Linkedin, the link is added to this column.   

We observe the difference between "content_links" & "media_url" here :  
* Content_links gathers all links inside the writen content. 
* Media_url gathers all the links of the added media, if there is one


### Hashtags

In [66]:
print(data.hashtags[10000])
print("------------")
print(data.content_links[10000])

[['#casestudies', 'https://www.linkedin.com/feed/hashtag/?keywords=casestudies&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104'], ['#startups', 'https://www.linkedin.com/feed/hashtag/?keywords=startups&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104'], ['#team', 'https://www.linkedin.com/feed/hashtag/?keywords=team&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104'], ['#entrepreneurs', 'https://www.linkedin.com/feed/hashtag/?keywords=entrepreneurs&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104']]
------------
[['https://www.linkedin.com/feed/hashtag/?keywords=casestudies&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#casestudies'], ['https://www.linkedin.com/feed/hashtag/?keywords=startups&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#startups'], ['https://www.linkedin.com/feed/hashtag/?keywords=team&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6743766447648047104', '#team'], ['http

**Comments**  
We observe that links in the content_links are also in hashtags column.  
Thus, this information appears two times in the dataset. 

## Conclusions on the first look

To wrap it up :

* Contents are not complete. A new scraping should be recquired to gather more data. **However**, we can use the first sentences to draw some insights. Indeed, the tagline is usually more important than the core of the content. 
* Headlines contain a lot of information, like position or business
* Content_links contains links of mentions/hashtags/websites used in the post. Hashtags are contained also in its own column. 
* Media_URL is present if a media is added to the post. 
* Time spent column could be changed into days to make easier the analysis.
* "About" section is complete and is a nice way to know in detail the person.

In [67]:
data.head(2)

Unnamed: 0,name,headline,about,content,content_links,media_url,num_hashtags,hashtags,reactions,comments,Locations,Followers,Time_spent,Media_type
0,Nicholas Wyman,CEO IWSI Group,Nicholas Wyman for the past 25 years has shone...,Robert Lerman writes that achieving a healthy...,[['https://www.linkedin.com/in/ACoAAACy1HkBviR...,['https://www.urban.org/urban-wire/its-time-mo...,4,"[['#workbasedlearning', 'https://www.linkedin....",12,1,Unknown,6484.0,1 day ago,article
1,Nicholas Wyman,CEO IWSI Group,Nicholas Wyman for the past 25 years has shone...,"National disability advocate Sara Hart Weir, ...",[['https://www.linkedin.com/in/ACoAAAHsfJgBb7_...,[],0,[],11,0,Unknown,6484.0,1 week ago,none


## Analysis ideas

1. Study the content as it is. Even if it is not complete, we can considere in our analysis the first sentences present in this dataset. NLP technique can be easily applied here.
2. NLP techniques applied to headlines. 
3. Count mentions/hashtags in posts
4. Influence of media presence (media or not : 1-0)
5. NLP to About section
6. Relation between number of followers & reactions/comments
7. Influence of the media type

Reactions & Comments will be the main insight to define if a post was good or not. 

## Datasets creation

For each analysis, we are going to create adapted datasets and do some cleaning of the original one.

**Cleaning The Data**    

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**

* Make text all lower case  
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**

* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

### Content dataset

For the content, we will apply NLP techniques.  
To do so, we need a **Corpus** & a **Document-term matrix**

A **Corpus** is a collection of texts.

A **Document-term matrix** is a dataframe added with a column for each word, and inside it, the number of times this word appears in the text

#### Corpus

In [68]:
contentCorpus = pd.DataFrame()
contentCorpus[['Name','Content','#Reactions','#Comments','Location','Followers','Time_spent','Media_type']] = data[['name','content','reactions', 'comments', 'Locations','Followers', 'Time_spent', 'Media_type']]
contentCorpus.head()

Unnamed: 0,Name,Content,#Reactions,#Comments,Location,Followers,Time_spent,Media_type
0,Nicholas Wyman,Robert Lerman writes that achieving a healthy...,12,1,Unknown,6484.0,1 day ago,article
1,Nicholas Wyman,"National disability advocate Sara Hart Weir, ...",11,0,Unknown,6484.0,1 week ago,none
3,Nicholas Wyman,Exploring in this months Talent Management & H...,44,0,Unknown,6484.0,2 months ago,article
4,Nicholas Wyman,I count myself fortunate to have spent time wi...,22,2,Unknown,6484.0,2 months ago,article
5,Nicholas Wyman,Online job platforms are a different way of wo...,21,1,Unknown,6484.0,2 months ago,article


**Cleaning**  
Let's remove :
* Punctuation
* Lowercase letters
* Remove numbers
* Remove # (hashtags) => we will just remove # because some authors decided to incorporate #word inside their content
* Remove "/n/n/n see more..."

In [69]:
import re
import string

def clean_text(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation, 
        remove words containing numbers, remove # and remove "see more" at the end of the LinkedIn post'''
    text = text.lower()
    text = re.sub('\[.*?\] ', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('#','',text)
    text = re.sub("\n \n \n …see more",'',text)
    return text

cleaning_text = lambda x: clean_text(x)

In [70]:
#Clean the content and create a new column              
contentCorpus['CleanedContent'] = contentCorpus['Content'].apply(cleaning_text)   

In [71]:
#Drop the previous column "Content"
contentCorpus = contentCorpus.drop(['Content'],axis=1)
contentCorpus.head(2)

Unnamed: 0,Name,#Reactions,#Comments,Location,Followers,Time_spent,Media_type,CleanedContent
0,Nicholas Wyman,12,1,Unknown,6484.0,1 day ago,article,robert lerman writes that achieving a healthy...
1,Nicholas Wyman,11,0,Unknown,6484.0,1 week ago,none,national disability advocate sara hart weir m...


In [72]:
#Let's see some examples
print(contentCorpus.CleanedContent[100])
print('---------')
print(contentCorpus.CleanedContent[1000])
print('---------')
print(contentCorpus.CleanedContent[10000])

for those intersted in youth career pathways great to read today about the expansion of citi foundation’s pathways to progress inititiave  new commitment to  young adults jobready jobs training  joanne gedge   janet searle   louise martin lindsay   amylou cowdroyling    
---------
community building has meant something dramatically different the past couple of months  when  startups hosted an event with  rsvps we had to pivot almost  times to accommodate changing restrictions along the way  people are rallying around their communities to show support for groups like healthcare workers while staying in their homes for months  all these things have made me ask the question whats the best way to build community right now how is this changing our perception of community thus im starting my letter to the community on the subject 
---------
where can we find  casestudies  of  startups  that were built by a  team  of  entrepreneurs  virtually


We note that some names are present in the text.  
Nevertheless, the Corpus can be exploited right now for a *first attempt*

In [73]:
#Let's rename CleanedContent into Content for better understanding
contentCorpus = contentCorpus.rename({"CleanedContent":"Content"},axis=1)
contentCorpus

Unnamed: 0,Name,#Reactions,#Comments,Location,Followers,Time_spent,Media_type,Content
0,Nicholas Wyman,12,1,Unknown,6484.0,1 day ago,article,robert lerman writes that achieving a healthy...
1,Nicholas Wyman,11,0,Unknown,6484.0,1 week ago,none,national disability advocate sara hart weir m...
3,Nicholas Wyman,44,0,Unknown,6484.0,2 months ago,article,exploring in this months talent management hr...
4,Nicholas Wyman,22,2,Unknown,6484.0,2 months ago,article,i count myself fortunate to have spent time wi...
5,Nicholas Wyman,21,1,Unknown,6484.0,2 months ago,article,online job platforms are a different way of wo...
...,...,...,...,...,...,...,...,...
34007,Simon Sinek,4005,93,Unknown,4206024.0,4 years ago,image,igniter of the year well i know that im an op...
34008,Simon Sinek,1698,74,Unknown,4206024.0,4 years ago,video,executives who prioritize the shareholder are ...
34009,Simon Sinek,661,59,Unknown,4206024.0,4 years ago,video,like many i too have been reflecting as we nea...
34010,Simon Sinek,766,35,Unknown,4206024.0,4 years ago,video,if you say customer first that means your empl...


In [74]:
#Let's pickle the dataset for later use
contentCorpus.to_pickle("contentCorpus.pkl")

#### Document-term matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

*Memory issue*

It appears that creating a Document-term matrix of all the content is impossible due to the weight of the resulting dataset. To avoid this issue, we are going to take only few authors to create this dataset.  
Moreover, in order to analyze what makes a linkedin post good, it would be better to compare posts of the same author only.  
Let's take Simon Sinek, a famous personality in Personal Development, writer and public speaker.


In [103]:
#Corpus of Simon Sinek's content
sinekCorpus = contentCorpus.loc[contentCorpus["Name"] == "Simon Sinek"]
#Reset index to allow concat with dataframe
sinekCorpus = sinekCorpus.reset_index(drop=True) #drop to avoid current index into the dataframe 

print(sinekCorpus.shape)
sinekCorpus.head(2)

(245, 8)


Unnamed: 0,Name,#Reactions,#Comments,Location,Followers,Time_spent,Media_type,Content
0,Simon Sinek,12093,257,Unknown,4206024.0,23 hours ago,none,we are only in charge when we are willing to l...
1,Simon Sinek,5415,164,Unknown,4206024.0,23 hours ago,none,when the people have to manage dangers from in...


In [104]:
#Drop useless columns for this analysis too.
sinekCorpus.drop(['Name','Location','Time_spent','Media_type','Followers'],axis=1,inplace=True)

In [105]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
dataCv = cv.fit_transform(sinekCorpus.Content)
dataDtm = pd.DataFrame(dataCv.toarray(), columns=cv.get_feature_names_out())

print(dataDtm.shape)
dataDtm.head(3)

(245, 1530)


Unnamed: 0,ability,abitofoptimism,able,abnormal,absolutely,accept,accepting,accepts,achieve,act,...,years,yesterday,yin,york,youll,youre,youtube,youtubecomsimonsinek,youve,única
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [106]:
#Let's concat this Document-term matrix with data to perform EDA
sinekDtm = pd.concat([sinekCorpus,dataDtm],axis=1)

print(sinekDtm.shape)
sinekDtm.head(2)

(245, 1533)


Unnamed: 0,#Reactions,#Comments,Content,ability,abitofoptimism,able,abnormal,absolutely,accept,accepting,...,years,yesterday,yin,york,youll,youre,youtube,youtubecomsimonsinek,youve,única
0,12093,257,we are only in charge when we are willing to l...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5415,164,when the people have to manage dangers from in...,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [102]:
#Let's picke the Document-term matrix
sinekDtm.to_pickle("sinekDtm.pkl")

**Corpus for each author**

We will also study each creator content, and compare it with the number of followers for instance.

We can create a corpus for every authors.  
To do so, let's gather all content posts into one text for each author.  
Then we will apply CountVectorizer for each author.  
Easy !

In [89]:
contentAuthorCorpus = contentCorpus[['Name','Followers','Content']]

In [90]:
#Let's gather the content for each author
#We replace the 'Content' column
contentAuthorCorpus['Content'] = contentAuthorCorpus.groupby(['Name'])['Content'].transform(lambda x : ' '.join(x))

#Groupby group like a pivot_table
#Transform is Apply to the Serie itself
#Join gathers string together with the string ' ' defined (here nothing)

#Remove duplicates due to .transform
contentAuthorCorpus = contentAuthorCorpus.drop_duplicates() 
contentAuthorCorpus.reset_index(inplace=True,drop=True)
contentAuthorCorpus

Unnamed: 0,Name,Followers,Content
0,Nicholas Wyman,6484.0,robert lerman writes that achieving a healthy...
1,Jonathan Wolfer,2462.0,proud of this new feature at douglass this yea...
2,Karen Gross,88720.0,a piece worth reading i hope with suggestions ...
3,Kaia Niambi Shivers Ph.D.,3725.0,i remember native read went low but we’re ...
4,Daniel Cohen-I'm Flyering,28605.0,passion is one of those qualities you can’t sp...
...,...,...,...
63,Quentin Michael Allums,66387.0,comparing your career to someone else’s is lik...
64,AJ Wilcox,19101.0,i am so excited to be a part of this webinar c...
65,Kevin O'Leary,2826016.0,clearly not all crypto currencies are created ...
66,Amy Blaschka,64548.0,news flash what’s preventing your career progr...


In [91]:
#Export the dataset
contentAuthorCorpus.to_pickle('contentAuthorCorpus.pkl')

**Document-Term Matrix for headlines**