# Linkedin Posts Analysis

## Question : what makes a post good ?

Nowadays, a lot of businesses are based on the relationship with their prospects and clients.   
Moreover, a lot of "Linkedin experts" propose their services for companies to increase their visibility on this social media.  

Understanding what makes a post good on LinkedIn is essential to increase the reach, interactions with prospects, to create trust and therefore increase business revenus. 

In this notebook, we will study a dataset provided on Kaggle to understand which criteria make a post really "good" on this social media. 

We call a "good post" content which creates interactions with the community. 

### Datasets comments on Kaggle

This dataset contains LinkedIn Influencers' post details and other details(post dependent as well as independent) per post. This dataset can be used to analyze LinkedIn reach based on post content and related account details.

This dataset is great for Exploratory Data Analysis and NLP tasks.

The data was scraped using BeautifulSoup and Selenium.Last updated on 15th Feb,2021

### Principal steps of this analysis
* Understand the dataset
* Cleaning
* EDA

# Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import requests
import io

# Downloading the dataset

In [2]:
# # Downloading the csv.file from my Github account

# # We copy the "raw" link
# url = "https://github.com/JeremyArancio/Linkedin_Post_Analysis/blob/main/influencers_data.csv?raw=true"
# download = requests.get(url).content

# # Reading the downloaded content and turning it into a pandas dataframe
# df = pd.read_csv(io.StringIO(download.decode('utf-8')))

In [3]:
df = pd.read_csv('influencers_data.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


# First look

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34012 entries, 0 to 34011
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         34012 non-null  int64  
 1   name               34012 non-null  object 
 2   headline           34012 non-null  object 
 3   location           31740 non-null  object 
 4   followers          33970 non-null  float64
 5   connections        25713 non-null  object 
 6   about              34012 non-null  object 
 7   time_spent         34011 non-null  object 
 8   content            31996 non-null  object 
 9   content_links      34012 non-null  object 
 10  media_type         26779 non-null  object 
 11  media_url          34012 non-null  object 
 12  num_hashtags       34012 non-null  int64  
 13  hashtag_followers  34012 non-null  int64  
 14  hashtags           34012 non-null  object 
 15  reactions          34012 non-null  int64  
 16  comments           340

In [5]:
#Nan in %
df.isna().sum() / df.shape[0] * 100

Unnamed: 0             0.000000
name                   0.000000
headline               0.000000
location               6.679995
followers              0.123486
connections           24.400212
about                  0.000000
time_spent             0.002940
content                5.927320
content_links          0.000000
media_type            21.266024
media_url              0.000000
num_hashtags           0.000000
hashtag_followers      0.000000
hashtags               0.000000
reactions              0.000000
comments               0.000000
views                100.000000
votes                 99.747148
dtype: float64

In [6]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,name,headline,location,followers,connections,about,time_spent,content,content_links,media_type,media_url,num_hashtags,hashtag_followers,hashtags,reactions,comments,views,votes
0,0,Nicholas Wyman,CEO IWSI Group,,6484.0,500+,Nicholas Wyman for the past 25 years has shone...,1 day ago,Robert Lerman writes that achieving a healthy...,[['https://www.linkedin.com/in/ACoAAACy1HkBviR...,article,['https://www.urban.org/urban-wire/its-time-mo...,4,0,"[['#workbasedlearning', 'https://www.linkedin....",12,1,,


## Observations for the Comments section

In [7]:
#Let's check Time_spent
df.time_spent.value_counts().head(3)

1 year ago     7753
2 years ago    5728
3 years ago    3759
Name: time_spent, dtype: int64

In [8]:
#Let's check Media_type
df.media_type.value_counts()

article       15144
image          8708
video          2690
document        113
poll             86
entity           32
newsletter        4
view              2
Name: media_type, dtype: int64

In [9]:
#Content_links
df.content_links[0]

"[['https://www.linkedin.com/in/ACoAAACy1HkBviRGLfLG__Jk8FRH2JY2rGg3nTU', 'Robert Lerman'], ['https://www.linkedin.com/feed/hashtag/?keywords=workbasedlearning&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6765387069389967360', '#workbasedlearning'], ['https://www.linkedin.com/feed/hashtag/?keywords=usa&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6765387069389967360', '#USA'], ['https://www.linkedin.com/feed/hashtag/?keywords=apprenticeship&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6765387069389967360', '#apprenticeship'], ['https://www.linkedin.com/feed/hashtag/?keywords=urbanwire&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6765387069389967360', '#UrbanWire'], ['https://www.linkedin.com/company/urban-institute/', 'Urban Institute']]"

In [10]:
#Media_url
df.media_url.head()[0]

"['https://www.urban.org/urban-wire/its-time-modernize-american-apprenticeship-system']"

In [11]:
#Hashtags_followers
df.hashtag_followers.value_counts()

0    34012
Name: hashtag_followers, dtype: int64

In [12]:
#hashtags
df.hashtags[0]

"[['#workbasedlearning', 'https://www.linkedin.com/feed/hashtag/?keywords=workbasedlearning&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6765387069389967360'], ['#USA', 'https://www.linkedin.com/feed/hashtag/?keywords=usa&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6765387069389967360'], ['#apprenticeship', 'https://www.linkedin.com/feed/hashtag/?keywords=apprenticeship&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6765387069389967360'], ['#UrbanWire', 'https://www.linkedin.com/feed/hashtag/?keywords=urbanwire&highlightedUpdateUrns=urn%3Ali%3Aactivity%3A6765387069389967360']]"

In [13]:
df.votes.value_counts()

265.0    3
240.0    2
141.0    2
256.0    1
305.0    1
        ..
195      1
881.0    1
59       1
2,131    1
387.0    1
Name: votes, Length: 82, dtype: int64

In [14]:
#Connections
df.connections.value_counts()

500+    25633
171        80
Name: connections, dtype: int64

In [15]:
#Location
df.location.value_counts().head()

['New', 'York,', 'New', 'York,', 'United', 'States']    4234
['Miami,', 'Florida,', 'United', 'States']              1959
['Hungary']                                             1846
['Cambridge,', 'England,', 'United', 'Kingdom']         1784
['New', 'York', 'City', 'Metropolitan', 'Area']         1738
Name: location, dtype: int64

In [16]:
#Location nan
df.location.isna().sum()

2272

In [17]:
#Media types
df.media_type.value_counts()

article       15144
image          8708
video          2690
document        113
poll             86
entity           32
newsletter        4
view              2
Name: media_type, dtype: int64

In [18]:
#Media types
df.media_type.isna().sum()

7233

## Comments

We have different columns :
* Unnamed : to delete
* Name : Linkedin profile
* Headline : below the name in Linkedin
* Location : some Nan ; List : example ['Gloucester,', 'Massachusetts,', 'United', 'States']
* Followers : number of followers (connextions + followers in Linkedin) ; This dataset is composed of influencers with a lot of followers
* Connections : careful of 500+ (is it useful ?) => only 2 differents values => delete it !
* About : about section in the profile
* time_spent : time between the post date & when this dataset was created (categorical values : 1 month ago, 5 years ago, ...)
* Content : core of the content
* Content_links : every links conatined in the post : profile, #, media (website, picture,...)
* Media_type : media added to the post : article, image, photo,... ; Some nan
* Media_url : link of the media
* Num_hashtags : number
* Hashtag_followers : nulls
* Hashtags : [['#',"link"]]
* Reactions : number of likes, supports, loves,... counted as identical
* Comments : number of comments
* Views : Nan
* Votes : 82 non-null, composed of decimal numbers too. Weird. Because it is unsignificant, we could delete it

# Cleaning

## Drop useless columns

In [19]:
df_dropped = df.drop(['Unnamed: 0','connections','hashtag_followers','views','votes'],axis=1)

## Replace Nan

New save of the dataset modified

In [20]:
df_fillNa = df_dropped.copy()

In [21]:
df_fillNa.isna().sum()

name                0
headline            0
location         2272
followers          42
about               0
time_spent          1
content          2016
content_links       0
media_type       7233
media_url           0
num_hashtags        0
hashtags            0
reactions           0
comments            0
dtype: int64

### Location

Some locations are missing, then we are going to replace them by Unknown

In [22]:
#We add a new column 'Locations'
df_fillNa["Locations"] = df_dropped['location'].fillna('Unknown')

In [23]:
#Then we drop the previous "location" column
df_fillNa = df_fillNa.drop(['location'],axis=1)

### Followers

To avoid any bias, let's take the mean to replace Nan.  
There are only 42 missing values.

In [24]:
Followers_mean = df_fillNa.followers.mean()
Followers_mean

1125922.2806005299

In [25]:
df_fillNa["Followers"] = df_fillNa['followers'].fillna(Followers_mean)

In [26]:
#Drop
df_fillNa = df_fillNa.drop(['followers'],axis=1)

### Time spent

There is only one missing value. 

In [27]:
df_fillNa.loc[df_fillNa['time_spent'].isna()]

Unnamed: 0,name,headline,about,time_spent,content,content_links,media_type,media_url,num_hashtags,hashtags,reactions,comments,Locations,Followers
14049,Gary Frisch,30-Year Public Relations Pro and Skilled Writer,I began my public relations careers when the S...,,"Amid Coronavirus, PR pros should tread careful...",[],article,['https://www.linkedin.com/pulse/public-relati...,0,[],8,2,"['Greater', 'Philadelphia']",30971.0


We see it's about coronavirus.  
We can think the post was published the day the extraction was done.  
For this reason, let's replace Nan by "Current date"

In [28]:
df_fillNa['Time_spent'] = df_fillNa['time_spent'].fillna('Current date')

In [29]:
#Drop the previous column
df_fillNa = df_fillNa.drop(['time_spent'],axis=1)

### Content

The question we decided to treat is about post contents.  
Therefore we have no reason to keep rows without content.  
Let's drop them

In [30]:
df_fillNa = df_fillNa.loc[df_fillNa['content'].notnull()]

### Media type

We suppose that Nan means there is no media in the post.
Let's replace it by "None"

In [31]:
df_fillNa["Media_type"] = df_fillNa['media_type'].fillna("none")

In [32]:
df_fillNa = df_fillNa.drop(['media_type'],axis=1)

### Check if everything is ok

In [33]:
df_fillNa.isna().sum()

name             0
headline         0
about            0
content          0
content_links    0
media_url        0
num_hashtags     0
hashtags         0
reactions        0
comments         0
Locations        0
Followers        0
Time_spent       0
Media_type       0
dtype: int64

In [34]:
df_fillNa.shape

(31996, 14)

In [35]:
print('We dropped {} rows from the original dataset'.format(df.shape[0]-df_fillNa.shape[0]))

We dropped 2016 rows from the original dataset


The data is now ready to be explored.

# Export the cleaned Dataset

In [36]:
df_cleaned = df_fillNa.copy()

In [37]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31996 entries, 0 to 34011
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           31996 non-null  object 
 1   headline       31996 non-null  object 
 2   about          31996 non-null  object 
 3   content        31996 non-null  object 
 4   content_links  31996 non-null  object 
 5   media_url      31996 non-null  object 
 6   num_hashtags   31996 non-null  int64  
 7   hashtags       31996 non-null  object 
 8   reactions      31996 non-null  int64  
 9   comments       31996 non-null  int64  
 10  Locations      31996 non-null  object 
 11  Followers      31996 non-null  float64
 12  Time_spent     31996 non-null  object 
 13  Media_type     31996 non-null  object 
dtypes: float64(1), int64(3), object(10)
memory usage: 3.7+ MB


In [38]:
import pickle
df_cleaned.to_pickle('cleaned_data.pkl')

Let's go to the next notebook