<img src="Assets/header.png" style="width: 800px;">

# `Contents`

- [Load Libraries](#load)
- [Engineering New Features / Adding Extra Layers of Data](#engineer)
	- [Contains Links / Videos](#links) 
	- [Contains Hashtags](#hash) 
	- [Integrating like data from secondary sources](#integrate) 
	- [Creating engagement rate](#engage)         
- [Next Steps](#next)

<a id="load"></a>
# `Load Libraries`
---

In [1]:
import numpy as np
import pandas as pd
import time
import ast
import regex as re

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Import the merged clean data
---

In [2]:
df = pd.read_csv('./Clean_Data/Final_merged.csv')

<a id="engineer"></a>
# `Engineering New Features / Adding Extra Layers of Data`
---

Ideas for new features:

    -Identify if post contans video or link
    -Identify if post contains any hashtags 
    - Add likes by time
    -Aggregate extra fan data (likes) create engagement rate by interaction

<a id="links"></a>
## `Contains Links / Videos`
---

In [3]:
df.Date = pd.to_datetime(df.Date)

In [4]:
def link_tag(x):
    link_identifiers = ['http','bit.ly','.co','po.st']
    if any(i in x for i in link_identifiers):
        return True
    else:
        return False

In [5]:
#create a column that returns true if the corresponding 'Post_Content' column contained a http link
df['Contains_Link'] = df['Post_Content'].apply(link_tag)

In [6]:
#create a column that returns true if the corresponding 'Views' column has views
df['Contains_Video'] = df['Views'] > 0

In [7]:
df.head(10)

Unnamed: 0,Date,Year,Brand,Post_Content,All_Responses,Comments,Shares,Views,Contains_Link,Contains_Video
0,2018-12-08,2018,Sainsburys,Bake a festive showstopper with Sainsbury’s ma...,132,74,23,0,True,False
1,2018-12-05,2018,Sainsburys,Get in the party spirit with Sainsbury’s magaz...,142,33,5,18000,True,True
2,2018-12-01,2018,Sainsburys,Harry and Meghan’s wedding cake maker Claire P...,193,59,26,0,True,False
3,2018-11-29,2018,Sainsburys,These cookie-cup mince pies are deliciously ch...,187,34,22,0,True,False
4,2018-11-26,2018,Sainsburys,We need your help to brighten 1 million Christ...,2200,180,710,627000,False,True
5,2018-11-25,2018,Sainsburys,Did you know you can make your Christmas puddi...,208,80,58,0,True,False
6,2018-11-22,2018,Sainsburys,Kids AND adults will love making these cute re...,273,97,67,0,True,False
7,2018-11-19,2018,Sainsburys,Calling all cheese lovers! Christmas really ha...,898,1500,455,461000,False,True
8,2018-11-15,2018,Sainsburys,*Watches on repeat for the kid dressed as a plug*,26000,12000,17918,2200000,False,True
9,2018-11-14,2018,Sainsburys,"He didn't choose the plug life, the plug life ...",29000,23000,14616,1800000,False,True


<a id="hash"></a>
## `Contains Hashtags`
---

In [8]:
#it might be useful to identify which posts have hashtags in them and how many - so let's create two new columns 
hashtag = re.compile('[#]')

def hashtag_present(x):
    if hashtag.findall(x):
        return True
    else:
        return False
    
def hashtag_count(x):
    return len(hashtag.findall(x))

In [9]:
df['Has_Hashtag'] = df.Post_Content.apply(hashtag_present)

In [10]:
df['Hashtag_Count'] = df.Post_Content.apply(hashtag_count)

<a id="integrate"></a>
## `Integrating like data from secondary sources`
---

You can't get retrospective like data for brands on Facebook - so I used a site called Fanpage Karma to acquire this data. It may prove useful in getting some context about the brands.

<img src= "Assets/fp_karma.png">



In [11]:
asda_likes = pd.read_csv('./FanPage_Karma_Data/ASDA_likes.csv')
lidl_likes = pd.read_csv('./FanPage_Karma_Data/Lidl_likes.csv')
sains_likes = pd.read_csv('./FanPage_Karma_Data/Sainsburys_likes.csv')
mns_likes = pd.read_csv('./FanPage_Karma_Data/mns_likes.csv')
morrisons_likes = pd.read_csv('./FanPage_Karma_Data/Morrisons_likes.csv')
tesco_likes = pd.read_csv('./FanPage_Karma_Data/Tesco_likes.csv')
waitrose_likes = pd.read_csv('./FanPage_Karma_Data/Waitrose_likes.csv')

In [12]:
asda_likes.Date = pd.to_datetime(asda_likes['Date'],format='%d/%m/%y')
lidl_likes.Date = pd.to_datetime(lidl_likes['Date'],format='%d/%m/%y')
sains_likes.Date = pd.to_datetime(sains_likes['Date'],format='%d/%m/%y')
mns_likes.Date = pd.to_datetime(mns_likes['Date'],format='%d/%m/%y')
morrisons_likes.Date = pd.to_datetime(morrisons_likes['Date'],format='%d/%m/%y')
tesco_likes.Date = pd.to_datetime(tesco_likes['Date'],format='%d/%m/%y')
waitrose_likes.Date = pd.to_datetime(waitrose_likes['Date'],format='%d/%m/%y')

In [13]:
Lidl = df[df.Brand == 'Lidl']
Tesco = df[df.Brand == 'Tesco']
MnS = df[df.Brand == 'Marks and Spencer']
Morrisons = df[df.Brand == 'Morrisons']
Waitrose = df[df.Brand == 'Waitrose']
Sainsburys = df[df.Brand == 'Sainsburys']
Asda = df[df.Brand == 'ASDA']

In [14]:
#merging the like data with the individual brand df
sains_m = pd.merge(Sainsburys, sains_likes, left_index=False, right_index=False,left_on='Date', right_on='Date', how = 'left')
lidl_m = pd.merge(Lidl, lidl_likes, left_index=False, right_index=False,left_on='Date', right_on='Date', how = 'left')
tesco_m = pd.merge(Tesco, tesco_likes, left_index=False, right_index=False,left_on='Date', right_on='Date', how = 'left')
morrisons_m = pd.merge(Morrisons, morrisons_likes, left_index=False, right_index=False,left_on='Date', right_on='Date', how = 'left')
mns_m = pd.merge(MnS, mns_likes, left_index=False, right_index=False,left_on='Date', right_on='Date', how = 'left')
waitrose_m = pd.merge(Waitrose, waitrose_likes, left_index=False, right_index=False,left_on='Date', right_on='Date', how = 'left')
asda_m = pd.merge(Asda, asda_likes, left_index=False, right_index=False,left_on='Date', right_on='Date', how = 'left')

In [15]:
df = pd.concat([sains_m,lidl_m,tesco_m,morrisons_m,mns_m,waitrose_m,asda_m])

In [16]:
df.dropna(inplace=True)

<a id="engage"></a>
## `Creating engagement rate`
---

In [17]:
df.head()

Unnamed: 0,Date,Year,Brand,Post_Content,All_Responses,Comments,Shares,Views,Contains_Link,Contains_Video,Has_Hashtag,Hashtag_Count,Likes
0,2018-12-08,2018,Sainsburys,Bake a festive showstopper with Sainsbury’s ma...,132,74,23,0,True,False,False,0,1700448.0
1,2018-12-05,2018,Sainsburys,Get in the party spirit with Sainsbury’s magaz...,142,33,5,18000,True,True,False,0,1700328.0
2,2018-12-01,2018,Sainsburys,Harry and Meghan’s wedding cake maker Claire P...,193,59,26,0,True,False,False,0,1700220.0
3,2018-11-29,2018,Sainsburys,These cookie-cup mince pies are deliciously ch...,187,34,22,0,True,False,False,0,1700046.0
4,2018-11-26,2018,Sainsburys,We need your help to brighten 1 million Christ...,2200,180,710,627000,False,True,False,0,1699846.0


In [18]:
df['Response_Rate'] = df['All_Responses'] / df ['Likes']
df['Comments_Rate'] = df['Comments'] / df ['Likes']
df['Shares_Rate'] = df['Shares'] / df ['Likes']
df['Video_Rate'] = df['Views'] / df ['Likes']

In [19]:
df.to_csv('./Clean_Data_Eng/Final_merged_Eng.csv',index=False)

In [20]:
df.Brand.value_counts()

Lidl                 1353
Tesco                1068
Marks and Spencer     870
Morrisons             798
Waitrose              777
ASDA                  770
Sainsburys            714
Name: Brand, dtype: int64

<a id="next"></a>
# `Next Steps:`
---

In the next section I'm going to begin the EDA stage of the project. My main aim in this section is to dig into all these metrics I have engineered and see if there is any indication that our brands' content is any different. Time to start visualising!