# Machine Learning Work for Facebook Metrics Dataset

This dataset is about a campaign from a famous costemic brand. Between 1 January 2014 and 31 December 2014 dates, this brand did this campaign on Facebook.

What will we do here is visualize some data from this, and create a regression prediction model for predicting <code>Total Interactions</code>. Let's start this notebook with the required imports.

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split
from math import sqrt

print('Imports are successful.')

Imports are successful.


We are ready to take our dataset into our <code>df</code> variable. 

After that, we drop the unrequired columns from our data:
* One of them is <code>Unnamed: 0</code>, because it looks like an index column.
* Other one is <code>comment</code> column, because it is a mathematical addition element of <code>Total Interactions</code> column.
* <code>like</code> and <code>share</code> elements also dropped, because they are the same element with <code>comment</code> column.

Let's continue.

In [2]:
# read the dataset by df variable
df = pd.read_csv('/kaggle/input/facebook-metrics-dataset-of-cosmetic-brand/Facebook Metrics of Cosmetic Brand.csv')
# we drop these columns here as they will not be needed for our operation
df.drop(['Unnamed: 0', 'comment', 'like', 'share'], axis=1, inplace=True)
df.head()

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,1777
4,139441,Photo,2,12,2,3,0.0,7244,13594,671,410,580,6228,3200,396,393


Review the dataset's features and their types.

In [3]:
# infos about dataset, especially columns
print('Facebook Metrics of Cosmetic Brand Dataset\n')
df.info()

Facebook Metrics of Cosmetic Brand Dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 16 columns):
 #   Column                                                               Non-Null Count  Dtype  
---  ------                                                               --------------  -----  
 0   Page total likes                                                     500 non-null    int64  
 1   Type                                                                 500 non-null    object 
 2   Category                                                             500 non-null    int64  
 3   Post Month                                                           500 non-null    int64  
 4   Post Weekday                                                         500 non-null    int64  
 5   Post Hour                                                            500 non-null    int64  
 6   Paid                                                          

Our columns are generally numerical... Perfect! That's what we really want.

We have not so many things in our data preprocessing part. But before, let's visualize our data. After that, we try to change our <code>Type</code> columns numerical.

In [4]:
df.describe()

Unnamed: 0,Page total likes,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,Total Interactions
count,500.0,500.0,500.0,500.0,500.0,499.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,123194.176,1.88,7.038,4.15,7.84,0.278557,13903.36,29585.95,920.344,798.772,1415.13,16766.38,6585.488,609.986,212.12
std,16272.813214,0.852675,3.307936,2.030701,4.368589,0.448739,22740.78789,76803.25,985.016636,882.505013,2000.594118,59791.02,7682.009405,612.725618,380.233118
min,81370.0,1.0,1.0,1.0,1.0,0.0,238.0,570.0,9.0,9.0,9.0,567.0,236.0,9.0,0.0
25%,112676.0,1.0,4.0,2.0,3.0,0.0,3315.0,5694.75,393.75,332.5,509.25,3969.75,2181.5,291.0,71.0
50%,129600.0,2.0,7.0,4.0,9.0,0.0,5281.0,9051.0,625.5,551.5,851.0,6255.5,3417.0,412.0,123.5
75%,136393.0,3.0,10.0,6.0,11.0,1.0,13168.0,22085.5,1062.0,955.5,1463.0,14860.5,7989.0,656.25,228.5
max,139441.0,3.0,12.0,7.0,23.0,1.0,180480.0,1110282.0,11452.0,11328.0,19779.0,1107833.0,51456.0,4376.0,6334.0


## Data Visualization

Firstly, let's look at for what kind of posts people like. There are many post types in Facebook like photos, status, videos, etc. Let's view this with a Pie Chart.

In [5]:
# total interactions by type of the facebook post
fig = px.pie(df, values='Total Interactions', names='Type', title='Total Interactions by Type of the Facebook Post')
fig.show()

As you can see in the chart, people mostly like photo posts on Facebook. Right now, we will review to which kind of post has more <code>Paid</code> then the other one.

Right now, we will create some visualizations to see which kind of post has what rate of <code>Paid - Unpaid</code>. Let's start with photo posts and continue with other ones.

In [6]:
# paid-unpaid pie chart for photo posts
df_photo_posts_count = df[df['Type']=='Photo']['Paid'].value_counts()
df_photo_posts_count = pd.DataFrame({'Paid': df_photo_posts_count.index, 'Count': df_photo_posts_count.values})

total_number_of_photos = df_photo_posts_count['Count'].sum()

fig = px.pie(df_photo_posts_count, values='Count', names='Paid', title='Paid-Unpaid Pie Chart for Photo Posts')
fig.show()

In [7]:
# paid-unpaid pie chart for status posts
df_status_posts_count = df[df['Type']=='Status']['Paid'].value_counts()
df_status_posts_count = pd.DataFrame({'Paid': df_status_posts_count.index, 'Count': df_status_posts_count.values})

total_number_of_status = df_status_posts_count['Count'].sum()

fig = px.pie(df_status_posts_count, values='Count', names='Paid', title='Paid-Unpaid Pie Chart for Status Posts')
fig.show()

In [8]:
# paid-unpaid pie chart for link posts
df_link_posts_count = df[df['Type']=='Link']['Paid'].value_counts()
df_link_posts_count = pd.DataFrame({'Paid': df_link_posts_count.index, 'Count': df_link_posts_count.values})

total_number_of_link = df_link_posts_count['Count'].sum()

fig = px.pie(df_link_posts_count, values='Count', names='Paid', title='Paid-Unpaid Pie Chart for Link Posts')
fig.show()

In [9]:
# paid-unpaid pie chart for video posts
df_video_posts_count = df[df['Type']=='Video']['Paid'].value_counts()
df_video_posts_count = pd.DataFrame({'Paid': df_video_posts_count.index, 'Count': df_video_posts_count.values})

total_number_of_video = df_video_posts_count['Count'].sum()

fig = px.pie(df_video_posts_count, values='Count', names='Paid', title='Paid-Unpaid Pie Chart for Video Posts')
fig.show()

Let's merge them in a table with create their <code>Paid Rate</code> and <code>Unpaid Rate</code>.

In [10]:
# paid-unpaid rate of post types in order
df_paid_posts = df[df['Paid']==1]['Type'].value_counts()
df_paid_posts = pd.DataFrame({'Type': df_paid_posts.index, 'Paid Rate': df_paid_posts.values})
df_paid_posts['Paid Rate'] = df_paid_posts['Paid Rate'].astype(float)

df_paid_posts.iloc[0, 1] = df_paid_posts.iloc[0, 1] / total_number_of_photos
df_paid_posts.iloc[1, 1] = df_paid_posts.iloc[1, 1] / total_number_of_status
df_paid_posts.iloc[2, 1] = df_paid_posts.iloc[2, 1] / total_number_of_link
df_paid_posts.iloc[3, 1] = df_paid_posts.iloc[3, 1] / total_number_of_video

df_paid_posts['Unpaid Rate'] = 1 - df_paid_posts['Paid Rate']
df_paid_posts.sort_values(by='Paid Rate', ascending=False, inplace=True)
df_paid_posts

Unnamed: 0,Type,Paid Rate,Unpaid Rate
3,Video,0.571429,0.428571
0,Photo,0.28,0.72
2,Link,0.272727,0.727273
1,Status,0.222222,0.777778


We sorted values with the highest to the lowest <code>Paid Rates</code>. As you can see, video posts have the highest paid rate. The most used post type (photos) is not in the first place.

From here, we can say that this cosmetics brand should use more video posts to sell their products. Let's continue with data preprocessing.

## Data Preprocessing

We do not have to do so many things here. We will see that, if we have lots of null values in our dataset. If no, we will remove null rows and continue.

Also, we will <code>map</code> our <code>Type</code> column values like a numerical values, and make this column's type as <code>int</code>.

In [11]:
# fulfill the empty cells
print(f'Before the process:\n{df.isnull().sum()}')
df.dropna(axis=0, inplace=True)
print(f'\nAfter the process:\n{df.isnull().sum()}')

Before the process:
Page total likes                                                       0
Type                                                                   0
Category                                                               0
Post Month                                                             0
Post Weekday                                                           0
Post Hour                                                              0
Paid                                                                   1
Lifetime Post Total Reach                                              0
Lifetime Post Total Impressions                                        0
Lifetime Engaged Users                                                 0
Lifetime Post Consumers                                                0
Lifetime Post Consumptions                                             0
Lifetime Post Impressions by people who have liked your Page           0
Lifetime Post reach by people w

As you can see, there is no empty cells in our dataset right now. Let's map the <code>Type</code> column.

In [12]:
# gather some information about how many rows and columns we have
print(f'''The dataset has {df.shape[0]} rows and {df.shape[1]} columns.''')

The dataset has 499 rows and 16 columns.


Let's map the <code>Type</code> column.

In [13]:
# map type column
df['Type'] = df['Type'].map({'Photo': 1, 'Status': 2, 'Link': 3, 'Video': 4})
df['Type'] = df['Type'].astype(int)

df[['Type']].info()

<class 'pandas.core.frame.DataFrame'>
Index: 499 entries, 0 to 498
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Type    499 non-null    int64
dtypes: int64(1)
memory usage: 7.8 KB


Success! We are done so far. The time has come for our <code>Machine Learning</code> part in this project.

## Machine Learning

Right now, we have all of the columns with numerical values. So, we are ready to create our prediction model.

In this part, we will predict continuous <code>Total Interactions</code> feature. So, we will use a multiple-regression here. Let's start. Firstly, create the <code>X</code> and <code>y</code> variables.

In [14]:
# create variables and split
X = df.drop(['Total Interactions'], axis=1)
y = df['Total Interactions']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

So, we can create our multiple-regression model here. Then, fit it. Create the metrics to evaluation.

In [15]:
# create multiple regression model and evaluate
lr = LinearRegression().fit(X_train, y_train)
y_hat = lr.predict(X_test)

mse = mean_squared_error(y_test, y_hat)
rmse = sqrt(mse)
mape = mean_absolute_percentage_error(y_test, y_hat)

Right now, let's see the model's strength.

In [16]:
# create table of the scores
dict_evaluation = {'MSE': mse, 'RMSE': rmse, 'MAPE': mape}

evaluation = pd.DataFrame(dict_evaluation, index=[0])
evaluation.head()

Unnamed: 0,MSE,RMSE,MAPE
0,6088.378511,78.028062,566887500000000.0


So, you are seeing the performance evaluation. Our multiple-regression model resulted with approximately 6088 MSE score. This means that there is a total error margin of 6088 between the independent variable and the dependent variables. You can also see the other observation metric results in the table.

This project finishes right now. Hope that it will be helpful the future projects here.

### Thank you for reading until now.

#### Mert Kont