## Analysing the Facebook Metrics Dataset

In this article, I will be using mainly the Python language and the Pandas library to conduct a preliminary analysis of the "facebook-metrics" dataset for the Stage 0 task of the HNG12 Data Analysis Track, a challenging program designed to identify the best Developers and analysts.

As an employer, if you are interested in finding out more, or hiring a data analyst, please click here: [hng.tech/hire/data-analysts](https://hng.tech/hire/data-analysts).
For interested interns, the HNG Internship Program can be found here:
[hng.tech/internship](https://hng.tech/internship)
Now unto the main course.

Contents<br>
1. Preparation and Setup<br>
2. Dataset Familiarization<br>
3. Initial Data Exploration<br>
4. Insight Identified<br>

### 1. Preparation and Setup
The Facebook Metrics Dataset dataset contains performance metrics for posts published by a selected Facebook page. 
##### Importing the dataset files.
The first step is to download the dataset. The "facebook-metrics" dataset is available on Kaggle here: [Facebook Metrics](https://www.kaggle.com/datasets/masoodanzar/facebook-metrics).
<br>I'll be accessing it using the kagglehub library.

##### Setting up the environment
The next step is to importing the analysis libraries to be used. For this analysis, I'll be using the Pandas library for data processing and I/O, the Numpy library for calculations and the Matplotlib and Seaborn libraries for visualization.


In [1]:
# Download the Facebook metrics dataset
import kagglehub
facebook_metrics_path = kagglehub.dataset_download('masoodanzar/facebook-metrics')

# Get input data file pathsin the
paths_input_files = []
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
# check file paths
        paths_input_files.append(os.path.join(dirname, filename))

print(paths_input_files)

# Import libraries for Analysis
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import matplotlib.pyplot as plt # data visualisations
import seaborn as sns # more data visualisations

display("Setup Completed")  

ModuleNotFoundError: No module named 'kagglehub'

In [2]:
# suppress warnings due to version changes 
import warnings
warnings.filterwarnings('ignore')

### 2. Dataset Familiarization

##### 2.1 Examining the Dataset Structure
Opening the provided dataset to understand its structure and contents.
I call the pd.Dataframe.info() method. This is a great way to get concise information on the number of columns, column names, column data types, and number of non-null values in each column of a pandas dataframe.

A cursory inspection shows there are 500 rows of page post publication with 19 feature columns. Referring to the dataset's Kaggle page, I easily identify there are seven prepublication characteristics features (columns: 0 to 6) and twelve postpublication metrics (columns: 7 to 18), a few of which contain null values. Most of the columns are typed as integers, with three float columns and a single object column of unknown datatype.

In [3]:
# Load dataset csv file
df = pd.read_csv(paths_input_files[0], sep=';')

display(df.info(),
        "-------",
        df.Type.dtype)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
 #   Column                                                               Non-Null Count  Dtype  
---  ------                                                               --------------  -----  
 0   Page total likes                                                     500 non-null    int64  
 1   Type                                                                 500 non-null    object 
 2   Category                                                             500 non-null    int64  
 3   Post Month                                                           500 non-null    int64  
 4   Post Weekday                                                         500 non-null    int64  
 5   Post Hour                                                            500 non-null    int64  
 6   Paid                                                                 499 non-null    float64
 7   Lifetime

None

'-------'

dtype('O')

##### 2.2 Examining the Dataset rows

Checking the first few and last few rows of the dataset further suggests that some columns might be incorrectly typed. 
For example, the ['like', 'share'] columns are typed as float but look like they should be integers, as likes and share are counted in ones. 
While the ['Paid'] column looks like it should be boolean instead of a float, as a post is either paid or not. 
Also, the ['Page total likes'] column looks like it represents the number of page likes the page had at the time of post publication. 
While ['Type', 'Category'] columns look like categorical data. 
Lastly, ['Post Month', 'Post Weekday', 'Post Hour'] seem like they detail time characteristics.

In [4]:
display(df.head(), 
        df.tail())

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,4,79.0,17.0,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,5,130.0,29.0,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,0,66.0,14.0,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,58,1572.0,147.0,1777
4,139441,Photo,2,12,2,3,0.0,7244,13594,671,410,580,6228,3200,396,19,325.0,49.0,393


Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
495,85093,Photo,3,1,7,2,0.0,4684,7536,733,708,985,4750,2876,392,5,53.0,26.0,84
496,81370,Photo,2,1,5,8,0.0,3480,6229,537,508,687,3961,2104,301,0,53.0,22.0,75
497,81370,Photo,1,1,5,2,0.0,3778,7216,625,572,795,4742,2388,363,4,93.0,18.0,115
498,81370,Photo,3,1,4,11,0.0,4156,7564,626,574,832,4534,2452,370,7,91.0,38.0,136
499,81370,Photo,2,1,4,4,,4188,7292,564,524,743,3861,2200,316,0,91.0,28.0,119


##### 2.3 Examining the Dataset statistical characteristics
Calling the pd.DataFrame.describe() method on the numerical columns shows the mean, min and max values.
While doing the same for the nonnumerical columns in the dataset shows the number of unique values, as well as the most frequent value and how often it appears.
The results show that except for ['Page total likes'], the prepublication columns have low maximum values, which I'll be investigating alongside other hints in the next section.

In [5]:
display(df.select_dtypes(include=np.number).describe(),
        df.select_dtypes(exclude=np.number).describe())

Unnamed: 0,Page total likes,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
count,500.0,500.0,500.0,500.0,500.0,499.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,499.0,496.0,500.0
mean,123194.176,1.88,7.038,4.15,7.84,0.278557,13903.36,29585.95,920.344,798.772,1415.13,16766.38,6585.488,609.986,7.482,177.945892,27.266129,212.12
std,16272.813214,0.852675,3.307936,2.030701,4.368589,0.448739,22740.78789,76803.25,985.016636,882.505013,2000.594118,59791.02,7682.009405,612.725618,21.18091,323.398742,42.613292,380.233118
min,81370.0,1.0,1.0,1.0,1.0,0.0,238.0,570.0,9.0,9.0,9.0,567.0,236.0,9.0,0.0,0.0,0.0,0.0
25%,112676.0,1.0,4.0,2.0,3.0,0.0,3315.0,5694.75,393.75,332.5,509.25,3969.75,2181.5,291.0,1.0,56.5,10.0,71.0
50%,129600.0,2.0,7.0,4.0,9.0,0.0,5281.0,9051.0,625.5,551.5,851.0,6255.5,3417.0,412.0,3.0,101.0,19.0,123.5
75%,136393.0,3.0,10.0,6.0,11.0,1.0,13168.0,22085.5,1062.0,955.5,1463.0,14860.5,7989.0,656.25,7.0,187.5,32.25,228.5
max,139441.0,3.0,12.0,7.0,23.0,1.0,180480.0,1110282.0,11452.0,11328.0,19779.0,1107833.0,51456.0,4376.0,372.0,5172.0,790.0,6334.0


Unnamed: 0,Type
count,500
unique,4
top,Photo
freq,426


### 3. Initial Data Exploration:

Conducting a quick review of the dataset without deep analysis, I look for obvious patterns, trends, or anomalies in the data.

##### 3.1 Checking for null values
Examining the data and checking for null entries shows the null values at row_indexes [111, 120, 124, 164, 499] of ['Paid', 'like', 'share'].

##### 3.2 Checking for unique values in prepublication columns
Checking for prepublication ![](http://)columns with less than 25 unique non-null values points to ['Type'] (4 unique), ['Category'] (3 unique), ['Post Month'] (12 unique) and ['Post Weekday'] (7 unique) as categorical representations. The ['Paid'] (2 unique) column may be boolean or categorical. But while ['Post Hour'] also appears a categorical representation of hours, two hourly values do not appear, namely '0' and '21'.

##### 3.3 Checking if Paid, like and share contain only whole numbers
Checking if ['Paid','like, 'share'] contain only whole numbers also indicates that they are integers.

In [6]:
# check for null entries
null_values = df.loc[df.isnull().any(axis=1), df.isnull().any(axis=0)]

# check for unique values to identify categorical columns 
unique_values_categories = df.iloc[:, 1:7].dropna().apply(lambda x: (len(x.unique()), sorted(set(x.unique()))), axis=0)
unique_values_categories = unique_values_categories.transpose()
unique_values_categories.columns = ['Number of Unique Values', 'List of Unique Values']

# check if Paid, like and share contain only whole numbers
check_paid = np.array_equal(df['Paid'].dropna().astype('int64'), df['Paid'].dropna())
check_like = np.array_equal(df['like'].dropna().astype('int64'), df['like'].dropna())
check_share = np.array_equal(df['share'].dropna().astype('int64'), df['share'].dropna())

# display 
display(
        null_values,
        unique_values_categories,
        df.iloc[:, 1:7].describe(),
        "", "Hourly values absent in the Post Hour column: ",
        ("0", 23*(23+1)/2 - sum(unique_values_categories.iloc[4,1])),
        "", "Do Paid, like and share contain only whole number integers?:",
        (check_paid and check_like and check_share)
)

Unnamed: 0,Paid,like,share
111,0.0,,
120,0.0,2.0,
124,0.0,7.0,
164,0.0,18.0,
499,,91.0,28.0


Unnamed: 0,Number of Unique Values,List of Unique Values
Type,4,"[Link, Photo, Status, Video]"
Category,3,"[1, 2, 3]"
Post Month,12,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]"
Post Weekday,7,"[1, 2, 3, 4, 5, 6, 7]"
Post Hour,22,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
Paid,2,"[0.0, 1.0]"


Unnamed: 0,Category,Post Month,Post Weekday,Post Hour,Paid
count,500.0,500.0,500.0,500.0,499.0
mean,1.88,7.038,4.15,7.84,0.278557
std,0.852675,3.307936,2.030701,4.368589,0.448739
min,1.0,1.0,1.0,1.0,0.0
25%,1.0,4.0,2.0,3.0,0.0
50%,2.0,7.0,4.0,9.0,0.0
75%,3.0,10.0,6.0,11.0,1.0
max,3.0,12.0,7.0,23.0,1.0


''

'Hourly values absent in the Post Hour column: '

('0', 21.0)

''

'Do Paid, like and share contain only whole number integers?:'

True

##### 3.4 Checking for trends

Another interesting observation is that the 'Post Month' column is sorted in descending order. 
The 'Page total likes' column is also similarly sorted in descending order but with exceptions at row indexes [25] and [28]. 
This suggests the sampled data is from a single calender year.

Visual inspection of the 'Total Interactions' column reveals it is the sum of the 'comment', 'like' and 'share' columns. Visual inspection suggests the 'NaN' values in ['like', 'share'] should be zero, which is confirmed by testing, suggesting that the single null entry in ['Paid'] is similarly zero.

In [7]:
# Check if any column is monotonically sorted
monotonic_cols = df.apply(lambda x: (x.is_monotonic_increasing, x.is_monotonic_decreasing), axis=0)
monotonic_cols.index = ['monotonic_increasing','monotonic_decreasing']
monotonic_cols = monotonic_cols.transpose()

display(monotonic_cols.loc[monotonic_cols.any(axis=1),monotonic_cols.any(axis=0)])

# Check where 'Page total likes' is not monotonically sorted
rows_page_likes_dropped_at = df.loc[(df['Page total likes'].diff() > 0), :].index
a, b = rows_page_likes_dropped_at[0], rows_page_likes_dropped_at[-1]

#Testing if ['Total Interactions'] is equal to the sum o
comment_like_share = df.iloc[:, -4::].fillna(0).astype('int64')
comment_like_share['Sum with NaNs set to 0'] = comment_like_share.iloc[:, -4:-1].sum(axis=1)

#Testing if ['Sum_comment_like_share'] == ['Total Interactions'] when NaN set to 0
comment_like_share = df.iloc[:, -4::].fillna(0).astype('int64')
comment_like_share['Sum with NaNs set to 0'] = comment_like_share.iloc[:, -4:-1].sum(axis=1)

display(df.iloc[a-1:a+1, 0:7],
        df.iloc[b-1:b+1, 0:7],
        comment_like_share.iloc[null_values.index],
        "Check if all Values are Equal:",
        (comment_like_share['Sum with NaNs set to 0'] == df['Total Interactions']).all()
)

Unnamed: 0,monotonic_decreasing
Post Month,True


Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid
24,138414,Status,2,12,6,10,0.0
25,138458,Status,2,12,6,3,0.0


Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid
27,138458,Photo,3,12,5,3,0.0
28,138895,Photo,2,12,5,3,0.0


Unnamed: 0,comment,like,share,Total Interactions,Sum with NaNs set to 0
111,0,0,0,0,0
120,0,2,0,2,2
124,0,7,0,7,7
164,0,18,0,18,18
499,0,91,28,119,119


'Check if all Values are Equal:'

True

### 4. Insights Identified

1. The dataset contain null values which appear to be zero values mistakenly not entered during data entry.
2. The sorting of the 'Post Month' and 'Page total likes' column suggests that the sampled data is from a single calender year of page operation.
3. The page appears to have consistently grown over the year from 81k to 139k page likes, with the exception of two days in December.
4. The page posted on all day and at all times except at midnight ('0') and 9pm ('21').
5. The dataset columns are incorrectly typed.
6. Of the 7 prepublication features, six are categorical (columns 0 to 6), and only one is numerical data ['Page total likes']. In comparison, all twelve postpublication feature columns are integers.

|#  |Columns               |Datatype |Likely Meaning |
|:--|:---------------------|:-------:|:--------------|
|   |**Prepublication Features**                     |
| 0 |Page total likes      |Integer  |Number of people who have liked the Face page??|
| 1 |Type                  |Catergorical |Type of post: Link, Photo, Status, Video.  |
| 2 |Category              |Catergorical |A categorical variable: (1, 2, 3)          |
| 3 |Post month            |Catergorical |Month the post was published.|
| 4 |Post hour             |Catergorical |Hour the post was published.|
| 5 |Post weekday          |Catergorical |Weekday the post was published.|
| 6 |Paid                  |Catergorical |If the publication was paid.|
|  |**Postpublication Features**                   |
| 7 |Lifetime post total reach  |Integer  |Total people reached??|
| 8 |Lifetime post total impressions |Integer  | Total number of impressions ??|
| 9 |Lifetime engaged users     |Integer  |Number of people who engaged with a page post ??|
|10 |Lifetime post consumers    |Integer  |??|
|11 |Lifetime post consumptions |Integer  |??|
|12 |Lifetime post impressions by <br>people who have liked a page|Integer  |Number of post impressions from people liked the page.|
|13 |Lifetime post reach by people <br>who like a page|Integer  |Total number of page followers (likers) reached??|
|14 |Lifetime people who have liked <br>a page and engaged with a post|Integer  |Number of people who engaged with a post that liked the Page|
|15 |Comments              |Integer  |Number of comments on a post.|
|16 |Likes                 |Integer  |Number of likes on a page post.|
|17 |Shares                |Integer  |Number of times a post was shared.|
|18 |Total interactions    |Integer  |The sum of “likes,”“comments,” and “shares” for a post.|
|   |**Table: Datakey of Dataset Features** |
