# Explore our dataset: Facebook

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels as sm
import seaborn as sns

## Read data into a DataFrame

* Looks like CSV is separated by colons instead of commas.
* Let's adjust our `pd.read_csv()` call.

In [None]:
df = pd.read_csv('../data/facebook.csv')
df

* Much better now

In [None]:
df = pd.read_csv('../data/facebook.csv', delimiter=';')
df

## Helpful definitions related the social media marketing

Given that this dataset is focused on social media marketing and includes jargon specific to that field, I found it helpful to look up the following definitions. The research paper also had the following figure which confirmed these definitions.

* **Reach** is "a metric that represents the number of unique people that were exposed and saw a piece of social media content." ([source: buffer.com][1])
* **Impressions** "refer to the number of times your social media content is displayed on someoneâ€™s screen." ([source: buffer.com][2])
* **Engagement** is "the total number of likes, comments, shares, and other interactions." ([source: buffer.com][3])
* **Comsumers** and **Consumptions** refer to "the process by which individuals engage with various forms of media." ([source: fivable.me][4])

![image](../docs/3-Table2-1.png)

[1]: https://buffer.com/social-media-terms/reach
[2]: https://buffer.com/social-media-terms/impressions
[4]: https://buffer.com/social-media-terms/engagement-rate
[3]: https://fiveable.me/key-terms/media-literacy/media-consumption

## Check for missing or incorrectly formatted data

### Missing data
* From `df.info()`, it looks like three columns have tiny bit of missing data.
* Those columns are: 'paid', 'like', and 'share'.

### Incorrectly formatted data
* The 'Type' column is a string, which means it's haven't to pre-processed into something numeric (like through `pd.get_dummies()`.
* The 'Category' column is an integer that is coded to mean one of three different post categories, which means we'll want to pre-process it too.

In [None]:
df.info()

In [None]:
## Examine descriptive statistics

In [None]:
def print_and_plot_descriptive(df, cols, rotation=0):
    display(df[cols].describe())
    df[cols].plot.box()
    plt.xticks(rotation=rotation)

### Column 0 - Page total likes

The `Page total likes` column tells us that we dealing with posts on pages that have a minimum of 81K likes and up to 139K likes. That observation confirms that we're looking at data a relatively popular cosmetics brand.

In [None]:
print_and_plot_descriptive(df, ['Page total likes'])

### Column 1 - Type

Because `Type` is a categorial variable, we can't compute the mean, standard deviation, or other descriptive statistics, but we can look at the percent breakdown of values using `df.value_counts()`.

In [None]:
type_breakdown = df['Type'].value_counts() / 500 * 100
type_breakdown

In [None]:
_ = type_breakdown.plot.pie(autopct='%1.1f%%')
_ = plt.title('Post type breakdown')

### Column 2 - Category

The `Category` column is another categorical variable, so we'll compute the value counts and plot the breakdown. As shown in the plot, there are three categories and the research paper indentified them as the different type of advertizing strategy, such as action, product, and inspiration. The following figure from the research paper provides more context.

![Figure](../docs/4-Table3-1.png)


In [None]:
category_breakdown = df['Category'].value_counts().sort_index()
category_breakdown

In [None]:
_ = category_breakdown.plot.pie(labels=['Action', 'Product', 'Inspiration'], autopct='%1.1f%%')
_ = plt.title('Post category breakdown')

### Column 3-5 - Post month, post weekday, post hour

Of the datetime columns

* `Post Month` and `Post Weekday` look like they have a normal range (1-12 and 1-7).
* `Post Hour` looks a little strange because the range (1-23) is missing an hour.
* This `df.describe()` and `df.value_counts()` output strongly suggests that there is a data quality issue with `Post Hour`.

In [None]:
print_and_plot_descriptive(df, ['Post Month', 'Post Weekday', 'Post Hour'])

In [None]:
df['Post Hour'].value_counts().sort_index()

### Column 6 - Paid

The `Paid` column is a binary variable, where 0 means the company didn't pay Facebook to advertise the post, and 1 means they did pay. Our `df.describe()` and `df.value_counts()`shows 72% of the posts in the dataset were unpaid and 28% were paid.

In [None]:
df['Paid'].describe()

In [None]:
df['Paid'].value_counts() / 499 * 100

### Column 7, 8, 12, 13 - Impressions and reach

As shown in the output from `df.describe()`,

* There is the expected dropoff in magnitude of impressions/reach when comparing the whole population to just the subgroup of people who have liked the page.
* Reach ranges from 238 to 180,480 for the total group, whereas the reach by that subgroup is from 236 to 51,456.
* Impressions ranges from 5.7e+02 to 1.1e+06 for the total group, whereas for subgroup it's from 5.7e+02 to 1.1e+06.

As far as variance goes,

* For reach, the standard deviation for the total group is much greater than that of the subgroup (22,740 vs 7,682).
* For impressions, the standard deviation for the total group and subgroup are relatively similiar (7.7+04 vs 6.0+04).

The box plots show that the impressions/reach data isn't normally distributed, so we'll try to visualize it differently.

In [None]:
print_and_plot_descriptive(df, [
    'Lifetime Post Total Reach',
    'Lifetime Post Total Impressions',
    'Lifetime Post Impressions by people who have liked your Page',
    'Lifetime Post reach by people who like your Page'
])

In [None]:
def plot_impress_reach_hist(df, col, num_of_bins=30):
    num_of_records = df[[col]].shape[0]
    col_max = np.ceil(df[col].max())
    bins = np.linspace(0, col_max, num_of_bins)
    df2 = pd.DataFrame({
        col: df[col],
        'bin': pd.cut(df[col], bins=bins)
    })

    fig = plt.figure(figsize=(6, 4))
    ax = fig.add_subplot(111)
    df2.hist(ax=ax, bins=num_of_bins)
    ax.text(
        0.7,
        0.7,
        s=f'# of bins: {num_of_bins}',
        transform=ax.transAxes,
        horizontalalignment='left',
        verticalalignment='center'
    )
    ax.text(
        0.7,
        0.6,
        s=f'# of records: {num_of_records}',
        transform=ax.transAxes,
        horizontalalignment='left',
        verticalalignment='center'
    )
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')
    

Plotting histograms of the reach/impression columns shows:
 
 * A distribution that heavily skewed to the right with an overwhelming major of posts within the smallest bins. After the smallest bin there is a steep dropoff in posts, that drags on for more than of the total number of bins.
 * If there was more data, it is possible that the distribution is right skewed normal, but for now, it's appears to follow a Poisson distribution.

In [None]:
plot_impress_reach_hist(df, 'Lifetime Post Total Reach', num_of_bins=20)
plot_impress_reach_hist(df, 'Lifetime Post Total Impressions', num_of_bins=50)
plot_impress_reach_hist(df, 'Lifetime Post Impressions by people who have liked your Page', num_of_bins=50)
plot_impress_reach_hist(df, 'Lifetime Post reach by people who like your Page', num_of_bins=50)

### Column 9 - Engagement and consumption

From `df.describe()` we see that:

* The mean number of `Engaged Users` (920) is larger than the mean number of `Consumers` (798).
* The range for `Engaged Users` (9 to 11.4K) is similar in size the that of `Consumers` (9 to 11.3K). This is a bit odd given that the definition of engaged users is the number of unique consumers, whereas consumers can contain repetitions, and yet we see in this dataset more engaged users than consumers. This observation may be a sign of a data quality issue.
* The mean of `Consumptions` > `Engaged Users` > `Consumers`, as expected because one user can consume a post multiple times, each time counting as another consumption.

In [None]:
print_and_plot_descriptive(df, [
    'Lifetime Engaged Users',
    'Lifetime Post Consumers',
    'Lifetime Post Consumptions'
])

### Column 14 - People who have liked your Page and engaged with your post

As shown in df.describe() here,

* The mean of `Engaged users` > `People who have liked your Page and engaged with your post`.
* This observation matches our understanding that column 14 is a subset of `Engaged users`

In [None]:
print_and_plot_descriptive(df, [
    'Lifetime Engaged Users',
    'Lifetime People who have liked your Page and engaged with your post',
])

### Column 15-18 - comment, like, share, total interactions

These first three columns columns show a breakdown and a sum of the ways in which people interacted with the post.

* The columns in order of increasing magnitude (judged by mean and max) are `Like`, `Comment`, and `Share`.

In [None]:
print_and_plot_descriptive(df, [
    'comment',
    'like',
    'share',
    'Total Interactions',
])

* By summing together the first three columns, we can see that this sum is identical to the `Total interactions` column.

In [None]:
df['New total interact'] = df[[
    'comment',
    'like',
    'share']].sum(axis=1)
df[[
    'Total Interactions',
    'New total interact',
]].describe()

In [None]:
df.corr(df[['like', 'comment', 'share']])

## Corr plot

Regularization L1 (Lasso) L2 (Ridge) add a penalty term that shrinks the cooeffecient so that some coeffecients have less sway.