# Project 1: Facebook dataset (political ads)

**Team members: MD Abdullah & Ronald**<br>

### About dataset
This database, updated daily, contains ads that ran on Facebook and were submitted by thousands of ProPublica users from around the world. We asked our readers to install browser extensions that automatically collected advertisements on their Facebook pages and sent them to our servers. We then used a machine learning classifier to identify which ads were likely political and included them in this dataset.

In [1]:
# reference to the data
data_file = './data/fbpac-ads-en-US.csv'

## Data loading

In [4]:
# imports
import os
import pandas as pd
import time

# try load data from pickle file; if it does not exist, try load from .csv file; if that also fails, show an error
loading_start_time = time.time() 
try:
    # load pickle file
    print(f"loading data from pickle file...")
    df = pd.read_pickle(data_file.replace('.csv','.pkl'))
    print(f"data loaded (took {time.time() - loading_start_time:.1f} seconds)")
except FileNotFoundError:
    try:
        print("  -> no pickle file found")
        print("loading data from csv file instead (can take a while)...")
        df = pd.read_csv(data_file)
        print(f"  -> data loaded (took {time.time() - loading_start_time:.1f} seconds)")
        # resave as .pkl file - this loads around 7 times faster on my machine
        print(f"  -> resaving it as .pkl file (for faster loading in the future)")
        df.to_pickle(data_file.replace('.csv','.pkl'))
        # save a preview of the data in an excel file, with just 100 rows 
        print(f"  -> saving a preview of the data as {data_file.replace('.csv','_preview.xlsx')}")
        df.head(100).to_excel(data_file.replace('.csv','_preview.xlsx'), index=False)
    except FileNotFoundError:
        print(f"\nERROR: data '{data_file}' does not exist.\nCheck if the file exists and if Jupyter is in the correct working directory")

loading data from pickle file...
  -> no pickle file found
loading data from csv file instead (can take a while)...

ERROR: data './daata/fbpac-ads-en-US.csv' does not exist.
Check if the file exists and if Jupyter is in the correct working directory


## Descriptive statistics
* Building Summary 
* Calculate Central/Dispersion measures 
* Get the distribution of the data (each column)
* Analyze relationships between features

In [11]:
# Insert code for descriptive stats here

# Show a summary of the data frame
print("\n### Data Summary ###")
print(df.info())

# Unique counts for each column
print("\n*** Unique Counts ***")
print(df.nunique())

# Show central tendency measures
print("\n### Central Tendency Measures ###")
pd.options.display.float_format = "{:.2f}".format
print(df.describe().transpose())


### Data Summary ###
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162324 entries, 0 to 162323
Data columns (total 24 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              162324 non-null  object 
 1   html                            162324 non-null  object 
 2   political                       162324 non-null  int64  
 3   not_political                   162324 non-null  int64  
 4   title                           162306 non-null  object 
 5   message                         162324 non-null  object 
 6   thumbnail                       162324 non-null  object 
 7   created_at                      162324 non-null  object 
 8   updated_at                      162324 non-null  object 
 9   lang                            162324 non-null  object 
 10  images                          162324 non-null  object 
 11  impressions                     162324 non-null  int64  

## Exploratory Data Analysis (EDA):
* Identify the variables and their types 
* Clean the data (address errors, duplicates, missing values, outliers)
* Transformation (standardization, normalization, encoding categorical to numerical)
* Data Visualization (use the suitable visualization that you need)

In [None]:
# Insert code block for exploratory data analysis (EDA)

## Define your research questions/objectives
* Perform Hypothesis testing
* Interpret the results findings in the context of your research question or objective. Draw conclusions and make recommendations based on your analysis.
* Communicate your results: Present your insights and conclusions in a clear and concise manner, using visualizations and descriptive statistics. Tailor your communication to your audience, whether it be technical or non-technical.

In [3]:
# Insert code block for answering the research questions / objectives