# Snapchat Political Ads
This project uses political ads data from Snapchat, a popular social media app. Interesting questions to consider include:
- What are the most prevalent organizations, advertisers, and ballot candidates in the data? Do you recognize any?
- What are the characteristics of ads with a large reach, i.e., many views? What may a campaign consider when maximizing an ad's reach?
- What are the characteristics of ads with a smaller reach, i.e., less views? Aside from funding constraints, why might a campaign want to produce an ad with a smaller but more targeted reach?
- What are the characteristics of the most expensive ads? If a campaign is limited on advertising funds, what type of ad may the campaign consider?
- What groups or regions are targeted frequently? (For example, for single-gender campaigns, are men or women targeted more frequently?) What groups or regions are targeted less frequently? Why? Does this depend on the type of campaign?
- Have the characteristics of ads changed over time (e.g. over the past year)?
- When is the most common local time of day for an ad's start date? What about the most common day of week? (Make sure to account for time zones for both questions.)

### Getting the Data
The data and its corresponding data dictionary is downloadable [here](https://www.snap.com/en-US/political-ads/). Download both the 2018 CSV and the 2019 CSV. 

The CSVs have the same filename; rename the CSVs as needed.

Note that the CSVs have the exact same columns and the exact same data dictionaries (`readme.txt`).

### Cleaning and EDA
- Concatenate the 2018 CSV and the 2019 CSV into one DataFrame so that we have data from both years.
- Clean the data.
    - Convert `StartDate` and `EndDate` into datetime. Make sure the datetimes are in the correct time zone. You can use whatever timezone (e.g. UTC) you want as long as you are consistent. However, if you want to answer a question like "When is the most common local time of day for an ad's start date," you will need to convert timezones as needed. See Hint 2 below for more information.
- Understand the data in ways relevant to your question using univariate and bivariate analysis of the data as well as aggregations.

*Hint 1: What is the "Z" at the end of each timestamp?*

*Hint 2: `pd.to_datetime` will be useful here. `Series.dt.tz_convert` will be useful if a change in time zone is needed.*

*Tip: To visualize geospatial data, consider [Folium](https://python-visualization.github.io/folium/) or another geospatial plotting library.*

### Assessment of Missingness
Many columns which have `NaN` values may not actually have missing data. How come? In some cases, a null or empty value corresponds to an actual, meaningful value. For example, `readme.txt` states the following about `Gender`:

>  Gender - Gender targeting criteria used in the Ad. If empty, then it is targeting all genders

In this scenario, an empty `Gender` value (which is read in as `NaN` in pandas) corresponds to "all genders".

- Refer to the data dictionary to determine which columns do **not** belong to the scenario above. Assess the missingness of one of these columns.

### Hypothesis Test / Permutation Test
Find a hypothesis test or permutation test to perform. You can use the questions at the top of the notebook for inspiration.

# Summary of Findings

### Introduction
TODO

### Cleaning and EDA
TODO

### Assessment of Missingness
TODO

### Hypothesis Test
TODO

# Code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from scipy import stats
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

### Cleaning and EDA

In [2]:
# Reading in the CSVs

pol18 = pd.read_csv('2018PoliticalAds.csv')
pol19 = pd.read_csv('2019PoliticalAds.csv')

In [3]:
# First, I check to see if all the columns match between the CSVs

pol19.columns == pol18.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [4]:
# I check to see if the concatenation of the CSVs was successful

pol_comb = pd.concat([pol18, pol19], ignore_index = True)
pol_comb.columns == pol19.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [5]:
# I chose the columns that were relevant to my question

useful_cols = ['ADID', 'Spend', 'StartDate', 'EndDate', 'OrganizationName', 'CountryCode']
pol_comb = pol_comb[useful_cols]

In [6]:
# I create a table that contains the percentage of null values in each category - everything has some value except for EndDate

s_type = pol_comb.dtypes
s_null = pol_comb.isnull().mean().sort_values(ascending = False)
type_null = pd.concat([s_type, s_null], axis = 1)
type_null.columns = ['type', 'null %']
type_null.sort_values(by = 'null %', ascending = False)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


Unnamed: 0,type,null %
EndDate,object,0.182052
ADID,object,0.0
CountryCode,object,0.0
OrganizationName,object,0.0
Spend,int64,0.0
StartDate,object,0.0


In [7]:
# It is difficult to see the distribution because there are outliers

us_pol_comb['Spend'].plot(kind='hist', title='Distribution of Expenditures')
us_pol_comb['Spend'].describe()

NameError: name 'us_pol_comb' is not defined

In [None]:
# First we convert the date to DateTime Objects 

pol_comb["StartDate"] = pd.to_datetime(pol_comb["StartDate"])
pol_comb["EndDate"] = pd.to_datetime(pol_comb["EndDate"])

In [None]:
# Given the data, it is difficult to assess the origin of the political ad, which makes it difficult to indicate at what local time it was released
# For the purposes of this project, I identified the country origin of where the ads mostly come from

pol_comb['CountryCode'].value_counts(normalize = True)

In [None]:
# Because the majority of ads come from the US (53.58%), I decided to focus only on the ads that originate from the US

us_pol_comb = pol_comb[pol_comb['CountryCode'] == 'united states'].reset_index().drop(columns = ['index'])

In [None]:
# Although there are still different time zones in the US, I decided to settle on Pacific Standard Time in order to standardize all the times
# As a result, ads that were published after 9pm PST could potentially be a part of the next day depending on the region, but we are forced to generalize

us_pol_comb.loc[:, "StartDate"] = us_pol_comb.loc[:, "StartDate"].dt.tz_convert('US/PACIFIC')
us_pol_comb.loc[:, "EndDate"] = us_pol_comb.loc[:, "EndDate"].dt.tz_convert('US/PACIFIC')

In [None]:
# We can extract the month, day of week, and hour of when the ads are released

us_pol_comb['StartDOW'] = us_pol_comb['StartDate'].apply(lambda x: x.weekday)
us_pol_comb['StartMonth'] = us_pol_comb['StartDate'].apply(lambda x: x.month)

In [None]:
# Distribution of Weekdays - It is interesting to note that the ads usually are published on Tuesday/Thursdays, but not during the weekend
# This is an issue that we want to focus on

dayDict = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
us_pol_comb['StartDOW'].replace(dayDict, inplace = True)
us_pol_comb['StartDOW'].value_counts().plot(kind = 'bar', title = 'Number of Ads by Weekday')

In [None]:
# In order to focus on our question, it is important we now group the ads in weekdays or weekends

us_pol_comb['isWeekday'] = us_pol_comb['StartDOW'].apply(lambda x: True if x not in ['Saturday','Sunday'] else False)

In [None]:
# We can now visualize a bar chart of the number of ads aggregated by the part of week they are released
# We observe a high number of ads released on a weekday compared to the weekend

dow = us_pol_comb[['isWeekday', 'Spend']]
dow_counts = dow.groupby('isWeekday').count()
dow_counts.plot.bar(title = 'Number of Ads Released')

In [None]:
# Similarly, we can see that more money was spent on average on ads released on a weekday as compared to on a weekend
# However, this visualization can be biased due to outliers in the data

dow_median_spend = dow.groupby('isWeekday').mean()
dow_median_spend.plot.bar(title = 'Average Amount of Money Spent on Ads')

In [None]:
# We visualized the distributions of the expenditures on the weekday vs the weekend and noticed most of the data points are centered around zero
# The visualization is wide because it is being drawn out by outliers

us_pol_comb.groupby('isWeekday')['Spend'].plot(kind='kde', legend=True, title='Distribution of Expenditures')
plt.xlim(-12000, 15000)

In [None]:
# To emphasize the influence of the outliers, we created a box plot to see how the data is distributed
# The box itself is squished on the far left because there are so many data points around zero and as a result, there are many outliers (especially for weekdays)

weekday = dow[dow['isWeekday'] == True]
weekend = dow[dow['isWeekday'] == False]
sns.boxplot(data=[weekday['Spend'], weekend['Spend']], orient='h')

In [None]:
us_pol_comb.shape[0]

In [None]:
# To visualize our data without the outliers, we decided to calculate the z-score of all the expenditures 
# We got rid of the data points that had a z-score greater than 3, which were 16 data points (printed below were their z-scores)


no_out = us_pol_comb.copy()
z = np.abs(stats.zscore(no_out['Spend']))
print(np.where(z > 3))
no_out = no_out[z < 3]

In [None]:
# Because we took out only 16 data points, we do not expect the distribution of counts to change significantly

dayDict = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
no_out['StartDOW'].replace(dayDict, inplace = True)
no_out['StartDOW'].value_counts().plot(kind = 'bar', title = 'Number of Ads by Weekday')

In [None]:
# The distribution around zero 

no_out.groupby('isWeekday')['Spend'].plot(kind='kde', legend=True, title='Distribution of Expenditures')
plt.xlim(-12000, 15000)

In [None]:
# Now we can go back to categorizing the ads by time of week without including the outliers

dow_no_out = no_out[['isWeekday', 'Spend']]
dow_no_out

In [None]:
# By taking out the outliers, we can drastically see a change in the average amount of money spent
# This is because all the outliers were in weekdays, which means that the companies that invested heavily on their ad wanted it to be released on a weekday

dow_median_spend_no_out = dow_no_out.groupby('isWeekday').mean()
dow_median_spend_no_out.plot.bar(title = 'Average Amount of Money Spent on Ads')

In [None]:
weekday_outno = dow_no_out[dow_no_out['isWeekday'] == True]
weekend_outno = dow_no_out[dow_no_out['isWeekday'] == False]

In [None]:
# Although not as visual as we would want, the data is distributed as such by the boxplots because the data is not uniformly distributed
# There are several data points that are still outliers (not as extreme) and that can be seen in the distributions

sns.boxplot(data=[weekday_outno['Spend'], weekend_outno['Spend']], orient='h')

In [None]:
# dayDict = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
# us_pol_comb['StartDOW'].replace(dayDict, inplace = True)
us_pol_comb
pd.pivot_table(us_pol_comb, values = 'Spend', index = 'StartDOW', columns = 'StartMonth', aggfunc = np.sum)

### Assessment of Missingness

In [None]:
# TODO

### Hypothesis Test

## Permutation Test - Testing by Simulation
- **Null hypothesis**: There is no significant difference between the amount of money spent on ads shown on weekends and weekdays.
- **Alternate hypothesis**: There is a significant difference between the amount of money spent on ads shown on weekends and weekdays.
- **Test Statistic**: Absolute difference in means

set a significance level of 0.05

In [None]:
#observed means
means_table = dow.groupby('isWeekday').mean()
means_table

In [None]:
#observed test statistic
observed_difference = means_table.diff().iloc[-1,0]
observed_difference

In [None]:
#simulation

N = 1000
results = []

for _ in range(N):
    #create shuffled dataframe
    s = weekday_and_spend['Weekday'].sample(frac=1, replace=False).reset_index(drop=True)
    shuffled = weekday_and_spend.assign(weekday=s)
    
    #calculate difference of means and add to results array
    shuff_means_table = shuffled.groupby('weekday').mean()
    results.append(abs(shuff_means_table.diff().iloc[-1,0]))

diffs_of_means = pd.Series(results)

In [None]:
diffs_of_means

In [None]:
pval = (diffs_of_means >= observed_difference).sum() / N
pval

### Conclusion

* We cannot reject the null hypothesis that there is no significant difference between the amount of money spent on ads shown on weekdays and weekends

## However
Our exploratory data analysis showed clear outliers - what would happen if these were removed?

In [None]:
#observed means
means_table_clean = dow_no_out.groupby('isWeekday').mean()
means_table_clean

In [None]:
#observed test statistic
observed_difference_clean = means_table_clean.diff().iloc[-1,0]
observed_difference_clean

In [None]:
#simulation

N = 1000
results = []

for _ in range(N):
    #create shuffled dataframe
    s = weekday_and_spend_clean['Weekday'].sample(frac=1, replace=False).reset_index(drop=True)
    shuffled = weekday_and_spend_clean.assign(weekday=s)
    
    #calculate difference of means and add to results array
    shuff_means_table = shuffled.groupby('weekday').mean()
    results.append(abs(shuff_means_table.diff().iloc[-1,0]))

diffs_of_means_clean = pd.Series(results)

In [None]:
pval = (diffs_of_means_clean >= observed_difference_clean).sum() / N
pval

## Conclusion Without Outliers
- with a p-value of less than 0.05, we reject the null hypothesis that there is no significant difference between the amount of money spent on ads shown on weekdays and weekends
- we accept the alternate hypothesis - we have found a **significant difference** between the observed distribution and one created by random chance
- the outliers have had a significant effect on the outcome of the test - the outliers themselves merit more analysis