# Predictive Analytics using Lasso Regression Model
## By Nicole Haberer
## Created for APRD6342

Based on a dataset of advertising campaign engagement from Facebook. Each row is an ad that ran on the platform. It includes the amount spent on the campaign, and other outcome/engagement data. 

Can we use this data to learn how to advertise more effectively?

Find the correlation between Amount Spent and these variables:

Reach <br>
Frequency <br>
Unique Clicks <br>
Page Likes <br>

Then run a regression where Unique Clicks is the dependent variable and Reach and Frequency are the independent (predictor) variables.

In [2]:
import pandas as pd
from pandas import DataFrame
import datetime
import dateutil.parser   
import numpy
import statsmodels.api as sm

In [3]:
#assigns filename into string variable
pony = pd.read_csv("Travel Pony Facebook.csv")

In [12]:
#create column 'cost per impression' by dividing Amount Spent / Impressions and save that as a column.
pony['costperimpression'] = pony['Amount Spent (USD)'] / pony['Impressions']

In [8]:
# Create new variable to translate Start Date into day of the week format
weekdate = pd.to_datetime(pony['Start Date'], format = '%x')   
dayofweek = weekdate.dt.strftime('%A')

In [30]:
# Use variable to add new column to dataframe
pony['Day of Week'] = dayofweek

# find cost per impression by day
MondayCost = pony[pony['Day of Week'] == 'Monday']
MondayImpress = numpy.mean(MondayCost['costperimpression'])
print ('Monday Cost Per Impression = ' + str(MondayImpress))

TuesdayCost = pony[pony['Day of Week'] == 'Tuesday']
TuesdayImpress = numpy.mean(TuesdayCost['costperimpression'])
print ('Tuesday Cost Per Impression = ' + str(TuesdayImpress))

WednesdayCost = pony[pony['Day of Week'] == 'Wednesday']
WednesdayImpress = numpy.mean(WednesdayCost['costperimpression'])
print ('Wednesday Cost Per Impression = ' + str(WednesdayImpress))

ThursdayCost = pony[pony['Day of Week'] == 'Thursday']
ThursdayImpress = numpy.mean(ThursdayCost['costperimpression'])
print ('Thursday Cost Per Impression = ' + str(ThursdayImpress))

FridayCost = pony[pony['Day of Week'] == 'Friday']
FridayImpress = numpy.mean(FridayCost['costperimpression'])
print ('Friday Cost Per Impression = ' + str(FridayImpress))

SaturdayCost = pony[pony['Day of Week'] == 'Saturday']
SaturdayImpress = numpy.mean(SaturdayCost['costperimpression'])
print ('Saturday Cost Per Impression = ' + str(SaturdayImpress))

SundayCost = pony[pony['Day of Week'] == 'Sunday']
SundayImpress = numpy.mean(SundayCost['costperimpression'])
print ('Sunday Cost Per Impression = ' + str(SundayImpress))

#Cheapest day to generate impressions = Saturday
#Most expensive day to generate impressions = Friday

Monday Cost Per Impression = 0.0029099769333124532
Tuesday Cost Per Impression = 0.003040039601778186
Wednesday Cost Per Impression = 0.002980820271366957
Thursday Cost Per Impression = 0.003398311623555868
Friday Cost Per Impression = 0.004096890719487211
Saturday Cost Per Impression = 0.0026286969333697923
Sunday Cost Per Impression = 0.003687913293955785


In [31]:
#Compute correlation between: Reach, Frequency, Unique Clicks, Page Likes
#Which correlation is the strongest? What does that mean practically? (respond in a tweet or less)

#Limit to only desired columns for correlation
slimpony = pony[['Amount Spent (USD)','Reach','Frequency','Unique Clicks','Page Likes']] 
print(slimpony.corr())

#The strongest correlation is between Reach and Unique Clicks. 
#This means that if you are looking for the biggest reach, try to devote attention to unique clicks as an advertising goal.

                    Amount Spent (USD)     Reach  Frequency  Unique Clicks  \
Amount Spent (USD)            1.000000  0.703124   0.130201       0.882993   
Reach                         0.703124  1.000000   0.334101       0.722249   
Frequency                     0.130201  0.334101   1.000000       0.135103   
Unique Clicks                 0.882993  0.722249   0.135103       1.000000   
Page Likes                    0.757612  0.304388   0.000182       0.584614   

                    Page Likes  
Amount Spent (USD)    0.757612  
Reach                 0.304388  
Frequency             0.000182  
Unique Clicks         0.584614  
Page Likes            1.000000  


In [33]:
#Finally, perform a simple multiple regression analysis where Unique Clicks is the dependent variable 
#and Reach and Frequency are the independent (predictor) variables.
X = pony[['Reach', 'Frequency']]
y = pony['Unique Clicks']
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()

#What variable most strongly predicts unique clicks? What does that mean practically? (respond in a tweet or less)
#Frequency is the strongest predictor of unique clicks (coefficient of 3.61), which means unique clicks increase more when 
#frequency increases than when reach increases 

0,1,2,3
Dep. Variable:,Unique Clicks,R-squared:,0.557
Model:,OLS,Adj. R-squared:,0.556
Method:,Least Squares,F-statistic:,2325.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,19:02:04,Log-Likelihood:,-15973.0
No. Observations:,3705,AIC:,31950.0
Df Residuals:,3703,BIC:,31960.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Reach,0.0019,3.12e-05,62.490,0.000,0.002,0.002
Frequency,3.6139,0.298,12.109,0.000,3.029,4.199

0,1,2,3
Omnibus:,5107.616,Durbin-Watson:,0.803
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5128392.747
Skew:,7.331,Prob(JB):,0.0
Kurtosis:,184.674,Cond. No.,9840.0
