<h1>Week 2 Challenge: A/B Hypothesis Testing: Ad campaign performance</h1>

<h2>Project Overview </h2>

* SmartAd is a mobile first advertiser agency. It designs intuitive touch-enabled advertising.Their company is based on the principle of voluntary participation which is proven to increase brand engagement and memorability 10 x more than static alternatives. 
* SmartAd provides an additional service called Brand Impact Optimiser (BIO), a lightweight questionnaire, served with every campaign to determine the impact of the creative, the ad they design, on various upper funnel metrics, including memorability and brand sentiment. 

**Objective**

As a Machine learning engineer in SmartAd, one of your tasks is to design a reliable hypothesis testing algorithm for the BIO service and to determine whether a recent advertising campaign resulted in a significant lift in brand awareness.

SmartAd ran this campaign from 3-10 July 2020. The users that were presented with the questionnaire above were chosen according to the following rule: 

***Control:*** users who have been shown a dummy ad

***Exposed:***  users who have been shown a creative (ad) that was designed by SmartAd for the client. 

In [1]:
#Loading useful packages
import numpy as np
import pandas as pd
pd.set_option('max_column', None)
pd.set_option('display.float_format',lambda x:'%5f'%x)

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import streamlit as st

import warnings
warnings.filterwarnings('ignore')

import scipy
from scipy import stats

In [7]:
#Loading data
url='https://media.githubusercontent.com/media/katenjoki/AdCampaign/main/OneDrive/Desktop/10Academy/AdCampaign/data/AdSmartABdata.csv'
data=pd.read_csv(url)
data.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 9


<h3>Dataset Dictionary</h3>

* auction_id: the unique id of the online user who has been presented the BIO. In standard terminologies this is called an impression id. The user may see the BIO questionnaire but choose not to respond. In that case both the yes and no columns are zero.
* experiment: which group the user belongs to - control or exposed.
* date: the date in YYYY-MM-DD format
* hour: the hour of the day in HH format.
* device_make: the name of the type of device the user has e.g. Samsung
* platform_os: the id of the OS the user has. 
* browser: the name of the browser the user uses to see the BIO questionnaire.
* yes: 1 if the user chooses the “Yes” radio button for the BIO questionnaire.


<h3>Data Preprocessing</h3>

In [3]:
#check for null values
#the data has no null values
data.isnull().any()

version https://git-lfs.github.com/spec/v1    False
dtype: bool

In [4]:
#check data types
data.dtypes

version https://git-lfs.github.com/spec/v1    object
dtype: object

In [6]:
#convert date column to datetime object
data['date']=pd.to_datetime(data['date'])

KeyError: 'date'

In [None]:
print('Number of unique device types:',data['device_make'].nunique())
print('\nUnique browsers:\n',data['browser'].unique())
print('\nUnique platforms,os:\n',data['platform_os'].unique())

In [None]:
data['platform_os'].nunique()

<h2>Task  1.2</h2>
<h3>Data Exploration</h3>

In [None]:
#Histogram
def plot_hist(df,col1):
    plt.figure(figsize=(10,8))
    plt.hist(df[col1],bins=20,color='#B0C485',edgecolor='#64894B',linewidth=0.5)
    plt.title(f'Histogram of {col1}', size=16,fontweight='bold')
    plt.show()
#Scatter plot
def plot_scatter(df,col1,col2):
    plt.figure(figsize=(12, 7))
    sns.scatterplot(data = df, x=col1, y=col2, hue=col1, style=col1)
    plt.title(f'{col1} vs {col2}', size=16)
    plt.xticks(fontsize=14)
    plt.yticks( fontsize=14)
    plt.show()
    
def plot_count(df,col1,col2):
    plt.figure(figsize=(12,8))
    plt.subplot(1,2,1)
    sns.countplot(data=df, x=col1,palette='summer')
    plt.title(f'Distribution of {col1}', size=16, fontweight='bold')
    plt.xticks(rotation=70)
    
    plt.subplot(1,2,2)
    sns.countplot(data=df, x=col2,palette='summer_r')
    plt.title(f'Distribution of {col2}', size=16, fontweight='bold')
    plt.xticks(rotation=70)
    plt.show()

In [None]:
plot_count(data,'experiment','platform_os')

* The online users belonging to the exposed and control groups are equally distributed
* Most iOS users that were targeted by the campaign have a os version 6

In [None]:
plot_count(data,'yes','no')

**Most users who came across the questionnaier didn't fill it**

In [None]:
plot_hist(data,'hour')

**Most of the users filled the form between 4 and 5pm**

**To be able to see how many users acinteracted with the ads we create a new column where 0 means the user did not interact with the ad and 1 meaning they did i.e both Yes and No column values were 0**

In [None]:
data = data.drop(data[(data['yes'] ==0) &(data['no']==0)].index)
print(data.shape)
print(data.sample(5))

* **The exposed group seems to have slightly more counts of yes than the exposed group.**
* **However, for both groups most people are still not aware of the brand**
* **We can't conclude based on the difference in counts of yes and no's of the 2 groups, that the SmartAd increased brand awareness**

<h2>Perform hypothesis testing: apply the classical p-value based algorithm and the  sequential A/B testing algorithm</h2>

**Null hypothesis:** there is no significant difference in brand awareness between the exposed and control groups in the current case

**Alternate hypothesis:** there is a significant difference in brand awareness between the exposed and control groups in the current case 

To reject our null hypothesis, we have to prove that there is statistical significance 

Given that the outcome of the questionnaire is binary, we use the chi-square test to check for significance

In [None]:
#Creating a new dataframe with a summary of the observed outcomes
df=data.assign(experiment_=data['experiment']).groupby('experiment').agg({'experiment':'count','yes':'sum','no':'sum'})
df['total']=df['experiment']#rename count column
df=df.drop(['experiment'],axis=1)
df=df.reset_index()
df.head()

In [None]:
#Chi-square test of independence of variables
control_yes=df['yes'][0]
control_no=df['no'][0]
exposed_yes=df['yes'][1]
exposed_no=df['no'][1]

#create np array
T = np.array([[control_yes,control_no],[exposed_yes,exposed_no]])

print(scipy.stats.chi2_contingency(T,correction=False)[1])

**The p-value has been calculated to be 51.8%. Assuming a 5% significance level, we fail to reject the null hypothesis and conclude that there is no significant difference in brand awareness between the exposed and control groups in the current case,  meaning brand awareness isn't increased for the exposed group**

**The number of data points may not be enough to make a reasonable judgement, as out of the approximately 8000 users,only 1243 actually responded, so the experiment could be extended so as to get a more clear cut representation of the population. However, the p-value is very large assuming a significance level of 5%, hence chances of the p-value reducing significantly with an increase in the sample size are a bit slim**