<a href="https://www.kaggle.com/code/omersonmezsoy/marketing-a-b-testing?scriptVersionId=128689250" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="text-align:center"><img src="https://ik.imagekit.io/ccyubla1p9v/wp-content/uploads/2019/02/AB-Testing.png" /></div>


# **A/B Testing**

Would the campaign be successful?
If the campaign was successful, how much of that success could be attributed to the ads?
With the second question in mind, we normally do an A/B test. The majority of the people will be exposed to ads (the experimental group). And a small portion of people (the control group) would instead see a Public Service Announcement (PSA) (or nothing) in the exact size and place the ad would normally be.

The idea of the dataset is to analyze the groups, find if the ads were successful, how much the company can make from the ads, and if the difference between the groups is statistically significant.

**Data dictionary:**

* Index: Row index
* User ID: User ID (unique)
* Test Group: If "ad" the person saw the advertisement, if "psa" they only saw the public service announcement
* Converted: If a person bought the product then True, else is False
* Total Ads: Amount of ads seen by person
* Most Ads Day: Day that the person saw the biggest amount of ads
* Most Ads Hour: Hour of day that the person saw the biggest amount of ads.






In [1]:
import pandas as pd
import numpy as np
from scipy.stats import kstest, norm, mannwhitneyu

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

df = pd.read_csv("/kaggle/input/marketing-ab-testing/marketing_AB.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
0,0,1069124,ad,False,130,Monday,20
1,1,1119715,ad,False,93,Tuesday,22
2,2,1144181,ad,False,21,Tuesday,18
3,3,1435133,ad,False,355,Tuesday,10
4,4,1015700,ad,False,276,Friday,14


In [2]:
# "Unnamed" column is unnecessary and we are removing it.

df = df.drop(["Unnamed: 0"], axis =1)
df.head()

Unnamed: 0,user id,test group,converted,total ads,most ads day,most ads hour
0,1069124,ad,False,130,Monday,20
1,1119715,ad,False,93,Tuesday,22
2,1144181,ad,False,21,Tuesday,18
3,1435133,ad,False,355,Tuesday,10
4,1015700,ad,False,276,Friday,14


In [3]:
# Renaming variable names.

df.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
df.head()


Unnamed: 0,user_id,test_group,converted,total_ads,most_ads_day,most_ads_hour
0,1069124,ad,False,130,Monday,20
1,1119715,ad,False,93,Tuesday,22
2,1144181,ad,False,21,Tuesday,18
3,1435133,ad,False,355,Tuesday,10
4,1015700,ad,False,276,Friday,14


In [4]:
# Data understanding

def check_df(dataframe, head=5):
    print(" SHAPE ".center(70, '-'))
    print('Rows: {}'.format(dataframe.shape[0]))
    print('Columns: {}'.format(dataframe.shape[1]))
    print(" TYPES ".center(70, '-'))
    print(dataframe.dtypes)
    print(" MISSING VALUES ".center(70, '-'))
    print(dataframe.isnull().sum())
    print(" DUPLICATED VALUES ".center(70, '-'))
    print(dataframe.duplicated().sum())
    print(" DESCRIBE ".center(70, '-'))
    print(dataframe.describe().T)

check_df(df)

------------------------------- SHAPE --------------------------------
Rows: 588101
Columns: 6
------------------------------- TYPES --------------------------------
user_id           int64
test_group       object
converted          bool
total_ads         int64
most_ads_day     object
most_ads_hour     int64
dtype: object
--------------------------- MISSING VALUES ---------------------------
user_id          0
test_group       0
converted        0
total_ads        0
most_ads_day     0
most_ads_hour    0
dtype: int64
------------------------- DUPLICATED VALUES --------------------------
0
------------------------------ DESCRIBE ------------------------------
                     count          mean          std          min           25%           50%           75%           max
user_id       588101.00000 1310692.21579 202225.98313 900000.00000 1143190.00000 1313725.00000 1484088.00000 1654483.00000
total_ads     588101.00000      24.82088     43.71518      1.00000       4.00000      13

In [5]:
# We convert the true/false values to 1 and 0.

df['converted'] = df['converted'].astype(int)
df.head()

Unnamed: 0,user_id,test_group,converted,total_ads,most_ads_day,most_ads_hour
0,1069124,ad,0,130,Monday,20
1,1119715,ad,0,93,Tuesday,22
2,1144181,ad,0,21,Tuesday,18
3,1435133,ad,0,355,Tuesday,10
4,1015700,ad,0,276,Friday,14


In [6]:
# Let's check the number of users who bought the product and who did not

df.groupby('converted').count()['test_group']

converted
0    573258
1     14843
Name: test_group, dtype: int64

In [7]:
# Let's take a look at the conversion rates

conversion_group = pd.DataFrame(df.groupby(by = 'test_group')['converted'].sum()/ df.groupby(by='test_group')['converted'].count())
conversion_group

Unnamed: 0_level_0,converted
test_group,Unnamed: 1_level_1
ad,0.02555
psa,0.01785


In [8]:
# Normality

data = df['converted']

statistic, p_value = kstest(data, 'norm')

alpha = 0.05
if p_value < alpha:
    print("The data does not follow a normal distribution (H0 hypothesis rejected)")
else:
    print("The data is normally distributed (H0 hypothesis is accepted)")

The data does not follow a normal distribution (H0 hypothesis rejected)


In [9]:
# Since normal distribution assumption is not satisfied, we will perform the "Mann-Whitney U" test.

data1 = df.loc[df['test_group'] == 'ad', 'converted']
data2 = df.loc[df['test_group'] == 'psa', 'converted']

statistic, p_value = mannwhitneyu(data1, data2)

alpha = 0.05
if p_value < alpha:
    print("There is a significant difference between the groups (H0 hypothesis rejected)")
else:
    print("There is no significant difference between the groups (H0 hypothesis accepted)")

There is a significant difference between the groups (H0 hypothesis rejected)


**According to our test results, there is a statistically significant difference between the two groups. Therefore, the advertisement has an effect on the purchase, and the advertisement is successful.**