* We want to check whether changing enroll button for loyalty program will make customer to enroll in loyalty program?

### Import packages

In [4]:
import numpy as np
import pandas as pd

### Import Data

In [5]:
data = pd.read_csv('grocerywebsiteabtestdata.csv')

In [6]:
data.head()

Unnamed: 0,RecordID,IP Address,LoggedInFlag,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0


* `RecordId` is an identifier 
* `IP Address` is address of user visiting website
* `LoggedInFlag` is 1 if user has account and logged in
* `ServerID` : one of the 3 server the user is connected to, server id 2,3 is for control group and 1 is for treatment group
* `VisitPageFlag`: Does user clicked on loyalty program link?

### Cleaning data

* Make sure each individual Ip only shown once. Consolidate it
* Remove data for user who already have an account, we only interested in new users.

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184588 entries, 0 to 184587
Data columns (total 5 columns):
RecordID         184588 non-null int64
IP Address       184588 non-null object
LoggedInFlag     184588 non-null int64
ServerID         184588 non-null int64
VisitPageFlag    184588 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.0+ MB


In [8]:
new_visitor_data = data[data["LoggedInFlag"] == 0]

In [15]:
new_visitor_data.groupby(by=["IP Address", "ServerID", "LoggedInFlag"])["VisitPageFlag"].sum().value_counts()

0    39535
1     9238
2      719
3       20
4        1
Name: VisitPageFlag, dtype: int64

In [16]:
consolidated_data = new_visitor_data.groupby(by=["IP Address", "ServerID", "LoggedInFlag"])["VisitPageFlag"].sum()

* Now we have only 1 record per IP address

* Let's add one column called `visitFlag` with value 1 if `VisitPageFlag` is 1 or more elese 0

In [24]:
consolidated_data = consolidated_data.reset_index()

In [31]:
consolidated_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49513 entries, 0 to 49512
Data columns (total 4 columns):
IP Address       49513 non-null object
ServerID         49513 non-null int64
LoggedInFlag     49513 non-null int64
VisitPageFlag    49513 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [32]:
consolidated_data['visitFlag'] = np.where(consolidated_data["VisitPageFlag"] >= 1, 1, 0)

In [33]:
consolidated_data.visitFlag.value_counts()

0    39535
1     9978
Name: visitFlag, dtype: int64

In [34]:
consolidated_data["Group"] = np.where(consolidated_data["ServerID"] == 1, "Treatment", "Control")

In [35]:
consolidated_data.head()

Unnamed: 0,IP Address,ServerID,LoggedInFlag,VisitPageFlag,visitFlag,Group
0,0.0.108.2,1,0,0,0,Treatment
1,0.0.111.8,3,0,0,0,Control
2,0.0.163.1,2,0,0,0,Control
3,0.0.181.9,1,0,1,1,Treatment
4,0.0.20.3,1,0,0,0,Treatment
