# Python Task

> *We have data on page visits from tracking of one of the services over a certain period of time. During this time, the service was testing a new registration page. Can it be said that the new registration page is better than the old one?*
>
> *Determine which metrics can be used to assess the 'goodness' of the page and calculate them.*

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('example_data')

In [3]:
data

Unnamed: 0,user_id,timestamp,event
0,148870bfa84777898359aaa8e120a373,2021-01-01 00:00:01.000000000,landing
1,ac3948ea43cb39cdc4e739004d252d0b,2021-01-01 00:00:01.445020335,landing
2,48a0df50d7ed1fcaaddf742b828b85e5,2021-01-01 00:00:10.566157670,login
3,70fbdd335abb11a3d072b5de7b218048,2021-01-01 00:00:10.764937005,main
4,48a0df50d7ed1fcaaddf742b828b85e5,2021-01-01 00:00:10.764937005,login
...,...,...,...
1040166,a63250880822c619ecbbf9fa511d31cd,2021-04-07 13:20:16.775249492,login
1040167,7fa83873bead5c5a52d6805570aba31d,2021-04-07 13:20:18.688737570,registration
1040168,1d716213d6f611f80592391ac61b5a5c,2021-04-07 13:20:18.688737570,main
1040169,7fa83873bead5c5a52d6805570aba31d,2021-04-07 13:20:22.134689896,registration


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1040171 entries, 0 to 1040170
Data columns (total 3 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   user_id    1040171 non-null  object
 1   timestamp  1040171 non-null  object
 2   event      1040171 non-null  object
dtypes: object(3)
memory usage: 23.8+ MB


In [5]:
data.describe()

Unnamed: 0,user_id,timestamp,event
count,1040171,1040171,1040171
unique,261130,938229,5
top,c9be7da3b5c975cc2795f15d45f0390a,2021-03-19 08:53:22.740489453,landing
freq,5130,12,389350


In [6]:
data.nunique()

user_id      261130
timestamp    938229
event             5
dtype: int64

In [7]:
data['event'].unique()

array(['landing', 'login', 'main', 'registration', 'registration_new'],
      dtype=object)

In [8]:
data.groupby('event').nunique()

Unnamed: 0_level_0,user_id,timestamp
event,Unnamed: 1_level_1,Unnamed: 2_level_1
landing,206953,355775
login,50485,244098
main,44635,327560
registration,24078,49812
registration_new,2540,5428


In [9]:
data['event'].value_counts()

landing             389350
main                336030
login               257227
registration         51943
registration_new      5621
Name: event, dtype: int64

In [10]:
data_ts = data
data_ts['timestamp'] = pd.to_datetime(data_ts['timestamp'])
data_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1040171 entries, 0 to 1040170
Data columns (total 3 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   user_id    1040171 non-null  object        
 1   timestamp  1040171 non-null  datetime64[ns]
 2   event      1040171 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 23.8+ MB


In [11]:
data_ts.head(20)

Unnamed: 0,user_id,timestamp,event
0,148870bfa84777898359aaa8e120a373,2021-01-01 00:00:01.000000000,landing
1,ac3948ea43cb39cdc4e739004d252d0b,2021-01-01 00:00:01.445020335,landing
2,48a0df50d7ed1fcaaddf742b828b85e5,2021-01-01 00:00:10.566157670,login
3,70fbdd335abb11a3d072b5de7b218048,2021-01-01 00:00:10.764937005,main
4,48a0df50d7ed1fcaaddf742b828b85e5,2021-01-01 00:00:10.764937005,login
5,70fbdd335abb11a3d072b5de7b218048,2021-01-01 00:00:26.757563301,main
6,48a0df50d7ed1fcaaddf742b828b85e5,2021-01-01 00:00:30.111507967,main
7,7c4d355f03cfc54bd351b51fd6950bc8,2021-01-01 00:00:42.556088605,landing
8,0fd5fc803b51f27e8ab1dad44d12594e,2021-01-01 00:00:44.500571856,main
9,6fe9147ebeb41c766a6f649239cce40b,2021-01-01 00:01:00.565173574,main


In [12]:
sorted_data = data_ts.sort_values(by=['user_id', 'timestamp'])
sorted_data.head(50)


Unnamed: 0,user_id,timestamp,event
83898,00006e145e005308c5387dfbb3c9a490,2021-01-08 11:35:41.977195939,landing
536462,00008aab9f1597af45ad21aa141030aa,2021-02-17 00:36:00.319473015,login
537547,00008aab9f1597af45ad21aa141030aa,2021-02-17 04:32:35.786731688,registration
537597,00008aab9f1597af45ad21aa141030aa,2021-02-17 04:42:10.180987082,registration
537624,00008aab9f1597af45ad21aa141030aa,2021-02-17 04:45:25.626196034,main
537665,00008aab9f1597af45ad21aa141030aa,2021-02-17 04:50:38.128422496,login
537700,00008aab9f1597af45ad21aa141030aa,2021-02-17 04:57:24.844030486,main
537736,00008aab9f1597af45ad21aa141030aa,2021-02-17 05:03:17.211478741,main
537757,00008aab9f1597af45ad21aa141030aa,2021-02-17 05:06:09.468095542,login
537770,00008aab9f1597af45ad21aa141030aa,2021-02-17 05:08:58.010618286,login


## Research Design 

Unfortunately, we do not have exact metrics to test. Therefore, we will consider that there is a funnel:
**registration page &rarr; login**  

We will consider that a user who completes registration sooner or later will log in.  

Therefore, we will do the following for both 'registration' and 'registration_new' events separately:  

1. Count how many users visited a registration page AND later logged in. Technically, we will find amount of users who have both 'registration' and 'login' events and timestamp of the former one is less than the same of the latter one (i.e., **registration** was *before* **login**) 
<br>  

2. Count amount of users who have ever visited a registration page  
<br>  

3. Calculate proportion (ratio of 'register and log in' to 'register')
<br>  

4. Formulate and test hypotheses
<br>  

5. Make conclusions

In [38]:
# Count conversion for 'registration'
# Filter for 'registration' and 'login' events
registration_data = data[data['event'] == 'registration']
login_data = data[data['event'] == 'login']

# Merge the two DataFrames on 'user_id'
merged_data = registration_data.merge(login_data, on='user_id', suffixes=('_registration', '_login'))

# Group by 'user_id' and check if there's at least one case where 'login' timestamp is greater than 'registration' timestamp
user_has_condition = merged_data.groupby('user_id').apply(lambda x: (x['timestamp_login'] > x['timestamp_registration']).any())

# Count the number of users meeting the condition
count_reg_log = user_has_condition.sum()
distinct_registration_users = data[data['event'] == 'registration']['user_id'].nunique()
# count_reg = (data['event'] == 'registration').sum()
ratio_reg = count_reg_log / distinct_registration_users

# Print the count
print(f"Number of users with at least one case of 'login' having timestamp greater than 'registration': {count_reg_log}\nNumber of users with 'registration' values: {distinct_registration_users} \nRatio: {ratio_reg}")


Number of users with at least one case of 'login' having timestamp greater than 'registration': 8768
Number of users with 'registration' values: 24078 
Ratio: 0.3641498463327519


In [39]:
# Count conversion for 'registration_new'
# Filter for 'registration' and 'login' events
registration_data_new = data[data['event'] == 'registration_new']
login_data_new = data[data['event'] == 'login']

# Merge the two DataFrames on 'user_id'
merged_data_new = registration_data_new.merge(login_data_new, on='user_id', suffixes=('_registration_new', '_login'))

# Group by 'user_id' and check if there's at least one case where 'login' timestamp is greater than 'registration' timestamp
user_has_condition_new = merged_data_new.groupby('user_id').apply(lambda x: (x['timestamp_login'] > x['timestamp_registration_new']).any())

# Count the number of users meeting the condition
count_reg_log_new = user_has_condition_new.sum()
distinct_registration_users_new = data[data['event'] == 'registration_new']['user_id'].nunique()
# count_reg_new = (data['event'] == 'registration_new').sum()
ratio_reg_new = count_reg_log_new / distinct_registration_users_new

# Print the count
print(f"Number of users with at least one case of 'login' timestamp greater than 'registration_new': {count_reg_log_new}\nNumber of users with 'registration_new' values: {distinct_registration_users_new} \nRatio: {ratio_reg_new}")


Number of users with at least one case of 'login' timestamp greater than 'registration_new': 1007
Number of users with 'registration_new' values: 2540 
Ratio: 0.3964566929133858


In [40]:
# Check if there are any user_id's with both 'registration' and 'registration_new' events 
# Filter for 'registration' and 'registration_new' events
filtered_data = data[data['event'].isin(['registration', 'registration_new'])]

# Group by 'user_id' and count the distinct values in the 'event' column
distinct_event_count = filtered_data.groupby('user_id')['event'].nunique()

# Count the number of users where both 'registration' and 'registration_new' events occurred
count = len(distinct_event_count[distinct_event_count == 2])

print(f"Number of distinct 'user_id's with both 'registration' and 'registration_new' events: {count}")


Number of distinct 'user_id's with both 'registration' and 'registration_new' events: 0


### Now, we will formulate hypotheses  
**H<sub>0</sub> (null hypothesis):** there is no difference in conversion between pages 'registration' and 'registration_new'  
**H<sub>A</sub> (null hypothesis):** there is a difference in conversion between pages 'registration' and 'registration_new'

For testing, we will use a *Chi-squared test* to check independence between registration pages and a *two-sample proportion test* 

In [15]:
from scipy.stats import chi2_contingency

# Create the contingency table
contingency_table = [[24078, 8768], [2540, 1007]]

# Perform the Chi-squared test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
alpha = 0.05

# Make a decision
if p_value < alpha:
    print("There is a statistically significant difference between the pages.")
else:
    print("There is no statistically significant difference between the pages.")

# Output the test statistics and p-value
print(f"Chi-squared: {chi2}")
print(f"P-value: {p}")


There is a statistically significant difference between the pages.
Chi-squared: 4.600902133327493
P-value: 0.031955136977651455


In [16]:
from statsmodels.stats.proportion import proportions_ztest

# Define the data
count = np.array([8768, 1007])  # Number of logins for control_page and test_page
nobs = np.array([24078, 2540])  # Number of registrations for control_page and test_page

# Perform the two-sample proportion test
z_stat, p_value = proportions_ztest(count, nobs)

# Set the significance level (alpha)
alpha = 0.05

# Make a decision
if p_value < alpha:
    print("There is a statistically significant difference between the pages.")
else:
    print("There is no statistically significant difference between the pages.")
print(f"z-value: {z_stat}")
print(f"P-value: {p_value}")


There is a statistically significant difference between the pages.
z-value: -3.212485940700336
P-value: 0.0013159158900053333


### Conclusion  

We calculated that conversions of 'registration' and 'registration_new' pages are 0.36 and 0.4, respectively, and we see that the latter version of a registration page has better conversion.  
</br>
Hypotheses testing using a *Chi-squared test* and a *two-sample proportion test* showed that difference between conversions is **statistically significant**.   
</br>

Thus, repeating the question posed at the beginning, we can say that **the new registration page is better than the old one**.