## General description:
Вы работаете аналитиком в приложении по доставке продуктов. Команда внедрила в приложение умную систему рекомендации товаров – предполагается, что такая система поможет пользователям эффективнее работать с приложением и лучше находить необходимые товары.

You are working as an analyst for a grocery delivery app. The team has implemented a smart product recommendation system in the app, with the expectation that this system will help users work more efficiently with the app and find the necessary products more easily.

To test the effectiveness of the recommendation system, an A/B test was conducted. Users in group 1 were exposed to the new recommendation system, while users in group 0 were using the older version of the app, which did not include product recommendations.

Your task is to evaluate whether the new recommendation system has been beneficial to both the business and the app's users. To do this, you need to select the metrics that reflect the quality of the service and statistically compare these metrics between the two groups.

The result of your work should be an analytical report answering the question of whether the new recommendation system should be rolled out to all users.

In the data, you will find logs of users' orders:
- **ab_users_data** – – history of users' orders, which includes information about the orders that users created and canceled.
- **ab_orders** – detailed information about the contents of each order, including a list of product IDs that were included in each order.
- **ab_products** – detailed information about the products, including their names and prices.

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as ss

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

sns.set(rc={'figure.figsize':(12,6)}, style="whitegrid")

**Performing EDA**

In [4]:
delievery_users = pd.read_csv('ab_users_data.csv', parse_dates=['date', 'time'])

In [144]:
delievery_users.dtypes

user_id              int64
order_id             int64
action              object
time        datetime64[ns]
date        datetime64[ns]
group                int64
dtype: object

In [145]:
delievery_users.shape

(4337, 6)

In [146]:
delievery_users.isna().sum()

user_id     0
order_id    0
action      0
time        0
date        0
group       0
dtype: int64

In [147]:
delievery_users.nunique()

user_id     1017
order_id    4123
action         2
time        4312
date          14
group          2
dtype: int64

In [5]:
delievery_users.head()

Unnamed: 0,user_id,order_id,action,time,date,group
0,964,1255,create_order,2022-08-26 00:00:19,2022-08-26,0
1,965,1256,create_order,2022-08-26 00:02:21,2022-08-26,1
2,964,1257,create_order,2022-08-26 00:02:27,2022-08-26,0
3,966,1258,create_order,2022-08-26 00:02:56,2022-08-26,0
4,967,1259,create_order,2022-08-26 00:03:37,2022-08-26,1


Сalculate the total number of orders

In [208]:
delievery_users.groupby(['group'], as_index=False).agg({'order_id': 'count'})

Unnamed: 0,group,order_id
0,0,1691
1,1,2646


In [217]:
orders_by_action = delievery_users.groupby(['group', 'action' ], as_index=False).agg({'order_id': 'count'})
orders_by_action

Unnamed: 0,group,action,order_id
0,0,cancel_order,82
1,0,create_order,1609
2,1,cancel_order,132
3,1,create_order,2514


look at how orders are distributed across days

In [219]:
grouped_data = delievery_users.groupby(['group', 'action', 'date']).agg({'order_id': 'count'}).reset_index()
fig = px.bar(grouped_data, x='date', y='order_id', color='action', facet_col='group',
             title='Распределение количества заказов по датам',
             labels={'order_id': 'Количество заказов', 'date': 'Дата'},
             color_continuous_scale='Viridis')
fig.show()

It's interesting to note the spike in orders on 26.08.22. Since the test period was less than a month, more data could provide a clearer picture. However, even with the current data, we can observe an increase in the number of orders. This could indicate a potential impact of the new recommendation system, but further analysis with a longer dataset would be ideal to confirm these findings.

In [151]:
delievery_orders = pd.read_csv('ab_orders.csv', parse_dates=['creation_time'])

In [152]:
delievery_orders.head()

Unnamed: 0,order_id,creation_time,product_ids
0,1255,2022-08-26 00:00:19,"{75, 22, 53, 84}"
1,1256,2022-08-26 00:02:21,"{56, 76, 39}"
2,1257,2022-08-26 00:02:27,"{76, 34, 41, 38}"
3,1258,2022-08-26 00:02:56,"{74, 6}"
4,1259,2022-08-26 00:03:37,"{20, 45, 67, 26}"


In [157]:
delievery_orders.dtypes

order_id                  int64
creation_time    datetime64[ns]
product_ids              object
dtype: object

In [156]:
delievery_orders.shape

(4123, 3)

In [159]:
delievery_orders.isna().sum()

order_id         0
creation_time    0
product_ids      0
dtype: int64

In [160]:
delievery_orders.nunique()

order_id         4123
creation_time    4098
product_ids      3877
dtype: int64

In [161]:
delievery_products = pd.read_csv('ab_products.csv')

In [167]:
delievery_products.head()

Unnamed: 0,product_id,name,price
0,1,сахар,150.0
1,2,чай зеленый в пакетиках,50.0
2,3,вода негазированная,80.4
3,4,леденцы,45.5
4,5,кофе 3 в 1,15.0


In [164]:
delievery_products.shape

(87, 3)

In [165]:
delievery_products.isna().sum()

product_id    0
name          0
price         0
dtype: int64

In [166]:
delievery_orders.nunique()

order_id         4123
creation_time    4098
product_ids      3877
dtype: int64

merge the first two tables by order_id

In [169]:
orders = delievery_users.merge(delievery_orders, on = 'order_id')


In [171]:
orders.head()

Unnamed: 0,user_id,order_id,action,time,date,group,creation_time,product_ids
0,964,1255,create_order,2022-08-26 00:00:19.000000,2022-08-26,0,2022-08-26 00:00:19,"{75, 22, 53, 84}"
1,965,1256,create_order,2022-08-26 00:02:21.000000,2022-08-26,1,2022-08-26 00:02:21,"{56, 76, 39}"
2,964,1257,create_order,2022-08-26 00:02:27.000000,2022-08-26,0,2022-08-26 00:02:27,"{76, 34, 41, 38}"
3,966,1258,create_order,2022-08-26 00:02:56.000000,2022-08-26,0,2022-08-26 00:02:56,"{74, 6}"
4,966,1258,cancel_order,2022-08-26 00:08:25.486419,2022-08-26,0,2022-08-26 00:02:56,"{74, 6}"


remove one of the time columns and convert product_ids

In [183]:
orders['product_ids'] = orders.product_ids.str.replace(r'[{}]', '')

In [None]:
orders['product_ids'] = orders['product_ids'].str.split(',')
orders = orders.explode('product_ids')

In [195]:
orders.head()

Unnamed: 0,user_id,order_id,action,date,group,creation_time,product_ids
0,964,1255,create_order,2022-08-26,0,2022-08-26 00:00:19,75
0,964,1255,create_order,2022-08-26,0,2022-08-26 00:00:19,22
0,964,1255,create_order,2022-08-26,0,2022-08-26 00:00:19,53
0,964,1255,create_order,2022-08-26,0,2022-08-26 00:00:19,84
1,965,1256,create_order,2022-08-26,1,2022-08-26 00:02:21,56


In [189]:
orders['product_ids'] = orders.product_ids.astype('int64')


In [196]:
orders = orders.rename(columns = {'product_ids':'product_id'})

In [197]:
orders.dtypes

user_id                   int64
order_id                  int64
action                   object
date             datetime64[ns]
group                     int64
creation_time    datetime64[ns]
product_id                int64
dtype: object

In [198]:
orders.shape

(14569, 7)

join the last table by product_id

In [220]:
delievery = orders.merge(delievery_products, how = 'outer', on = 'product_id')

In [221]:
delievery.head()

Unnamed: 0,user_id,order_id,action,date,group,creation_time,product_id,name,price
0,964,1255,create_order,2022-08-26,0,2022-08-26 00:00:19,75,сок ананасовый,120.0
1,987,1287,create_order,2022-08-26,0,2022-08-26 00:31:36,75,сок ананасовый,120.0
2,1073,1403,create_order,2022-08-26,1,2022-08-26 03:01:40,75,сок ананасовый,120.0
3,1089,1424,create_order,2022-08-26,1,2022-08-26 04:01:22,75,сок ананасовый,120.0
4,1139,1495,create_order,2022-08-26,1,2022-08-26 06:04:05,75,сок ананасовый,120.0


We need to examine how our metrics have changed with the updated product recommendation algorithm. Specifically, we are interested in understanding the changes in orders.

We will check if the average number of orders per user has changed between the groups.

In [237]:
created_orders = delievery.query('action == "create_order"')

In [238]:
orders_number = created_orders.groupby(['group', 'user_id'], as_index = False)\
        .agg({'order_id':'nunique'})\
        .rename(columns = {'order_id':'orders_number'}) \
        .sort_values(by='orders_number', ascending=False)

In [236]:
orders_number

Unnamed: 0,group,user_id,orders_number
794,1,1533,13
101,0,1170,13
855,1,1641,13
824,1,1583,12
796,1,1537,12
...,...,...,...
71,0,1100,1
279,0,1510,1
581,1,1103,1
575,1,1089,1


In [239]:
orders_number.groupby(['group'], as_index = False).orders_number.mean()

Unnamed: 0,group,orders_number
0,0,3.124272
1,1,5.007968


The average number of orders has increased. We will statistically test whether this difference is significant. We will use a t-test to compare the means.

- $H_0$: The average number of orders in the groups is the same.
- $H_1$: The average number of orders in the groups is different.

In [243]:
from scipy.stats import ttest_ind
ttest_ind(orders_number.query('group == 1').orders_number, orders_number.query('group == 0').orders_number)

Ttest_indResult(statistic=14.510868123433648, pvalue=1.6974865514796017e-43)

With a p-value < 0.05, we can reject the null hypothesis and conclude that the average number of orders has increased.

calculate the average order value per user

In [246]:
check = created_orders.groupby(['group', 'user_id', 'order_id'], as_index = False)\
        .agg({'price':'sum'})
check.groupby(['group'], as_index = False).price.mean()

Unnamed: 0,group,price
0,0,381.285768
1,1,369.622912


The average order value has not changed significantly, although it is slightly lower for Group 1. We will check if the difference is statistically significant using a t-test.

- $H_0$: The average order value in the groups is not different.
- $H_1$: The average order value in the groups is different.

In [248]:
ttest_ind(check.query('group == 1').price, check.query('group == 0').price)

Ttest_indResult(statistic=-1.4815692121713073, pvalue=0.13853141121218765)

The p-value > 0.05, which does not allow us to reject the null hypothesis, leading to the conclusion that the average order value does not differ between the groups.

сalculate revenue

In [251]:
created_orders.groupby('group', as_index = False).agg({'price':'sum'})


Unnamed: 0,group,price
0,0,613488.8
1,1,929232.0


### Conclusion:

1. The average number of orders in the new group has increased, and the profit has also increased.
2. The average check in the groups did not change.

Since the test was conducted on the improved recommendation algorithm and both the number of orders and the profit have increased, it is advisable to roll out the update to all users.