Задача

Проанализируйте тест между группами 0 и 3 по метрике линеаризованных лайков. Видно ли отличие? Стало ли 𝑝−𝑣𝑎𝑙𝑢𝑒 меньше?
Проанализируйте тест между группами 1 и 2 по метрике линеаризованных лайков. Видно ли отличие? Стало ли 𝑝−𝑣𝑎𝑙𝑢𝑒 меньше?

In [1]:
import pandahouse
import pandas as pd
import seaborn as sns
from scipy import stats
import numpy as np

In [2]:
connection = {
    'host': 'https://clickhouse.lab.karpov.courses',
    'password': 'dpo_python_2020',
    'user': 'student',
    'database': 'simulator_20220820'
}

### Группы 1 и 2

Классический t-тест для групп 1 и 2

In [3]:
q = """
SELECT exp_group, 
        user_id, 
        sum(action = 'like') as likes,
        sum(action = 'view') as views,
        likes/views as ctr
FROM {db}.feed_actions
WHERE toDate(time) between '2022-07-13' and '2022-07-19' and exp_group in (1,2)
GROUP BY exp_group, user_id       
"""
      
df = pandahouse.read_clickhouse(q, connection=connection)

In [4]:
df.head()

Unnamed: 0,exp_group,user_id,likes,views,ctr
0,1,109963,3,15,0.2
1,1,26117,32,141,0.22695
2,1,138232,18,73,0.246575
3,1,26295,33,122,0.270492
4,1,18392,7,32,0.21875


In [5]:
stats.ttest_ind(df[df.exp_group == 1].ctr,
               df[df.exp_group == 2].ctr,
                equal_var = False # дисперсии неодинаковые
               )

Ttest_indResult(statistic=0.709439204127032, pvalue=0.47806231308750413)

Согласно данному тесту, недостаточно оснований отклонить нулевую гипотезу о равенстве средних ctr в группах

In [6]:
q = """
SELECT user_id, exp_group, likes, views, ctr_control, likes - views * ctr_control AS linearized_likes FROM
    (SELECT    
        user_id, 
        exp_group,
        sum(action = 'like') as likes,
        sum(action = 'view') as views
    FROM {db}.feed_actions 
    WHERE toDate(time) between '2022-07-13' and '2022-07-19' and exp_group in (1,2)
    GROUP BY user_id, exp_group) query_in_1
    
    CROSS JOIN 
    
    (SELECT  
    sum(action = 'like') / sum(action = 'view') as ctr_control
    FROM simulator_20220820.feed_actions
    WHERE toDate(time) between '2022-07-13' and '2022-07-19' and exp_group == 1
    GROUP BY exp_group) query_in_2
"""

df = pandahouse.read_clickhouse(q, connection=connection)

In [7]:
df.head()

Unnamed: 0,user_id,exp_group,likes,views,ctr_control,linearized_likes
0,109963,1,3,15,0.208027,-0.120402
1,26117,1,32,141,0.208027,2.668221
2,138232,1,18,73,0.208027,2.814043
3,26295,1,33,122,0.208027,7.62073
4,18392,1,7,32,0.208027,0.343142


In [8]:
stats.ttest_ind(df[df.exp_group == 1].linearized_likes,
               df[df.exp_group == 2].linearized_likes,
                equal_var = False # дисперсии неодинаковые
               )

Ttest_indResult(statistic=6.122579994775972, pvalue=9.439432187037712e-10)

Теперь t-тест показывает значимые различия между группами, p-value стал меньше 0,05, при этом в классическом тесте p-value был близок к значению 0.5.

### Группы 0 и 3

In [9]:
q = """
SELECT user_id, exp_group, likes, views, ctr_control, likes - views * ctr_control AS linearized_likes FROM
    (SELECT    
        user_id, 
        exp_group,
        sum(action = 'like') as likes,
        sum(action = 'view') as views
    FROM {db}.feed_actions 
    WHERE toDate(time) between '2022-07-13' and '2022-07-19' and exp_group in (0,3)
    GROUP BY user_id, exp_group) query_in_1
    
    CROSS JOIN 
    
    (SELECT  
    sum(action = 'like') / sum(action = 'view') as ctr_control
    FROM {db}.feed_actions
    WHERE toDate(time) between '2022-07-13' and '2022-07-19' and exp_group == 0
    GROUP BY exp_group) query_in_2
"""

df = pandahouse.read_clickhouse(q, connection=connection)

In [10]:
df.head()

Unnamed: 0,user_id,exp_group,likes,views,ctr_control,linearized_likes
0,115383,3,9,30,0.208236,2.752916
1,123580,3,13,48,0.208236,3.004666
2,4944,0,8,41,0.208236,-0.537681
3,4504,0,5,15,0.208236,1.876458
4,121508,0,18,88,0.208236,-0.324779


In [11]:
stats.ttest_ind(df[df.exp_group == 0].linearized_likes,
               df[df.exp_group == 3].linearized_likes,
                equal_var = False # дисперсии неодинаковые
               )

Ttest_indResult(statistic=-15.214995460903827, pvalue=5.4914249479690016e-52)

В случае классического t-теста на метрике ctr был получен результат pvalue=5.4914249479690016e-52 > pvalue(linearized_likes). Чувствительность теста повысилась.