Instalação de bibliotecas gerais

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Instalação de bibliotecas de análises

In [2]:
from scipy import stats as st
import pingouin as pg

Carregamento de Dados

In [3]:
df_teste_ab = pd.read_csv('exampleDataABtest.csv')
df_teste_ab

Unnamed: 0,group,time,clickedTrue
0,A,2016-06-02 02:17:53,0
1,A,2016-06-02 03:03:54,0
2,A,2016-06-02 03:18:56,1
3,B,2016-06-02 03:23:43,0
4,A,2016-06-02 04:04:00,0
...,...,...,...
995,B,2016-06-10 00:21:15,0
996,B,2016-06-10 00:52:15,0
997,B,2016-06-10 00:55:36,0
998,A,2016-06-10 01:06:36,0


In [5]:
df_nps = pd.read_csv('nps_example.csv', sep = ';')
df_nps.head()

Unnamed: 0,id,response_status,how_long_listening,age,nps_score,gender
0,11706300,Complete,Less than 6 months,25-34,10.0,Female
1,11706302,Complete,1 year to less than 3 years,25-34,10.0,Female
2,11706307,Complete,6 months to less than a year,35-44,10.0,Female
3,11706312,Complete,Less than 6 months,35-44,10.0,Female
4,11706316,Complete,6 months to less than a year,25-34,10.0,Male


TESTE T

Comparação de Grupos

In [7]:
df_teste_ab.groupby('group') \
            .agg(media_cliques = pd.NamedAgg('clickedTrue', 'mean'),
                 dp_cliques = pd.NamedAgg('clickedTrue', 'std'),
                 n = pd.NamedAgg('clickedTrue', 'count'))

Unnamed: 0_level_0,media_cliques,dp_cliques,n
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.04,0.196155,500
B,0.08,0.271565,500


Aplicação do Teste T com scipy

In [9]:
grA = df_teste_ab[df_teste_ab['group'] == 'A']['clickedTrue']
grB = df_teste_ab[df_teste_ab['group'] == 'B']['clickedTrue']
grA
grB

3      0
6      0
7      0
8      0
10     0
      ..
993    0
995    0
996    0
997    0
999    0
Name: clickedTrue, Length: 500, dtype: int64

Hora do teste

In [10]:
st.ttest_ind(a=grA, b=grB, alternative='two-sided')

TtestResult(statistic=np.float64(-2.669938469060931), pvalue=np.float64(0.007709783987515963), df=np.float64(998.0))

# se o p-value < 0,05  então eu rejeito a hipótese nula. Quer dizer que A é diferente de B. Nesse caso, 0.007 é menor que 0.05

Aplicação do Teste T com Pingouim

In [18]:
pg.ttest(x=grA,
          y=grB,
          alternative='two-sided',
          confidence = 0.95)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-2.669938,998,two-sided,0.00771,"[-0.07, -0.01]",0.168862,2.349,0.760344


Agora vamos fazer o testeF especializado para mais de 2 grupos de comparação.

Verificação de quem respondeu o NPS

In [19]:
df_nps.groupby('response_status') \
      .size() \
      .to_frame('n') \
      .reset_index()

Unnamed: 0,response_status,n
0,Complete,2281
1,Incomplete,265
2,Terminated,33


Verificação dos Nulos

In [20]:
df_nps[df_nps['nps_score'].isnull()]

Unnamed: 0,id,response_status,how_long_listening,age,nps_score,gender
17,11706467,Incomplete,Less than 6 months,18-24,,
31,11706938,Incomplete,1 year to less than 3 years,25-34,,
32,11706979,Incomplete,6 months to less than a year,25-34,,
43,11707426,Incomplete,6 months to less than a year,25-34,,
48,11707719,Incomplete,3 years to less than 5 years,35-44,,
...,...,...,...,...,...,...
2546,13093216,Incomplete,6 months to less than a year,35-44,,
2556,13278063,Incomplete,3 years to less than 5 years,18-24,,
2570,13565327,Complete,1 year to less than 3 years,45-54,,Female
2572,13601847,Incomplete,3 years to less than 5 years,25-34,,


Hora de filtrar

In [22]:
df_nps_filtrado = df_nps[(df_nps['response_status']== 'Complete') & \
                        (df_nps['nps_score'].notna())]

df_nps_filtrado

Unnamed: 0,id,response_status,how_long_listening,age,nps_score,gender
0,11706300,Complete,Less than 6 months,25-34,10.0,Female
1,11706302,Complete,1 year to less than 3 years,25-34,10.0,Female
2,11706307,Complete,6 months to less than a year,35-44,10.0,Female
3,11706312,Complete,Less than 6 months,35-44,10.0,Female
4,11706316,Complete,6 months to less than a year,25-34,10.0,Male
...,...,...,...,...,...,...
2573,13610170,Complete,6 months to less than a year,25-34,10.0,Female
2574,13640772,Complete,3 years to less than 5 years,18-24,10.0,Female
2576,13732056,Complete,1 year to less than 3 years,18-24,10.0,Female
2577,13734055,Complete,Less than 6 months,25-34,10.0,Male


Verificar o grupo foco

In [34]:
df_nps_filtrado.groupby('age') \
                .agg(media_nps = pd.NamedAgg ('nps_score', 'mean'),
                    dp_nps = pd.NamedAgg('nps_score', 'std'),
                    n = pd.NamedAgg('nps_score', 'size')) \
                .reset_index() 

Unnamed: 0,age,media_nps,dp_nps,n
0,18-24,9.464539,1.116275,282
1,25-34,9.694828,0.957639,580
2,35-44,9.707612,0.979501,578
3,45-54,9.719039,0.928254,541
4,55-64,9.733871,0.92302,248
5,65-74,9.423077,1.36156,26
6,75+,8.0,0.0,2


Percebemos que o grupo de idade +75 é muito baixo e pode atrapalhar e nossas análises. Vamos filtrar para que não apareçam

In [37]:
df_nps_filtrado_aj = df_nps_filtrado[df_nps_filtrado['age'] != '75+']


In [38]:
df_nps_filtrado_aj.groupby('age') \
                .agg(media_nps = pd.NamedAgg ('nps_score', 'mean'),
                    dp_nps = pd.NamedAgg('nps_score', 'std'),
                    n = pd.NamedAgg('nps_score', 'size')) \
                .reset_index() 

Unnamed: 0,age,media_nps,dp_nps,n
0,18-24,9.464539,1.116275,282
1,25-34,9.694828,0.957639,580
2,35-44,9.707612,0.979501,578
3,45-54,9.719039,0.928254,541
4,55-64,9.733871,0.92302,248
5,65-74,9.423077,1.36156,26


Teste Scipy - Tem que separar todos os grupos

In [50]:
dados_18_24 = df_nps_filtrado_aj[df_nps_filtrado_aj['age'] == '18-24']['nps_score']
dados_25_34 = df_nps_filtrado_aj[df_nps_filtrado_aj['age'] == '25-34']['nps_score']
dados_35_44 = df_nps_filtrado_aj[df_nps_filtrado_aj['age'] == '35-44']['nps_score']
dados_45_54 = df_nps_filtrado_aj[df_nps_filtrado_aj['age'] == '45-54']['nps_score']
dados_55_64 = df_nps_filtrado_aj[df_nps_filtrado_aj['age'] == '55-64']['nps_score']
dados_65_74 = df_nps_filtrado_aj[df_nps_filtrado_aj['age'] == '65-74']['nps_score']

In [51]:
st.f_oneway(dados_18_24,
            dados_25_34,
            dados_35_44,
            dados_45_54,
            dados_55_64,
            dados_65_74)

F_onewayResult(statistic=np.float64(3.522166098104082), pvalue=np.float64(0.0035606861304280546))

In [52]:
print([len(g) for g in [dados_18_24, dados_25_34, dados_35_44, dados_45_54, dados_55_64, dados_65_74]])

[282, 580, 578, 541, 248, 26]


como o valor p value é < 0.05, então rejeitamos a hipótese nula, ou seja, alguma das médias é diferente das demais. No caso HÁ DIFERENÇA. E é um grande motivo para se estudar a finco. é uma hipótese que foi validada!

Usando Pingouim - Bem menos complexa

In [54]:
pg.anova(dv = 'nps_score',
         between = 'age',
         data = df_nps_filtrado_aj,
         detailed = True)

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,age,16.888794,5,3.377759,3.522166,0.003561,0.00777
1,Within,2156.791916,2249,0.959,,,


#testegit