# Statistics for Data Science IV

- Two Sample t-test
- One-way ANOVA

## Hypothesis Testing

#### Two sample t-test

scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate', permutations=None, random_state=None, alternative='two-sided', trim=0)

---

Calculate the T-test for the means of two independent samples of scores.

This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

**Membuat statement hipotesa**

H0 : rata-rata 'tip' dari perokok **sama dengan** rata-rata 'tip' dari yang bukan perokok

H1 : rata-rata 'tip' dari perokok **tidak sama dengan** rata-rata 'tip' dari yang bukan perokok

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = sns.load_dataset('tips')
data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
smokers_tip = data[data['smoker']=='Yes']['tip']
smokers_tip.mean()

3.008709677419355

In [4]:
nonsmokers_tip = data[data['smoker']=='No']['tip']
nonsmokers_tip.mean()

2.9918543046357624

In [5]:
ttest_2 = st.ttest_ind(a = smokers_tip, b = nonsmokers_tip, equal_var=True)

In [6]:
ttest_2.pvalue
# Belum cukup bukti untuk menentang H0

0.9265931522244976

In [7]:
smokers_tip.std()

1.4014675738128255

In [8]:
nonsmokers_tip.std()

1.37719008805297

#### Contoh 2: E-commerce Sample

In [None]:
df = pd.read_csv('e-commerce_example_dataset.csv')
df.head()

In [None]:
df.shape

In [None]:
df['discount'].value_counts()

In [None]:
disc = df[df['discount']=='discount']
non_disc = df[df['discount']=='not-discount']

In [None]:
disc['gmv'].mean()

In [None]:
non_disc['gmv'].mean()

#### Langkah 1

Kita ingin menganalisis apakah discount punya pengaruh signifikan pada gmv

**Membuat statement hipotesa**

H0 : rata-rata gmv discount **sama dengan** gmv non-discount

H1 : rata-rata gmv discount **tidak sama dengan** gmv non-discount


Jika H1 diterima berarti discount punya pengaruh signifikan pada gmv, dimana gmv discount > gmv not-discount

In [None]:
disc['gmv'].hist()

In [None]:
non_disc['gmv'].hist()
plt.xlim([750000,2750000])

In [None]:
disc['gmv'].std()

In [None]:
non_disc['gmv'].std()

In [None]:
ttest_ecom = st.ttest_ind(a=disc['gmv'], b=non_disc['gmv'], equal_var=False)
ttest_ecom.pvalue

In [None]:
print('Cukup Bukti untuk menentang H0')

## Analysis of Variance (ANOVA)

In [None]:
iris = pd.read_csv('Iris.csv')
iris.head()

In [None]:
iris['Species'].unique()

![iris](https://miro.medium.com/max/1000/1*Hh53mOF4Xy4eORjLilKOwA.png)

Uji Anova One-Way

- H0: Lebar Sepal untuk ketiga kategori **sama**
- H1: Lebar Sepal untuk ketiga kategori **tidak sama**

In [None]:
setosa = iris[iris['Species']=='Iris-setosa']
versicolor = iris[iris['Species']=='Iris-versicolor']
virginica = iris[iris['Species']=='Iris-virginica']

In [None]:
anova_iris = st.f_oneway(setosa['SepalWidthCm'], versicolor['SepalWidthCm'], virginica['SepalWidthCm'])

In [None]:
anova_iris

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('SepalWidthCm ~ Species', data=iris).fit()

In [None]:
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

In [None]:
pair_ttest = model.t_test_pairwise('Species')
pair_ttest.result_frame

In [None]:
setosa['SepalWidthCm'].mean()

In [None]:
versicolor['SepalWidthCm'].mean()

In [None]:
virginica['SepalWidthCm'].mean()

T-test of Mean Difference:

- H0: (mean_1 - mean_2) sama dengan 0
- H1: (mean_1 - mean_2) tidak sama dengan 0