# Student's t-test
1. One Sample t-test
2. Two Sample t-test
    1. Unpaired or Independent t-test
    2. paired or relational/dependent t-test

# One-Sample Student's t-test
Test a sample with known standard value.

**Assumption**
- Observation in each sample is independent and identically distributed.
- Observation in each sample is normally distributed.

**Interpretation**
>**H0:** the means of the sample is equal to the known value.\
>**H0:** the means of the sample is unequal to the known value.

In [1]:
pip install scipy

Note: you may need to restart the kernel to use updated packages.


In [2]:
# One-Sample_t-test

# import libraries
import seaborn as sns
import pandas as pd
from scipy.stats import ttest_1samp

# load dataset
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [7]:
# taking subsets

df1 = df[["sex","age","fare"]]
df1.head()

Unnamed: 0,sex,age,fare
0,male,22.0,7.25
1,female,38.0,71.2833
2,female,26.0,7.925
3,female,35.0,53.1
4,male,35.0,8.05


In [8]:
df1.describe()

Unnamed: 0,age,fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


In [10]:
from scipy.stats import ttest_1samp

# Check the age and compare with a known values of 45 years

# ttest_1samp(df["age"],29)

stat,p = ttest_1samp(df["age"],50)

print('stat = %.3f,p = %.3f' % (stat,p))

# adding additional argument for ease

if p > 0.05:
    print("Probably the same distribution")
else:
    print("Probably the different distribution")

stat = nan,p = nan
Probably the different distribution


# Two-Sample t-test
## Independent Student's t-test

**Assumptions**

- Observation in each sample are independent and identically distributed.
- Observation in each sample are normally distributed.
- Observation in each sample have the same variance.

**Interpretation**
>H0: the means of the sample is equal to the known value.\
>H0: the means of the sample is unequal to the known value.

In [12]:
# We will compare age and fare of male and female passengers

## Splitting datasets

df_male = df1.loc[df1["sex"]== "male"]
df_female = df1.loc[df1["sex"]== "female"]

# library
from scipy.stats import ttest_ind
stat,p = ttest_ind(df_male["age"],df_female["age"])

print('stat = %.3f,p = %.3f' % (stat,p))

# adding additional argument for ease

if p > 0.05:
    print("Probably the same distribution")
else:
    print("Probably the different distribution")

stat = nan,p = nan
Probably the different distribution


In [13]:
df_male.describe()

Unnamed: 0,age,fare
count,453.0,577.0
mean,30.726645,25.523893
std,14.678201,43.138263
min,0.42,0.0
25%,21.0,7.8958
50%,29.0,10.5
75%,39.0,26.55
max,80.0,512.3292


In [14]:
df_female.describe()

Unnamed: 0,age,fare
count,261.0,314.0
mean,27.915709,44.479818
std,14.110146,57.997698
min,0.75,6.75
25%,18.0,12.071875
50%,27.0,23.0
75%,37.0,55.0
max,63.0,512.3292


# Paired Student's t-test

Best whether the means of two paired samples are significantly different. **Assumptions**

observations in each sample are independent and identically distributed.

Observations in each samples are normally distributed.

Observations in each sample have the same variance.

Observations across each samples are paired

**Interpretation**

>H0: the means of the samples are equal.

>H1: the means of the samples are unequal

In [15]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [16]:
#Select only male data

df_male = df.loc[df["sex"] == "male"]
df_male.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


In [19]:
#Select Only two classes

df_male_first = df_male.loc[df_male["class"] == "First"]
df_male_second = df_male.loc[df_male["class"] == "Second"]
df_male_third = df_male.loc[df_male["class"] == "Third"]

In [20]:
df_male_first.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
23,1,1,male,28.0,0,0,35.5,S,First,man,True,A,Southampton,yes,True
27,0,1,male,19.0,3,2,263.0,S,First,man,True,C,Southampton,no,False
30,0,1,male,40.0,0,0,27.7208,C,First,man,True,,Cherbourg,no,True
34,0,1,male,28.0,1,0,82.1708,C,First,man,True,,Cherbourg,no,False


In [21]:
df_male_third.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
12,0,3,male,20.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [22]:
# Import library

from scipy.stats import ttest_rel

# apply test to campre class_1 and class 3

ttest_rel(df_male_first["age"],df_male_third["age"])

ValueError: unequal length arrays

In [24]:
df_1st = df_male_first.sample(100)
df_2nd = df_male_second.sample(100)
df_3rd = df_male_third.sample(100)

In [27]:
print("The number of instances in first class are = ",df_1st.shape)
print("The number of instances in sencod class are = ",df_2nd.shape)
print("The number of instances in third class are = ",df_3rd.shape)

The number of instances in first class are =  (100, 15)
The number of instances in sencod class are =  (100, 15)
The number of instances in third class are =  (100, 15)


In [25]:
ttest_rel(df_1st["age"],df_3rd["age"])

Ttest_relResult(statistic=nan, pvalue=nan)

In [28]:
stat,p = ttest_rel(df_1st["age"],df_3rd["age"])

print('stat = %.3f,p = %.3f' % (stat,p))

# adding additional argument for ease

if p > 0.05:
    print("Probably the same distribution")
else:
    print("Probably the different distribution")

stat = nan,p = nan
Probably the different distribution
