# Basic Stats

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
df = pd.read_csv('EmployeeAttrition.csv')

In [3]:
df.head()

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,HourlyRate,...,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition
0,41,Travel_Rarely,1102,Sales,1,2,Life Sciences,2,Female,94,...,80,0,8,0,1,6,4,0,5,Yes
1,49,Travel_Frequently,279,Research & Development,8,1,Life Sciences,3,Male,61,...,80,1,10,3,3,10,7,1,7,No
2,37,Travel_Rarely,1373,Research & Development,2,2,Other,4,Male,92,...,80,0,7,3,3,0,0,0,0,Yes
3,33,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,4,Female,56,...,80,0,8,3,3,8,7,3,0,No
4,27,Travel_Rarely,591,Research & Development,2,1,Medical,1,Male,40,...,80,1,6,3,3,2,2,2,2,No


# Hypothesis Testing

### 1. One-Sample t-test
Example 1.1: One-tailed, one-sample t-test <br>
Question: Examines if the average age is less than or equals to 35 <br>
H0: The average age <= 35. <br>
Ha: the average age > 35.<br> 

In [4]:
stats.ttest_1samp(df['Age'], 35, alternative="greater")

TtestResult(statistic=8.074105690924794, pvalue=7.011064450631255e-16, df=1469)

### Conclusion:
P-Value < 0.05 => Hence we reject null hypothesis and accept alternative hypothesis to conclude that average age of employee is greater than 35

<b>You can also switch the hypoythesis (alternative)</b> <br>
H0: The average age >= 35. <br>
Ha: the average age < 35.

In [5]:
stats.ttest_1samp(df['Age'], 35, alternative="less")

TtestResult(statistic=8.074105690924794, pvalue=0.9999999999999993, df=1469)

### Conclusion:
P-Value > 0.05 => Hence we accept null hypothesis and rejects alternative hypothesis to conclude that average age of employee is greater than 35

### 2. Two-Sample/Two Group t-test
Hypothesis tests for mean differences: Paired Data <br>
Hypothesis tests for two means: Independent Data <br><br>
Example 2.1: two-sample t-test (equal sample sizes) <br><br>
H0: the average monthlyIncome of male and female employees (age <= 20) is same (equal or similar). <br>
Ha: the average monthlyIncome of male and female employees (age <= 20) is not same (different). <br>

In [6]:
male = df[(df['Gender']=='Male') & (df['Age']<20)]['MonthlyIncome']
female = df[(df['Gender']=='Female') & (df['Age']<20)]['MonthlyIncome']

In [7]:
male

127    1675
177    1102
296    1420
422    2564
457    1878
688    2121
727    1051
828    1904
853    2552
Name: MonthlyIncome, dtype: int64

In [8]:
female

149     1483
171     2325
301     1200
892     1859
909     2994
972     1611
1153    1569
1311    1514
Name: MonthlyIncome, dtype: int64

In [9]:
stats.ttest_ind(male, female)

Ttest_indResult(statistic=-0.04335872820959084, pvalue=0.9659874992613295)

### Conclusion:
P-Value > 0.05 => We cannot reject Null hypothesis H0 <br> 
so H0 is true/valid i.e, the average monthlyIncome of male and female employees (age <= 20) is equal or similar.

#### Example 2.2: two-sample t-test (equal sample sizes)
H0: the average monthlyIncome of male is equal similar than female employees (age <= 20). <br>
Ha: the average monthlyIncome of male is greater than female employees (age <= 20). <br>

In [10]:
stats.ttest_ind(male, female, alternative="greater")

Ttest_indResult(statistic=-0.04335872820959084, pvalue=0.5170062503693352)

### Conclusion:
P-Value > 0.05 => We cannot reject Null hypothesis H0 <br> 
so H0 is true/valid i.e, the average monthlyIncome of male and female employees (age <= 20) is equal or similar.

## Chi-Square Test

#### In general, we prove the following in Chi-Square Test
H0 (Null Hypothesis) = The 2 variables to be compared are independent. <br>
H1 (Alternate Hypothesis) = The 2 variables are dependent. <br><br>
Example 3.1: <br>
H0: Department and Gender are independent and are not related to each other <br>
Ha: Department and Gender are dependent and related to each other <br>

In [17]:
# pd.crosstab(df['Department'],df['Gender'], margins=True)

In [15]:
chisqt = pd.crosstab(df['Department'],df['Gender'], margins=True)

In [16]:
chisqt

Gender,Female,Male,All
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Human Resources,20,43,63
Research & Development,379,582,961
Sales,189,257,446
All,588,882,1470


In [18]:
stats.chi2_contingency(chisqt)

Chi2ContingencyResult(statistic=2.9644916359463056, pvalue=0.8132901408044013, dof=6, expected_freq=array([[  25.2,   37.8,   63. ],
       [ 384.4,  576.6,  961. ],
       [ 178.4,  267.6,  446. ],
       [ 588. ,  882. , 1470. ]]))

In [19]:
chi2, p, dof, ex = stats.chi2_contingency(chisqt)

In [20]:
chi2

2.9644916359463056

In [21]:
p

0.8132901408044013

Above is the p-value

<b> From above, 0.8132 is the p-value. As the p-value is greater than 0.05, we accept the NULL hypothesis and conclude that the variables ‘Department’ and ‘Gender’ are independent of each other and have no significant relationship.