# Probablility & Hypothesis Testing
by [Mavis Wang](https://github.com/mavisw)

In [1]:
from scipy.stats import binom, poisson, expon

1. Joon takes a 40-question exam, but did not study so choose the answer randomly. Each question has two possible choices for the answer (True of False). Find the probability that the Joon guesses more than 30 questions correctly (# of correct answer > 30). (Please use Binomial distribution) (20 points)

In [2]:
# X~B(x; n,P), n=40, p=0.5
# P(X > 30) = 1- P(≤ 30)

x1 = binom(40,0.5)

# 1-cdf(30) = sf(30)
print(f'P(X > 30) = {1-x1.cdf(30):.4f}')

P(X > 30) = 0.0003


2. The average number of pets our students have their lifetime is 1.47. Suppose that one student is randomly chosen (Let’s define a random variable X as the number of pets for our students) (please use Poisson distribution) (20 points)

    a. Find the probability that the student has no pet.</br>
    b. Find the probability that the student has fewer pet than the average.</br>
    c. Find the probability that the student has more pets than the average.

In [3]:
# λ = 1.47
# Χ = # of pets ~Poisson(λ)

# a. P(X = 0)
# b. P(X < 1.47) = P(X≤1)
# c. P(X > 1.47) = 1-P(X≤1.47)

x2 = poisson(1.47)

print(f'a. P(X=0)={x2.pmf(0):.4f}')
print(f'b. P(X<1.47) = P(X≤1)= {x2.cdf(1.47):.4f}')
print(f'c. P(X > 1.47)= {1-x2.cdf(1.47):.4f}')

a. P(X=0)=0.2299
b. P(X<1.47) = P(X≤1)= 0.5679
c. P(X > 1.47)= 0.4321


3. Suppose that the useful life of a particular car battery, measured in months, decays with parameter 0.025 (Exponential distribution). We are interested in the life of the battery. Let’s define a random variable X as the life time of battery). (20 points)

    a. On average, how long would you expect one car battery to last?</br>
    b. Find the probability that a car battery lasts less than 36 months.

In [4]:
# decay parameter = 0.025 = λ
# μ = 1/λ
# X ~ Expon(x;0.025)

# a. E(X) = μ = 1/λ
# b. P(X<36)

mu = 1/0.025
x3 = expon(scale=mu)

print(f'a. Average car battery life = {mu} months.')
print(f'b. P(X<36) = {x3.cdf(36):.4f}.')

a. Average car battery life = 40.0 months.
b. P(X<36) = 0.5934.


#### Practical
The bank wants to check who has a personal loan. They collect data and assume that three factors (age, income, education) are related to the personal loan. Please help them using hypothesis testing. (40 points)

   a. Find “Bank.csv” and load it using pandas.</br>
   b. Splitting it into two groups (PersonalLoan=0 and 1). (10 points)</br>
   c. Please do three hypothesis tests (t-test) using “ttest_ind” for (1) Age, (2) Income, and (3) Education. Show your results for 95% confidence level (p=0.05). (Note: your Null hypothesis (H0) is that the population mean of the two groups is the same (no difference)). (30 points)

In [5]:
import numpy as np
import pandas as pd

### a. load dataset
- Bank.csv

In [15]:
bank_df = pd.read_csv('Bank.csv', usecols = ['Age', 'Income', 'Education', 'PersonalLoan'])

In [17]:
bank_df.head()

Unnamed: 0,Age,Income,Education,PersonalLoan
0,25,49,1,0
1,45,34,1,0
2,39,11,1,0
3,35,100,2,0
4,35,45,2,0


### b. split into groups 

- no_Loan: PersonalLoan = 0
- with_Loan: PersonalLoan = 1

In [18]:
# PersonalLoan = 0
noLoan = bank_df[bank_df['PersonalLoan']== 0]

# PersonalLoan = 1
withLoan = bank_df[bank_df['PersonalLoan']== 1]

In [19]:
noLoan.head()

Unnamed: 0,Age,Income,Education,PersonalLoan
0,25,49,1,0
1,45,34,1,0
2,39,11,1,0
3,35,100,2,0
4,35,45,2,0


In [20]:
withLoan.head()

Unnamed: 0,Age,Income,Education,PersonalLoan
9,34,180,3,1
16,38,130,3,1
18,46,193,3,1
29,38,119,2,1
38,42,141,3,1


### c. Hypothesis Test

Assumption: the three factors (Age, Income, and Education) are related to loan status (H0≠H1).

In [21]:
from scipy.stats import ttest_ind # t-test for general

#### Age

Let μ0 = average Age of noLoan group and μ1 = average Age of withLoan group.
α = 0.05

    Ho: μ0 - μ1 = 0
    Η1: μ0 - μ1 ≠ 0

In [22]:
tstat, p = ttest_ind(noLoan['Age'],withLoan['Age'])
print(f'T-score = {tstat:.3f}, p-value = {p:.3f}')

if p < 0.05:
    print(f'The two groups are different. Age is related to loan status.')
else:
    print(f'There is no differece between the two groups. Therefore, age is not significantly related to loan status.')

T-score = 0.546, p-value = 0.585
There is no differece between the two groups. Therefore, age is not significantly related to loan status.


#### Income

Let μ0 = average Income of noLoan group and μ1 = average Income of withLoan group.
α = 0.05

    Ho: μ0 - μ1 = 0
    Η1: μ0 - μ1 ≠ 0

In [23]:
tstat, p = ttest_ind(noLoan['Income'],withLoan['Income'])
print(f'T-score = {tstat:.3f}, p-value = {p:.3f}')

if p < 0.05:
    print(f'The two groups are different. Income is related to loan status.')
else:
    print(f'There is no differece between the two groups. Therefore, income is not significantly related to loan status.')

T-score = -41.085, p-value = 0.000
The two groups are different. Income is related to loan status.


#### Education

Let μ0 = average Education of noLoan group and μ1 = average Education of withLoan group.
α = 0.05

    Ho: μ0 - μ1 = 0
    Η1: μ0 - μ1 ≠ 0

In [24]:
tstat, p = ttest_ind(noLoan['Education'],withLoan['Education'])
print(f'T-score = {tstat:.3f}, p-value = {p:.3f}')

if p < 0.05:
    print(f'The two groups are different. Education is related to loan status.')
else:
    print(f'There is no differece between the two groups. Therefore, education is not significantly related to loan status.')

T-score = -9.757, p-value = 0.000
The two groups are different. Education is related to loan status.
