### Problem Statement 1:
Is gender independent of education level? A random sample of 395 people were
surveyed and each person was asked to report the highest education level they
obtained. The data that resulted from the survey is summarized in the following table:

   High School Bachelors Masters Ph.d. Total
  
Female 60 54 46 41 201

Male 40 44 53 57 194

Total 100 98 99 98 395


Question: Are gender and education level dependent at 5% level of significance? In
other words, given the data collected above, is there a relationship between the
gender of an individual and the level of education that they have obtained?

ANS:
- H0:two rv independent, H1: two rv correlated

In [1]:
import pandas as pd
from scipy.stats import chi2
obs = pd.DataFrame({"High School ":[60,40],'Bachelors':[54,44],"Masters":[46,53],"Ph.d.":[41,57]})
## 0 : female,1:male
colSum = obs.sum(axis=0)
rowSum = obs.sum(axis=1)
total = sum(colSum)
obs

Unnamed: 0,High School,Bachelors,Masters,Ph.d.
0,60,54,46,41
1,40,44,53,57


In [2]:
exp = pd.DataFrame()
for i,colName in zip(colSum,obs.columns):
    exp[colName] = [i*j/total for j in rowSum]
exp

Unnamed: 0,High School,Bachelors,Masters,Ph.d.
0,50.886076,49.868354,50.377215,49.868354
1,49.113924,48.131646,48.622785,48.131646


In [3]:
chi2_cal = 0
df = (len(obs.columns)-1)*(len(obs.index)-1)
for lst1, lst2 in zip(obs.values,exp.values):
    for o,e in zip(lst1, lst2):
        chi2_cal += (o-e)**2/e
## assume alpha = 5% 
alpha = .05
pval = 1-chi2.cdf(chi2_cal, df, loc=0, scale=1)
pval

0.04588650089174717

In [4]:
pval<alpha
## reject H0

True

### Problem Statement 2:
Using the following data, perform a oneway analysis of variance using α=.05. Write
up the results in APA format.

[Group1: 51, 45, 33, 45, 67]
[Group2: 23, 43, 23, 43, 45]
[Group3: 56, 76, 74, 87, 56]

ANS:
- H0:mu1 = mu2 = mu3,H1:at least one group have different mean

In [5]:
from scipy.stats import f
data = pd.DataFrame({"Group1 ":[51, 45, 33, 45, 67],'Group2':[23, 43, 23, 43, 45],"Group3":[56, 76, 74, 87, 56]})
# number of gps
k = len(data.columns)
# number of observation in each gp
n = len(data)
N = n*k

grant_mean = sum(data.sum())/N
gp_means = data.mean()
gp_sameple_sds = data.std()
grant_mean,gp_means,gp_sameple_sds

(51.13333333333333,
 Group1     48.2
 Group2     35.4
 Group3     69.8
 dtype: float64,
 Group1     12.377399
 Group2     11.349009
 Group3     13.535139
 dtype: float64)

In [6]:
## sum sqare between the gps
SSB = 0
for gp_mean in gp_means:
    SSB += n*(gp_mean - grant_mean)**2
## sum sqare within the gps
SSW = 0
for gp_sameple_sd in gp_sameple_sds:
    SSW += (n-1)*gp_sameple_sd**2
alpha = .05
dfb = k-1
dfw = N-k
MSB = SSB/dfb
MSW = SSW/dfw
f_cal = MSB/MSW
1 - f.cdf(f_cal, k-1, N-k, loc=0, scale=1)

0.003059754143443061

In [7]:
pval<alpha
## reject H0

True

In [8]:
print(f'''{'-'*80}
Source\tdf\tSS\t\t\tMS\t\t\tf
Between\t{dfb}\t{SSB}\t{MSB}\t{f_cal}
Within\t{dfw}\t{SSW}\t\t\t{MSW}
Total\t{dfb+dfw}\t{SSB+SSW}\t
{'-'*80}
''')

--------------------------------------------------------------------------------
Source	df	SS			MS			f
Between	2	3022.933333333333	1511.4666666666665	9.747205503009457
Within	12	1860.8			155.06666666666666
Total	14	4883.733333333333	
--------------------------------------------------------------------------------



### Problem Statement 3:
Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25.
For 10, 20, 30, 40, 50:

In [9]:
data = pd.DataFrame({"Group1 ":[10, 20, 30, 40, 50],'Group2':[5,10,15, 20, 25]})
# number of gps
k = len(data.columns)
# number of observation in each gp
n = len(data)
N = n*k

grant_mean = sum(data.sum())/N
gp_means = data.mean()
gp_sameple_sds = data.std()
grant_mean,gp_means,gp_sameple_sds

(22.5,
 Group1     30.0
 Group2     15.0
 dtype: float64,
 Group1     15.811388
 Group2      7.905694
 dtype: float64)

In [10]:
## sum sqare between the gps
SSB = 0
for gp_mean in gp_means:
    SSB += n*(gp_mean - grant_mean)**2
## sum sqare within the gps
SSW = 0
for gp_sameple_sd in gp_sameple_sds:
    SSW += (n-1)*gp_sameple_sd**2
alpha = .05
dfb = k-1
dfw = N-k
MSB = SSB/dfb
MSW = SSW/dfw
f_cal = MSB/MSW
1 - f.cdf(f_cal, k-1, N-k, loc=0, scale=1)

0.09434977284243773

In [11]:
pval<alpha
## reject H0

True

In [12]:
print(f'''{'-'*80}
Source\tdf\tSS\tMS\tf
Between\t{dfb}\t{SSB}\t{MSB}\t{f_cal}
Within\t{dfw}\t{SSW}\t{MSW}
Total\t{dfb+dfw}\t{SSB+SSW}\t
{'-'*80}
''')

--------------------------------------------------------------------------------
Source	df	SS	MS	f
Between	1	562.5	562.5	3.6
Within	8	1250.0	156.25
Total	9	1812.5	
--------------------------------------------------------------------------------

