<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# show all data in csv
pd.options.display.max_columns-100

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns


# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

from statsmodels.stats.proportion import proportions_ztest

<a id="t"></a>
# 3. t Test

<a id="1t"></a>
## 3.1 One Sample t Test

Let us perform a one sample t-test for the population mean. We compare the population mean with a specific value. 

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu = \mu_{0}$ or $\mu \geq \mu_{0}$ or $\mu \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu \neq \mu_{0}$ or $\mu < \mu_{0}$ or $\mu > \mu_{0}$</strong></p>

The test statistic is given as:
<p style='text-indent:25em'> <strong> $t = \frac{\overline{X} -  \mu_{0}}{\frac{s}{\sqrt(n)}}$</strong></p>

Where, <br>
$\overline{X}$: Sample mean<br>
$s$: Sample standard deviation<br>
$n$: Sample size
 
Under $H_{0}$ the test statistic follows a t-distribution with n-1 degrees of freedom.


#### 1. A survey claims that in a math test female students tend to score marks greater than 75. Consider a sample of 24 female students and perform a hypothesis test to check the claim with 90% confidence.

Use the dataset available in the CSV file `mathscore_1ttest.csv`.

In [3]:
df = pd.read_csv('mathscore_1ttest.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group C,standard,none,60,72,74,206,Nature Learning
1,female,group C,standard,none,59,72,68,199,Nature Learning
2,female,group E,standard,none,100,100,100,300,Speak Global Learning
3,female,group D,standard,none,69,74,74,217,Speak Global Learning
4,female,group A,free/reduced,none,47,59,50,156,Speak Global Learning


test of normality - shapiro
h0:data is normal
ha:data is not normal

In [4]:
female_math=df['math score']

In [5]:
stats.shapiro(female_math)

ShapiroResult(statistic=0.9368310570716858, pvalue=0.13859796524047852)

h0:mu<=75
ha:mu>75

In [7]:
mu=75
xbar=np.mean(female_math)
s=np.std(female_math)#,ddof=1)# ddof means Digree of Freedom
n=24
s

11.357740263313339

In [None]:
female_math.std()

In [8]:
tstat=(xbar-mu)/(s/(n**0.5))
tstat

-3.6843112100134086

In [9]:
pval=stats.t.sf(tstat,df=n-1)
pval

0.9993861124656039

In [10]:
stats.ttest_1samp(female_math,popmean=75,alternative='greater')   # n24 -1 23 the na value

TtestResult(statistic=-3.6067380757023204, pvalue=0.9992573386042322, df=23)

#### 2. A researcher is studying the growth of bacteria in waters of Lake Beach. The mean bacteria count of 100 per unit volume of water is within the safety level. The researcher collected 10 water samples of unit volume and found the mean bacteria count to be 94.8 with a sample variance of 72.66. Does the data indicate that the bacteria count is within the safety level? Test at the α = .05 level. Assume that the measurements constitute a sample from a normal population.

In [None]:
# Data is normal
# pop std is not known
# one sample t test(left tailed)

In [11]:
mu=100
n=10
xbar=94.8
s=np.sqrt(72.66)

In [12]:
tstat=(xbar-mu)/(s/(n**0.5))
tstat

-1.9291040236750068

In [13]:
pval=stats.t.cdf(tstat,df=n-1)
pval

0.04289782134327503

<a id="2t"></a>
## 3.2 Two Sample t Test (Unpaired)

The two sample t-test is used to compare the means of two independent populations. This test assumes that the populations are normally distributed from which the samples are taken.

The null and alternative hypothesis is given as:
<p style='text-indent:25em'> <strong> $H_{0}: \mu_{1} - \mu_{2} = \mu_{0}$ or $\mu_{1} - \mu_{2} \geq \mu_{0}$ or $\mu_{1} -\mu_{2} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{1} - \mu_{2} \neq \mu_{0} $ or $\mu_{1} - \mu_{2} < \mu_{0}$ or $\mu_{1} -\mu_{2} > \mu_{0}$</strong></p>

Let us take a sample of size ($n_{1}$) from the first population and sample of size ($n_{2}$) from a second independent population. If both $n_{1}$ and $n_{2}$ are less than 30 and standard deviation of populations are unknown. We use two-sample t-test.

Consider the equal variance for both the populations. The test statistic for two sample t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{(\overline{X_{1}} - \overline{X_{2}}) - \mu_{0}} {s \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}}$</strong></p>

Where, <br>
$\overline{X_{1}}$, $\overline{X_{2}}$: Mean of both the samples<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$s$: Pooled standard deviation<br>
$n_{1}, n_{2}$: Size of samples from both the populations

The pooled standard deviation is defined as:
$s = \sqrt{\frac{(n_{1} - 1)s_{1}^{2} + (n_{2} - 1)s_{2}^{2}}{n_{1} + n_{2} - 2}}$ $\hspace{2cm}$  Where, $s_{1}, s_{2}$: Standard deviation of both the samples

Under $H_{0}$, the test statistic follows a t-distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

If the population variances are equal and also the sample size is the same for both the samples then the test statistic is given as:
<p style='text-indent:25em'> <strong> $t = \frac{(\overline{X_{1}} - \overline{X_{2}}) - \mu_{0}} {s \sqrt{\frac{2}{n}}}$</strong></p>

Where the pooled standard deviation $s = \sqrt{\frac{s_{1}^{2} + s_{2}^{2}}{2}}$

Under $H_{0}$, the test statistic follows a t-distribution with $(2n-2)$ degrees of freedom.

If both the population variances and the sample sizes are not equal then the Welch's test is used.

### Example: 

#### 1. The teachers' association claims that the total score of the students who completed the test preparation course is different than the total score of the students who have not completed the course. The sample data consists of 15 students who completed the course and 18 students who have not completed the course. Test the association's claim with ⍺ = 0.05.

Consider the total score of the students who have/ have not completed the preparation course are given in the CSV file `totalmarks_2ttest.csv`.

In [14]:
df = pd.read_csv('totalmarks_2ttest.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,male,group E,standard,completed,84,83,78,245,Speak Global Learning
1,male,group C,free/reduced,completed,79,77,75,231,Speak Global Learning
2,male,group A,standard,none,91,96,92,279,Nature Learning
3,female,group B,free/reduced,completed,76,94,87,257,Speak Global Learning
4,male,group A,standard,completed,46,41,43,130,Nature Learning


In [15]:
df['test preparation course'].unique()

array(['completed', 'none'], dtype=object)

In [None]:
# Test of Normality - Shapiro  test  # unpaired qution
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [16]:
completed =df[df['test preparation course']=='completed']['total score']
notcompleted =df[df['test preparation course']=='none']['total score']

In [17]:
stats.shapiro(completed)

ShapiroResult(statistic=0.9055536389350891, pvalue=0.11574102193117142)

In [None]:
# completed data is normal

In [None]:
# pval =0.11
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [18]:
stats.shapiro(notcompleted)   # not completed is also 

ShapiroResult(statistic=0.9481862187385559, pvalue=0.3972780704498291)

In [None]:
# pval =0.11
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [None]:
# Hypothesis:
h0:Average of completed=Average is notcompleted
ha:Average of completed !=Average of not completed

In [None]:
# Data is normal
# pop std is not known
# pop is independant
# Unpaired two sample t test (two tailed)

In [19]:
# ind means indipendent
stats.ttest_ind(completed,notcompleted)

Ttest_indResult(statistic=1.4385323319823262, pvalue=0.16030339806989594)

In [None]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [20]:
n2=18
n1=15
df1=n1-1
df2=n2-2
x1bar=completed.mean()
x2bar=notcompleted.mean()
s1=completed.std()
s2=notcompleted.std()
num=(x1bar-x2bar)
sp2=(df1*(s1**2)+df2*(s2**2))/(df1+df2)
k=((1/n1)+(1/n2))
z=num/np.sqrt(k*sp2)
print(z)

1.4367647266795065


In [21]:
stats.t.sf(z,31)*2

0.16080151242483784

In [22]:
stats.ttest_ind(completed,notcompleted)

Ttest_indResult(statistic=1.4385323319823262, pvalue=0.16030339806989594)

<a id="paired"></a>
## 3.3 Paired t Test

A paired t-test is used to compare the mean of the population for two dependent samples. The dependent samples can be the scores before and after a specific treatment. 

Let $X_{i}$ be the sample before the treatment and $Y_{i}$ be the sample after the treatment. Let $\mu_{X}$, $\mu_{Y}$ be the mean of the data X and Y respectively. The mean difference $\mu_{d} = \mu_{Y} - \mu_{X}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu_{d} = \mu_{0}$ or $\mu_{d} \geq \mu_{0}$ or $\mu_{d} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{d} \neq \mu_{0}$ or $\mu_{d} < \mu_{0}$ or $\mu_{d} > \mu_{0}$</strong></p>

The test statistic for paired t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{\overline{X_{D}} - \mu_{0}} {\frac{s_{D}}{\sqrt{n}}}$</strong></p>

Where, <br>
$\overline{X_{D}}$: Mean difference between the pairs<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$s_{D}$: Standard deviation of differences between the pairs<br>
$n$: Sample size

Under $H_{0}$, the test statistic follows a t-distribution with (n-1) degrees of freedom.

### Example:

#### 1. A training institute wants to check if their writing training program was effective or not. 17 students are selected to check the hypothesis. Consider 0.05 as the level of significance.

The writing scores before and after training are provided in the CSV file `WritingScores.csv`. 

In [23]:
df = pd.read_csv('WritingScores.csv')
df.head()

Unnamed: 0,score_before,score_after
0,59,50
1,62,67
2,76,92
3,32,75
4,61,98


In [24]:
before=df['score_before']
after=df['score_after']

In [25]:
stats.shapiro(before)

ShapiroResult(statistic=0.947382390499115, pvalue=0.41645893454551697)

In [None]:
pval=0.41

In [None]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [26]:
stats.shapiro(after)

ShapiroResult(statistic=0.9686525464057922, pvalue=0.7944169044494629)

data is normal
pop std dev is not known
pop is dependent
paired to sample ttest
one tail (left tail)

In [27]:
stats.ttest_rel(before,after,alternative='less')

TtestResult(statistic=-1.4394882729049499, pvalue=0.08464506448139923, df=16)

#### 2. An energy drink distributor claims that a new advertisement poster, featuring a life-size picture of a well-known athlete, will increase the product sales in outlets by an average of 50 bottles in a week. For a random sample of 10 outlets, the following data was collected. Test that the null hypothesis that there the advertisement was effective in increasing sales. Test the hypothesis using critical region technique. Use α = 0.05.

Given data:

        sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
        sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]

In [28]:
sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]

In [29]:
print(shapiro(sales_before )[1],shapiro(sales_after)[1])

0.3817565143108368 0.3293103277683258


In [30]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [31]:
# Hypothesis:

# Ho : sales_before = sales_after(The advertisement was not effective in increasing sales ( 𝜇𝑑≤0)
# Ha : sales_before != sales_after

In [32]:
# Data is normal
# pop std is not known
# pop is dependent
# Paired two sample t test (left tailed)

In [33]:
stats.ttest_rel(sales_before,sales_after,alternative='less')

TtestResult(statistic=-0.10085458113185983, pvalue=0.46107385734626494, df=8)

In [None]:
pval=0.46
sig lvl=0.05
pval >sig lvl,hence null hyp is accepted

<a id="prop"></a>
# 4. Z Proportion Test

<a id="1_p"></a>
## 4.1 One Sample Test

Perform one sample Z test for the population proportion. We compare the population proportion ($P$) with a specific value ($P_{0}$).

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P = P_{0}$ or $P \geq P_{0}$ or $P \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P \neq P_{0}$ or $P < P_{0}$ or $P > P_{0}$</strong></p>

The test statistic for proportion Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{p -  P_{0}}{\sqrt{\frac{P_{0}(1-P_{0})}{n}}}$</strong></p>

Where, <br>
$p$: Sample proportion<br>
$n$: Sample size

Under $H_{0}$, the test statistic follows a standard normal distribution.

#### A claim states 70% people opts of LED TV. A sample of 100 people are choosen and found that 65% of people are opting for LED TV. Test the hypothesis with 95% CI.

In [None]:
# h0: Proportion of LED TV(prop_population) = 70% = 0.7
# ha: Proportion of LED TV(prop_population) <> 70% <> 0.7

In [None]:
# No Conditions need to be checked

In [35]:
psamp=0.65
prop_pop=0.7
n=100
num=psamp-prop_pop
den=np.sqrt((prop_pop*(1-prop_pop))/n)
zstat=num/den
zstat

-1.0910894511799603

In [36]:
pval=stats.norm.sf(abs(zstat))*2
pval

0.27523352407483503

In [37]:
from statsmodels.stats.proportion import proportions_ztest

In [38]:
proportions_ztest(count=70,nobs=100,value=0.65)   # 2 Tailed

(1.0910894511799603, 0.27523352407483503)

### Example:

#### 1. In previous years, people believed that at most 80% of male students score more than 50 marks out of 100 in Mathematics. Perform a test to check whether this percentage is more than 80. Consider the level of significance as 0.05.

Consider the sample of math scores of male students available in the CSV file `StudentsPerformance.csv`.

In [42]:
df =pd.read_csv('StudentsPerformance.csv')
df.head(1)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning


In [43]:
math_score=df=['math score']
pop=0.8
n=df[df['gender']=='male']['math score'].shape[0]
psam=df[df['math score']>50 & (df['gender']=='male')]['math score'].shape[0]/math_score.shape[0]
z=(psam-pop)/(np.sqrt)

TypeError: list indices must be integers or slices, not str

In [44]:
male=df[df['gender']=='male']

TypeError: list indices must be integers or slices, not str

In [45]:
male_gr_50=df[(df['gender']=='male')&(df['math score']>50)]

TypeError: list indices must be integers or slices, not str

In [46]:
pop=0.8
psamp=len(male_gr_50)/len(male)
psamp

NameError: name 'male_gr_50' is not defined

In [47]:
n=len(male)

NameError: name 'male' is not defined

In [48]:
num=psamp-pop
den=np.sqrt((pop*(1-pop))/n)
zstat=num/den
zstat

-3.7500000000000004

In [49]:
stats.norm.sf(zstat)

0.9999115827147992

In [50]:
pval=0
sig lvl =0.05
pval< sig lvl
reject H0
proportion of male student scoring grater than 50>0.8

SyntaxError: invalid syntax (3822885847.py, line 2)

#### 2. From a sample of 361 business owners had gone into bankruptcy due to recession. On taking a survey, it was found that 105 of them had not consulted any professional for managing their finance before opening the business. Test the null hypothesis that at most 25% of all businesses had not consulted before opening the business. Test the claim using p-value technique. Use α = 0.05.

The null and alternative hypothesis is:

H<sub>0</sub>: $P \leq 0.25$<br>
H<sub>1</sub>: $P > 0.25$ 

In [51]:
pop=0.25
psamp=105/361
n=361
num=psamp-pop
den=np.sqrt((pop*(1-pop))/n)
zstat=num/den
zstat

1.7928245201151534

In [52]:
stats.norm.sf(zstat)

0.03650049373124949

In [53]:
pval=0.03
sig lvl =0.05
pval< sig lvl
reject H0


SyntaxError: invalid syntax (3804992873.py, line 2)

<a id="2_p"></a>
## 4.2 Two Sample Test

Perform two sample Z test for the population proportion. We check the equality of population proportions $P_{1}$ and $P_{2}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P_{1} - P_{2} = P_{0}$ or $P_{1} - P_{2} \geq P_{0}$ or $P_{1} - P_{2} \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P_{1} - P_{2} \neq P_{0}$ or $P_{1} - P_{2} < P_{0}$ or $P_{1} - P_{2} > P_{0}$</strong></p>

The test statistic for two sample proportion Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{(p_{1} -  p_{2}) - P_{0}}{\sqrt{\bar{P}(1-\bar{P})(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}$   $\hspace{2 cm} \bar{P} = \frac{n_{1}p_{1} + n_{2}p_{2}}{n_{1} + n_{2}}$ </strong></p>

Where, <br>
$p_{1}, p_{2}$: Samples proportions<br>
$P_{0}$: Hypothesized proportion<br>
$\bar{P}$: Proportion of pooled sample<br>
$n_{1}, n_{2}$: Samples sizes

### Example:

#### 1. A team of nutritionists believes that each institute provides 'standard' lunch to an equal proportion of students. A sample of students from institutes <i>Nature Learning</i> and <i>Speak Global Learning</i> is given. Consider the null hypothesis as equality of proportion with 0.1 level of significance.

Consider the sample data available in the CSV file `StudentsPerformance.csv`.

In [54]:
# read the students performance data 
df = pd.read_csv('StudentsPerformance.csv')

# display the first two observations
df.head(10)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning
2,female,group B,standard,none,64,71,56,191,Nature Learning
3,male,group A,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,standard,none,75,66,51,192,Nature Learning
5,female,group B,standard,none,74,85,55,214,Nature Learning
6,female,group B,standard,completed,55,51,51,157,Nature Learning
7,male,group B,free/reduced,none,74,49,77,200,Nature Learning
8,male,group D,free/reduced,completed,43,87,77,207,Nature Learning
9,female,group B,free/reduced,none,76,70,71,217,Nature Learning


In [55]:
import statsmodels
from statsmodels.stats import proportion

In [56]:
nl=df[df['training institute']=='Nature Learning']
nl_std=nl[nl['lunch']=='standard']
nl_count=len(nl)
nl_std_count=len(nl_std)

In [57]:
sgl=df[df['training institute']=='Speak Global Learning']
sgl_std=sgl[sgl['lunch']=='standard']
sgl_count=len(sgl)
sgl_std_count=len(sgl_std)

In [58]:
pval=proportions_ztest(count=[nl_std_count,sgl_std_count],
                 nobs=[nl_count,sgl_count])
pval

(0.7935300106078008, 0.4274690915859791)

#### 2. Steve owns a kiosk where he sells two magazines - A and B in a month. He buys 100 copies of magazine A out of which 78 were sold and 70 copies of magazine B out of which 65 were sold. Is there enough evidence to say that magazine is B is more popular? Test the claim using p-value technique with α = 0.05.

In [59]:
a_count=100
a_sold_count=78
prop_a=a_sold_count/a_count

In [60]:
b_count=70
b_sold_count=65
prop_b=b_sold_count/b_count

In [61]:
prop_a,prop_b

(0.78, 0.9285714285714286)

In [62]:
proportions_ztest(count=[a_sold_count,b_sold_count],
                 nobs=[a_count,b_count],alternative='smaller')

(-2.60830803458311, 0.004549551600547303)