## **Statistical Inference**


---


The process of drawing conclusions about populations or scientific truths from data. 

Inferences about a population are made based on certain statistics calculated from a *sample of data drawn from that population*.

**Descriptive statistics** focus on describing the visible characteristics of a dataset (a population or sample). Meanwhile, **Inferential statistics** focus on making predictions or generalizations about a larger dataset, based on a sample of those data

#Question 1 : Suppose we want to test whether the mean weight of apples in a grocery store is 150#
grams. We randomly sample 20 apples from the store and measure their weights,
getting the following data:
Apple_weights = [145, 155, 160, 146, 142, 152, 150, 147, 148, 149, 148, 152,
153, 155, 154, 148, 151, 147, 153, 146]

In [1]:
Apple_weights = [145, 155, 160, 146, 142, 152, 150, 147, 148, 149, 148, 152, 153, 155, 154, 148, 151, 147, 153, 146]

In [2]:
Apple_weights

[145,
 155,
 160,
 146,
 142,
 152,
 150,
 147,
 148,
 149,
 148,
 152,
 153,
 155,
 154,
 148,
 151,
 147,
 153,
 146]

<ul><li> <b>What test should we use and why?</b><br>
<b>Hypothesis Testing </b>
<br>
Generally used when we want to compare a single group with an external standard (mean) <br>
Type of Hypothesis testing to be used here is <b> one sample t-test </b> (here we have one group to be compared against a standard value


---



<ul><li><b>State the null and alternative hypotheses.</b></li></ul>

Null Hypothesis : H<sub>0</sub> - The mean weight of apples in a grocery store is equal to 150 grams. (accepted fact) <br />
Alternative Hypothesis: H<sub>a</sub> : The mean weight of apples in the grocery store is not equal to  150 (two tailed or two sided)<br />
Alternative Hypothesis: H<sub>a</sub> : The mean weight of apples in the grocery store is either less than or greater than 150 (one tailed or one sided)


---



<ul><li><b>Choose a significance level (α) (the probability of rejecting the null hypothesis
when it is actually true). </b></l></ul><br>
<b>5%</b>, which is the generally chosen value
<br><br>
Note: <i>Level of significance defines whether the null hypothesis is assumed to be accepted or rejected. The significance level, also denoted as alpha, is a measure of the strength of the evidence that must be present in your sample before rejecting the null.</i>


<ul><li><b>Determine the degrees of freedom (df) of the sample. Df = sample size -1</b></li></ul>


---



Given Sample size = 20 <br>
Degree of freedom (df) = 20 - 1 = 19


---



<ul><li><b>Determine the critical value of t based on the significance level and degrees of
freedom. For a two-tailed test with α = 0.05 and df = 19, the critical values are
-2.093 and 2.093.</b></li></ul>

In [3]:
#given mean defined by the variable mu
mu = 150

In [4]:
#including the libraries
import numpy as np
from scipy import stats

#calculate the mean of the given sample data
sample_mean = np.mean(Apple_weights)

#calculate the standard deviation of sample data
sample_std = np.std(Apple_weights,ddof=1)

# Calculate the t-statistic and p-value
t_stat = (sample_mean - mu) / (sample_std / np.sqrt(len(Apple_weights)))
p_value = 2 * stats.t.sf(np.abs(t_stat), df=len(Apple_weights)-1)

print("Sample mean :",sample_mean)
print("Sample standard deviation :", round(sample_std,3))
print("Test statistic value :",round(t_stat,3))
print("P value :", round(p_value,3))


Sample mean : 150.05
Sample standard deviation : 4.261
Test statistic value : 0.052
P value : 0.959


<ul><li><b>Compare and interpret the results of the test to the critical value</b></li></ul>

In [5]:
t_critical = 2.093

if t_stat < -t_critical or t_stat > t_critical:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")


Fail to reject the null hypothesis


#Question 2 : Suppose we want to test whether the mean height of all men in a population is 180 cm
assuming that the population standard deviation = 2. We randomly sample 50 men from
the population and measure their heights, getting the following data:
Men_height = [177, 180, 182, 179, 178, 181, 176, 183, 179, 180, 178, 181, 177,
178, 180, 179, 182, 180, 183, 181, 179, 177, 180, 181, 178, 180, 182, 179, 177,
182, 178, 181, 183, 179, 180, 181, 183, 178, 177, 181, 179, 182, 180, 181, 178,
180, 179, 181, 183, 179]


<ul><li><b>What test should we use and why? </b></li></ul><br>
<b>Sample Z test (single sample Z-test)</b><br>
One Sample Z-test (single sample Z-test) is used to compare the sample mean with some specific or hypothesized value (known mean of the population). One Sample Z-test checks whether the sample comes from a known population where population mean and standard deviation (σ) should be known.
<br>

1.  Z-test is a parametric statistical method for comparing the means of the two populations (two-sample independent and paired Z-test) and for comparing the mean of a sample to a specific value (one sample Z-test).
2.  Z-test requires a large sample size (n ≥ 30) and known population variance (or deviation). If the population variance is not known, use a t-test, which uses sample standard deviation (s).
3.  In Z-test, the test statistic follows the standard normal distribution (Z-distribution) (type of continuous probability distribution) under the null hypothesis.
4.  Z-test has three main types: One Sample Z-test, two sample Z-test (unpaired or independent), and paired Z-test.

<br> Reference : https://www.reneshbedre.com/blog/z-test-in-python.html?utm_content=cmp-true <br>
https://www.educba.com/python-z-test/


<ul><li><b>State the null and alternative hypotheses. </b></li></ul><br>
<b>Assumptions</b>
<br>
<ul><li>The dependent variable should have an approximately standard normal distribution i.e. N(0, 1) (Shapiro-Wilks Test)</li><li>
Population standard deviation should be known</li>
<li>The dependent variable should be a continuous variable</li>
<li>Observations are independent of each other and randomly drawn from a population</li>
<li>The sample size should be large (n ≥ 30)</li>

<table><tr><td>Null hypothesis (H0)	</td><td>Alternative hypothesis (Ha)</td><td>	type</td>	<td>Reject H0 when
(at 5% significance level)</td><tr>
<td>µ = µ0</td><td>	µ ≠ µ0</td>	<td>two-tailed	</td><td>|Z| > 1.96</td></tr>
<td>µ ≥ µ0</td>	<td>µ < µ0</td>	<td>one-tailed (lesser)	</td><td>Z < -1.64</td></tr>
<td>µ ≤ µ0	</td><td>µ > µ0	</td><td>one-tailed (greater)	</td><td>Z > 1.64</td></tr></table>

Null hypothesis H<sub>0</sub>: Sample mean is equal to the hypothesized or known population mean (mean height of all men is 180 cm)<br />
Alternative hypothesis H<sub>a</sub>: Sample mean is not equal to the hypothesized or known population mean (two-tailed) (mean height of all men is not 180cm)<br />
Alternative hypothesis H<sub>a</sub>: Sample mean is either greater or lesser than the hypothesized or known population mean (one-tailed) (mean height of all men is either less than or greater than 180cm 

<ul><li><b>Choose a significance level (α) (the probability of rejecting the null hypothesis
when it is actually true).</b></li></ul><br>
Significance level for confidence interval <br />
alpha = 0.05

<ul><li><b>Determine the degrees of freedom (df) of the sample. Df = sample size -1 </b></li></ul><br>
Given Sample size = 50 <br>
Degree of freedom (df) = 50 - 1 = 49

<ul><li><b>Determine the critical value of t based on the significance level and degrees of
freedom. For a two-tailed test with α = 0.05 and df = 49, the critical values are
-2.0096 and 2.0096. </b></li></ul><br>



statsmodels.stats.weightstats.ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1.0)

In [6]:
from scipy.stats.distributions import alpha_gen
from numpy.lib import math
import numpy as np
import math

from statsmodels.stats.weightstats import ztest as ztest


Men_height = [177, 180, 182, 179, 178, 181, 176, 183, 179, 180, 178, 181, 177, 178, 180, 179, 182, 180, 183, 181, 179, 177, 180, 181, 178, 180, 182, 179, 177, 182, 178, 181, 183, 179, 
               180, 181, 183, 178, 177, 181, 179, 182, 180, 181, 178, 180, 179, 181, 183, 179]

samp_height_mean=np.array(Men_height).mean()
sd_samp_height=2/math.sqrt(50)
alpha =0.05
pop_mean =180

print("The sample mean is ",samp_height_mean)
  
# now we perform the test. In this function, we passed Men_height, in the value parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we check whether the
# mean is larger
  
ztest_Score, p_value= ztest(Men_height,value = pop_mean, alternative='larger')
# the function outputs a p_value and z-score corresponding to that value, we compare the 
# p-value with alpha, if it is greater than alpha then we do not null hypothesis 
# else we reject it.

print("z testscore",round(ztest_Score,4),"P Value ",round(p_value,4))
print("alpha value ",alpha)
if(p_value <  alpha):
  print("Reject Null Hypothesis")
else:
  print("Fail to Reject NUll Hypothesis")

The sample mean is  179.84
z testscore -0.6061 P Value  0.7278
alpha value  0.05
Fail to Reject NUll Hypothesis


In [7]:
ztest_Score, p_value= ztest(Men_height,value = pop_mean, alternative='smaller')
# the function outputs a p_value and z-score corresponding to that value, we compare the 
# p-value with alpha, if it is smaller than alpha then we do not null hypothesis 
# else we reject it.
  
print("z testscore",round(ztest_Score,4),"P Value ",round(p_value,4))

if(p_value <  alpha):
  print("Reject Null Hypothesis")
else:
  print("Fail to Reject NUll Hypothesis")

z testscore -0.6061 P Value  0.2722
Fail to Reject NUll Hypothesis


In [8]:
ztest_Score, p_value= ztest(Men_height,value = pop_mean, alternative='two-sided')
# the function outputs a p_value and z-score corresponding to that value, we compare the 
# p-value with alpha, if it is equal to alpha then we do not null hypothesis 
# else we reject it.

print("z testscore",round(ztest_Score,4),"P Value ",round(p_value,4))

if(p_value <  alpha):
  print("Reject Null Hypothesis")
else:
  print("Fail to Reject NUll Hypothesis")

z testscore -0.6061 P Value  0.5444
Fail to Reject NUll Hypothesis


<ul><li><b>Compare and interpret the results of the test to the critical value </b></li></ul><br>
Since this p-value is not less than .05, we do not have sufficient evidence to reject the null hypothesis. In other words, the mean height is 180cm

#Question 3.
Suppose we want to test whether the mean weight of a population of cats is different
from 4 kg. We randomly sample 50 cats from the population and measure their weights,
getting the following data:
Cats_weights = [3.9, 4.2, 4.5, 4.1, 4.3, 3.8, 4.6, 4.2, 3.7, 4.3, 3.9, 4.0, 4.1, 4.5,
4.2, 3.8, 3.9, 4.3, 4.1, 4.0, 4.4, 4.2, 4.1, 4.6, 4.4, 4.2, 4.1, 4.3, 4.0, 4.4, 4.3, 3.8,
4.1, 4.5, 4.2, 4.3, 4.0, 4.1, 4.2, 3.9, 4.3, 3.7, 4.1, 4.5, 4.2, 4.0, 4.2, 4.4, 4.1, 4.5]


<ul><li><b>Perform one sample two tailed Z-Test to determine whether the mean weight of
the sampled cats is significantly different from 4 kg</b></li></ul>

In [9]:
from scipy.stats.distributions import alpha_gen
from numpy.lib import math
import numpy as np
import math

from statsmodels.stats.weightstats import ztest as ztest

Cats_weights = [3.9, 4.2, 4.5, 4.1, 4.3, 3.8, 4.6, 4.2, 3.7, 4.3, 3.9, 4.0, 4.1, 4.5, 4.2, 3.8, 3.9, 4.3, 4.1, 4.0, 4.4, 4.2, 4.1, 4.6, 4.4, 4.2, 4.1, 4.3, 4.0, 4.4, 4.3, 3.8, 4.1, 4.5, 
                4.2, 4.3, 4.0, 4.1, 4.2, 3.9, 4.3, 3.7, 4.1, 4.5, 4.2, 4.0, 4.2, 4.4, 4.1, 4.5]

#population mean given
mu=4

#sample size 50

#mean of the sample
samp_cat_weight_mean=np.array(Cats_weights).mean()

alpha =0.05
pop_mean = 4

print("The sample mean is ",round(samp_cat_weight_mean,1))

alpha =0.05
pop_mean =4

ztest_Score, p_value= ztest(Cats_weights,value = pop_mean, alternative='two-sided')
# the function outputs a p_value and z-score corresponding to that value, we compare the 
# p-value with alpha, if it is greater than alpha then we do not null hypothesis 
# else we reject it.

print("z testscore",round(ztest_Score,4),"P Value ",round(p_value,4))

if(p_value <  alpha):
  print("Reject Null Hypothesis")
else:
  print("Fail to Reject NUll Hypothesis")


The sample mean is  4.2
z testscore 5.2336 P Value  0.0
Reject Null Hypothesis


<ul><li><b>State the null and alternative hypotheses.</b></li></ul> <br>
Null Hypothesis : H0: The mean weight of the population of cats is equal to 4kg
Alternate Hypothesis : H1 : The mean weight of the population of cats is not equal to 4 kg



<h2>Choose a significance level (α) (the probability of rejecting the null hypothesis
when it is actually true).</h2>

Significance level (alpha) = 0.05


<h2>Calculate the z-score using the formula:</h2>


In [10]:
sample_mean = sum(Cats_weights)/len(Cats_weights) # sample mean
pop_mean = 4
pop_std = sample_mean / math.sqrt(len(Cats_weights)) # population standard deviation

z_score = (sample_mean - pop_mean) / pop_std # calculating z score with the formula

print("Sample Mean:",round(sample_mean,3))
print("Population Standard Deviation:",round(pop_std,4))
print("The z-score is:", round(z_score,4))

Sample Mean: 4.17
Population Standard Deviation: 0.5897
The z-score is: 0.2883


<h2>Assuming that the standard deviation is equal to the sample mean</h2>

In [11]:
pop_std = sample_mean # Assuming that the standard deviation is equal to the sample mean
# Sample size
n = len(Cats_weights)
# Calculate the z-score
z_score1 = (sample_mean - pop_mean)/(pop_std/math.sqrt(n))

print("Sample Mean:",round(sample_mean,3))
print("Population Standard Deviation:",round(pop_std,4))
print("The z-score is:", round(z_score1,4))


Sample Mean: 4.17
Population Standard Deviation: 4.17
The z-score is: 0.2883


<h2>Look up the critical z-value at the chosen significance level (α) using a z-table.</h2>

For the alpha value = 0.05, from the z-table  the critical z values are +/- 1.96


<h2>Compare the calculated z-score to the critical z-values. If the calculated z-score
falls outside the range between the critical z-values, we reject the null hypothesis
in favor of the alternative hypothesis</h2>

In [13]:
print(z_score)
if(z_score > 1.96 or z_score< -1.96):
    print("Reject the null hypothesis")
else:
    print("Reject the alternate hypothesis")


0.28826895156285953
Reject the alternate hypothesis
