# Inferential Statistics

Inferential statistics use your sample to make reasonable guesses about the larger population.



Inferential statistics relies on various statistical techniques, such as 
- hypothesis testing, 
- confidence intervals 
-regression analysis.

These techniques enable researchers to assess the reliability and the significance of their finding and make informed decision based on the data

**It is important to note that inferential statistics involves uncertainty,** as the sample data may not perfectly reflect the population characteristics. However, by using appropriate statistical methods and ensuring the sample is representative,  researcher can make valid inferences and draw meaningful conclusion about the population of the interest


`Inferential statistics have two main uses:`

**1. making estimates about populations (for example, the mean of math scores in a classes).**



**2. testing hypotheses to draw conclusions about populations (for example, the relationship between SAT scores and family income).**



![j.png](attachment:j.png)






# Common terminology 

- **Hypothesis Testing:** `A statistical method used to test a hypothesis about a population parameter based on sample data.`


________________________________________________



- **null hypothesis:** `no significant difference in the group being study`

________________________________________________________________________

- **Alternative hypothesis:** `there is difference the the group being study`

_________________________________________



_____________________________________________________________

- **Confidence Interval:** it provide range of values where the true population parameter is likely to lies.
`the range of values is derived from the data, that's believed to contain the true value of the parameter being estimated.` for example, a 95% confidence interval suggests that if we were to repeat the experiment or sampling process multiple times, 95% of the time, the true parameter would fall within this interval
________________________________________________________________________

- **p-value:** `The probability of obtaining a test statistic as extreme as, or more extreme than, the observed value, assuming the null hypothesis is true.`

___________________________________________________________________________

- **Type I error:** `False positive. mistakely reject the null hypothesis when it is true `

_____________________________________________________________________________

- **Type II error :** `False negative. fail to reject the null hypothesis when  suppose to reject`

_________________________________________________________________________________

- **significant level (alpha):** `probability of making type I error`


In [None]:
negative


#  Hypothesis testing
Hypothesis testing is a formal process of statistical analysis using inferential statistics. The goal of hypothesis testing is to compare populations or assess relationships between variables using samples.

Hypotheses, or predictions, are tested using statistical tests. Statistical tests also estimate sampling errors so that valid inferences can be made.

Statistical tests can be `parametric or non-parametric.` Parametric tests are considered more statistically powerful because they are more likely to detect an effect if one exists.

**Parametric tests make assumptions that include the following:**

•	the population that the sample comes from follows a normal distribution of scores

•	the sample size is large enough to represent the population

•	the variances, a measure of spread, of each group being compared are similar

When your data violates any of these assumptions, **non-parametric tests** are more suitable. Non-parametric tests are called `distribution-free tests` because they don’t assume anything about the distribution of the population data.

**Statistical tests come in three forms:**

**a.** `Tests of comparison`

**b.** `Correlation `

**c.** `Regression.`


# a. Comparison tests
`Comparison tests` assess whether there are differences in means, medians or rankings of scores of two or more groups.



**To decide which test suits your aim,** consider whether your data meets the conditions necessary for parametric tests, the number of samples, and the levels of measurement of your variables.

Means can only be found for `interval or ratio data,` while medians and rankings are more appropriate measures for `ordinal data.`

![Screenshot%20%28270%29.png](attachment:Screenshot%20%28270%29.png)

# ANOVA
ANOVA (Analysis of Variance) is a statistical test used to determine whether there are any significant differences between the means of two or more groups. It compares the means of  continues dependent variable across different categorical independent variables  

- `The null hypothesis in ANOVA assumes that there is no significant difference between the means of the groups,` while 

- `the alternative hypothesis suggests that there is at least one group with a different mean.`



ANOVA tests can be classified into one-way ANOVA and two-way ANOVA based on the number of factors or independent variables being considered.

### One-way ANOVA:
One-way ANOVA is used when there is only one factor or independent variable being studied. It analyzes the differences between the means of two or more groups. 

`The null hypothesis assumes that all group means are equal, while the alternative hypothesis suggests that at least one group mean is different.`



# Task on One_way Anova Test
#### Problem Statement:
We want to determine if the amount of sunlight exposure has a significant impact on the health of plants. We have three categories of plant health (Poor, Fair, Good) and measured the amount of sunlight exposure (in hours) for each group of plants.

#### Null Hypothesis (H0):
The null hypothesis states that there is no significant difference in plant health categories based on the amount of sunlight exposure. In other words, the mean plant health scores are the same across different levels of sunlight exposure.

#### Alternative Hypothesis (H1):
The alternative hypothesis states that there is a significant difference in plant health categories based on the amount of sunlight exposure. In other words, at least one group has a different mean plant health score compared to the others.

In [4]:
import numpy as np
import pandas as pd



In [3]:
path = "C:/Users/pc/Documents/Lasop/statistic/part two/part two -/"
data= pd.read_csv(path + "one_way.csv")
data

Unnamed: 0,Plant_Health,Sunlight_Exposure
0,Poor,4.496714
1,Poor,3.861736
2,Poor,4.647689
3,Poor,5.523030
4,Poor,3.765847
...,...,...
85,Good,7.498243
86,Good,8.915402
87,Good,8.328751
88,Good,7.470240


In [7]:
data[data["Plant_Health"]=="Poor"]["Sunlight_Exposure"]

0     4.496714
1     3.861736
2     4.647689
3     5.523030
4     3.765847
5     3.765863
6     5.579213
7     4.767435
8     3.530526
9     4.542560
10    3.536582
11    3.534270
12    4.241962
13    2.086720
14    2.275082
15    3.437712
16    2.987169
17    4.314247
18    3.091976
19    2.587696
20    5.465649
21    3.774224
22    4.067528
23    2.575252
24    3.455617
25    4.110923
26    2.849006
27    4.375698
28    3.399361
29    3.708306
Name: Sunlight_Exposure, dtype: float64

In [10]:
import scipy.stats as stats
# Perform one-way ANOVA test
f_statistic, p_value = stats.f_oneway(
    data[data['Plant_Health'] == 'Poor']['Sunlight_Exposure'],
    data[data['Plant_Health'] == 'Fair']['Sunlight_Exposure'],
    data[data['Plant_Health'] == 'Good']['Sunlight_Exposure']
)


In [11]:
print(p_value)

7.54969223483566e-29


In [12]:
# Set significance level
alpha = 0.05

# Interpret the results
print("One-Way ANOVA Results:")
print(f"F-statistic: {f_statistic:.2f}")
print(f"P-value: {p_value:.4f}")

if p_value < alpha:
    print("Conclusion: There is a significant difference in plant health categories based on the amount of sunlight exposure.")
    print("reject the null hypothesis")
else:
    print("Conclusion: There is no significant difference in plant health categories based on the amount of sunlight exposure.")
    print("Fail to reject null hypothesis")

One-Way ANOVA Results:
F-statistic: 149.24
P-value: 0.0000
Conclusion: There is a significant difference in plant health categories based on the amount of sunlight exposure.
reject the null hypothesis


# Task on Two_way Anova Test


**Problem Statement:**
We want to determine if both the type of fertilizer (Factor A: A, B, C, D) and the amount of water (Factor B: Low, Medium, High) have a significant impact on plant growth (the dependent variable). We have collected data for each combination of fertilizer and water levels.

**Null Hypotheses (H0):**
1. Null Hypothesis for Factor A (Fertilizer): There is no significant difference in plant growth among the different types of fertilizers.
2. Null Hypothesis for Factor B (Water): There is no significant difference in plant growth among the different levels of water.
3. Null Hypothesis for Interaction: There is no interaction effect between Factor A (Fertilizer) and Factor B (Water) on plant growth.

**Alternative Hypotheses (H1):**
1. Alternative Hypothesis for Factor A (Fertilizer): There is a significant difference in plant growth among the different types of fertilizers.
2. Alternative Hypothesis for Factor B (Water): There is a significant difference in plant growth among the different levels of water.
3. Alternative Hypothesis for Interaction: There is an interaction effect between Factor A (Fertilizer) and Factor B (Water) on plant growth.


In [13]:
df = pd.read_csv( path + "two_way.csv")
df

Unnamed: 0,Fertilizer,Water,Plant_Growth
0,A,Low,22.483571
1,A,Low,19.308678
2,A,Low,23.238443
3,A,Low,27.615149
4,A,Low,18.829233
...,...,...,...
355,D,High,14.987353
356,D,High,19.907434
357,D,High,18.556707
358,D,High,21.613593


In [15]:
import statsmodels.api as sm
from statsmodels.formula.api import ols


In [16]:
# Perform two-way ANOVA test
model = ols('Plant_Growth ~ C(Fertilizer) + C(Water) + C(Fertilizer):C(Water)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Set significance level
alpha = 0.05

# Print the ANOVA table
print("Two-Way ANOVA Results:")
print(anova_table)



Two-Way ANOVA Results:
                             sum_sq     df         F    PR(>F)
C(Fertilizer)             37.179603    3.0  0.547687  0.650007
C(Water)                   3.279334    2.0  0.072461  0.930116
C(Fertilizer):C(Water)   183.154887    6.0  1.349012  0.234667
Residual                7874.639485  348.0       NaN       NaN


In [17]:
# Interpret the results
p_value_A = anova_table['PR(>F)']['C(Fertilizer)']
p_value_B = anova_table['PR(>F)']['C(Water)']
p_value_interaction = anova_table['PR(>F)']['C(Fertilizer):C(Water)']


In [18]:
if p_value_A < alpha:
    print("Factor A (Fertilizer) has a significant effect on plant growth.")
else:
    print("Factor A (Fertilizer) does not have a significant effect on plant growth.")


Factor A (Fertilizer) does not have a significant effect on plant growth.


In [19]:
if p_value_B < alpha:
    print("Factor B (Water) has a significant effect on plant growth.")
else:
    print("Factor B (Water) does not have a significant effect on plant growth.")


Factor B (Water) does not have a significant effect on plant growth.


In [20]:
if p_value_interaction < alpha:
    print("There is a significant interaction effect between Factor A and Factor B.")
else:
    print("There is no significant interaction effect between Factor A and Factor B.")


There is no significant interaction effect between Factor A and Factor B.


# t-test
The t-test is a statistical test used to determine whether there is a significant difference between the means of two groups. It compares the means of the two groups and assesses whether the observed difference is larger than what would be expected by chance.

There are different types of t-tests depending on the characteristics of the data and the research question:

- Independent Samples t-test:
The independent samples t-test is used when the two groups being compared are independent of each other. It compares the means of two separate groups to determine if they are significantly different from each other.

- Paired Samples t-test:
The paired samples t-test, also known as the dependent samples t-test, is used when the two groups being compared are related or matched in some way. It compares the means of the same group under two different conditions or at two different time points.

# Task on  T-test


**Problem Statement:**
We want to determine if there is a significant difference in test scores between two groups of students, Group A and Group B.

**Null Hypothesis (H0):**
The null hypothesis states that there is no significant difference in test scores between Group A and Group B. In other words, the mean test score for Group A is equal to the mean test score for Group B.

**Alternative Hypothesis (H1):**
The alternative hypothesis states that there is a significant difference in test scores between Group A and Group B. In other words, the mean test score for Group A is not equal to the mean test score for Group B.


**Interpretation of Results:**

- If the p-value is less than the chosen significance level (alpha), we reject the null hypothesis.

- If the p-value is greater than or equal to alpha, we fail to reject the null hypothesis.


In [22]:
import scipy.stats as stats

data = pd.read_csv( path + "ttest.csv")
# Perform an independent two-sample t-test
t_statistic, p_value = stats.ttest_ind(data["group_A_scores"], data["group_B_scores"])

# Set significance level
alpha = 0.05



In [23]:
# Interpret the results
print("Independent Two-Sample T-Test Results:")
print(f"T-statistic: {t_statistic:.2f}")
print(f"P-value: {p_value:.4f}")



Independent Two-Sample T-Test Results:
T-statistic: -0.28
P-value: 0.7779


In [24]:
if p_value < alpha:
    print("Conclusion: There is a significant difference in test scores between Group A and Group B.")
else:
    print("Conclusion: There is no significant difference in test scores between Group A and Group B.")


Conclusion: There is no significant difference in test scores between Group A and Group B.


# b. Correlation tests

`Correlation tests determine the extent to which two variables are associated.`


![Screenshot%20%28271%29.png](attachment:Screenshot%20%28271%29.png)

# Chi-square test:

The chi-square test is a statistical test used to determine whether there is a significant association or dependency between two categorical variables. It compares the observed frequencies in a contingency table to the expected frequencies under the assumption of independence.

The test is based on the chi-square statistic, which measures the discrepancy between the observed and expected frequencies. 


`The null hypothesis assumes that there is no association between the variables, while the alternative hypothesis suggests that there is a significant association.`




### Problem Statement:
We want to investigate if there's an association between the gender of individuals and their preference for types of food (e.g., pizza, burger, salad). 

### Null Hypothesis (H0):
There is no significant association between gender and food preference among the group of individuals. In other words, the distribution of food preferences is the same for both genders.

### Alternative Hypothesis (H1):
There is a significant association between gender and food preference among the group of individuals. The distribution of food preferences differs significantly between genders.




In [25]:
import pandas as pd
from scipy.stats import chi2_contingency

# Example data
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'Food_Preference': ['Pizza', 'Pizza', 'Burger', 'Burger', 'Salad', 'Salad', 'Pizza', 'Burger']
}

# Creating a DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Gender,Food_Preference
0,Male,Pizza
1,Female,Pizza
2,Male,Burger
3,Female,Burger
4,Male,Salad
5,Female,Salad
6,Male,Pizza
7,Female,Burger


In [26]:
# Creating a contingency table
contingency_table = pd.crosstab(df['Gender'], df['Food_Preference'])
contingency_table

Food_Preference,Burger,Pizza,Salad
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,2,1,1
Male,1,2,1


In [27]:
# Performing the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-Square Statistic:", chi2)
print("P-Value:", p)

Chi-Square Statistic: 0.6666666666666666
P-Value: 0.7165313105737892


This example sets up data for the food preferences and genders of eight individuals. The chi-square test is applied to the contingency table created from this data to assess if there's a significant association between gender and food preference among these individuals.

A chi-square test for goodness-of-fit, on the other hand, compares observed frequencies in one categorical variable against the frequencies you'd expect based on a specific distribution.

to perform a chi-square goodness-of-fit test. In this scenario, we'll explore if the observed food preferences match an expected distribution where individuals have an equal preference for pizza, burger, and salad.

### Problem Statement:
We want to investigate whether the observed food preferences among a group of individuals match an expected distribution, assuming an equal preference for pizza, burger, and salad.

### Null Hypothesis (H0):
The observed food preferences among the group of individuals match the expected distribution of an equal preference for pizza, burger, and salad.

### Alternative Hypothesis (H1):
The observed food preferences among the group of individuals do not match the expected distribution of an equal preference for pizza, burger, and salad.




In [32]:
import pandas as pd
from scipy.stats import chisquare

# Example data for observed food preferences
observed = [16, 18, 16, 14, 12, 12]  # Observed frequencies of Pizza, Burger, Salad respectively

# Expected frequencies assuming equal preference
expected = [16, 16, 16, 16, 16, 8]  # Equal preference for Pizza, Burger, Salad
# Performing the chi-square goodness-of-fit test
chi2, p = chisquare(observed, f_exp=expected)

print("Chi-Square Statistic:", chi2)
print("P-Value:", p)


Chi-Square Statistic: 3.5
P-Value: 0.6233876277495822


# c. Regression tests
`Regression tests demonstrate whether  changes in predictor variables cause changes in an outcome variable.` You can decide which regression test to use based on the number and types of variables you have as predictors and outcomes.

Most of the commonly used regression tests are parametric. `If your data is not normally distributed, you can perform data transformations`.

Data transformations help you make your data normally distributed using mathematical operations, like taking the square root of each value.

![Screenshot%20%28272%29.png](attachment:Screenshot%20%28272%29.png)

### Regression Test:
To test the relationship between the independent variables and the dependent variable, you can perform a regression analysis. This involves fitting a regression model (linear regression, in this case) to the data and examining the coefficients of the independent variables.


output will provide information about the coefficients, their significance, and the overall model fit. This analysis helps to determine if the independent variables (x1, x2, x3) have a significant impact on the target variable (Target).


### Problem Statement:
We want to investigate the relationship between multiple independent variables (x1, x2, x3) and a dependent variable (Target).

### Null Hypothesis (H0):
There is no significant relationship between the independent variables (x1, x2, x3) and the dependent variable (Target). In other words, the coefficients of the independent variables in the regression model are all equal to zero.

### Alternative Hypothesis (H1):
There is a significant relationship between at least one of the independent variables (x1, x2, x3) and the dependent variable (Target). In other words, the coefficients of the independent variables in the regression model are not all equal to zero.



In [33]:
import numpy as np
import pandas as pd

# Generating predictor variables (features)
np.random.seed(42)
num_samples = 100

# Three predictor variables
x1 = np.random.rand(num_samples)  # Random values for x1
x2 = np.random.rand(num_samples) * 3  # Random values for x2 (with more spread)
x3 = np.random.rand(num_samples) - 0.5  # Random values for x3 (centered around 0)

# Generating a target variable (dependent variable) as a linear combination of predictors with some noise
# Target variable y = 5*x1 + 3*x2 - 2*x3 + noise
target = 5 * x1 + 3 * x2 - 2 * x3 + np.random.randn(num_samples)

# Creating a DataFrame to organize the data
data = pd.DataFrame({'X1': x1, 'X2': x2, 'X3': x3, 'Target': target})
display(data.head())


Unnamed: 0,X1,X2,X3,Target
0,0.37454,0.094288,0.142032,1.917072
1,0.950714,1.909231,-0.41586,10.661385
2,0.731994,0.943068,-0.338371,9.30986
3,0.598658,1.525712,0.398554,7.407239
4,0.156019,2.722699,0.106429,6.710191


In [36]:

import statsmodels.api as sm

# Separate the predictors (independent variables) and the target variable
X = data[['X1', 'X2', 'X3']]
y = data['Target']

In [37]:
X

Unnamed: 0,X1,X2,X3
0,0.374540,0.094288,0.142032
1,0.950714,1.909231,-0.415860
2,0.731994,0.943068,-0.338371
3,0.598658,1.525712,0.398554
4,0.156019,2.722699,0.106429
...,...,...,...
95,0.493796,1.047629,0.022243
96,0.522733,2.177867,0.269994
97,0.427541,2.691331,-0.284179
98,0.025419,2.661259,0.122890


In [38]:
# Add a constant term to the predictors for the intercept
X = sm.add_constant(X)
X

Unnamed: 0,const,X1,X2,X3
0,1.0,0.374540,0.094288,0.142032
1,1.0,0.950714,1.909231,-0.415860
2,1.0,0.731994,0.943068,-0.338371
3,1.0,0.598658,1.525712,0.398554
4,1.0,0.156019,2.722699,0.106429
...,...,...,...,...
95,1.0,0.493796,1.047629,0.022243
96,1.0,0.522733,2.177867,0.269994
97,1.0,0.427541,2.691331,-0.284179
98,1.0,0.025419,2.661259,0.122890


In [39]:
# Fit the regression model
model = sm.OLS(y, X).fit()

# Print the summary of the regression results
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                 Target   R-squared:                       0.916
Model:                            OLS   Adj. R-squared:                  0.914
Method:                 Least Squares   F-statistic:                     350.6
Date:                Tue, 14 Nov 2023   Prob (F-statistic):           1.43e-51
Time:                        16:48:32   Log-Likelihood:                -138.17
No. Observations:                 100   AIC:                             284.3
Df Residuals:                      96   BIC:                             294.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2354      0.256     -0.919      0.3

```

This code performs a multiple linear regression using the statsmodels library in Python. The summary 