In [1]:
import numpy       as np
import pandas      as pd
import scipy.stats as stats

# Chi Square

A chi-square distribution with k degrees of freedom is given by sum of squares of standard normal random variables $Z_1$, $Z_2$, ... $Z_k$ obtained by transforming normal standard variables $X_1$, $X_2$, ... $X_k$ with mean values $\mu_1$, $\mu_2$, ... $\mu_k$ and corresponding standard deviation $\sigma_1$, $\sigma_2$, ... $\sigma_k$

${\chi_k}^2$ = ${Z_1}^2$ + ${Z_2}^2$ + … + ${Z_k}^2$ 


The probability density function of f(x) = 

$\frac{x^{\frac{k}{2}-1}e^\frac{-x}{2}}{2^{\frac{k}{2}} \Gamma {\bigg(\frac{k}{2}\bigg)}}$ if x > 0 else 0

where Γ(k/2) is a gamma function given by


$\Gamma\frac{k}{2}$ = ${\int_0}^\infty x^{k-1} e^{-x} dx$

### Properties of Chi Square distribution

##### 1. The mean and standard deviation of a chi-square distribution are k and √2k respectively, where k is the degrees of freedom.
##### 2. As the degrees of freedom increases, the probability density function of a chi-square distribution approaches normal distribution.
##### 3. Chi-square goodness of fit is one of the popular tests for checking whether a data follows a specific probability distribution.
##### 4. Chi square test is a right tailed test.

### Chi-square Goodness of fit tests

Goodness of fit tests are hypothesis tests that are used for comparing the observed distribution pf data with expected distribution of the data to decide whether there is any statistically significant difference between the observed distribution and a theoretical distribution (for example, normal, exponential, etc.) based on the comparison of observed frequencies in the data and the expected frequencies if the data follows a specified theoretical distribution.

| Hypothesis | Description                                                           |
| ----------- | -------------------------------------------------------------------- |
| Null hypothesis | There is no statistically significant difference between the observed frequencies and the expected frequencies from a hypothesized distribution |
| Alternative hypothesis | There is statistically significant difference between the observed frequencies and the expected frequencies from a hypothesized distribution |


### Chi-square Goodness of fit tests

Chi-square statistic for goodness of fit is given by 

$\chi^2$ = $\sum_{i=1}^{n}\sum_{j=1}^{m}\frac{({O_{ij}-E_{ij}})^2}{E_{ij}}$

This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5.

### Step 5: Decide to reject or accept null hypothesis

###### In this example, p value is 0.0000449 and < 0.05 so we reject the null hypothesis. 
###### So, we conclude that Meal preference is not defined in the null hypothesis.

### Chi-square tests of independence

Chi-square test of independence is a hypothesis test in which we test whether two or more groups are statistically independent or not.

| Hypothesis | Description |
| --------------------- | ----------------------- |
| Null Hypothesis | Two or more groups are independent |
| Alternative Hypothesis | Two or more groups are dependent |

$\chi^2$ = $\sum_{i=1}^{n}\sum_{j=1}^{m}\frac{({O_{ij}-E_{ij}})^2}{E_{ij}}$

The corresponding degrees of freedom is (r - 1) * ( c  - 1) , where r is the number of rows and c is the number of columns in the contingency table. 

scipy.stats.chi2_contingency is the Chi-square test of independence of variables in a contingency table.

This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed. The expected frequencies are computed based on the marginal sums under the assumption of independence.

Load the Migrane Data Set and perform the following analysis
#1.Does the Headache is dependent on Gender
#2.Does Gender affect Headache_type (Aura,Mixed & No-Aura)


In [None]:
M=pd.read_csv('/content/drive/My Drive/Statistics Mahesh Anand/Migraine.csv',index_col=0)

#Ho: Prop(male) having Headache = Prop(female) having Headache
#Ha: Proportion is not equal

In [None]:
from statsmodels.stats.proportion import proportions_ztest

In [None]:
proportions_ztest([209,225],[489,473])

#Ho: Prop(Male)having Aura=Prop(Female)having Aura,  Prop(Male)having No-Aura=Prop(Female)having No-Aura,Prop(Male)having Mixed=Prop(Female)having Mixed
#Ha: They are not equal

In [None]:
f=pd.crosstab(M['Gender'],M['hatype'])
f

Calculate the Expected Count and Chi_square Value Manually & cross check with Python function

### Example:

The table below contains the number of perfect, satisfactory and defective products are manufactured by both male and female.

| Gender  | Perfect | Satisfactory | Defective |
| ------- | ---- | --------- | -------- |
| Male    | 138 | 83 | 64 |
| Female  | 64 | 67 | 84 |


Do these data provide sufficient evidence at the 5% significance level to infer that there are differences in quality among genders (Male and Female)?

### Step 1: State the null and alternative hypothesis:

Null hypothesis: $H_0$: There is no difference in quality of the products manufactured by male and female
                        
Alternative hypothesis: $H_A$: There is a significant difference in quality of the products manufactured by male and female

### Step 2: Decide the significance level

Here we select α = 0.05

### Step 3: Identify the test statistic

We use the chi-square test of independence to find out the difference of categorical variables 

### Step 4: Calculate p value or chi-square statistic value

In [None]:
import pandas      as pd
import numpy       as np
import scipy.stats as stats

quality_array = np.array([[138, 83, 64],[64, 67, 84]])
chi_sq_Stat, p_value, deg_freedom, exp_freq = stats.chi2_contingency(quality_array)

print('Chi-square statistic %3.5f P value %1.6f Degrees of freedom %d' %(chi_sq_Stat, p_value,deg_freedom))

Chi-square statistic 22.15247 P value 0.000015 Degrees of freedom 2


### Step 5: Decide to reject or accept null hypothesis

###### In this example, p value is 0.000015 and < 0.05 so we reject the null hypothesis. 
###### So, we conclude that there is a significant difference in quality of the products manufactured by male and female.

## Chi-Square - One factor

###  Example 2

A1 airlines operated daily flights to several Indian cities. The operations manager believes that 30% of their passengers prefer vegan food, 45% prefer vegetarian food , 20% prefer non-veg food 5% request for Jain food. 

A sample of 500 passengers was chosen to analyse the food preferences and the data is shown in the following table:

|               | Food type | Vegan | Vegetarian | Non-Vegetarian | Jain |
| ------------------------- | ---- | ---- | ----- | ---- | ---- |
|Number of passengers |  | 190 | 185 | 90 | 35 |

At 5% level of significance, can you confirm that the meal preference is as per the belief of the operations manager?

### Step 1: State the null and alternative hypothesis:

Null hypothesis: $H_0$: Meal preference is as per the perceived ratios of the operations manager
                        
Alternative hypothesis: $H_A$: Meal preference is different from the perceived ratios of the operations manager

### Step 2: Decide the significance level

Here we select α = 0.05

### Step 3: Identify the test statistic

Since we have observed frequencies of meal preference and we can calculate the expected frequencies, we can use chi-square goodness of fit for this problem.

### Step 4: Calculate p value or chi-square statistic value

Use the scipy.stats.chisquare function to compute Chi square goodness of fit by giving the observed values and expected values as input.

The first value in the returned tuple is the χ2 value itself, while the second value is the p-value computed using 
ν = k−1 where k is the number of values in each array.

We can calculate the expected frquency as follows:
1. Compute the total number of passengers. It will be 500.
2. We expect 30% of them prefer Vegan food, so the expected frequency for Vegan Food is = 0.3 * 500 = 150
3. Similarly we can calculate the expected frequencies of the rest of them.

In [None]:
import scipy.stats as stats
import scipy

observed_values    = scipy.array([190, 185, 90, 35])
n                  = observed_values.sum()

expected_values    = scipy.array([n*0.3, n*.45, n*0.2, n*0.05])

chi_square_stat, p_value = stats.chisquare(observed_values, f_exp=expected_values)

print('At 5 %s level of significance, the p-value is %1.7f' %('%', p_value))

print('At 5 %s level of significance, the chi observed is %1.7f' %('%', chi_square_stat))

## chi critical at 95% is 8.907


At 5 % level of significance, the p-value is 0.0000449
At 5 % level of significance, the chi observed is 22.7777778


### Step 5: Decide to reject or accept null hypothesis

### In this example, p value is 0.0000449 and < 0.05 so we reject the null hypothesis. 
### So, we conclude that Meal preference is not defined in the null hypothesis.

Refer to the above example 2. Here the operations manager changes his belief and now believes that 28% of their passengers prefer vegan food, 42% prefer vegetarian food , 25% prefer non-veg food 5% request for Jain food. 

At 5% level of significance, can you confirm that the meal preference is as per the belief of the operations manager?

### Chi-square tests of independence

Chi-square test of independence is a hypothesis test in which we test whether two or more groups are statistically independent or not.

| Hypothesis | Description |
| --------------------- | ----------------------- |
| Null Hypothesis | Two or more groups are independent |
| Alternative Hypothesis | Two or more groups are dependent |

$\chi^2$ = $\sum_{i=1}^{n}\sum_{j=1}^{m}\frac{({O_{ij}-E_{ij}})^2}{E_{ij}}$

The corresponding degrees of freedom is (r - 1) * ( c  - 1) , where r is the number of rows and c is the number of columns in the contingency table. 

scipy.stats.chi2_contingency is the Chi-square test of independence of variables in a contingency table.

This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed. The expected frequencies are computed based on the marginal sums under the assumption of independence.

## End