
# MSS482 - GRAPHING TECHNOLOGY IN MATHEMATICS AND SCIENCE

**SEMESTER 1 2023/2024**


>R.U.Gobithaasan (2023). School of Mathematical Sciences, Universiti Sains Malaysia.
[Official Website](https://math.usm.my/academic-profile/705-gobithaasan-rudrusamy) 


<p align="center">
     © 2023 R.U. Gobithaasan All Rights Reserved.
</p>

# Analysis of Categorical Data
- https://www.pythonfordatascience.org/chi-square-test-of-independence-python/

6.1 Introduction to Categorical Data Analysis

6.2. Introduction to Chi-squared $\chi^2$ goodness of fit test

6.3 Chi-squared $\chi^2$ test of independence and homogeneity.


### requirements

> Install the following: `!python -m pip install pandas`
1. pandas
2. researchpy
3. statsmodels
4. matplotlib
5. seaborn

### Dataset: Online Dataset sources

**Online Sources:** 
-  Kühne, R. (2017). Categorical Data Analysis. The International Encyclopedia of Communication Research Methods, 1–10. doi:10.1002/9781118901731.iecrm0021


### Tips

In [None]:
# Magic command to display Matplotlib plots inline :https://ipython.readthedocs.io/en/stable/interactive/magics.html
%matplotlib inline
# To ignore warnings, use the following code to make the display more attractive.
# Import seaborn and matplotlib.
import warnings
warnings.filterwarnings("ignore")

# 6.1 Introduction to Categorical Data Analysis


**Categorical data** consists of non-numeric information, such as gender, colors, categories, etc. 

- Analyzing **categorical data** involves using various statistical techniques to understand patterns, relationships, and distributions within the data. Here's a step-by-step guide on how to analyze categorical data:

1. **Data Preparation**: Organize your data into a tabular format, making it easier to work with.
2. **Frequency Distribution**: Create a frequency table or use visualization tools like bar charts, histograms, or pie charts to represent the frequency distribution.
3. **Central Tendency Measures**: For categorical data, **mode** is the measure of central tendency that indicates the most frequently occurring category. Calculate the mode to identify the most common category in your dataset.
4. **Cross-tabulation (Contingency Tables)**: When you want to understand relationships between two or more categorical variables, create cross-tabulations or contingency tables. Use tools like `pd.crosstab()` in Python if working with `Pandas` to create these tables.
5. **Chi-Square $\chi^2$ Test**: to assess the association or independence between two categorical variables, use a Chi-Square test. The test measures how expected frequencies differ from observed frequencies. It determines whether there is a significant relationship between the variables.
6. **Clustering**: For more complex analyzes, consider techniques like clustering to identify patterns and groupings within categorical data.
7. **Interpretation and Conclusion**: After performing the analysis, interpret the results and draw conclusions based on the findings. Explain any relationships, patterns, or insights revealed by the analysis.

# 6.1 Introduction to Chi-squared test
https://www.pythonfordatascience.org/chi-square-test-of-independence-python/

The Chi-Square $\chi^2$ test is a statistical method used to **determine the association or independence between categorical variables**. 
> It's particularly useful for analyzing the **relationship between two categorical variables** in a contingency table(also known as a cross-tabulation or a two-way frequency table).

- The test evaluates whether there is a difference between the observed frequencies and the frequencies that would be expected if the variables were independent of each other.

- Chi-Square $\chi^2$ test is an **omnibus test**; if a significant relationship is found and one wants to test for differences between groups then post-hoc testing will need to be conducted. Typically, a proportions test is used as a follow-up post-hoc test.



### How the Chi-Square Test Works

1. **Hypotheses:**
   - H0 (Null Hypothesis): Assumes independence between variables.
   - H1 (Alternative Hypothesis): Assumes association between variables.

2. **Contingency Table:**
   - Organize categorical data into a table format.

3. **Expected Frequencies:**
   - Calculate expected frequencies under the assumption of independence.

4. **Chi-Square Statistic:**
   - Compute the Chi-Square statistic using the formula.

5. **Degrees of Freedom (df):**
   - Calculate degrees of freedom based on table dimensions.

6. **Chi-Square Distribution:**
   - Compare calculated Chi-Square value to critical value.

7. **Interpretation:**
   - Determine significance and make conclusions based on results.



> $\chi^2$ test of independence assumptions:
1. The two samples are independent: This means that the occurrence of an event in one cell should not influence the occurrence of an event in another cell.
2. No expected cell count is = 0. There should be no empty cells in the contingency table.
3. Expected frequencies should be greater than 5 for most cells. When expected frequencies are too small, the chi-square test may not be reliable.


---

> toy example: Two Categories; 
1. `Team`; {'A', 'B'} 
2. `Outcome`; {'Success', 'Failure'}

>Setting up data for Categorical Data Analysis: Contingency Table

> In the context of Python's pandas library, `pd.crosstab` is a convenient function for creating **contingency tables, also known as cross-tabulations or pivot tables**. It is particularly useful for analyzing the relationship between two or more categorical variables. 


In [None]:
import pandas as pd

# Sample DataFrame
data = {'Team': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B'],
        'Outcome': ['Success', 'Failure', 'Success', 'Success', 'Failure', 'Failure', 'Success', 'Failure','Failure']}
df = pd.DataFrame(data)
df


> example of contigency table from dataframe

In [None]:
# Create a contingency table using pd.crosstab
contingency_table = pd.crosstab(index=df['Team'], columns=df['Outcome'])

# Display the contingency table
print(contingency_table)



In [None]:
import matplotlib.pyplot as plt
# Plotting
contingency_table.plot(kind='bar', stacked=True, figsize=(4,3))

# Adding labels and title
plt.xlabel('Team')
plt.ylabel('Count')
plt.title('Bar chart from contingency table')

# Show the plot
plt.show()

In [None]:
#Descriptive statistics
row_percentages = pd.crosstab(index=df['Team'], columns=df['Outcome'], normalize='index') * 100
column_percentages = pd.crosstab(index=df['Team'], columns=df['Outcome'], normalize='columns') * 100
mode_team = df['Team'].mode()[0]
mode_outcome = df['Outcome'].mode()[0]

# Display descriptive statistics
print("\nRow Percentages:")
print(row_percentages)
print("\nColumn Percentages:")
print(column_percentages)
print(f"\nMode of Team: {mode_team}")
print(f"Mode of Outcome: {mode_outcome}")

<div class="alert alert-block alert-danger">
<b>Question:</b> Determine if `outcome` between `Team` are similar or different.
</div>

        H0: there is no association between the two categorical variables (independence). 
        H1: The alternative hypothesis is that there is an association.

In [None]:
from scipy.stats import chi2_contingency

# Perform the chi-square test
chi2, p, dof, expected_freq = chi2_contingency(contingency_table)

# Display the test statistics and p-value
print(f"\nChi-Square Value: {chi2}")
print(f"P-value: {p}")

# Check the significance level (e.g., 0.05)
alpha = 0.05
print("\nSignificance Test:")
if p < alpha:
    print("The association between Team and Outcome is statistically significant.")
else:
    print("There is no significant association between Team and Outcome, thus they are independent.")

In [None]:
expected_freq

> In summary, we have reason to believe that there is likely a **NO meaningful connection between 'Team' and 'Outcome', hence they are independent** in the dataset.

In [None]:
from scipy.stats import chi2_contingency

# Create DataFrames for observed and expected frequencies
observed_df = pd.DataFrame(contingency_table.values, index=contingency_table.index, columns=contingency_table.columns)
expected_df = pd.DataFrame(expected_freq, index=contingency_table.index, columns=contingency_table.columns)

# Rename columns with prefixes
observed_df.columns = [f'Observed_{col}' for col in observed_df.columns]
expected_df.columns = [f'Expected_{col}' for col in expected_df.columns]

# Display the combined DataFrame
result_df = pd.concat([observed_df, expected_df], axis=1)
print("Contingency Table with Observed and Expected Frequencies:")
print(result_df)


> the assumption (3) was not met: **All** of the cells have expected frequencies < 5.

$\chi^2$ test might not be reliable!

---

> Interpreting the Chi-Square Statistic

The Chi-Square statistic obtained from a Chi-Square test is a measure used **to determine whether there is a significant association or independence between categorical variables**. Interpreting the Chi-Square statistic involves understanding its magnitude and comparing it to a critical value from a Chi-Square distribution table with appropriate degrees of freedom.

> How to Interpret the Chi-Square Statistic

**Magnitude of the Chi-Square Statistic:**
- A high chi-square value indicates that **there is a notable association between the categorical variables being analyzed**.

**Comparison with Critical Value:**
- Compare the calculated Chi-Square value to the critical value from the Chi-Square distribution table.
- The critical value depends on the chosen significance level (often 0.05 or 0.01) and degrees of freedom.

**P-Value Interpretation:**
- Often, the Chi-Square test also provides a p-value.
- If the p-value is less than the chosen significance level (e.g., 0.05), it indicates a significant association between the variables.

**Conclusion based on Comparison:**
- If the calculated Chi-Square statistic is greater than the critical value, you reject the null hypothesis.
  - **Result:** Indicates a **significant association** between the variables.
- If the calculated Chi-Square statistic is less than the critical value, you fail to reject the null hypothesis.
  - **Result:** Suggests no significant association (independence) between the variables.



>Examining residuals of your data can provide additional insights after finding a significant association in a Chi-squared test.

 -  the standardized residuals help identify the cells in a contingency table that contribute the most to the observed association. 

**How to interpret the standardized residuals:**

1. **Positive Residuals**: If a standardized residual is positive, it indicates that the observed 
count is higher than what would be expected by chance. In the context of the Chi-squared test, positive residuals suggest an over-representation of cases in that cell.
2. **Negative Residuals**: Conversely, if a standardized residual is negative, it indicates that the observed count is lower than expected. Negative residuals suggest an under-representation of cases in that cell.
3. **Magnitude of Residuals**: The larger the magnitude of the standardized residual, the more the observed count deviates from what would be expected. A larger magnitude suggests a greater contribution to the overall Chi-squared statistic.


In [None]:
import numpy as np
import scipy.stats as stats


# Standardized residuals:
expected = stats.contingency.expected_freq(observed_df)
residuals = (observed_df - expected) / np.sqrt(expected)

# Print standardized residuals
print("Standardized Residuals:")
print(residuals)

> comparing the residual table with stacked bar chart gives a clearer picture.

In [None]:
import matplotlib.pyplot as plt
# Plotting
contingency_table.plot(kind='bar', stacked=True, figsize=(4,3))

# Adding labels and title
plt.xlabel('Team')
plt.ylabel('Count')
plt.title('Bar chart from contingency table')

# Show the plot
plt.show()

---

> example from https://www.pythonfordatascience.org/chi-square-test-of-independence-python/

>The data used in this example comes from Stata and is 1980 U.S. census data from 956 cities.

<div class="alert alert-block alert-warning">
<b> The research question: </b>  
Is there a relationship between the region and age?
</div>

- Before testing this relationship, let's see some basic univariate statistics
- The chi-square test for independence is generally robust to imbalances in the levels of categorical variables, especially when the sample size is large. 

In [None]:
import pandas as pd
import scipy.stats as stats

# To load a sample dataset for this demonstration
import statsmodels.api as sm

df = sm.datasets.webuse("citytemp2")

In [None]:
df.info()

In [None]:
df['region'].describe()

In [None]:
df['region'].value_counts()

In [None]:
df['agecat'].describe()

In [None]:
df['agecat'].value_counts()

In [None]:
contingency_table = pd.crosstab(df["region"], df["agecat"])
contingency_table

In [None]:
import matplotlib.pyplot as plt
# Plotting
contingency_table.plot(kind='bar', stacked=True, figsize=(4,3))

# Adding labels and title
plt.xlabel('Team')
plt.ylabel('Count')
plt.title('Bar chart from contingency table')

# Show the plot
plt.show()

In [None]:
from scipy.stats import chi2_contingency

# Perform the chi-square test
chi2, p, dof, expected_freq = chi2_contingency(contingency_table)

# Display the test statistics and p-value
print(f"\nChi-Square Value: {chi2}")
print(f"P-value: {p}")

# Check the significance level (e.g., 0.05)
alpha = 0.05
print("\nSignificance Test:")
if p < alpha:
    print("The association between region and age is statistically significant.")
else:
    print("There is no significant association between region and age, thus they are independent.")

> There is a relationship between region and the age distribution, 
$\chi(6) = 61.29, p< 0.0001$.

> ASSUMPTION CHECK

1. The two samples are independent:

        - The variables were collected independently of each other, i.e. the answer from one variable was not dependent on the answer of the other.

2. No expected cell count is = 0.
3. No more than 20% of the cells have and expected cell count < 5.



In [None]:
from scipy.stats import chi2_contingency

# Create DataFrames for observed and expected frequencies
observed_df = pd.DataFrame(contingency_table.values, index=contingency_table.index, columns=contingency_table.columns)
expected_df = pd.DataFrame(expected_freq, index=contingency_table.index, columns=contingency_table.columns)

# Rename columns with prefixes
observed_df.columns = [f'Observed_{col}' for col in observed_df.columns]
expected_df.columns = [f'Expected_{col}' for col in expected_df.columns]

# Display the combined DataFrame
result_df = pd.concat([observed_df, expected_df], axis=1)
print("Contingency Table with Observed and Expected Frequencies:")
print(result_df)


In [None]:
import numpy as np
# Standardized residuals:
expected = stats.contingency.expected_freq(observed_df)
residuals = (observed_df - expected) / np.sqrt(expected)

# Print standardized residuals
print("Standardized Residuals:")
print(residuals)

> comparing the residual table with stacked bar chart gives a clearer picture.

In [None]:
import matplotlib.pyplot as plt
# Plotting
contingency_table.plot(kind='bar', stacked=True, figsize=(4,3))

# Adding labels and title
plt.xlabel('Team')
plt.ylabel('Count')
plt.title('Bar chart from contingency table')

# Show the plot
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Exercise:</b> Is there a relationship between the division and age?

</div>

In [None]:
df['division'].value_counts()

---