## Introduction to Data Science


***Data Mining and Discovery***

**Chi-Squared test**

**Hello Data**



## Why We Use the Chi-Squared Test (χ² Test) in Data Mining

The **Chi-squared test** is a statistical method used when dealing with **categorical data**. It helps answer two main questions:

---

### 1. Association Between Two Categorical Variables

* Example: In the `stent30` dataset, we might want to see if the **30-day outcome (Success/Failure)** depends on whether a **stent was used (Yes/No)**.
* The test checks whether the difference in outcomes is **due to chance** or whether there is a **real relationship** between the variables.

---

### 2. Goodness-of-Fit Test

* Sometimes we have **expected proportions** (e.g., 50% Male, 50% Female), but the observed data might be different (e.g., 40% Male, 60% Female).
* The Chi-squared test tells us whether the observed difference is **statistically significant** or just **random variation**.

---

### 3. Independence Testing Using Contingency Tables

A contingency table shows the counts for combinations of two variables. Example:

|          | Success | Failure | Total |
| -------- | ------- | ------- | ----- |
| Stent    | 40      | 10      | 50    |
| No Stent | 30      | 20      | 50    |

The Chi-squared test checks whether the two variables (Stent and Outcome) are **independent** or **related**.

---

### 4. Why Not Just Look at Percentages?

* Percentages might look different, but the difference could be due to **random chance**.
* The Chi-squared formula measures how far the observed counts deviate from what we would expect if there was **no relationship**:

  $$
  χ² = \sum \frac{(Observed - Expected)^2}{Expected}
  $$

---

### 5. Real-World Applications

* **Healthcare**: Check if a treatment works better than another.
* **Marketing**: See if buying habits depend on customer age groups.
* **Education**: Analyze whether test results depend on teaching methods.

---




In [9]:
#Setup
# Importing Libraries
import pandas as pd # for data manipulation and analysis
import numpy as np # for numerical operations
import matplotlib.pyplot as plt1    # fpr plotting graphs
import seaborn as sns   # for data visualization
from scipy.stats import chi2_contingency # for Chi-squared test 

*To day We will learn with Case Study called "Using Stents to prevent Strokes*

##Source:Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011.##

In [2]:
# Loading the dataset
stent30 = pd.read_csv('stent30.csv')
stent30.head(10)

Unnamed: 0,group,outcome
0,treatment,stroke
1,treatment,stroke
2,treatment,stroke
3,treatment,stroke
4,treatment,stroke
5,treatment,stroke
6,treatment,stroke
7,treatment,stroke
8,treatment,stroke
9,treatment,stroke


**Treatment group (N = 224): received stent and medical management (medications, management of risk factors, lifestyle modification)**

**Control group(N = 227): same medical management as the treatment group, but no stent**

**What are the Hypothesis**

Stents alone prevent strokes

Medical management alone prevents strokes

Both stents and medical management prevent strokes


In [None]:
#let's create a frequency table for the "Group" and "Outcome" columns
frequency_table = pd.crosstab(index = stent30['group'], columns = stent30['outcome'])
frequency_table

ERROR! Session/line number was not unique in database. History logging moved to new session 47


outcome,no event,stroke
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,214,13
treatment,191,33


**Chi-squared Test**

We need to test the relationship statically

In [8]:
# Performing Chi-Squared Test
chi2, p, dof, expected = chi2_contingency(frequency_table)
print(f"Chi-squared: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:")
print(expected)             

Chi-squared: 9.023331421738945
P-value: 0.0026655513536403995
Degrees of freedom: 1
Expected frequencies:
[[203.84700665  23.15299335]
 [201.15299335  22.84700665]]


**Conclsions**

There was a >2.5x increase in strokes from the treatment!

There is a statistical difference between no event and strokes when comparing control and treatment groups

