# Hospital Readmission Prediction - Exploratory Data Analysis
**Author:** Jiayi Lyu  
**Major:** Mathematics, Minor in Statistics  

## Objective
Analyze patient readmission patterns using statistical methods and probability theory.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load data
df = pd.read_csv('../data/raw/patient_data.csv')
print(f"Dataset shape: {df.shape}")
df.head()

## 1. Statistical Summary
First, let's examine the distribution of key variables.

In [None]:
# Descriptive statistics
df.describe()

## 2. Distribution Analysis
Testing for normality using Shapiro-Wilk test and examining skewness.

In [None]:
# Check distribution of length of stay
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(df['time_in_hospital'], bins=30, edgecolor='black')
plt.title('Distribution of Length of Stay')
plt.xlabel('Days')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
stats.probplot(df['time_in_hospital'], dist="gamma", plot=plt)
plt.title('Q-Q Plot (Gamma Distribution)')

plt.tight_layout()
plt.show()

print(f"Skewness: {df['time_in_hospital'].skew():.3f}")
print(f"Kurtosis: {df['time_in_hospital'].kurtosis():.3f}")

## 3. Hypothesis Testing
### H0: Readmission rate is independent of age group
Using Chi-square test of independence.

In [None]:
# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 50, 65, 100], labels=['<50', '50-65', '65+'])

# Contingency table
contingency_table = pd.crosstab(df['age_group'], df['readmitted'])
print("Contingency Table:")
print(contingency_table)

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4e}")
print(f"Degrees of freedom: {dof}")

if p_value < 0.05:
    print("\nReject H0: Readmission rate is significantly different across age groups")
else:
    print("\nFail to reject H0")

## 4. Correlation Analysis
Examining linear relationships between predictors.

In [None]:
# Select numerical features
numerical_features = ['age', 'time_in_hospital', 'num_medications', 
                     'number_diagnoses', 'number_inpatient', 'readmitted']

# Correlation matrix
corr_matrix = df[numerical_features].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Key Variables')
plt.tight_layout()
plt.show()

## 5. Class Imbalance Analysis
The readmission problem has imbalanced classes - important for model selection.

In [None]:
# Class distribution
class_dist = df['readmitted'].value_counts()
print("Class Distribution:")
print(class_dist)
print(f"\nImbalance ratio: {class_dist[0]/class_dist[1]:.2f}:1")

# Visualization
plt.figure(figsize=(8, 6))
class_dist.plot(kind='bar', color=['#2ecc71', '#e74c3c'])
plt.title('Class Distribution: Readmission')
plt.xlabel('Readmitted (0=No, 1=Yes)')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## 6. Statistical Insights

### Key Findings:
1. **Length of Stay** follows a Gamma distribution (right-skewed)
2. **Age and readmission** show significant association (p < 0.05)
3. **Class imbalance** (~70:30) requires special handling in modeling
4. **Prior admissions** show strongest correlation with readmission

### Statistical Considerations for Modeling:
- Use **stratified sampling** to preserve class distribution
- Consider **ROC-AUC** over accuracy due to imbalance
- Apply **class weights** in tree-based models
- Feature engineering based on **interaction terms** (mathematical perspective)