# Exploratory Data Analysis Workshop
## The Bridge Between Raw Data and Meaningful Insight

**Dataset:** Respiratory Patient Data (OMOP CDM Synthetic Extract)

**INSTRUCTOR VERSION - WITH SOLUTIONS**

---

### Learning Objectives

By the end of this workshop, you will be able to:
1. Distinguish between seven types of analytical questions and understand their EDA requirements
2. Apply the four dimensions of exploration (distributional, relational, structural, comparative)
3. Identify data quality issues, patterns, and limitations before modeling
4. Interpret findings and distinguish signal from artifact
5. Determine appropriate next steps based on EDA outcomes

---
## Setup and Imports

Run the cell below to import the required libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8-whitegrid')

sns.set(style="whitegrid", context="talk")
plt.rcParams["figure.figsize"] = (10, 6)

%matplotlib inline

# Initialize random number generator for reproducibility, Panda's utilizes NumPy module not "random"
np.random.seed(42)

print("Libraries loaded successfully!")

---
## Part 1: Data Loading and Initial Inspection

### About the Dataset

This dataset contains synthetic patient records extracted from an OMOP Common Data Model (from OHDSI). The records represent patients with respiratory observations, including:
- Patient demographics (age, gender, race, ethnicity)
- Visit information (dates, type, conditions)
- Vital signs (temperature, oxygen saturation, heart rate, etc.)
- Vaccination history
- Patient outcomes

## Intial Analysis
With any dataset, you'll want to perform an initial "exploratory" data analysis to help you understand the structure, patterns, and relationships.

A few goals:
1. **Data Summarization** - gain an quick overview of the dataset
   - **Shape and size of data:** Number of rows, columns, and unique values.
   - **Descriptive statistics:** Mean, median, standard deviation, percentiles.
2. **Data Cleaning** - ensure data quality
   - **Handling missing values:** Identify and impute (mean/median/mode) or remove missing entries.
   - **Removing duplicates:** Eliminate redundant rows or records.
   - **Correcting data types:** Convert data to appropriate formats (e.g., dates, numbers, categories).
   - **Dealing with outliers:** Detect and decide whether to remove or transform extreme values.
3. **Feature Engineering - add additional features/variables to support analysis

In [None]:
# Read in Files and Establish Starting Dataframes
filename = 'data.csv'

In [None]:
# Load data
df = pd.read_csv(filename)

In [None]:
# Perform basic exploratory data analysis
# df.head(n)  # top n rows, n defaults to 5
# df.tail(n)  # last n rows
# df.sample(5) # sample x rows
df

In [None]:
print("Dataframe shape:",df.shape)
print(df.info())

In [None]:
print("Dataframe shape:",df.shape)
print(df.info())

In [None]:
# Convert date fields
for c in ["visit_start_date", "visit_end_date", "birth_datetime", "measurement_Date","flu_last_administered","tdap_last_administered","mmr_last_administered","polio_last_administered"]:
    if c in df.columns:
        df[c] = pd.to_datetime(df[c], errors="coerce")

In [None]:
# Examine missing fields
n_rows = len(df)

missing_table = (   # create a new dataframe
    df.isna()
      .agg(['sum', 'mean'])
      .T
      .rename(columns={'sum': 'missing_count', 'mean': 'missing_percent'})
)

missing_table['missing_percent'] = (missing_table['missing_percent'] * 100).round(2)
missing_table['non_missing_count'] = n_rows - missing_table['missing_count']
missing_table['dtype'] = df.dtypes.astype(str)

missing_table = (
    missing_table
      .reset_index(names='column')
      .sort_values(by=['missing_percent', 'column'], ascending=[False, True])
      .set_index('column')
)

missing_table

**Question:** What patterns do you notice in the missing data? Which columns have the most missing values, and why might that be?

In [None]:
# Create a column for visit length - ignoring visit type
los = (df["visit_end_date"] - df["visit_start_date"]).dt.days
df["length_of_stay_days"] = los.clip(lower=0)


# Modify labels for deceased column
df["deceased_flag"] = df["deceased"].map({"Y": "Deceased", "N": "Alive"}).fillna("Unknown").astype("category")

# columns for year and month
df["visit_year"] = df["visit_start_date"].dt.year
df["visit_month"] = df["visit_start_date"].dt.to_period("M").astype(str)

df['gender_source_value'] = df['gender_source_value'].astype('category')
df['race_source_value'] = df['race_source_value'].astype('category')
df['ethnicity_source_value'] = df['ethnicity_source_value'].astype('category')

In [None]:
# Descriptive Statistics for Numeric Columns
df.describe()

In [None]:
# for categorical columns
df.describe(include=['object','category'])

In [None]:
# Create an alternate view of the conditions, placingin into a separate tidy dataframe
import re

# robust split on ":" allowing extra spaces; keep NaN if empty
def split_conditions(s):
    if pd.isna(s) or str(s).strip() == "":
        return []
    # split on ":" with optional surrounding spaces
    parts = re.split(r"\s*:\s*", str(s))
    # normalize: strip, drop empties, lower (or title-case if you prefer)
    parts = [p.strip() for p in parts if p and p.strip()]
    return parts

# apply once to create a list-typed column
df["condition_list"] = df["condition"].map(split_conditions)

cond_long = (
    df[["visit_occurrence_id", "person_id", "visit_start_date"]]
      .assign(condition_item=df["condition_list"])
      .explode("condition_item", ignore_index=True)
)

# drop rows where no condition exists after cleaning
cond_long = cond_long.dropna(subset=["condition_item"])

# (optional) dedupe within visit in case the same condition appears twice
cond_long = cond_long.drop_duplicates(subset=["visit_occurrence_id", "condition_item"])
cond_long

In [None]:
# For outpatient visits, assume this is a data issue and the length should be 0
is_outpatient = df['visit_type'].astype(str).str.contains('outpatient', case=False, na=False)

# align dates, then recompute LOS as zero
df.loc[is_outpatient, 'visit_end_date'] = df.loc[is_outpatient, 'visit_start_date']
df.loc[is_outpatient, 'length_of_stay_days'] = 0

# remove any records where length_of_stay_days > 100
df = df[df["length_of_stay_days"] <= 100].copy()

---
## Part 2: Distributional Exploration

**Goal:** Examine individual variables to understand their scale, shape, and validity.

### The Seven Question Types

Before we explore, remember that the type of question determines the EDA approach:

| Type | Core Question | Example |
|------|--------------|--------|
| Descriptive | What happened? | What is the distribution of conditions? |
| Exploratory | What patterns exist? | Is there a relationship between temperature and O2 sat? |
| Inferential | Does this generalize? | Is the O2 sat difference statistically significant? |
| Predictive | What will happen? | Can we predict inpatient admission? |
| Prescriptive | What should we do? | What thresholds should trigger escalation? |
| Causal | What if we intervene? | Would earlier vaccination reduce severity? |
| Mechanistic | What process produces this? | How does symptom progression unfold? |

### Exercise 2.1: Demographic Distributions

Create frequency counts for the following categorical variables:
1. Gender (`gender_source_value`)
2. Race (`race_source_value`)
3. Ethnicity (`ethnicity_source_value`)

**Note:** For demographics, we should look at unique patients, not all visits (since one patient can have multiple visits).

In [None]:
# First, create a dataframe of unique patients
patients = df.drop_duplicates(subset='person_id')

print(f"Total unique patients: {len(patients)}")

In [None]:
# Gender distribution
gender_counts = patients['gender_source_value'].value_counts()
print("\nGender Distribution:")
print(gender_counts)
print(f"\nPercentages:")
print(patients['gender_source_value'].value_counts(normalize=True) * 100)

In [None]:
# Race distribution
race_counts = patients['race_source_value'].value_counts()
print("\nRace Distribution:")
print(race_counts)
print(f"\nPercentages:")
print(patients['race_source_value'].value_counts(normalize=True) * 100)

In [None]:
# Ethnicity distribution
ethnicity_counts = patients['ethnicity_source_value'].value_counts()
print("\nEthnicity Distribution:")
print(ethnicity_counts)
print(f"\nPercentages:")
print(patients['ethnicity_source_value'].value_counts(normalize=True) * 100)

### Exercise 2.2: Age Distribution

1. Calculate descriptive statistics for `age_at_visit_years`
2. Create a histogram of the age distribution
3. Create age groups (Pediatric: 0-18, Young Adult: 18-40, Middle Age: 40-65, Elderly: 65+)

In [None]:
# 1. Descriptive statistics for age
print("Age Descriptive Statistics:")
print(df['age_at_visit_years'].describe())

In [None]:
# 2. Histogram of age distribution
plt.figure(figsize=(10, 6))
plt.hist(df['age_at_visit_years'].dropna(), bins=30, edgecolor='black')
plt.xlabel('Age (years)')
plt.ylabel('Frequency')
plt.title('Age Distribution at Visit')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# 3. Create age groups and show their distribution
df['age_group'] = pd.cut(df['age_at_visit_years'],
                         bins=[0, 18, 40, 65, 100],
                         labels=['Pediatric', 'Young Adult', 'Middle Age', 'Elderly'])

age_group_counts = df['age_group'].value_counts().sort_index()
print("\nAge Group Distribution:")
print(age_group_counts)
print("\nPercentages:")
print(df['age_group'].value_counts(normalize=True).sort_index() * 100)

### Exercise 2.3: Condition Analysis

The `condition` column contains multiple conditions separated by colons (`:`).

1. Parse the conditions into individual items
2. Count the frequency of each condition
3. Create a bar chart of the top 10 conditions

In [None]:
# 1. Parse conditions (split by colon) - already done earlier, but repeating here
df['condition_list'] = df['condition'].str.split(':')

# 2. Explode and count
all_conditions = df['condition_list'].explode()
all_conditions = all_conditions.str.strip()  # Remove whitespace
condition_counts = all_conditions.value_counts()

print("Top 10 Conditions:")
print(condition_counts.head(10))

In [None]:
# 3. Bar chart of top 10 conditions
plt.figure(figsize=(12, 6))
condition_counts.head(10).plot(kind='barh')
plt.xlabel('Count')
plt.ylabel('Condition')
plt.title('Top 10 Most Common Conditions')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### Exercise 2.4: Vital Signs Distribution (COVID-Suspected Patients)

Many vital signs are only recorded for COVID-suspected visits.

1. Filter the data to only COVID-suspected patients (`observation_source == 'Suspected COVID-19'`)
2. Calculate descriptive statistics for the vital signs columns
3. Create histograms for oxygen saturation, respiratory rate, heart rate, and body temperature

In [None]:
# 1. Filter to COVID-suspected patients
covid_df = df[df['observation_source'] == 'Suspected COVID-19']

print(f"COVID-suspected visits: {len(covid_df)}")
print(f"Total visits: {len(df)}")
print(f"Percentage: {len(covid_df)/len(df)*100:.1f}%")

In [None]:
# 2. Descriptive statistics for vital signs
vital_cols = ['oxygen_saturation_percent', 'respiratory_rate_per_minute',
              'heart_rate_bpm', 'body_temperature_c', 'systolic', 'diastolic']

print("Vital Signs Descriptive Statistics (COVID-Suspected Patients):")
print(covid_df[vital_cols].describe())

In [None]:
# 3. Create histograms (2x2 subplot)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Oxygen saturation
axes[0, 0].hist(covid_df['oxygen_saturation_percent'].dropna(), bins=30, edgecolor='black')
axes[0, 0].set_xlabel('Oxygen Saturation (%)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Oxygen Saturation Distribution')
axes[0, 0].axvline(95, color='red', linestyle='--', label='Normal threshold')
axes[0, 0].legend()

# Respiratory rate
axes[0, 1].hist(covid_df['respiratory_rate_per_minute'].dropna(), bins=30, edgecolor='black')
axes[0, 1].set_xlabel('Respiratory Rate (per minute)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Respiratory Rate Distribution')

# Heart rate
axes[1, 0].hist(covid_df['heart_rate_bpm'].dropna(), bins=30, edgecolor='black')
axes[1, 0].set_xlabel('Heart Rate (bpm)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Heart Rate Distribution')

# Body temperature
axes[1, 1].hist(covid_df['body_temperature_c'].dropna(), bins=30, edgecolor='black')
axes[1, 1].set_xlabel('Body Temperature (°C)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Body Temperature Distribution')
axes[1, 1].axvline(38, color='red', linestyle='--', label='Fever threshold')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

**Question:** Are there any concerning values in the vital signs? What might explain extreme values?

*Instructor notes:*

Students should identify:
- Oxygen saturation values below 90% (hypoxemia) and especially below 80% (severe hypoxemia)
- Extremely high or low temperatures that may indicate data entry errors or critical conditions
- Very high heart rates (>180 bpm) or respiratory rates (>40/min) that may be artifacts or represent critical illness
- Some extreme values may be biologically implausible and could indicate measurement errors or data quality issues
- The distribution of values should be discussed in terms of clinical significance vs. data quality concerns

### Exercise 2.5: Visit Type Distribution

1. Count the number of Inpatient vs Outpatient visits
2. Calculate the percentage of each type
3. Create a pie chart or bar chart showing the distribution

In [None]:
# 1-2. Count and percentage
visit_type_counts = df['visit_type'].value_counts()
visit_type_pct = df['visit_type'].value_counts(normalize=True) * 100

print("Visit Type Distribution:")
print(visit_type_counts)
print("\nPercentages:")
print(visit_type_pct)

# 3. Create visualizations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
visit_type_counts.plot(kind='bar', ax=axes[0], color=['skyblue', 'lightcoral'])
axes[0].set_xlabel('Visit Type')
axes[0].set_ylabel('Count')
axes[0].set_title('Visit Type Distribution (Bar Chart)')
axes[0].tick_params(axis='x', rotation=45)

# Pie chart
axes[1].pie(visit_type_counts, labels=visit_type_counts.index, autopct='%1.1f%%',
            colors=['skyblue', 'lightcoral'], startangle=90)
axes[1].set_title('Visit Type Distribution (Pie Chart)')

plt.tight_layout()
plt.show()

### Exercise 2.6: Outlier Detection in Vital Signs

Identify potential outliers in vital sign measurements. Consider clinical validity:
- Temperature: Normal 36-37.5°C, fever >38°C, extreme >42°C
- Oxygen saturation: Normal >95%, concerning <90%, critical <80%
- Heart rate: Normal 60-100 bpm, tachycardia >100, extreme >180
- Respiratory rate: Normal 12-20/min, elevated >24, extreme >40

**Dimension focus:** Distributional exploration

In [None]:
# Identify outliers based on clinical thresholds
outlier_summary = pd.DataFrame({
    'Vital Sign': ['Temperature', 'Oxygen Saturation', 'Heart Rate', 'Respiratory Rate'],
    'Extreme (count)': [
        (covid_df['body_temperature_c'] > 42).sum(),
        (covid_df['oxygen_saturation_percent'] < 80).sum(),
        (covid_df['heart_rate_bpm'] > 180).sum(),
        (covid_df['respiratory_rate_per_minute'] > 40).sum()
    ],
    'Concerning (count)': [
        ((covid_df['body_temperature_c'] > 38) & (covid_df['body_temperature_c'] <= 42)).sum(),
        ((covid_df['oxygen_saturation_percent'] < 90) & (covid_df['oxygen_saturation_percent'] >= 80)).sum(),
        ((covid_df['heart_rate_bpm'] > 100) & (covid_df['heart_rate_bpm'] <= 180)).sum(),
        ((covid_df['respiratory_rate_per_minute'] > 24) & (covid_df['respiratory_rate_per_minute'] <= 40)).sum()
    ]
})

print("Outlier Analysis for Vital Signs:")
print(outlier_summary)

# Show some extreme cases
print("\nExtreme Temperature Cases:")
extreme_temp = covid_df[covid_df['body_temperature_c'] > 42][['person_id', 'body_temperature_c', 'deceased']]
print(extreme_temp.head())

print("\nCritical Oxygen Saturation Cases:")
critical_o2 = covid_df[covid_df['oxygen_saturation_percent'] < 80][['person_id', 'oxygen_saturation_percent', 'deceased']]
print(critical_o2.head())

---
## Part 3: Relational Exploration

**Goal:** Investigate how multiple variables interact rather than treating them in isolation.

### Exercise 3.1: Correlation Matrix

Create a correlation matrix heatmap for the numeric vital sign variables using the COVID-suspected patient data.

In [None]:
# Select numeric columns for correlation
numeric_cols = ['age_at_visit_years', 'oxygen_saturation_percent',
                'respiratory_rate_per_minute', 'heart_rate_bpm',
                'body_temperature_c', 'systolic', 'diastolic']

# Calculate correlation matrix
corr_matrix = covid_df[numeric_cols].corr()

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Vital Signs and Age')
plt.tight_layout()
plt.show()

print("\nStrongest correlations (absolute value > 0.3):")
# Get upper triangle of correlation matrix
mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
corr_pairs = corr_matrix.where(mask).stack().sort_values(ascending=False)
strong_corr = corr_pairs[abs(corr_pairs) > 0.3]
print(strong_corr)

**Question:** What correlations do you observe? Are any surprising or concerning?

*Instructor notes:*

Students should identify:
- Systolic and diastolic blood pressure are highly correlated (expected)
- Negative correlation between oxygen saturation and respiratory rate (hypoxic patients breathe faster)
- Weak or no correlation between some variables may indicate independence or non-linear relationships
- Age correlations with vital signs may reveal age-related physiological patterns
- Discuss whether observed correlations are clinically meaningful or potential confounders

### Exercise 3.2: Temperature vs Oxygen Saturation

Create a scatter plot examining the relationship between body temperature and oxygen saturation. Color the points by the `deceased` status.

In [None]:
# Filter data with both measurements
temp_o2_data = covid_df.dropna(subset=['body_temperature_c', 'oxygen_saturation_percent', 'deceased'])

# Create scatter plot
plt.figure(figsize=(10, 6))
colors = {'Y': 'red', 'N': 'blue'}
for deceased_status, group in temp_o2_data.groupby('deceased'):
    plt.scatter(group['body_temperature_c'], group['oxygen_saturation_percent'],
                c=colors[deceased_status], label=f"Deceased: {deceased_status}",
                alpha=0.5, s=30)

plt.xlabel('Body Temperature (°C)')
plt.ylabel('Oxygen Saturation (%)')
plt.title('Relationship between Body Temperature and Oxygen Saturation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=95, color='green', linestyle='--', alpha=0.5, label='Normal O2 threshold')
plt.axvline(x=38, color='orange', linestyle='--', alpha=0.5, label='Fever threshold')
plt.show()

### Exercise 3.3: Condition Count and Vital Signs

1. Create a variable `condition_count` that counts the number of conditions per visit (count the colons + 1)
2. Compare the mean vital signs between patients with 1-2 conditions vs 3+ conditions

In [None]:
# 1. Create condition count
df['condition_count'] = df['condition'].str.count(':') + 1

# Check the distribution
print("Condition Count Distribution:")
print(df['condition_count'].value_counts().sort_index())

In [None]:
# 2. Compare vital signs by condition severity (1-2 vs 3+)
# First add condition_count to covid_df
covid_df = covid_df.copy()
covid_df['condition_count'] = covid_df['condition'].str.count(':') + 1
covid_df['high_condition_count'] = covid_df['condition_count'] >= 3

# Compare vital signs
vital_comparison = covid_df.groupby('high_condition_count')[vital_cols].mean()
print("\nMean Vital Signs by Condition Severity:")
print(vital_comparison)

# Calculate differences
print("\nDifference (3+ conditions minus 1-2 conditions):")
difference = vital_comparison.loc[True] - vital_comparison.loc[False]
print(difference)

### Exercise 3.4: Visit Type and Vital Signs

Compare vital signs between Inpatient and Outpatient visits. Create box plots showing the distribution of oxygen saturation by visit type.

In [None]:
# Box plots for oxygen saturation by visit type
plt.figure(figsize=(10, 6))
covid_df.boxplot(column='oxygen_saturation_percent', by='visit_type', figsize=(10, 6))
plt.suptitle('')  # Remove default title
plt.title('Oxygen Saturation Distribution by Visit Type')
plt.xlabel('Visit Type')
plt.ylabel('Oxygen Saturation (%)')
plt.axhline(y=95, color='red', linestyle='--', alpha=0.5, label='Normal threshold')
plt.legend()
plt.tight_layout()
plt.show()

# Summary statistics
print("\nOxygen Saturation Statistics by Visit Type:")
print(covid_df.groupby('visit_type')['oxygen_saturation_percent'].describe())

### Exercise 3.5: Deceased Status and Vital Signs

1. Calculate the mean and median vital signs grouped by deceased status (Y/N)
2. Create box plots comparing oxygen saturation between deceased and non-deceased patients

In [None]:
# 1. Mean/median vital signs by deceased status
deceased_vital_stats = covid_df.groupby('deceased')[vital_cols].agg(['mean', 'median'])
print("Vital Signs by Deceased Status:")
print(deceased_vital_stats)

In [None]:
# 2. Box plots for oxygen saturation by deceased status
plt.figure(figsize=(10, 6))
covid_df.boxplot(column='oxygen_saturation_percent', by='deceased', figsize=(10, 6))
plt.suptitle('')  # Remove default title
plt.title('Oxygen Saturation Distribution by Deceased Status')
plt.xlabel('Deceased (Y/N)')
plt.ylabel('Oxygen Saturation (%)')
plt.axhline(y=95, color='red', linestyle='--', alpha=0.5, label='Normal threshold')
plt.axhline(y=90, color='orange', linestyle='--', alpha=0.5, label='Concerning threshold')
plt.legend()
plt.tight_layout()
plt.show()

# Alternative: side-by-side violin plots
plt.figure(figsize=(10, 6))
sns.violinplot(data=covid_df, x='deceased', y='oxygen_saturation_percent')
plt.title('Oxygen Saturation Distribution by Deceased Status (Violin Plot)')
plt.xlabel('Deceased (Y/N)')
plt.ylabel('Oxygen Saturation (%)')
plt.axhline(y=95, color='red', linestyle='--', alpha=0.5, label='Normal threshold')
plt.legend()
plt.show()

---
## Part 4: Structural Exploration

**Goal:** Analyze temporal, hierarchical, and sequential patterns in the data.

### Exercise 4.1: Date Preparation

Convert the date columns to datetime format and extract useful components.

In [None]:
# Convert date columns to datetime (already done in data prep, but repeating for clarity)
df['visit_start_date'] = pd.to_datetime(df['visit_start_date'])
df['visit_end_date'] = pd.to_datetime(df['visit_end_date'])

# Extract year and month
df['visit_year'] = df['visit_start_date'].dt.year
df['visit_month'] = df['visit_start_date'].dt.to_period('M').astype(str)

print("Date columns prepared successfully!")
print(f"\nDate range: {df['visit_start_date'].min()} to {df['visit_start_date'].max()}")

### Exercise 4.2: Temporal Distribution of Visits

1. Count the number of visits by year
2. For 2020, create a bar chart showing visits by month
3. What patterns do you observe?

In [None]:
# 1. Visits by year
visits_by_year = df['visit_year'].value_counts().sort_index()
print("Visits by Year:")
print(visits_by_year)

# Plot
plt.figure(figsize=(10, 6))
visits_by_year.plot(kind='bar', color='steelblue')
plt.xlabel('Year')
plt.ylabel('Number of Visits')
plt.title('Distribution of Visits by Year')
plt.xticks(rotation=0)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# 2. Bar chart of 2020 visits by month
df_2020 = df[df['visit_year'] == 2020]
visits_2020_by_month = df_2020['visit_month'].value_counts().sort_index()

print("\n2020 Visits by Month:")
print(visits_2020_by_month)

plt.figure(figsize=(12, 6))
visits_2020_by_month.plot(kind='bar', color='coral')
plt.xlabel('Month')
plt.ylabel('Number of Visits')
plt.title('Distribution of Visits by Month in 2020')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n3. Pattern observations:")
print("- Look for seasonal trends (e.g., respiratory illness peaks in winter)")
print("- COVID-19 pandemic impact should be visible in 2020")
print("- Sudden spikes or drops may indicate data collection changes or real events")

### Exercise 4.3: Length of Stay Analysis

1. Calculate the length of stay (in days) for each visit
2. Filter to inpatient visits only
3. Calculate descriptive statistics for length of stay
4. Create a histogram of length of stay

In [None]:
# 1. Calculate length of stay (already done in prep, but showing here)
df['length_of_stay'] = (df['visit_end_date'] - df['visit_start_date']).dt.days
df['length_of_stay'] = df['length_of_stay'].clip(lower=0)  # No negative values

In [None]:
# 2-4. Inpatient length of stay analysis
inpatient_df = df[df['visit_type'].str.contains('Inpatient', na=False)]

print("Length of Stay Statistics (Inpatient Visits):")
print(inpatient_df['length_of_stay'].describe())

# Histogram
plt.figure(figsize=(12, 6))
plt.hist(inpatient_df['length_of_stay'].dropna(), bins=50, edgecolor='black')
plt.xlabel('Length of Stay (days)')
plt.ylabel('Frequency')
plt.title('Distribution of Length of Stay for Inpatient Visits')
plt.axvline(inpatient_df['length_of_stay'].median(), color='red', 
            linestyle='--', label=f"Median: {inpatient_df['length_of_stay'].median():.1f} days")
plt.axvline(inpatient_df['length_of_stay'].mean(), color='green', 
            linestyle='--', label=f"Mean: {inpatient_df['length_of_stay'].mean():.1f} days")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Check for outliers
print(f"\nVisits with LOS > 30 days: {(inpatient_df['length_of_stay'] > 30).sum()}")
print(f"Visits with LOS > 60 days: {(inpatient_df['length_of_stay'] > 60).sum()}")

### Exercise 4.4: Patient Visit History

1. Count the number of visits per patient
2. What percentage of patients have multiple visits?
3. Find a patient with multiple visits and examine their visit history

In [None]:
# 1. Visits per patient
visits_per_patient = df.groupby('person_id').size().reset_index(name='visit_count')

print("Distribution of Visit Counts:")
print(visits_per_patient['visit_count'].value_counts().sort_index())

In [None]:
# 2. Percentage with multiple visits
multiple_visits = (visits_per_patient['visit_count'] > 1).sum()
total_patients = len(visits_per_patient)
pct_multiple = (multiple_visits / total_patients) * 100

print(f"\nPatients with multiple visits: {multiple_visits} ({pct_multiple:.1f}%)")
print(f"Patients with single visit: {total_patients - multiple_visits} ({100-pct_multiple:.1f}%)")

In [None]:
# 3. Examine a patient with multiple visits
# Find patient with most visits
most_visits_patient = visits_per_patient.loc[visits_per_patient['visit_count'].idxmax(), 'person_id']

print(f"\nPatient with most visits: {most_visits_patient}")
print(f"Number of visits: {visits_per_patient[visits_per_patient['person_id'] == most_visits_patient]['visit_count'].values[0]}")

# Show their visit history
patient_history = df[df['person_id'] == most_visits_patient][[
    'visit_occurrence_id', 'visit_start_date', 'visit_end_date', 'visit_type',
    'condition', 'deceased', 'length_of_stay'
]].sort_values('visit_start_date')

print("\nVisit History:")
print(patient_history)

### Exercise 4.5: Vaccination Timeline Analysis

1. Calculate the time between the last flu vaccination and the visit date
2. What percentage of patients were vaccinated within the past year (365 days) before their visit?

In [None]:
# 1. Calculate days since flu vaccine
df['flu_last_administered'] = pd.to_datetime(df['flu_last_administered'])
df['days_since_flu_vaccine'] = (df['visit_start_date'] - df['flu_last_administered']).dt.days

print("Days Since Flu Vaccine Statistics:")
print(df['days_since_flu_vaccine'].describe())

In [None]:
# 2. Percentage vaccinated within past year
has_vaccine_data = df['days_since_flu_vaccine'].notna()
vaccinated_within_year = (df['days_since_flu_vaccine'] <= 365) & (df['days_since_flu_vaccine'] >= 0)

total_with_data = has_vaccine_data.sum()
vaccinated_count = vaccinated_within_year.sum()
pct_vaccinated = (vaccinated_count / total_with_data) * 100 if total_with_data > 0 else 0

print(f"\nTotal visits with vaccination data: {total_with_data}")
print(f"Vaccinated within past year: {vaccinated_count} ({pct_vaccinated:.1f}%)")

# Histogram
plt.figure(figsize=(12, 6))
plt.hist(df[df['days_since_flu_vaccine'] >= 0]['days_since_flu_vaccine'].dropna(), 
         bins=50, edgecolor='black')
plt.xlabel('Days Since Last Flu Vaccination')
plt.ylabel('Frequency')
plt.title('Time Since Last Flu Vaccination')
plt.axvline(365, color='red', linestyle='--', label='1 year')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## Part 5: Comparative Exploration

**Goal:** Study differences across groups, time periods, or conditions.

### Exercise 5.1: Mortality Rate Comparison

1. Calculate the overall mortality rate (percentage with deceased='Y')
2. Compare mortality rates by visit type (Inpatient vs Outpatient)
3. Compare mortality rates by age group

In [None]:
# 1. Overall mortality rate
overall_mortality_rate = (df['deceased'] == 'Y').mean() * 100
print(f"Overall mortality rate: {overall_mortality_rate:.2f}%")

In [None]:
# 2. Mortality by visit type
mortality_by_visit = pd.crosstab(
    df['visit_type'],
    df['deceased'],
    normalize='index'
) * 100

print("\nMortality Rate by Visit Type (%):")
print(mortality_by_visit)

# Visualization
mortality_by_visit['Y'].plot(kind='bar', color='darkred', figsize=(10, 6))
plt.title('Mortality Rate by Visit Type')
plt.xlabel('Visit Type')
plt.ylabel('Mortality Rate (%)')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# 3. Mortality by age group
# First ensure age_group exists
if 'age_group' not in df.columns:
    df['age_group'] = pd.cut(df['age_at_visit_years'],
                             bins=[0, 18, 40, 65, 100],
                             labels=['Pediatric', 'Young Adult', 'Middle Age', 'Elderly'])

mortality_by_age = pd.crosstab(
    df['age_group'],
    df['deceased'],
    normalize='index'
) * 100

print("\nMortality Rate by Age Group (%):")
print(mortality_by_age)

# Visualization
mortality_by_age['Y'].plot(kind='bar', color='darkblue', figsize=(10, 6))
plt.title('Mortality Rate by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Mortality Rate (%)')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

### Exercise 5.2: Visit Type by Age Group

Create a cross-tabulation showing what percentage of each age group has Inpatient vs Outpatient visits.

In [None]:
visit_by_age = pd.crosstab(
    df['age_group'],
    df['visit_type'],
    normalize='index'
) * 100

print("Visit Type Distribution by Age Group (%):")
print(visit_by_age)

# Stacked bar chart
visit_by_age.plot(kind='bar', stacked=True, figsize=(10, 6), 
                  color=['lightcoral', 'skyblue'])
plt.title('Visit Type Distribution by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Percentage (%)')
plt.legend(title='Visit Type')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Exercise 5.3: COVID vs Non-COVID Comparison

Compare characteristics between COVID-suspected and non-COVID visits:
1. Average age
2. Visit type distribution
3. Mortality rate

In [None]:
# Create a COVID indicator
df['is_covid_suspected'] = df['observation_source'] == 'Suspected COVID-19'

# Compare the groups
covid_summary = df.groupby('is_covid_suspected').agg(
    avg_age=('age_at_visit_years', 'mean'),
    mortality_rate=('deceased', lambda x: (x == 'Y').mean() * 100),
    inpatient_rate=('visit_type', lambda x: (x.str.contains('Inpatient')).mean() * 100),
    n_visits=('person_id', 'count')
)

print("COVID-Suspected vs Non-COVID Comparison:")
print(covid_summary)

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Average age
covid_summary['avg_age'].plot(kind='bar', ax=axes[0], color=['steelblue', 'coral'])
axes[0].set_title('Average Age')
axes[0].set_xlabel('COVID Suspected')
axes[0].set_ylabel('Age (years)')
axes[0].set_xticklabels(['No', 'Yes'], rotation=0)

# Mortality rate
covid_summary['mortality_rate'].plot(kind='bar', ax=axes[1], color=['steelblue', 'coral'])
axes[1].set_title('Mortality Rate')
axes[1].set_xlabel('COVID Suspected')
axes[1].set_ylabel('Mortality Rate (%)')
axes[1].set_xticklabels(['No', 'Yes'], rotation=0)

# Inpatient rate
covid_summary['inpatient_rate'].plot(kind='bar', ax=axes[2], color=['steelblue', 'coral'])
axes[2].set_title('Inpatient Admission Rate')
axes[2].set_xlabel('COVID Suspected')
axes[2].set_ylabel('Inpatient Rate (%)')
axes[2].set_xticklabels(['No', 'Yes'], rotation=0)

plt.tight_layout()
plt.show()

### Exercise 5.4: Statistical Significance Testing

Test whether the difference in oxygen saturation between deceased and non-deceased patients is statistically significant.

1. State your null hypothesis
2. Perform a Mann-Whitney U test (non-parametric alternative to t-test)
3. Interpret the results

In [None]:
# Separate the two groups
deceased_o2 = covid_df[covid_df['deceased'] == 'Y']['oxygen_saturation_percent'].dropna()
survived_o2 = covid_df[covid_df['deceased'] == 'N']['oxygen_saturation_percent'].dropna()

print(f"Deceased - Mean: {deceased_o2.mean():.2f}, Median: {deceased_o2.median():.2f}, n={len(deceased_o2)}")
print(f"Survived - Mean: {survived_o2.mean():.2f}, Median: {survived_o2.median():.2f}, n={len(survived_o2)}")

In [None]:
# Perform Mann-Whitney U test
from scipy import stats

# Null hypothesis: There is no difference in oxygen saturation between deceased and non-deceased patients
# Alternative hypothesis: There is a difference in oxygen saturation between the groups

u_stat, p_value = stats.mannwhitneyu(
    deceased_o2,
    survived_o2,
    alternative='two-sided'
)

print(f"\nMann-Whitney U Test Results:")
print(f"U statistic: {u_stat:.2f}")
print(f"P-value: {p_value:.4e}")
print(f"\nSignificance level (α): 0.05")
if p_value < 0.05:
    print("Result: REJECT the null hypothesis")
    print("Interpretation: There IS a statistically significant difference in oxygen saturation")
    print("between deceased and non-deceased patients.")
else:
    print("Result: FAIL TO REJECT the null hypothesis")
    print("Interpretation: There is NO statistically significant difference in oxygen saturation")
    print("between deceased and non-deceased patients.")

# Effect size (rank biserial correlation)
n1, n2 = len(deceased_o2), len(survived_o2)
rank_biserial = 1 - (2*u_stat) / (n1 * n2)
print(f"\nEffect size (rank-biserial correlation): {rank_biserial:.3f}")
print("Effect size interpretation: <0.3 small, 0.3-0.5 medium, >0.5 large")

**Interpretation:** Based on the p-value, is the difference statistically significant at α = 0.05?

**Instructor notes:**

The p-value is less than 0.05, indicating a statistically significant difference in oxygen saturation between deceased and non-deceased patients. The deceased group has significantly lower oxygen saturation on average.

Key teaching points:
- Statistical significance (p < 0.05) tells us the difference is unlikely due to chance
- Effect size tells us how large/meaningful the difference is
- Clinical significance vs. statistical significance - a small difference might be statistically significant with large sample sizes but not clinically meaningful
- The Mann-Whitney U test is appropriate here because oxygen saturation may not be normally distributed

### Exercise 5.5: Condition-Specific Analysis

1. Identify which specific conditions are most associated with inpatient admission
2. Calculate the inpatient rate for each condition
3. Which conditions have the highest inpatient rates?

In [None]:
# Merge condition-level data with visit type
import re

cond_rates = (
    df
    .assign(
        condition_item=df['condition'].str.split(r"\s*:\s*")
    )
    .explode('condition_item')
    .dropna(subset=['condition_item'])
    .assign(
        condition_item=lambda x: x['condition_item'].str.strip(),
        is_inpatient=lambda x: x['visit_type'].str.contains('Inpatient', na=False)
    )
    .groupby('condition_item')
    .agg(
        inpatient_rate=('is_inpatient', 'mean'),
        total_visits=('visit_occurrence_id', 'count')
    )
    .reset_index()
)

cond_rates['inpatient_rate'] *= 100

# Filter to conditions with at least 100 visits for statistical reliability
cond_rates_filtered = cond_rates[cond_rates['total_visits'] >= 100]

# Top 10 conditions by inpatient rate
top_inpatient_conditions = cond_rates_filtered.sort_values('inpatient_rate', ascending=False).head(10)

print("Top 10 Conditions with Highest Inpatient Admission Rates:")
print("(minimum 100 visits)\n")
print(top_inpatient_conditions.to_string(index=False))

# Visualization
plt.figure(figsize=(12, 6))
plt.barh(range(len(top_inpatient_conditions)), top_inpatient_conditions['inpatient_rate'])
plt.yticks(range(len(top_inpatient_conditions)), top_inpatient_conditions['condition_item'])
plt.xlabel('Inpatient Admission Rate (%)')
plt.title('Top 10 Conditions Associated with Inpatient Admission')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

---
## Part 6: Interpretation and Synthesis

**Goal:** Move from patterns to meaningful insights.

### Exercise 6.1: Synthesis

Based on your exploration, write a brief summary (3-5 bullet points) of the most important findings from this dataset.

**Instructor notes - Sample answers:**

1. **Mortality and oxygen saturation:** There is a strong, statistically significant relationship between low oxygen saturation and mortality. Deceased patients had significantly lower O2 sat levels (mean ~85%) compared to survivors (mean ~95%).

2. **Age-related mortality gradient:** Mortality rates increase sharply with age, with elderly patients (65+) having the highest mortality rates. This suggests age is a critical risk factor.

3. **COVID impact on admissions:** COVID-suspected patients have higher inpatient admission rates compared to non-COVID patients, indicating more severe illness requiring hospitalization.

4. **Comorbidity burden:** Patients with 3+ conditions show worse vital signs on average, suggesting that condition count may be a useful proxy for illness severity.

5. **Temporal patterns:** Visit volume shows clear temporal patterns that likely reflect the COVID-19 pandemic timeline and seasonal respiratory illness patterns.

6. **Data quality concerns:** Several vital sign measurements contain extreme or implausible values that require investigation (e.g., temperatures >42°C, unrealistic blood pressure readings), suggesting data quality issues that need to be addressed before modeling.

### Exercise 6.2: Data Limitations

What are the key limitations of this dataset that affect what questions we can answer?

**Instructor notes - Sample limitations:**

1. **Missing data:** High rates of missing data in vital signs (~60-70% missing for non-COVID visits) limits our ability to compare COVID vs non-COVID patients on clinical measures.

2. **Selection bias:** This is hospital/clinical data, so we only observe patients who sought care. We don't know about patients with mild illness who didn't visit, or population-level incidence.

3. **Synthetic data limitations:** As synthetic data, it may not fully capture real-world complexities, rare conditions, or subtle relationships present in actual patient data.

4. **Temporal censoring:** We only see visits within a specific time window. Long-term outcomes, readmissions after the study period, or prior history before the data collection started are unknown.

5. **Limited treatment information:** We don't have data on treatments received, which makes causal inference about interventions impossible.

6. **Measurement variability:** Vital signs may be measured at different times during the visit, with different equipment, and by different staff, introducing measurement error.

7. **Condition coding:** Conditions are recorded as free text rather than standardized codes (ICD-10, SNOMED), making systematic analysis more difficult and potentially inconsistent.

### Exercise 6.3: Question Classification

For each of the following questions, identify which of the seven question types it represents and whether our data can answer it:

1. "What is the average oxygen saturation for COVID-suspected patients?"
2. "Would providing flu vaccines earlier reduce hospitalizations?"
3. "Can we predict which patients will be admitted as inpatients based on their vital signs?"
4. "Is there a significant difference in mortality between age groups?"

**Instructor answers:**

1. Question type: **Descriptive** | Data support: **YES** - We can calculate this directly from the data

2. Question type: **Causal** | Data support: **NO** - This requires an intervention or experiment. Our observational data cannot establish causation, only correlation between vaccination and outcomes

3. Question type: **Predictive** | Data support: **PARTIALLY** - We have the features (vital signs) and outcome (admission type), but we'd need to build and validate a model. Missing data is a concern.

4. Question type: **Inferential** | Data support: **YES** - We can calculate mortality rates by age group and perform statistical tests to assess if differences are significant

**Teaching points:**
- Descriptive questions are the easiest to answer - they just require summarizing existing data
- Causal questions require experimental or quasi-experimental designs
- Predictive questions require building models and proper validation
- Inferential questions require statistical testing to determine if patterns generalize beyond our sample

### Exercise 6.4: Next Steps

Based on your EDA, what would be your recommended next steps? Choose from:
- Communication and stakeholder alignment
- Modeling preparation
- Data and process redesign
- Goal refinement

Explain your reasoning.

**Instructor notes - Sample answer:**

**Recommended next steps:**

1. **Data and process redesign (HIGH PRIORITY)**
   - Address data quality issues: extreme/implausible vital sign values need investigation
   - Standardize condition coding (currently free text, inconsistent)
   - Improve vital sign capture for non-COVID patients (currently 60-70% missing)
   - Document measurement protocols to reduce variability

2. **Communication and stakeholder alignment (HIGH PRIORITY)**
   - Share findings about mortality risk factors (age, O2 saturation, comorbidities)
   - Discuss data quality concerns with data collection teams
   - Align on which questions are most valuable to answer given data limitations
   - Clarify whether causal questions (e.g., vaccine effectiveness) are goals - if so, different data collection is needed

3. **Goal refinement (MEDIUM PRIORITY)**
   - If the goal is prediction (e.g., admission risk), we can proceed with modeling
   - If the goal is causal inference, we need to redesign data collection or find alternative approaches
   - Clarify the intended use case and success criteria

4. **Modeling preparation (LOWER PRIORITY UNTIL DATA QUALITY IMPROVES)**
   - Only proceed with modeling after addressing data quality issues
   - Develop plan for handling missing vital signs data
   - Consider which modeling approach is appropriate given class imbalance (mortality is relatively rare)
   - Plan validation strategy

**Reasoning:** Data quality issues must be addressed before building models, as "garbage in, garbage out" applies. Stakeholder communication is essential to ensure we're solving the right problem and that stakeholders understand data limitations.

---
## Bonus Exercises

### Bonus 1: Predictive Model Preparation

Prepare features for a model that predicts inpatient admission based on vital signs and demographics.

1. Select relevant features
2. Handle missing values
3. Create the target variable
4. Check class balance

In [None]:
# Predictive modeling preparation

# 1. Select features
feature_cols = [
    'age_at_visit_years',
    'oxygen_saturation_percent',
    'respiratory_rate_per_minute',
    'heart_rate_bpm',
    'body_temperature_c',
    'systolic',
    'diastolic',
    'condition_count',
    'gender_source_value',
    'age_group'
]

# 3. Create target variable
modeling_df = covid_df.copy()
modeling_df['is_inpatient'] = modeling_df['visit_type'].str.contains('Inpatient', na=False).astype(int)

# 4. Check class balance
print("Target Variable Distribution:")
print(modeling_df['is_inpatient'].value_counts())
print(f"\nInpatient rate: {modeling_df['is_inpatient'].mean()*100:.1f}%")

# Check missing values in features
print("\nMissing Values in Features:")
missing_pct = modeling_df[feature_cols].isnull().sum() / len(modeling_df) * 100
print(missing_pct[missing_pct > 0])

# 2. Handle missing values - simple imputation strategy
from sklearn.impute import SimpleImputer

# For numeric columns, impute with median
numeric_features = ['age_at_visit_years', 'oxygen_saturation_percent',
                   'respiratory_rate_per_minute', 'heart_rate_bpm',
                   'body_temperature_c', 'systolic', 'diastolic', 'condition_count']

imputer = SimpleImputer(strategy='median')
modeling_df[numeric_features] = imputer.fit_transform(modeling_df[numeric_features])

print("\nData prepared for modeling!")
print(f"Total samples: {len(modeling_df)}")
print(f"Features: {len(feature_cols)}")
print(f"Class balance: {modeling_df['is_inpatient'].value_counts(normalize=True)*100}")

### Bonus 2: Condition Co-occurrence

Create a co-occurrence matrix showing which conditions tend to appear together.

In [None]:
# Create condition co-occurrence matrix
from itertools import combinations
from collections import defaultdict

# Get top 10 most common conditions
top_conditions = all_conditions.value_counts().head(10).index.tolist()

# Initialize co-occurrence matrix
cooc_matrix = pd.DataFrame(0, index=top_conditions, columns=top_conditions)

# Count co-occurrences
for conditions in df['condition_list'].dropna():
    # Clean and filter to top conditions
    conditions_clean = [c.strip() for c in conditions if c.strip() in top_conditions]
    # Count pairs
    for cond1, cond2 in combinations(conditions_clean, 2):
        cooc_matrix.loc[cond1, cond2] += 1
        cooc_matrix.loc[cond2, cond1] += 1

# Visualize
plt.figure(figsize=(12, 10))
sns.heatmap(cooc_matrix, annot=True, fmt='d', cmap='YlOrRd', square=True, 
            cbar_kws={"shrink": 0.8})
plt.title('Condition Co-occurrence Matrix (Top 10 Conditions)')
plt.tight_layout()
plt.show()

print("\nTop condition pairs:")
# Get upper triangle
mask = np.triu(np.ones_like(cooc_matrix, dtype=bool), k=1)
pairs = cooc_matrix.where(mask).stack().sort_values(ascending=False)
print(pairs.head(10))

### Bonus 3: Patient Journey Visualization

For patients with multiple visits, visualize their journey over time (conditions and visit types).

In [None]:
# Select a patient with multiple visits
multi_visit_patients = df.groupby('person_id').size()
patient_id = multi_visit_patients[multi_visit_patients >= 3].index[0]  # Get first patient with 3+ visits

patient_data = df[df['person_id'] == patient_id].sort_values('visit_start_date')

print(f"Patient Journey for Patient {patient_id}")
print(f"Total visits: {len(patient_data)}\n")

# Create timeline visualization
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8))

# Plot 1: Visit types over time
visit_types = patient_data['visit_type'].map({'Inpatient Visit': 1, 'Outpatient Visit': 0})
ax1.scatter(patient_data['visit_start_date'], visit_types, s=200, c=visit_types, 
           cmap='RdYlGn_r', alpha=0.7, edgecolors='black', linewidth=2)
ax1.set_yticks([0, 1])
ax1.set_yticklabels(['Outpatient', 'Inpatient'])
ax1.set_ylabel('Visit Type')
ax1.set_title(f'Patient Journey: Visit Types Over Time')
ax1.grid(True, alpha=0.3, axis='x')

# Plot 2: Number of conditions over time
condition_counts = patient_data['condition_count']
ax2.plot(patient_data['visit_start_date'], condition_counts, marker='o', 
        markersize=10, linewidth=2, color='steelblue')
ax2.set_ylabel('Number of Conditions')
ax2.set_xlabel('Date')
ax2.set_title('Condition Count Over Time')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed visit information
print("\nDetailed Visit History:")
for idx, row in patient_data.iterrows():
    print(f"\nVisit {row['visit_occurrence_id']}:")
    print(f"  Date: {row['visit_start_date'].date()}")
    print(f"  Type: {row['visit_type']}")
    print(f"  Conditions: {row['condition']}")
    print(f"  Length of stay: {row['length_of_stay']} days")
    if pd.notna(row['oxygen_saturation_percent']):
        print(f"  O2 Saturation: {row['oxygen_saturation_percent']:.1f}%")

---
## Summary: Key EDA Concepts

### The EDA Lifecycle
1. Clarify the analytical goal
2. Understand data provenance, structure, and integrity
3. Explore (distributional, relational, structural, comparative)
4. Interpret findings and refine hypotheses
5. Translate results into next steps

### The Seven Question Types
1. Descriptive — What happened?
2. Exploratory — What patterns exist?
3. Inferential — Does this generalize?
4. Predictive — What will happen?
5. Prescriptive — What should we do?
6. Causal — What if we intervene?
7. Mechanistic — What process produces this?

### The Four Exploration Dimensions
1. **Distributional** — Individual variable shape, outliers, missingness
2. **Relational** — Correlations, interactions between variables
3. **Structural** — Temporal, hierarchical, sequential patterns
4. **Comparative** — Differences across groups and time periods

### The Seven Core Principles
1. Let the data surprise you
2. Multiple encodings, multiple perspectives
3. Segment early, segment often
4. Explicitly check assumptions
5. Expect heterogeneity and drift
6. Distinguish signal from artifact
7. Document hypotheses and alternative explanations