In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("diabetic_data.csv", dtype=str, low_memory=False)


### Load Raw Dataset (diabetic_data.csv)

- **Purpose:** Load the original diabetes hospitalization dataset into memory for cleaning and exploration.
- **What this cell does:**
  - Imports core libraries: pandas and numpy.
  - Reads the CSV file using:
    - dtype=str -- loads all columns as strings to prevent mixed-type issues.
    - low_memory=False -- ensures pandas reads the file in a single pass for accurate type inference.
- **Output:** A full raw DataFrame called **df** with ~101,766 rows and 50 columns.


In [4]:
df = df.replace("?", pd.NA)
print("Raw shape:", df.shape)
df["readmitted"].value_counts(dropna=False)


Raw shape: (101766, 50)


readmitted
NO     54864
>30    35545
<30    11357
Name: count, dtype: int64

### Basic Cleaning: Replace Missing Values & Inspect Target

- **Replaces** all "?" placeholders in the dataset with proper missing values (`pd.NA`), allowing for consistent handling in analysis.
- **Prints** the raw shape of the dataset (rows × columns) for verification.
- **Displays** value counts for the readmitted column to inspect class distribution:
  - **NO**– not readmitted
  - **>30** – readmitted after 30 days
  - **<30** – readmitted within 30 days (positive class for ML)


In [5]:
df["readmit_30d"] = (df["readmitted"] == "<30").astype("int8")
df["readmit_30d"].mean() 


np.float64(0.11159915885462728)

In [6]:
num_cols = [
    "time_in_hospital","num_lab_procedures","num_procedures","num_medications",
    "number_outpatient","number_emergency","number_inpatient","number_diagnoses"
]
for c in num_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")


### Convert Numeric Columns to Proper Data Types

- **Purpose:** Ensure that key clinical and utilization features are stored as numeric types for analysis and modeling.
- **What this cell does:**
  - Defines a list of columns that are expected to be numeric (e.g., length of stay, procedure counts).
  - Iterates through each column and converts it to a numeric data type using `pd.to_numeric()`.
    - Invalid entries (e.g., non-numeric strings or missing values) are coerced to `NaN`.
- **Why it's important:** This conversion enables correct statistical summaries, visualizations, and ML model input.


In [7]:
age_bins = df["age"].str.extract(r"\[(\d+)-(\d+)\)").astype(float)
df["age_mid"] = (age_bins[0] + age_bins[1]) / 2


### Convert Age Ranges to Numeric Midpoints

- **Purpose:** Transform the `age` column (which contains string ranges like `[60-70)`) into a numeric feature for modeling.
- **What this cell does:**
  - Uses **regex** to extract the lower and upper bounds from the age range string.
  - Converts those bounds to `float`.
  - Computes the **midpoint** of each age bin and stores it in a new column: `age_mid`.
    - Example: `[60-70)` --`(60 + 70) / 2 = 65.0`
- **Why it's useful:** Machine learning models work better with numeric values than with string intervals.


In [8]:
simple = df[[
    "encounter_id","patient_nbr","readmit_30d",
    "age_mid","gender","race",
    "time_in_hospital","num_lab_procedures","num_procedures","num_medications",
    "number_outpatient","number_emergency","number_inpatient","number_diagnoses",
    "A1Cresult","max_glu_serum",
    "admission_type_id","admission_source_id","discharge_disposition_id"
]].rename(columns={
    "encounter_id":"admission_id",
    "patient_nbr":"patient_id",
    "time_in_hospital":"los_days"
})
simple.shape


(101766, 19)

###  Create Simplified Modeling Dataset (`simple`)

- **Purpose:** Build a clean, compact DataFrame for modeling that includes essential predictors and the target label.
- **What this cell does:**
  - Selects a curated list of relevant columns from the raw DataFrame `df`.
  - Renames key columns for clarity:
    - `encounter_id`-- `admission_id`
    - `patient_nbr`-- `patient_id`
    - `time_in_hospital` -- `los_days`
  - Stores the result in a new DataFrame called `simple`.
- **Columns include:**
  - Patient and admission IDs
  - Target variable: `readmit_30d`
  - Demographics: `age_mid`, `gender`, `race`
  - Encounter info: length of stay, lab/procedure counts, utilization history
  - Clinical markers: `A1Cresult`, `max_glu_serum`
  - Admission context: type, source, discharge disposition
- **Output:** Prints the shape of the simplified DataFrame to confirm rows and columns (should be 19 columns total).


In [9]:

out_path = "readmit_simple.csv"
display(simple.head())

simple.to_csv(out_path, index=False)
print(f"Saved: {out_path}  | rows={len(simple):,}  cols={simple.shape[1]}")


Unnamed: 0,admission_id,patient_id,readmit_30d,age_mid,gender,race,los_days,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,A1Cresult,max_glu_serum,admission_type_id,admission_source_id,discharge_disposition_id
0,2278392,8222157,0,5.0,Female,Caucasian,1,41,0,1,0,0,0,1,,,6,1,25
1,149190,55629189,0,15.0,Female,Caucasian,3,59,0,18,0,0,0,9,,,1,7,1
2,64410,86047875,0,25.0,Female,AfricanAmerican,2,11,5,13,2,0,1,6,,,1,7,1
3,500364,82442376,0,35.0,Male,Caucasian,2,44,1,16,0,0,0,7,,,1,7,1
4,16680,42519267,0,45.0,Male,Caucasian,1,51,0,8,0,0,0,5,,,1,7,1


Saved: readmit_simple.csv  | rows=101,766  cols=19


###  Preview and Save the Cleaned Dataset (`readmit_simple.csv`)

- **Purpose:** Save the cleaned and simplified dataset for future analysis or modeling steps.
- **What this cell does:**
  - Displays the first few rows of the `simple` DataFrame using `display(simple.head())` for verification.
  - Saves the `simple` DataFrame to a CSV file named `readmit_simple.csv` in the current working directory.
  - Prints a confirmation message showing:
    - File name
    - Number of rows (formatted with commas)
    - Number of columns
- **Why this matters:** Ensures that you have a ready-to-use version of the dataset stored on disk for reproducibility and backup.


## Dataset Overview – Predicting Hospital Readmissions

### Data Source
- The dataset is sourced from the **UCI Machine Learning Repository**:  
  [Diabetes 130-US hospitals for years 1999–2008 Data Set](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008)
- It contains de-identified patient records related to diabetes admissions.

### File Information
- **File name:** `diabetic_data.csv`
- **Data size:** ~18 MB
- **Shape:** 101,766 rows × 50 columns
- **Time Period:** 1999–2008
- **Each row represents:** a single hospital encounter for a diabetic patient.

---

### Data Dictionary (Simplified Modeling Subset)

| Column Name               | Data Type | Description                                            | Sample / Categories                     |
|---------------------------|-----------|--------------------------------------------------------|------------------------------------------|
| `admission_id`            | string    | Unique hospital encounter ID                           | e.g., 100006                             |
| `patient_id`              | string    | De-identified patient number                           | e.g., 112186                             |
| `readmit_30d`             | int       | Target: 1 if readmitted within 30 days, else 0         | {0, 1}                                   |
| `age_mid`                 | float     | Midpoint of patient age bracket (from `age`)           | e.g., 65.0 for `[60-70)`                 |
| `gender`                  | category  | Patient gender                                         | `Male`, `Female`, `Unknown`              |
| `race`                    | category  | Patient race                                           | `Caucasian`, `AfricanAmerican`, etc.     |
| `los_days`                | int       | Length of hospital stay (days)                         | e.g., 3, 5                                |
| `num_lab_procedures`      | int       | Number of lab tests during encounter                   | e.g., 40                                 |
| `num_procedures`          | int       | Number of procedures performed                         | e.g., 1, 0                                |
| `num_medications`         | int       | Number of medications prescribed                       | e.g., 13                                 |
| `number_outpatient`       | int       | Outpatient visits in prior year                        | e.g., 0, 2                                |
| `number_emergency`        | int       | Emergency visits in prior year                         | e.g., 0, 1                                |
| `number_inpatient`        | int       | Inpatient visits in prior year                         | e.g., 0, 3                                |
| `number_diagnoses`        | int       | Number of diagnoses recorded                           | e.g., 9                                  |
| `A1Cresult`               | category  | A1C test result for blood sugar control                | `>8`, `Norm`, `None`, `>7`               |
| `max_glu_serum`           | category  | Maximum glucose level measured                         | `>200`, `Norm`, `None`, `>300`           |
| `admission_type_id`       | category  | Type of hospital admission                             | `1`, `2`, ..., `8` (mapped via lookup)   |
| `admission_source_id`     | category  | Source of patient admission                            | `1`, `2`, ..., `25` (mapped via lookup)  |
| `discharge_disposition_id`| category  | Discharge destination or status                        | `1`, `2`, ..., `30` (mapped via lookup)  |

---

### Target Variable
- **`readmit_30d`** — A binary indicator of whether the patient was readmitted within 30 days of discharge.

---

### Feature Variables (Predictors)
The following columns are selected as features for machine learning:
- **Demographics:** `age_mid`, `gender`, `race`
- **Utilization:** `los_days`, `num_lab_procedures`, `num_procedures`, `num_medications`
- **Visit History:** `number_outpatient`, `number_emergency`, `number_inpatient`
- **Clinical Markers:** `number_diagnoses`, `A1Cresult`, `max_glu_serum`
- **Admission Details:** `admission_type_id`, `admission_source_id`, `discharge_disposition_id`

---

### Notes
- Missing values such as `"?"` were replaced with `NaN`.
- Age ranges like `[60-70)` were converted to numeric midpoints.
- Categorical ID columns can be mapped to readable labels using the `IDS_mapping.csv` file.
