# 🏗️ Phase 2: Feature Engineering – MedAdhereAI

In this phase, we engineer meaningful features from patient-level refill data.  
We aim to transform transactional claim data into aggregated, interpretable metrics that can improve adherence prediction models.


### 🔹 Step 1: Load Cleaned and Sorted Dataset

We use the cleaned, date-parsed, and refill gap-calculated data from Phase 1.  
This dataset includes:
- Binary target (`ADHERENT_BINARY`)
- Date features (`SERVICE DATE`, `BIRTHDATE`, etc.)
- Time gaps (`DAYS_SINCE_LAST_REFILL`)


In [22]:
import pandas as pd

# Load dataset
df = pd.read_csv('../dataset/raw/Diabetes Adherence Data.csv')

# Convert service date just to be safe
df['SERVICE DATE'] = pd.to_datetime(df['SERVICE DATE'], errors='coerce')

# Sort and calculate refill gap
df = df.sort_values(by=['MEMBER', 'SERVICE DATE'])
df['DAYS_SINCE_LAST_REFILL'] = df.groupby('MEMBER')['SERVICE DATE'].diff().dt.days

# Preview
df[['MEMBER', 'SERVICE DATE', 'DAYS_SINCE_LAST_REFILL']].head()


Unnamed: 0,MEMBER,SERVICE DATE,DAYS_SINCE_LAST_REFILL
47110,92222888,2022-06-02,
47109,92222888,2022-09-01,91.0
20218,92222888,2022-09-07,6.0
47111,92222888,NaT,
47112,92222888,NaT,


Refill gaps are correctly computed, with `NaN` for first visits and `NaT` for missing service dates — the dataset is now ready for patient-level feature aggregation.

# Create adherence label
We recreate the binary adherence label since this notebook starts from raw data.  
This ensures we can merge it later with the patient-level feature set.


In [23]:
# Recreate adherence label
df['ADHERENT_BINARY'] = df['ADHERENCE'].apply(lambda x: 1 if x >= 8 else 0)


### 🔹 Step 2: Aggregate Refill Features Per Patient

We now generate patient-level refill metrics using the `DAYS_SINCE_LAST_REFILL` and `SERVICE DATE` columns.

These features include:
- Average gap between refills
- Maximum gap (longest break)
- Total number of visits
- First and last refill date
- Refill duration (in days)

These engineered features capture refill regularity, frequency, and duration for each patient.


In [24]:
# Group by MEMBER and calculate refill stats
agg_refill = df.groupby('MEMBER').agg(
    avg_refill_gap=('DAYS_SINCE_LAST_REFILL', 'mean'),
    max_refill_gap=('DAYS_SINCE_LAST_REFILL', 'max'),
    total_visits=('SERVICE DATE', 'count'),
    first_refill_date=('SERVICE DATE', 'min'),
    last_refill_date=('SERVICE DATE', 'max')
)

# Duration of observed refill behavior (in days)
agg_refill['refill_duration_days'] = (agg_refill['last_refill_date'] - agg_refill['first_refill_date']).dt.days

# Preview
agg_refill.head()


Unnamed: 0_level_0,avg_refill_gap,max_refill_gap,total_visits,first_refill_date,last_refill_date,refill_duration_days
MEMBER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
92222888,48.5,91.0,3,2022-06-02,2022-09-07,97.0
92222969,26.444444,149.0,10,2022-01-08,2022-09-03,238.0
92223895,0.0,0.0,2,2022-07-12,2022-07-12,0.0
92224675,63.5,244.0,5,2022-01-01,2022-09-12,254.0
92225985,78.5,95.0,3,2022-06-04,2022-11-08,157.0


### Aggregated Refill Metrics

Each patient now has key refill behavior metrics:

- `avg_refill_gap`: Average days between visits (e.g., 48.5, 63.5)
- `max_refill_gap`: Longest gap between refills (some over 240 days)
- `total_visits`: Number of recorded refills (from 2 to 10+)
- `refill_duration_days`: Total observed duration of refill history (spanning 0 to 250+ days)

These engineered features capture **refill frequency, regularity, and range**, and will feed directly into our model to predict medication adherence.


### 🔹 Step 3: Merge Adherence Labels

To prepare the dataset for classification, we attach the binary adherence label (`ADHERENT_BINARY`) to each patient’s aggregated refill features.

We’ll take the **most recent label** per patient, assuming it's based on their final known behavior.


In [25]:
# For each patient, get their latest adherence label
latest_labels = (
    df.sort_values(by=['MEMBER', 'SERVICE DATE'])
      .groupby('MEMBER')['ADHERENT_BINARY']
      .last()
)

# Merge into the aggregated refill dataset
agg_refill = agg_refill.merge(latest_labels, left_index=True, right_index=True)

# Preview result
agg_refill.head()


Unnamed: 0_level_0,avg_refill_gap,max_refill_gap,total_visits,first_refill_date,last_refill_date,refill_duration_days,ADHERENT_BINARY
MEMBER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
92222888,48.5,91.0,3,2022-06-02,2022-09-07,97.0,0
92222969,26.444444,149.0,10,2022-01-08,2022-09-03,238.0,1
92223895,0.0,0.0,2,2022-07-12,2022-07-12,0.0,1
92224675,63.5,244.0,5,2022-01-01,2022-09-12,254.0,1
92225985,78.5,95.0,3,2022-06-04,2022-11-08,157.0,1


We now have a complete patient-level dataset containing engineered refill features and the binary adherence label.

This dataset is ready for further enrichment (e.g., demographics) or direct use in training classification models.


### 🔹 Step 4: Merge Demographic Features (GENDER and AGE)

We enrich the aggregated dataset with patient demographics.  
`GENDER` and `BIRTHDATE` are taken from the first known record for each patient.  
We then compute `AGE` and merge both into the main feature table.


In [26]:
demo = (
    df.sort_values(by=['MEMBER', 'SERVICE DATE'])
      .groupby('MEMBER')[['GENDER', 'BIRTHDATE']]
      .first()
)

# Fix: Convert BIRTHDATE to datetime
demo['BIRTHDATE'] = pd.to_datetime(demo['BIRTHDATE'], errors='coerce')

# Calculate age
demo['AGE'] = pd.to_datetime('today').year - demo['BIRTHDATE'].dt.year

# Merge into agg_refill
agg_refill = agg_refill.merge(demo[['GENDER', 'AGE']], left_index=True, right_index=True)

# Preview
agg_refill.head()


  demo['BIRTHDATE'] = pd.to_datetime(demo['BIRTHDATE'], errors='coerce')


Unnamed: 0_level_0,avg_refill_gap,max_refill_gap,total_visits,first_refill_date,last_refill_date,refill_duration_days,ADHERENT_BINARY,GENDER,AGE
MEMBER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
92222888,48.5,91.0,3,2022-06-02,2022-09-07,97.0,0,M,62
92222969,26.444444,149.0,10,2022-01-08,2022-09-03,238.0,1,F,65
92223895,0.0,0.0,2,2022-07-12,2022-07-12,0.0,1,F,72
92224675,63.5,244.0,5,2022-01-01,2022-09-12,254.0,1,F,39
92225985,78.5,95.0,3,2022-06-04,2022-11-08,157.0,1,M,59


GENDER and AGE have been successfully merged into the dataset.  
This completes the enrichment phase, and the dataset is now ready for missing value handling and export for modeling.


### 🔹 Step 5: Handle Missing Values

We check for and handle any missing values to ensure that the final dataset is clean for modeling.

- `avg_refill_gap` and `max_refill_gap`: NaN when the patient had only one refill — we fill with 0
- `AGE`: NaN due to missing or invalid birthdates — we drop those rows for now


In [27]:
# Check missing values
agg_refill.isnull().sum()


avg_refill_gap          1455
max_refill_gap          1455
total_visits               0
first_refill_date        866
last_refill_date         866
refill_duration_days     866
ADHERENT_BINARY            0
GENDER                     0
AGE                        0
dtype: int64

The null check shows two types of missing values:

- `avg_refill_gap` and `max_refill_gap` are missing for 1,455 patients who had only one recorded visit.  
  Since there’s no refill behavior, we fill these with `0.0` to reflect that clearly instead of treating them as unknown.

- `first_refill_date`, `last_refill_date`, and `refill_duration_days` are missing for 866 patients — also due to limited visit history.  
  These were intermediate fields used to derive refill metrics, so we drop them from the final dataset.

All other fields — including `total_visits`, `ADHERENT_BINARY`, `GENDER`, and `AGE` — are complete and require no action.


We finalize the cleaning process by addressing the remaining missing values:

- For patients with only one refill, `avg_refill_gap` and `max_refill_gap` are undefined.  
  We fill these with `0.0` to reflect the absence of any refill history — a valid and model-informative decision.

- Columns like `first_refill_date`, `last_refill_date`, and `refill_duration_days` are dropped since their roles are complete and they introduce unnecessary nulls.

The final null check confirms that the dataset has **zero missing values** across all fields:



In [28]:
# Fill missing gap features with 0
agg_refill['avg_refill_gap'] = agg_refill['avg_refill_gap'].fillna(0)
agg_refill['max_refill_gap'] = agg_refill['max_refill_gap'].fillna(0)

# Drop columns we no longer need
agg_refill = agg_refill.drop(columns=['first_refill_date', 'last_refill_date', 'refill_duration_days'])

# Final null check
agg_refill.isnull().sum()


avg_refill_gap     0
max_refill_gap     0
total_visits       0
ADHERENT_BINARY    0
GENDER             0
AGE                0
dtype: int64

The final null check confirms that all fields in the dataset are now complete:

- `avg_refill_gap` and `max_refill_gap` have been filled with `0.0` for one-visit patients
- Temporary date-based fields have been dropped
- No missing values remain in key fields like `AGE`, `GENDER`, `ADHERENT_BINARY`, or `total_visits`

The dataset is now fully cleaned and ready to export for modeling.


### 🔹 Step 6: Export Final Modeling Dataset

Now that the dataset is fully cleaned, enriched, and validated, we export it for use in the next phase (model training).

We’ll save it both as a `.csv` for readability and a `.pkl` file for faster loading in Python.


In [30]:
# Save as CSV
agg_refill.to_csv('../dataset/processed/final_model_data.csv', index=True)

# Save as Pickle (faster for loading in notebooks)
agg_refill.to_pickle('../dataset/processed/final_model_data.pkl')

# Confirm shape
agg_refill.shape


(4444, 6)

The final dataset contains all the features needed for model training, including:

- Refill behavior metrics (`avg_refill_gap`, `max_refill_gap`, `total_visits`)
- Demographics (`AGE`, `GENDER`)
- Target variable (`ADHERENT_BINARY`)

Exporting it here allows us to keep a clear separation between Phase 2 (feature engineering) and Phase 3 (model building).  
It also ensures that our modeling steps are fully reproducible and not dependent on re-running earlier notebooks.


## ✅ Phase 2 Summary: Feature Engineering Complete

In this phase, we transformed raw claim-level data into a structured, patient-level dataset suitable for modeling.

### 🔧 Key Actions Completed:
- Recalculated `DAYS_SINCE_LAST_REFILL` for each patient using `SERVICE DATE`
- Aggregated refill behavior metrics per patient:
  - `avg_refill_gap`, `max_refill_gap`, `total_visits`, and refill duration
- Merged the final adherence label (`ADHERENT_BINARY`) based on each patient’s most recent refill behavior
- Enriched the dataset with demographic features:
  - Extracted `GENDER` and calculated `AGE` from `BIRTHDATE`
- Handled missing values:
  - Filled undefined refill gaps with `0.0`
  - Dropped intermediate date fields not needed for modeling
- Exported the final dataset in both `.csv` and `.pkl` formats for downstream modeling

### 📦 Final Dataset Ready for Modeling:
The resulting dataset is clean, complete, and contains both predictive features and the supervised target.  
We are now ready to begin Phase 3: Model Training and Evaluation.
