# Data Cleaning

This notebook loads the raw encounter data and cleans it up before we do any revenue or KPI analysis.

The goal here is simple:
- Remove non-billable or incomplete encounters
- Make sure dates and durations are formatted correctly
- Save a clean dataset for the next step

## Load the data

Read in the raw CSV file and take a quick look at the structure.

In [1]:
import pandas as pd
from pathlib import Path

DATA_PATH = Path("..") / "data" / "sample_monthly_data.csv"
df_raw = pd.read_csv(DATA_PATH)

df_raw.head()

FileNotFoundError: [Errno 2] No such file or directory: '..\\data\\sample_monthly_data.csv'

## Clean the data

Filter out:
- Encounters that arenâ€™t complete
- Encounters marked as non-billable

Also standardize dates and duration fields so everything is consistent.

In [2]:
df = df_raw.copy()

# Standardize types
df["encounter_date"] = pd.to_datetime(df["encounter_date"], errors="coerce")
df["duration_min"] = pd.to_numeric(df["duration_min"], errors="coerce")

# Basic cleaning
df = df.dropna(subset=["encounter_date", "cpt_code", "is_billable", "encounter_status"])
df = df[df["encounter_status"].str.lower().eq("complete")]
df = df[df["is_billable"].str.lower().eq("yes")]

df = df.reset_index(drop=True)
df.head()

NameError: name 'df_raw' is not defined

## Quick checks

Compare the number of rows before and after cleaning.

Also check the distribution of service codes to make sure nothing unexpected was removed.

In [3]:
# Quick sanity checks!
print("Raw rows:", len(df_raw))
print("Clean rows:", len(df))

df["cpt_code"].value_counts()

NameError: name 'df_raw' is not defined

## Save cleaned dataset

Write the cleaned data to:

data/clean_encounters.csv

This will be used in the next notebook for revenue calculations.

In [None]:
OUT_PATH = Path("..") / "data" / "clean_encounters.csv"
df.to_csv(OUT_PATH, index=False)
print("Wrote:", OUT_PATH)