***Assignment 2 _Data Analytics***

**Clustering Using the Sessa Empirical Estimator**

- Laurenz Mesiah A. Palanas
- Emily Rose Escartin

<br>

\begin{gather}
\Large \textbf{Introduction}
\end{gather}



The Sessa Empirical Estimator (SEE) is a data-driven method designed to estimate the duration of pharmacological prescriptions using electronic health records. It relies on clustering techniques to classify medication adherence behaviors based on refill patterns. This assignment focuses on implementing SEE using Python and comparing different clustering algorithms.





<br>


\begin{gather}
\Large \textbf{Methodology}
\end{gather}

1. **Data Preprocessing**
   - Load and clean the dataset.
   - Convert prescription dates to datetime format.
   - Compute event intervals between refills.

<br>

2. **Empirical Cumulative Distribution Function (ECDF) Analysis**
   - Visualize the distribution of event intervals to understand refill patterns.

<br>

3. **Clustering Techniques**
   - Apply **K-Means clustering** to classify adherence behaviors.
   - Use **DBSCAN clustering** as an alternative method.
   - Compare performance using the **Silhouette Score**.

<br>

4. **Evaluation and Comparison**
   - Compare results from K-Means and DBSCAN.
   - Assess clustering quality and interpret adherence patterns.

<br>



\begin{gather}
\Large \textbf{Python Implementation}
\end{gather}


### **Steps in the Code**:

**Step 1: Import Necessary Libraries**




In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score

**Step 2: Load and Preprocess the Data**

In [5]:
def preprocess_data(data):
    """Preprocesses the data: renaming columns, parsing dates, and computing event intervals."""
    data.columns = ['pnr', 'eksd', 'perday', 'ATC', 'dur_original']
    data['eksd'] = pd.to_datetime(data['eksd'])
    data = data.sort_values(by=['pnr', 'eksd'])
    data['prev_eksd'] = data.groupby('pnr')['eksd'].shift(1)
    data['event_interval'] = (data['eksd'] - data['prev_eksd']).dt.days
    return data.dropna()

**Step 3: ECDF Analysis**

In [10]:
def plot_ecdf(data):
    """Plots the Empirical Cumulative Distribution Function (ECDF)."""
    x = np.sort(data['event_interval'].dropna())
    y = np.arange(1, len(x) + 1) / len(x)
    plt.figure(figsize=(8, 6))
    plt.plot(x, y, marker='.', linestyle='none')
    plt.xlabel('Event Interval (Days)')
    plt.ylabel('ECDF')
    plt.title('Empirical Cumulative Distribution Function')
    plt.grid(True)
    plt.show()