# Healthcare Drug Persistence ML Project

## Pedram Doroudchi

### Problem Statement

Drug persistency may be defined as the extent to which a patient acts in accordance with the prescribed interval and dose of a dosing regimen. One of the challenges for all pharmaceutical companies is to understand the persistency of a drug as per the physician's prescription. To solve this problem, ABC Pharma Company would like us to automate this process of identification. With an objective to gather insights on the factors that are impacting the persistency, we will build a classification for the given dataset which contains patient demographics, provider attributes, clinical factors, and disease/treatment factors as well as the target variable indicating whether the patient was persistent or not in their medication usage.

### Business Understanding

There are many current medical research papers highlighting the importance of medication adherence in the healthcare industry [[1]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8124987/) [[2]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6103301/) [[3]](https://www.nature.com/articles/s41579-019-0196-3). These papers often cite the role of persistence as a key indicator in medication adherence-related quality and performance. Thus it is essential to accurately classify patients who are less likely to adhere to their prescription in order to better tailor their quality of care. As a result, we may improve medication persistence, health outcomes and healthcare efficiency worldwide.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('healthcare_dataset.csv')
print(df)

       Ptid Persistency_Flag  Gender           Race     Ethnicity   Region  \
0        P1       Persistent    Male      Caucasian  Not Hispanic     West   
1        P2   Non-Persistent    Male          Asian  Not Hispanic     West   
2        P3   Non-Persistent  Female  Other/Unknown      Hispanic  Midwest   
3        P4   Non-Persistent  Female      Caucasian  Not Hispanic  Midwest   
4        P5   Non-Persistent  Female      Caucasian  Not Hispanic  Midwest   
...     ...              ...     ...            ...           ...      ...   
3419  P3420       Persistent  Female      Caucasian  Not Hispanic    South   
3420  P3421       Persistent  Female      Caucasian  Not Hispanic    South   
3421  P3422       Persistent  Female      Caucasian  Not Hispanic    South   
3422  P3423   Non-Persistent  Female      Caucasian  Not Hispanic    South   
3423  P3424   Non-Persistent  Female      Caucasian  Not Hispanic    South   

     Age_Bucket        Ntm_Speciality Ntm_Specialist_Flag  \
0 

In [3]:
# descriptive statistics for each feature column
df.describe(include = 'all')

Unnamed: 0,Ptid,Persistency_Flag,Gender,Race,Ethnicity,Region,Age_Bucket,Ntm_Speciality,Ntm_Specialist_Flag,Ntm_Speciality_Bucket,...,Risk_Family_History_Of_Osteoporosis,Risk_Low_Calcium_Intake,Risk_Vitamin_D_Insufficiency,Risk_Poor_Health_Frailty,Risk_Excessive_Thinness,Risk_Hysterectomy_Oophorectomy,Risk_Estrogen_Deficiency,Risk_Immobilization,Risk_Recurring_Falls,Count_Of_Risks
count,3424,3424,3424,3424,3424,3424,3424,3424,3424,3424,...,3424,3424,3424,3424,3424,3424,3424,3424,3424,3424.0
unique,3424,2,2,4,3,5,4,36,2,3,...,2,2,2,2,2,2,2,2,2,
top,P1,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,
freq,1,2135,3230,3148,3235,1383,1439,1535,2013,2104,...,3066,3382,1788,3232,3357,3370,3413,3410,3355,
mean,,,,,,,,,,,...,,,,,,,,,,1.239486
std,,,,,,,,,,,...,,,,,,,,,,1.094914
min,,,,,,,,,,,...,,,,,,,,,,0.0
25%,,,,,,,,,,,...,,,,,,,,,,0.0
50%,,,,,,,,,,,...,,,,,,,,,,1.0
75%,,,,,,,,,,,...,,,,,,,,,,2.0


In [6]:
# find type of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424 entries, 0 to 3423
Data columns (total 69 columns):
 #   Column                                                              Non-Null Count  Dtype 
---  ------                                                              --------------  ----- 
 0   Ptid                                                                3424 non-null   object
 1   Persistency_Flag                                                    3424 non-null   object
 2   Gender                                                              3424 non-null   object
 3   Race                                                                3424 non-null   object
 4   Ethnicity                                                           3424 non-null   object
 5   Region                                                              3424 non-null   object
 6   Age_Bucket                                                          3424 non-null   object
 7   Ntm_Speciality          

In [None]:
# find number of duplicate rows
df.duplicated().sum()

In [None]:
# find number of missing values
df.isnull().sum().sum()