# DSC410: Project Milestone 3 - Feature Engineering
---

**Name**: Joseph Choi <br>
**Class**: DSC410-T301 Predictive Analytics (2243-1)

**Instructions**: Perform some feature engineering on your data

In [3]:
# Setup

import numpy as np
import pandas as pd

In [6]:
# Loading healthcare dataset and displaying results

healthcare_dataset = pd.read_csv('healthcare_dataset_v2.csv')
healthcare_dataset.head(3)

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Admission Type,Discharge Date,Medication,Test Results
0,Tiffany Ramirez,61,Female,AB+,Breast Cancer,11/17/2022,Patrick Parker,Wallace-Hamilton,Blue Cross,97088,Elective,12/1/2022,Aspirin,Inconclusive
1,Ruben Burns,45,Male,B+,Chronic Bronchitis,6/1/2023,Diane Jackson,"Burke, Griffin and Cooper",UnitedHealthcare,1405,Emergency,6/15/2023,Lipitor,Normal
2,Chad Byrd,77,Male,B-,Bacterial Pneumonia,1/9/2019,Paul Baker,Walton LLC,Medicare,9188,Emergency,2/8/2019,Lipitor,Normal


## Data Source and Description of Data:
**About Dataset** <br>
The data source I have chosen for my term project is a healthcare dataset that consists of healthcare records and related patient features. Each entry presents a patient's time during their admission to a healthcare facility. Columns included in the dataset are:
- **Name**: Patient's name
- **Age**: Age of the patient 
- **Gender**: Patient's gender
- **Blood Type**: Patient's blood type
- **Medical Condition**: Primary diagnosis or health condition
- **Date of Admission**: Admission date to the healthcare facility
- **Doctor**: Attending doctor's name
- **Hospital**: Healthcare facility or hospital of admission
- **Insurance Provider**: Patient's insurance provider
- **Billing Amount**: Total billed amount for healthcare services
- **Admission Type**: Type of admission (Emergency, Elective, Urgent)
- **Discharge Date**: Date of patient discharge
- **Medication**: Prescribed or administered medication
- **Test Results**: Outcome of medical tests (Normal, Abnormal, Inconclusive)

**Project Objective**: <br>
The goal of the project is to create a predictive model that predicts **Billing Amount** based on relevant main features like:
- **Age Group**
- **Grouped Medical Condition**

## Feature Engineering:
1. Removing unnecessary columns
2. Converting related attributes into a single category
3. Binning numerical data to make categorical
4. Encoding categorical variables

In [15]:
# Creating a copy of dataset to perform feature engineering procedures

healthcare_dataset_copy1 = healthcare_dataset.copy()

### 1. Removing Unnecessary Columns

In [16]:
# Specifying columns that needs to be removed
unnecessary_columns = ['Name', 'Gender', 'Blood Type', 'Date of Admission', 'Doctor', 'Hospital', 'Insurance Provider', 'Admission Type', 'Discharge Date', 'Medication', 'Test Results']

# Dropping columns that are not needed for the project
healthcare_dataset_copy1 = healthcare_dataset_copy1.drop(columns=unnecessary_columns)

# Printing updated summary:
healthcare_dataset_copy1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Age                99 non-null     int64 
 1   Medical Condition  99 non-null     object
 2   Billing Amount     99 non-null     object
dtypes: int64(1), object(2)
memory usage: 2.4+ KB


### 2. Converting Related Attributes Into a Single Category
Grouping values in **Medical Condition** column into below:
- **Cancer**: Breast Cancer, Leukemia, Colon Cancer, Lung Cancer
- **Bronchitis**: Chronic Bronchitis, Acute Bronchitis 
- **Pneumonia**: Bacterial Pneumonia, Viral Pneumonia, Mycoplasma Pneumonia
- **Asthma**: Childhood Asthma, Non-Allergic Asthma, Occupational Asthma, Severe Asthma
- **Diabetes**: Type 1 Diabetes, Secondary Diabetes, Monogenic Diabetes, Type 2 Diabetes
- **Hypertension**: Primary Hypertension, Secondary Hypertension

In [19]:
# Creating a dictionary that maps specific medical conditions to their corresponding grouped categories
conditions_mapping = {
    'Breast Cancer': 'Cancer',
    'Leukemia': 'Cancer',
    'Colon Cancer': 'Cancer',
    'Lung Cancer': 'Cancer',
    'Chronic Bronchitis': 'Bronchitis',
    'Acute Bronchitis': 'Bronchitis',
    'Bacterial Pneumonia': 'Pneumonia',
    'Viral Pneumonia': 'Pneumonia',
    'Mycoplasma Pneumonia': 'Pneumonia',
    'Childhood Asthma': 'Asthma',
    'Non-Allergic Asthma': 'Asthma',
    'Occupational Asthma': 'Asthma',
    'Severe Asthma': 'Asthma',
    'Type 1 Diabetes': 'Diabetes',
    'Secondary Diabetes': 'Diabetes',
    'Monogenic Diabetes': 'Diabetes',
    'Type 2 Diabetes': 'Diabetes',
    'Gestational Diabetes': 'Diabetes',
    'Primary Hypertension': 'Hypertension',
    'Secondary Hypertension': 'Hypertension'
}

# Creating a new column 'Grouped Medical Condition' by replacing original medical conditions to their corresponding grouped categories
healthcare_dataset_copy1['Grouped Medical Condition'] = healthcare_dataset_copy1['Medical Condition'].replace(conditions_mapping)

# Printing updated dataframe
healthcare_dataset_copy1.head(3)

Unnamed: 0,Age,Medical Condition,Billing Amount,Grouped Medical Condition,Age Group
0,61,Breast Cancer,97088,Cancer,Senior
1,45,Chronic Bronchitis,1405,Bronchitis,Adult
2,77,Bacterial Pneumonia,9188,Pneumonia,Senior


### 3. Binning Numerical Data to Make Categorical
Binning numeric values in the **Age** column into below:
- **Young**: 0-30
- **Adult**: 30-50
- **Senior**: 50 and over

In [31]:
# Setting the bins and labels for the new binned column
bins = [0, 30, 50, float('inf')]
labels = ['Young', 'Adult', 'Senior']

# Creating a new column 'Age Group' based on binning
healthcare_dataset_copy1['Age Group'] = pd.cut(healthcare_dataset_copy1['Age'], bins=bins, labels=labels, right=False)

# Printing updated dataframe
healthcare_dataset_copy1.head(3)

Unnamed: 0,Age,Medical Condition,Billing Amount,Grouped Medical Condition,Age Group
0,61,Breast Cancer,97088,Cancer,Senior
1,45,Chronic Bronchitis,1405,Bronchitis,Adult
2,77,Bacterial Pneumonia,9188,Pneumonia,Senior


### 4. Encoding Categorical Variables
Using one-hot encoding on columns below to convert categorical variables into numerical format
- **Grouped Medical Condition**
- **Age Group**

In [30]:
# Creating one-hot encoded columns for 'Age Group' and 'Grouped Medical Conditions'
healthcare_dataset_encoded = pd.get_dummies(healthcare_dataset_copy1, columns=['Age Group', 'Grouped Medical Condition'], drop_first=False)

# Printing updated dataframe
healthcare_dataset_encoded.head().T

Unnamed: 0,0,1,2,3,4
Age,61,45,77,84,56
Medical Condition,Breast Cancer,Chronic Bronchitis,Bacterial Pneumonia,Childhood Asthma,Viral Pneumonia
Billing Amount,97088,1405,9188,3773,9880
Age Group_Young,False,False,False,False,False
Age Group_Adult,False,True,False,False,False
Age Group_Senior,True,False,True,True,True
Grouped Medical Condition_Asthma,False,False,False,True,False
Grouped Medical Condition_Bronchitis,False,True,False,False,False
Grouped Medical Condition_Cancer,True,False,False,False,False
Grouped Medical Condition_Diabetes,False,False,False,False,False


## Quick EDA:

In [32]:
# Showing value counts for selected one-coded columns

selected_columns = [
    'Age Group_Young', 
    'Age Group_Adult', 
    'Age Group_Senior', 
    'Grouped Medical Condition_Asthma', 
    'Grouped Medical Condition_Bronchitis', 
    'Grouped Medical Condition_Cancer', 
    'Grouped Medical Condition_Diabetes', 
    'Grouped Medical Condition_Hypertension', 
    'Grouped Medical Condition_Pneumonia' 
]

for col in selected_columns:
    print(healthcare_dataset_encoded[col].value_counts())
    print('\n')

Age Group_Young
False    86
True     13
Name: count, dtype: int64


Age Group_Adult
False    69
True     30
Name: count, dtype: int64


Age Group_Senior
True     56
False    43
Name: count, dtype: int64


Grouped Medical Condition_Asthma
False    88
True     11
Name: count, dtype: int64


Grouped Medical Condition_Bronchitis
False    87
True     12
Name: count, dtype: int64


Grouped Medical Condition_Cancer
False    80
True     19
Name: count, dtype: int64


Grouped Medical Condition_Diabetes
False    80
True     19
Name: count, dtype: int64


Grouped Medical Condition_Hypertension
False    84
True     15
Name: count, dtype: int64


Grouped Medical Condition_Pneumonia
False    76
True     23
Name: count, dtype: int64


