<a href="https://colab.research.google.com/github/nestrada79/MSDA/blob/main/D209/D209_Task2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#D209 Task 2

## Part I: Research Question

### A. Describe the purpose of this data mining report by doing the following:


#### 1. Propose one question relevant to a real-world organizational situation that you will answer using one of the following prediction methods:

- decision trees
- random forests
- advanced regression (i.e., lasso or ridge regression)

> How do various patient demographics and medical factors (like age, gender, BMI, etc.) affect the total medical charges incurred by the patient?

This question seeks to understand the determinants of medical costs, which is of great significance to healthcare providers, insurance companies, and patients.

#### 2. Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.


- To identify the significant predictors of medical charges.
- To quantify the relationship between these predictors and medical charges.
- To develop a predictive model that can be used to forecast medical charges based on specific patient demographics and attributes.

## Part II: Method Justification

### B. Explain the reasons for your chosen prediction method from part A1 by doing the following:


#### 1. Explain how the prediction method you chose analyzes the selected data set. Include expected outcomes.

#### 2. Summarize one assumption of the chosen prediction method.

#### 3. List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.

Python Libraries:

- pandas: For data handling and manipulation.
- numpy: For numerical operations.
- scikit-learn: For implementing KNN and preprocessing data.
- matplotlib, seaborn: For visualizing data.

## Part III: Data Preparation

### C. Perform data preparation for the chosen data set by doing the following:

#### 1. Describe one data preprocessing goal relevant to the prediction method from part A1.

#### 2. Identify the initial data set variables that you will use to perform the analysis for the prediction question from part A1, and group each variable as continuous or categorical.

## Data Dictionary


The data set includes the following information:
- Patients who are readmitted to the hospital within a month of release (the "ReAdmis" column)
- Patient medical conditions (high blood pressure, stroke, obesity, arthritis, diabetes, etc.)
- Patient information (service they received while hospitalized, days in hospital, type of initial admission, etc.)
- Patient demographic information (gender, age, job, education level, etc.)

The data set consists of 10,000 customers and 50 columns/variables. I will only be using 29 of the 50 variables.

Note integers are discrete not continious but this question asks the student to define each variable as either continious or categorical. It would be inappropriate to treat all integers in this dataset as nominal categories so some are marked as continious for this reason.


| Column              | Description                                                                                                    | Type        | Example                |
|---------------------|----------------------------------------------------------------------------------------------------------------|-------------|------------------------|
| Population          | Population within a mile radius of the patient, based on census data                                           | Continious     | 11303                  |
| Area                | Area type (urban, suburban) based on unofficial census data                                                    | Categorical | Urban                  |
| Children            | Number of children in the patient's household as reported in the admissions information                        | Continious     | 1.0                    |
| Age                 | Age of the patient as reported in admissions information                                                       | Numeric     | 53.0                   |
| Education           | Highest earned degree of the patient as reported in admissions information                                     | Categorical | Some College, Less than 1 Year |
| Income              | Annual income of the patient (or primary insurance holder) as reported at the time of admission                | Continious     | 86575.93               |
| Marital             | Marital status of the patient (or primary insurance holder) as reported on admission information              | Categorical | Divorce                |
| Gender              | Customer self-identification as male, female, or nonbinary                                                     | Categorical | Male                   |
| ReAdmis             | Whether the patient was readmitted within a month of release or not                                           | Categorical | No                     |
| VitD_levels         | The patient's vitamin D levels as measured in ng/mL                                                            | Continious     | 17.80233               |
| Doc_visits          | Number of times the primary physician visited the patient during the initial hospitalization                   | Continious    | 5                      |
| Full_meals_eaten    | Number of full meals the patient ate while hospitalized                                                        | Numeric     | 3                      |
| VitD_supp           | The number of times that vitamin D supplements were administered to the patient                                | Continious     | 2                      |
| Soft_drink          | Whether the patient habitually drinks three or more sodas in a day                                             | Categorical | No                     |
| Initial_admin       | The means by which the patient was admitted into the hospital initially                                        | Categorical | Emergency              |
| HighBlood           | Whether the patient has high blood pressure                                                                    | Categorical | Yes                    |
| Stroke              | Whether the patient has had a stroke                                                                           | Categorical | No                     |
| Overweight          | Whether the patient is considered overweight based on age, gender, and height                                  | Categorical | No                     |
| Arthritis           | Whether the patient has arthritis                                                                              | Categorical | Yes                    |
| Diabetes            | Whether the patient has diabetes                                                                               | Categorical | Yes                    |
| Hyperlipidemia      | Whether the patient has hyperlipidemia                                                                         | Categorical | Yes                    |
| BackPain            | Whether the patient has chronic back pain                                                                      | Categorical | Yes                    |
| Anxiety             | Whether the patient has an anxiety disorder                                                                    | Categorical | Yes                    |
| Allergic_rhinitis   | Whether the patient has allergic rhinitis                                                                      | Categorical | Yes                    |
| Reflux_esophagitis  | Whether the patient has reflux esophagitis                                                                     | Categorical | No                     |
| Asthma              | Whether the patient has asthma                                                                                 | Categorical | No                     |
| Services            | Primary service the patient received while hospitalized                                                        | Categorical | Blood Work             |
| Initial_days        | The number of days the patient stayed in the hospital during the initial visit                                 | Continious    | 9.058210335            |
| TotalCharge         | The amount charged to the patient daily. This value reflects an average per patient based on the total charge divided by the number of days hospitalized. | Continious     | 3533.292197            |


#### 3. Explain the steps used to prepare the data for the analysis. Identify the code segment for each step.

**Preprocessing Data for Analysis **

**Data Transformation Goals**:

1. **Ensure Data Completeness and Quality**: Ensure that the dataset is free from missing values and duplicates which can skew the analysis.
2. **Feature Selection for Relevance**: Exclude variables that do not provide meaningful information in predicting medical charges.
3. **Convert Categorical Variables**: Transform categorical variables into a format suitable for regression modeling.
4. **Standardize Numerical Variables**: Ensure that numerical variables are on a comparable scale, making coefficients in the regression model interpretable.

**Steps Taken**:

1. **Data Cleaning**:
   - Checked and confirmed that there were no missing values in the dataset.
   - Checked for duplicate rows and found none.
   - Examined the data types of each column to ensure they align with their expected types.
   - Dropped irrelevant columns that were not pertinent to the research question, such as identifiers (`CaseOrder`, `Customer_id`, etc.), geographic details (`City`, `State`, etc.), and other columns like `Additional_charges` and survey items (`Item1` to `Item8`).

2. **Encoding Categorical Variables**:
   - Identified categorical variables in the dataset, including `Gender`, `MaritalStatus`, and others.
   - Used one-hot encoding to transform these categorical variables into binary columns, where each category is represented by a separate column.

3. **Scaling Numerical Variables**:
   - Identified numerical independent variables such as `Age`, `Income`, and `VitD_levels`.
   - Used Standard Scaling to transform these numerical variables, resulting in values with a mean of 0 and a standard deviation of 1. This ensures that all variables are on a comparable scale, which is crucial for interpreting the coefficients in a regression model.


In [None]:
#Import statements
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

# Set the display option to show all columns
pd.set_option('display.max_columns', None)

In [None]:
#Load dataset into pandas dataframe
#Reloading dataframe at this point to make the cleaning process easier to roll back without having to rerun any previous analysis
medical_data = pd.read_csv('/content/medical_clean.csv')

In [None]:
#Visually inspecting the first 10 rows
medical_data.head(10)

In [None]:
# 1. Check for missing values
missing_values = medical_data.isnull().sum()
missing_values

In [None]:
# 2. Check for duplicates
duplicate_rows = medical_data.duplicated().sum()
duplicate_rows

In [None]:
# 3. Check data types of columns
data_types = medical_data.dtypes
data_types

In [None]:
# Compute the correlation matrix for the original dataset
correlation_matrix_original = medical_data.corr(numeric_only=True)

# Extract correlations of all features with 'TotalCharge' from the original dataset
total_charge_correlations_original = correlation_matrix_original['TotalCharge'].sort_values(ascending=False)

# Set up the matplotlib figure
plt.figure(figsize=(8, 18))

# Generate a focused heatmap for the original dataset
sns.heatmap(total_charge_correlations_original.to_frame(), cmap='coolwarm', vmin=-1, vmax=1, annot=True, fmt=".2f", cbar_kws={'label': 'Correlation with TotalCharge'})

# Adjust layout for better visualization
plt.title("Correlation with TotalCharge (Original Dataset)")
plt.tight_layout()
plt.show()

- There are no missing values in the dataset.
- There are no duplicate rows in the dataset.
- The data types seem appropriate for each column, with a mix of numerical (int64 and float64) and categorical (object) data types.<P>

- TotalCharge has a very strong positive correlation with **Intial_days** and **CaseOrder**.<br>

There are a few columns that are not relevant to the research question so they will need to be dropped. These include **CaseOrder, Customer_id, Interaction, UID, City, State, County, Zip, Lat, Lng, TimeZone**, and **Job**

I also will remove **Additional_charges** since this information is already included inside **TotalCharge** so it is already known there will be a relationship between higher additional charges and the total charge.

I'm also choosing to drop the **Intial_days** column because based on prior healthcare knowledge I'm making the assumption that the total hospital bill is calculated including a per day charge along with other additional charges so it would be logical to assume that as a patient's stay increases so does their total hospital bill.

Also because **Initial_days** is also has a strong positive correlation to **TotalCharge** at 0.99 it is best to drop this column to avoid multicollinearity.

**Item1,	Item2,	Item3, Item4,	Item5,	Item6,	Item7** and	**Item8** will also be removed because they are not needed. These are survey items that related to patient satisfaction and logically do not have the ability to affect total charges. While there might be some correlation between patient satisfaction scores and total charges these survey scores do not have the ability to affect the amount a patient is charged.


In [None]:
# Dropping irrelevant columns
columns_to_drop = ['Initial_days','CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State', 'County',
                   'Zip', 'Lat', 'Lng', 'TimeZone', 'Job','Additional_charges','Item1','Item2','Item3','Item4','Item5','Item6','Item7','Item8']

medical_data_cleaned = medical_data.drop(columns=columns_to_drop)

In [None]:

# Display the first few rows of the cleaned dataset
medical_data_cleaned.head()

In [None]:
# One-hot encoding the categorical variables
medical_data_encoded = pd.get_dummies(medical_data_cleaned, drop_first=True)  # drop_first=True to avoid the dummy variable trap

# Display the first few rows of the encoded dataset
medical_data_encoded.head()

In [None]:
# Initialize the standard scaler
scaler = StandardScaler()

# Scaling the numerical variables
medical_data_encoded[numerical_independent_vars] = scaler.fit_transform(medical_data_encoded[numerical_independent_vars])

# Display the first few rows of the scaled dataset
medical_data_encoded.head()


#### 4. Provide a copy of the cleaned data set.

In [None]:
# Save the DataFrame as a CSV file
medical_data_encoded.to_csv('medical_data_encoded.csv', index=False)


## Part IV: Analysis

### D. Perform the data analysis and report on the results by doing the following:

#### 1. Split the data into training and test data sets and provide the file(s).

#### 2. Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

#### 3. Provide the code used to perform the prediction analysis from part D2.

## Part V: Data Summary and Implications

### E. Summarize your data analysis and explain the implications of your findings by doing the following:

#### 1. Provide the answer to the prediction question from part A1.

#### 2. Provide the final data output (e.g., data table, data visualization) that supports the answer to the prediction question from part A1.

#### 3. Explain how the data output from part E2 supports the answer to the prediction question from part A1.


#### 4. Discuss the implications of the findings from the data analysis in relation to the real-world organizational situation from part A1.

#### 5. Provide one recommendation based on the findings from the data analysis.