<a href="https://colab.research.google.com/github/nestrada79/MSDA/blob/main/D209/D209_Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Part I: Research Question

## A.  Describe the purpose of this data mining report by doing the following:


### 1.  Propose one question relevant to a real-world organizational situation that you will answer using one of the following classification methods:

- k-nearest neighbor (KNN)

- Naive Bayes

> What factors are most predictive of a patient being readmitted to the hospital within a month after their initial discharge?

### 2.  Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.


- To develop a predictive model to identify key factors influencing hospital readmission within one month of discharge.

- To identify and quantify the influence of various factors, such as medical history, demographic details, and hospital services received, on the risk of readmission.


 # Part II: Method Justification

## B.  Explain the reasons for your chosen classification method from part A1 by doing the following:



### 1.  Explain how the classification method you chose analyzes the selected data set. Include expected outcomes.

K-Nearest Neighbors (KNN) is a classification method well-suited for this dataset, which includes various patient health and demographic variables.

1. **Functionality:**
   - KNN works by finding the 'k' closest data points (neighbors) to a new data point based on similarity in features. In our dataset, this means identifying the most similar patients in terms of demographics, medical conditions, and hospital stay characteristics.

2. **Classification Decision:**
   - The algorithm assigns the new data point to the most common class among its 'k' nearest neighbors. For predicting hospital readmissions, it looks at the 'ReAdmis' status of the nearest neighbors and predicts whether a new patient will be readmitted based on these.

3. **Expected Outcomes:**
   - The model's effectiveness will largely depend on features like age, medical history, and length of hospital stay. I anticipate that the KNN model will be able to highlight key factors that correlate with higher readmission rates.

4. **Evaluation:**
   - The performance of the KNN model on our dataset will be evaluated using accuracy and AUC metrics, providing insight into its reliability for this specific application.

In this context, KNN's ability to capture the nuanced relationships in multi-feature medical data makes it a promising tool for identifying patterns leading to hospital readmissions.



### 2.  Summarize one assumption of the chosen classification method.

A key assumption of the K-Nearest Neighbors (KNN) algorithm is that similar things exist in close proximity. In other words, KNN assumes that data points that are near each other are likely to be in the same category.

### 3.  List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.

Python Libraries:

- pandas: For data handling and manipulation.
- numpy: For numerical operations.
- scikit-learn: For implementing KNN and preprocessing data.
- matplotlib, seaborn: For visualizing data.

# Part III: Data Preparation

## C.  Perform data preparation for the chosen data set by doing the following:

### 1.  Describe one data preprocessing goal relevant to the classification method from part A1.


### 2.  Identify the initial data set variables that you will use to perform the analysis for the classification question from part A1, and classify each variable as continuous or categorical.


### 3.  Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step.


In [None]:
#Import statements
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import warnings
from tabulate import tabulate



# Set the display option to show all columns
pd.set_option('display.max_columns', None)

# Suppress all warning
warnings.filterwarnings("ignore")

-  Describe your data cleaning goals and the steps used to clean the data to achieve the goals that align with your research question including the annotated code.

**Goals:**

**Identify and Handle Missing Values:** Ensure that the dataset does not have any gaps or missing data that could compromise the analysis.

**Check for Duplicates:** Ensure there's no redundancy in the data. Remove any duplicate rows if found.

**Ensure Appropriate Data Types:** Confirm that each column's data type aligns with the nature of the data it contains.

**Remove Irrelevant Columns:** Eliminate columns that don't contribute value to the research question to make the dataset and subsequent analysis more focused.

In [None]:
#Load dataset into pandas dataframe
#Reloading dataframe at this point to make the cleaning process easier to roll back without having to rerun any previous analysis
medical_data = pd.read_csv('/content/medical_clean.csv')

In [None]:
#Visually inspecting the first 10 rows
medical_data.head(10)

In [None]:
medical_data.info()

#### Data Cleaning Steps

In [None]:
# 1. Check for missing values
missing_values = medical_data.isnull().sum()
missing_values

In [None]:
# 2. Check for duplicates
duplicate_rows = medical_data.duplicated().sum()
duplicate_rows

In [None]:
# 3. Check data types of columns
data_types = medical_data.dtypes
data_types

**Findings**
- There are no missing values in the dataset.
- There are no duplicate rows in the dataset.
- The data types seem appropriate for each column, with a mix of numerical (int64 and float64) and categorical (object) data types.<P>

There are a few columns that are not relevant to the research question so they will need to be dropped. These include **CaseOrder, Customer_id, Interaction, UID, City, State, County, Zip, Lat, Lng, TimeZone**

I am also removing the **Complication_risk** column from my analysis as it does not make sense to include a risk calculation within my analysis to identify readmission risk especially without knowing *how* this calculation was made, with which factors and at what point in a patient's stay this calculation was made. A high complication risk assessed at the time of admission would have different meaning from a high complication risk assessed at the time of discharge. I am using my domain knowledge as a nurse to make this decision.

**Item1,	Item2,	Item3, Item4,	Item5,	Item6,	Item7** and	**Item8** will also be removed because they are not needed. These are survey items that related to patient satisfaction and logically do not have the ability to affect whether a patient gets readmitted or. While there might be some correlation between patient satisfaction scores and whether a patient is at risk of being readmitted including these scores may skew the analysis. If I was asking a research question like do patients who receive better service as measured by customer satisfaction survey have better outcomes with few readmissions then it might make sense to leave them in. But this is outside the scope of my research question.

Instinctively I want to remove **Job** from my analysis for similar reasons but I suspect there might be some relationship between a patient's job and their access to preventative medical which may reduce readmissions so I am leaving it in at this stage.


In [None]:
# Columns to be dropped from the dataset
columns_to_drop = [
    'CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State', 'County', 'Zip', 'Lat', 'Lng', 'TimeZone', 'Complication_risk',
    'Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8'
]

# Dropping the irrelevant columns
medical_data_cleaned = medical_data.drop(columns=columns_to_drop)

# Display the shape of the dataset after dropping the columns
medical_data_cleaned.shape

In [None]:
# Identify the original numerical variables
original_numerical_vars = ['Population', 'Children', 'Age', 'Income', 'VitD_levels', 'Doc_visits',
                           'Full_meals_eaten', 'vitD_supp', 'Initial_days', 'TotalCharge', 'Additional_charges']

# One-hot encoding the categorical variables
categorical_data = medical_data_cleaned.drop(original_numerical_vars, axis=1)
medical_data_encoded = pd.get_dummies(categorical_data, drop_first=True)

# Initialize the standard scaler
scaler = StandardScaler()

# Scale the numerical variables
scaled_numerical_data = scaler.fit_transform(medical_data_cleaned[original_numerical_vars])

# Convert the scaled numerical data back to a DataFrame
scaled_numerical_df = pd.DataFrame(scaled_numerical_data, columns=original_numerical_vars)

# Concatenate scaled numerical variables with one-hot encoded variables
medical_transformed = pd.concat([scaled_numerical_df, medical_data_encoded], axis=1)

# Display the first few rows of the final dataset
medical_transformed.head()

In [None]:
def calculate_vif(data):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = data.columns

    # Calculating VIF for each feature
    vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(len(data.columns))]

    return vif_data

features = medical_transformed.drop('ReAdmis_Yes', axis=1)
vif_data = calculate_vif(features)

print(vif_data)

In [None]:
# Drop columns with high VIF
high_vif_columns = ['Initial_days', 'TotalCharge', 'Additional_charges']
medical_logistic_prepared = medical_transformed.drop(high_vif_columns, axis=1)

### 4.  Provide a copy of the cleaned data set.


In [None]:
# Save the DataFrame as a CSV file
medical_logistic_prepared.to_csv('medical_logistic_prepared.csv', index=False)


NameError: ignored

# Part IV: Analysis

## D.  Perform the data analysis and report on the results by doing the following:

### 1.  Split the data into training and test data sets and provide the file(s).


2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

# Part V: Data Summary and Implications

## E.  Summarize your data analysis by doing the following:

### 1.  Explain the accuracy and the area under the curve (AUC) of your classification model.

### 2.  Discuss the results and implications of your classification analysis.

### 3.  Discuss one limitation of your data analysis.

### 4.  Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in part E2.