# D209 - Data Mining I
___
## Performance Assessment - Task 1: Classification Analysis
### Medical Readmission Data Set (Clean)
---
<br></br>

## Part I - Research Question

### A1: Proposal of Question

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Propose one question relevant to a real-world organizational situation that you will answer using one of the following classification methods:
 - k-nearest neighbor (KNN)
 - Naive Bayes

</font>

</br>

---

 <font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission proposes 1 question that is relevant to a real-world organizational situation, and the proposal includes 1 of the given classification methods.

</font>

---

The central research question addressed by this analysis is to determine:

>Can a patient's readmission (`ReAdmis`) status (`Yes/No`) be accurately classified given their age (`Age`) and number of days initially hospitalized (`Initial_days`)? 

In terms of hypothesis testing, our null hypothesis ($H_0$) is:

>The age (`Age`) and length of initial hospitalization (`Initial_days`) features from the medical readmission dataset have *no* statistically significant predictive power to classify a given patient's readmission status (`ReAdmis`).

Additionally, our alternate hypothesis ($H_1$) is:

>The age (`Age`) and length of initial hospitalization (`Initial_days`) features from the medical readmission dataset *do* classify a given patient's readmission status (`ReAdmis`) in a statistically significant way.

### A2: Defined Goal

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.

</font>

</br>

---

 <font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission defines 1 reasonable goal for data analysis that is within the scope of the scenario and is represented in the available data.

</font>

---

The primary goal of the following analysis is to discover what predictors potentially increase the likelihood a patient will be readmitted after initial hospitalization (`ReAdmis`). This will be assessed using the $\hbox{Python}$ programming language using the technique of multiple logistic regression to identify causal relationships between one or more predictor variables and a binary target variable.

## Part II - Method Justification

### B1: Explanation of Classification Method

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Explain how the classification method you chose analyzes the selected data set. Include expected outcomes.

</font>

</br>

---

 <font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission logically explains how the chosen classification method analyzes the selected data set and includes accurate expected outcomes.

</font>

---

### B2: Summary of Method Assumptions

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Summarize one assumption of the chosen classification method.

</font>

</br>

---

 <font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission adequately summarizes 1 assumption of the chosen classification method.

</font>

---

The following are basic assumptions of logistic regression:

-   The assumption of binary predictor outcome (between 0 and 1)
-   The assumption of linearity of continuous predictor variables
-   The assumption of non-multicollinearity of predictor variables
-   The assumption of no significantly influential outliers in dataset

### B3: Packages/Libraries List

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.

</font>

</br>

---

 <font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission lists the packages or libraries chosen for Python or R and justifies how each item on the list supports the analysis.

</font>

---

In [1]:
import numpy as np
import pandas as pd

## Part III - Data Preparation

### C1: Data Preprocessing

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Describe one data preprocessing goal relevant to the classification method from part A1.

</font>

</br>

---

 <font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission describes 1 data preprocessing goal that is relevant to the classification method from part A1.

</font>

---

The process we will need to complete in order to prepare the data for model selection is relatively minor, given that the raw dataset used in this project has already been cleaned in a prior project (see project D206 - Data Cleaning). Using the pre-cleaned dataset, we will first partition the data to include only those variables we intend to feed into our initial model. Because the first model selection process we will use is backward-oriented, this initial model will include all features that could possibly have a relationship to the binary target variable of readmission (`ReAdmis`). This will include a mix of numeric and categorical variables. 

Next, we will need to ensure that the data type of each variable is appropriate for that kind of feature. For example, we will determine which variables are categorical and need to be coded as a factor in $\hbox{Python}$. Once the dataset for the initial model has been partitioned and transformed (or converted to the right type, at least), we will look over the dataset to ensure that we have not created any problems in the process such as silently introducing null values.

### C2: Dataset Variables

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Identify the initial data set variables that you will use to perform the analysis for the classification question from part A1, and classify each variable as continuous or categorical.

</font>

</br>

---

 <font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission identifies the data set variables used to perform the analysis for the classification question from part A1, and the submission accurately classifies each variable as continuous or categorical.

</font>

---

### C3: Steps for Analysis

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step.

</font>

</br>

---

 <font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission accurately explains each step used to prepare the data for analysis, and the submission identifies an accurate code segment for each step.

</font>

---

### C4: Cleaned Dataset

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Provide a copy of the cleaned data set.

</font>

</br>

---

<font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission includes an accurate copy of the cleaned data set.


</font>

---

In [None]:
pd.to_csv()

## Part IV - Analysis

### D1: Splitting the Data

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Split the data into training and test data sets and provide the file(s).

</font>

</br>

---

<font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission provides reasonably proportioned training and test data sets.

</font>

---

### D2: Output & Intermediate Calculations

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

</font>

</br>

---

<font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission accurately describes the analysis technique used to appropriately analyze the data, and the submission includes accurate screenshots of the intermediate calculations performed.


</font>

---

### D3: Code Execution

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Provide the code used to perform the classification analysis from part D2.

</font>

</br>

---

<font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission provides the code used to perform the classification analysis from part D2 and the code executes without errors.

</font>

---

In [3]:
df = pd.read_csv('./data/medical_clean.csv')

df.head(5)

Unnamed: 0,CaseOrder,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,...,TotalCharge,Additional_charges,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
0,1,C412403,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,3a83ddb66e2ae73798bdf1d705dc0932,Eva,AL,Morgan,35621,34.3496,-86.72508,...,3726.70286,17939.40342,3,3,2,2,4,3,3,4
1,2,Z919181,d2450b70-0337-4406-bdbb-bc1037f1734c,176354c5eef714957d486009feabf195,Marianna,FL,Jackson,32446,30.84513,-85.22907,...,4193.190458,17612.99812,3,4,3,4,4,4,3,3
2,3,F995323,a2057123-abf5-4a2c-abad-8ffe33512562,e19a0fa00aeda885b8a436757e889bc9,Sioux Falls,SD,Minnehaha,57110,43.54321,-96.63772,...,2434.234222,17505.19246,2,4,4,4,3,4,3,3
3,4,A879973,1dec528d-eb34-4079-adce-0d7a40e82205,cd17d7b6d152cb6f23957346d11c3f07,New Richland,MN,Waseca,56072,43.89744,-93.51479,...,2127.830423,12993.43735,3,5,5,3,4,5,5,5
4,5,C544523,5885f56b-d6da-43a3-8760-83583af94266,d2f0425877b10ed6bb381f3e2579424a,West Point,VA,King William,23181,37.59894,-76.88958,...,2113.073274,3716.525786,2,1,3,3,5,3,4,3


In [4]:
df.columns

Index(['CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State',
       'County', 'Zip', 'Lat', 'Lng', 'Population', 'Area', 'TimeZone', 'Job',
       'Children', 'Age', 'Income', 'Marital', 'Gender', 'ReAdmis',
       'VitD_levels', 'Doc_visits', 'Full_meals_eaten', 'vitD_supp',
       'Soft_drink', 'Initial_admin', 'HighBlood', 'Stroke',
       'Complication_risk', 'Overweight', 'Arthritis', 'Diabetes',
       'Hyperlipidemia', 'BackPain', 'Anxiety', 'Allergic_rhinitis',
       'Reflux_esophagitis', 'Asthma', 'Services', 'Initial_days',
       'TotalCharge', 'Additional_charges', 'Item1', 'Item2', 'Item3', 'Item4',
       'Item5', 'Item6', 'Item7', 'Item8'],
      dtype='object')

In [72]:
df1 = df[['Initial_days',
          'VitD_levels',
          'ReAdmis']]


In [47]:
df1

Unnamed: 0,VitD_levels,Age,vitD_supp
0,19.141466,53,0
1,18.940352,51,1
2,18.057507,53,0
3,16.576858,78,0
4,17.439069,22,2
...,...,...,...
9995,16.980860,25,1
9996,18.177020,87,0
9997,17.129070,45,0
9998,19.910430,43,1


In [19]:
df1.sample(n=100,
           random_state=42)

Unnamed: 0,Initial_days,Age,Complication_risk
6252,48.634250,67,Medium
4684,12.062901,58,Medium
1731,3.766619,75,High
4742,12.612046,76,Low
4521,16.738161,52,Low
...,...,...,...
3787,6.505432,74,Medium
9189,65.345430,45,Medium
7825,71.229110,36,High
7539,67.077730,38,Low


In [71]:
import plotly.express as px

fig = px.scatter(df.sample(n=100,
                           random_state=42),
                 x='Initial_days',
                 y='VitD_levels',
                 color='ReAdmis',
                 template='seaborn')

fig.show()

In [74]:
df1

Unnamed: 0,Initial_days,VitD_levels,ReAdmis
0,10.585770,19.141466,No
1,15.129562,18.940352,No
2,4.772177,18.057507,No
3,1.714879,16.576858,No
4,1.254807,17.439069,No
...,...,...,...
9995,51.561220,16.980860,No
9996,68.668240,18.177020,Yes
9997,70.154180,17.129070,Yes
9998,63.356900,19.910430,Yes


In [86]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X = df1.drop(['ReAdmis'],
             axis=1)

y = df1.ReAdmis

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

knn = KNeighborsClassifier(1)
knn.fit(X_train,
        y_train)

knn.score(X_test,
          y_test)

0.971

In [93]:
for i in range(1,21, 2):
    print(i)

1
3
5
7
9
11
13
15
17
19


In [104]:
num_k = []
knnscore = []
for i in range(1,21):
    num_k.append(i)
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    knnscore.append(knn.score(X_test,
                              y_test))

In [105]:
pltscore = pd.DataFrame({'num_k': num_k,
                         'knnscore': knnscore})

In [108]:
fig = px.line(pltscore,
              x='num_k',
              y='knnscore')

fig.show()

## Part V - Data Summary & Implications

### E1: Accuracy & AUC

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Explain the accuracy and the area under the curve (AUC) of your classification model.

</font>

</br>

---

<font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission logically explains both the accuracy and the AUC of the classification model.

</font>

---

### E2: Results & Implications

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Discuss the results and implications of your classification analysis.

</font>

</br>

---

<font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission adequately discusses both the results and implications of the classification analysis.

</font>

---

### E3: Limitations

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Discuss **one** limitation of your data analysis.

</font>

</br>

---

<font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission logically discusses 1 limitation of the data analysis with adequate detail.

</font>

---

### E4: Course of Action

---

<font size=3 color='yellow'><b><i>Requirements</b></i>

Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in part E2.

</font>

</br>

---

<font size=3 color='green'><b><i>Rubric: Competent</b></i>

The submission recommends a reasonable course of action for the real-world organizational situation from part A1 based on the results and implications discussed in part E2.

</font>

---

### H: Sources


Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
