# Assignment for Week 2 - KNN
- Get to know your data, start out by data exploration. Summarized your finding. 
- Divide the data into training set and test set randomly with ratio 80:20. Make prediction based on 1-nearest neighbor. What is the error rate of this approach? Report your results in a confusion matrix. 
- Use different values for K, what is the optimal value of K from your experiments? Report the error rate of the optimal K value and its confusion matrix. Is there any improvement (by how much) over 1-nearest neighbor? 
- Is there anything else you can do to improve your model? If yes, demonstrate your approach. (Hint: there is always something that you can try, unless your accuracy score is 100%) 
 
**Deliverables:**
Upload your notebook's .ipynb file (Also, if you decide to use your heart_disease data set, I'll need a copy of that too. I can't validate your notebook without your dtatset.) 
 
> Important: Make sure your provide complete and thorough explanations for all of your analysis. You need to defend your thought processes and reasoning.

In [4]:
import pandas as pd
# import numpy as np
# import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, r2_score
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [5]:
df = pd.read_csv('./heart.disease.data.clean.csv')
df.info()
df_clean = df.copy()
df_clean.loc[df_clean['num'] > 0, 'num'] = 1

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 282 entries, 0 to 281
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       282 non-null    int64  
 1   sex       282 non-null    int64  
 2   cp        282 non-null    int64  
 3   trestbps  282 non-null    int64  
 4   chol      282 non-null    int64  
 5   cigs      282 non-null    float64
 6   years     282 non-null    float64
 7   fbs       282 non-null    int64  
 8   famhist   282 non-null    int64  
 9   restecg   282 non-null    int64  
 10  thalach   282 non-null    int64  
 11  exang     282 non-null    int64  
 12  thal      282 non-null    int64  
 13  num       282 non-null    int64  
dtypes: float64(2), int64(12)
memory usage: 31.0 KB


In [6]:
# Selected predictors with abs corr >= .15
target_col = 'num'
feature_cols = ['age', 'sex', 'cp', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'thal'] # 
X = df_clean[feature_cols].values
y = df_clean[target_col].values

In [7]:
# Get training set of 20% 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalized data to improve accurracy
sc_X = StandardScaler()
# fit StandardScaler on entire dataset
sc_X.fit(X)
X_train = sc_X.transform(X_train)
X_test = sc_X.transform(X_test)

In [8]:
# Results of best model
k = 5
classifier = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred))

# Compare best model against base line (k=1)
cf_matrix = confusion_matrix(y_test, y_pred)

              precision    recall  f1-score   support

           0       0.77      0.83      0.80        29
           1       0.81      0.75      0.78        28

    accuracy                           0.79        57
   macro avg       0.79      0.79      0.79        57
weighted avg       0.79      0.79      0.79        57



# Summary

- Feature selection was made based on correlation matrix, including predictors with abs corr >= .15 only
- As requested the training set used is 20% data set
- As requested base line model fit was k=1
- The final k parameter was choosed based on best model of the k=1..40
- The selection criteria was the model with higher accuracy score, then lower False Negatives.
- The model acurracy score was improved by 16.05% after normalizing the predictor values

In [10]:
from sklearn.pipeline import Pipeline
import joblib
pipeline = Pipeline([('scaler', sc_X), ('classifier', classifier)])

# save the model to disk
joblib.dump(pipeline, 'heart-disease.pkl')

['heart-disease.pkl']