# Creating a Cardiovascular Disease Classification App
## Using Logistic Regression and AWS EC2 Instance

_Machine Learning Program | Deployment_

---

Now that we have a basic understanding of how AWS works we will try to use it to construct a complete project from end to end. Our goal will be to have a simple web page which a user can use to enter a key cardio disease identifiers. The web page will then send the information off to our deployed model which will predict the sentiment of the entered review.

## General Outline

Recall the general outline for SageMaker projects using a notebook instance.

1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


## Step 1: Setting up the notebook

We will be importing necessary libraries


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle

## Step 2: Preparing and Processing the data

We will be doing some initial data processing. Then, we will split the dataset into a training set and a testing set.

In [2]:
df=pd.read_csv('cardio_train.csv',sep=';',index_col='id')

In [3]:
X=df.drop('cardio',axis=1)
y=df['cardio']

In [4]:
# convert age from days to years
age_year=X['age'].apply(lambda x:x/365)
X['age']=np.ceil(age_year)

In [5]:
X.head()

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
988,62.0,1,155,69.0,130,80,2,2,0,0,1
989,41.0,1,163,71.0,110,70,1,1,0,0,1
990,61.0,1,165,70.0,120,80,1,1,0,0,1
991,40.0,2,165,85.0,120,80,1,1,1,1,1
992,65.0,1,155,62.0,120,80,1,1,0,0,1


Now that we've transformed the data, we will select the beast feature for our model.

In [6]:
from sklearn.ensemble import ExtraTreesClassifier
model=ExtraTreesClassifier()
model.fit(X,y)



ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [7]:
FI=pd.Series(model.feature_importances_,index=X.columns)
best_feat=FI.nlargest(5)
feat_index=best_feat.index

In [8]:
X=X.filter(items=list(feat_index),axis=1)
X.head()

Unnamed: 0_level_0,weight,height,ap_hi,age,ap_lo
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
988,69.0,155,130,62.0,80
989,71.0,163,110,41.0,70
990,70.0,165,120,61.0,80
991,85.0,165,120,40.0,80
992,62.0,155,120,65.0,80


In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

In [10]:
param_grid = {'C': [0.1,1, 10, 100, 1000]}
n_folds=10

In [11]:
clf=GridSearchCV(LogisticRegression(),param_grid=param_grid,cv=n_folds,refit=True,verbose=3)

In [12]:
clf.fit(X,y)

Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] C=0.1 ...........................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................. C=0.1, score=0.7068244120617515, total=   0.4s
[CV] C=0.1 ...........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV] .................. C=0.1, score=0.6973019766267494, total=   0.4s
[CV] C=0.1 ...........................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s


[CV] .................. C=0.1, score=0.7158730158730159, total=   0.4s
[CV] C=0.1 ...........................................................




[CV] .................. C=0.1, score=0.7103896103896103, total=   0.4s
[CV] C=0.1 ...........................................................




[CV] .................. C=0.1, score=0.7109668109668109, total=   0.4s
[CV] C=0.1 ...........................................................




[CV] .................. C=0.1, score=0.7115440115440116, total=   0.5s
[CV] C=0.1 ...........................................................




[CV] .................. C=0.1, score=0.7124098124098124, total=   0.4s
[CV] C=0.1 ...........................................................




[CV] ................... C=0.1, score=0.712987012987013, total=   0.4s
[CV] C=0.1 ...........................................................




[CV] .................. C=0.1, score=0.7054834054834055, total=   0.4s
[CV] C=0.1 ...........................................................




[CV] .................. C=0.1, score=0.7107807764468177, total=   0.4s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7074015293608426, total=   0.4s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7042273842158419, total=   0.3s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7157287157287158, total=   0.3s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7113997113997114, total=   0.4s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7105339105339106, total=   0.4s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7142857142857143, total=   0.4s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7113997113997114, total=   0.4s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7154401154401154, total=   0.5s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7041847041847041, total=   0.4s
[CV] C=1 .............................................................




[CV] .................... C=1, score=0.7087602828691009, total=   0.4s
[CV] C=10 ............................................................




[CV] .................... C=10, score=0.706535853412206, total=   0.4s
[CV] C=10 ............................................................




[CV] ................... C=10, score=0.7046602221901601, total=   0.4s
[CV] C=10 ............................................................




[CV] .................... C=10, score=0.716017316017316, total=   0.3s
[CV] C=10 ............................................................




[CV] ................... C=10, score=0.7112554112554113, total=   0.6s
[CV] C=10 ............................................................




[CV] ................... C=10, score=0.7096681096681097, total=   0.4s
[CV] C=10 ............................................................




[CV] ................... C=10, score=0.7154401154401154, total=   0.4s
[CV] C=10 ............................................................




[CV] ................... C=10, score=0.7113997113997114, total=   0.4s
[CV] C=10 ............................................................




[CV] ................... C=10, score=0.7147186147186148, total=   0.5s
[CV] C=10 ............................................................




[CV] .................... C=10, score=0.704040404040404, total=   0.3s
[CV] C=10 ............................................................




[CV] ................... C=10, score=0.7091932457786116, total=   0.3s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7063915740874332, total=   0.4s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7045159428653874, total=   0.4s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7163059163059163, total=   0.4s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7112554112554113, total=   0.4s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7096681096681097, total=   0.5s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7154401154401154, total=   0.4s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7113997113997114, total=   0.5s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7145743145743145, total=   0.5s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7041847041847041, total=   0.5s
[CV] C=100 ...........................................................




[CV] .................. C=100, score=0.7089046038389378, total=   0.4s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7063915740874332, total=   0.5s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7048045015149329, total=   0.4s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7164502164502164, total=   0.6s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7112554112554113, total=   0.4s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7096681096681097, total=   0.4s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7154401154401154, total=   0.3s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7112554112554113, total=   0.4s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7145743145743145, total=   0.4s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7044733044733045, total=   0.3s
[CV] C=1000 ..........................................................




[CV] ................. C=1000, score=0.7089046038389378, total=   0.3s


[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   25.9s finished


GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [0.1, 1, 10, 100, 1000]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=3)

In [13]:
# pickle.dump(clf,open('clf_model.pkl','wb'))