### V. Algorithm Selection

In this part, we create a model, we try different algorithms and see which one delivers the best results. Then we chose the best algorithm and fine tune it. 

This notebook presents the following parts:

    1) Model creation
    2) Algorithm testing : Linear Regression (simple, lasso, ridge), Boosted decision tree regressor , Random forest regressor, Bayesian linear regressor
    3) Chosing best algorithm
    4) Save model


     

In [1]:
# import libraries
import pandas as pd
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import math
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics


%matplotlib inline
%matplotlib inline

In [2]:
#import data set
df=pd.read_csv('dfformodeling.csv')
df.shape

(3198, 44)

In [3]:
df.head(2)

Unnamed: 0,demo__birth_rate_per_1k,demo__death_rate_per_1k,demo__pct_adults_bachelors_or_higher,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_below_18_years_of_age,demo__pct_female,...,Economic_typo:_Federal/State government-dependent,Economic_typo:_Manufacturing-dependent,Economic_typo:_Mining_farming,Economic_typo:_Nonspecialized,Economic_typo:_Recreation,Area_rucc:_Metro,Area_rucc:_NonMetro,Age_Group:_old,Age_Group:_young,heart_disease_mortality_per_100k
0,0.117909,0.609758,-0.504912,0.665733,1.045102,-1.415408,-0.244441,-0.082609,0.212671,0.704488,...,-0.37262,2.339225,-0.546672,-0.809367,-0.328747,1.354449,-1.354449,-0.511328,0.511328,312
1,2.673105,-1.184837,0.670671,0.22459,-1.651557,0.78957,-0.197126,0.074713,1.292277,0.171536,...,-0.37262,-0.427358,1.828678,-0.809367,-0.328747,1.354449,-1.354449,-0.511328,0.511328,257


**1) Model Creation**

In [4]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(labels=['heart_disease_mortality_per_100k'], axis=1),
    df['heart_disease_mortality_per_100k'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((2238, 43), (960, 43))

**2) Algoirthm testing**

2.1Algorithm: **Linear Regression**

2.1.1 Linear Regression Simple

In [5]:
#Train the Model and predict
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

In [6]:
#print RMSLE
print ('Simple Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, lm_predictions)))

Simple Regression RMSLE is 0.12793763949775402


2.1.2 Lasso Linear Regression (l1)

In [7]:
#Train the Model and predict
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_log_error
Lasso = Lasso()
Lasso.fit(X_train,y_train)
Lasso_predictions = Lasso.predict(X_test)

In [8]:
#print RMSLE
print ('Lasso Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Lasso_predictions)))

Lasso Regression RMSLE is 0.1264713654708869


2.1.3 Ridge Regression (l2)

In [9]:
#Train the Model and predict
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_log_error
Ridge = Ridge()
Ridge.fit(X_train,y_train)
Ridge_predictions = Ridge.predict(X_test)

In [10]:
#print RMSLE
print ('Ridge Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Ridge_predictions)))

Ridge Regression RMSLE is 0.12469258919157813


2.2 Algorithm: **Boosted Decision Tree Regressor**

In [11]:
#Train the Model and predict
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_log_error
Tree = DecisionTreeRegressor()
Tree.fit(X_train,y_train)
Tree_predictions = Tree.predict(X_test)

In [12]:
#print RMSLE
print ('Boosted Decision Tree Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Tree_predictions)))

Boosted Decision Tree Regression RMSLE is 0.16514766057063374


2.3 Algorithm: **Random Forest Regressor**

In [13]:
#Train the Model and predict
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error
Forest = RandomForestRegressor()
Forest.fit(X_train,y_train)
Forest_predictions = Forest.predict(X_test)



In [14]:
#print RMSLE
print ('Random Forest Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Forest_predictions)))

Random Forest Regression RMSLE is 0.11890527806414833


2.4 Algorithm: **Bayesian Linear Regressor**

In [15]:
#Train the Model and predict
from sklearn.linear_model import BayesianRidge
from sklearn.metrics import mean_squared_log_error
Bayesian = BayesianRidge()
Bayesian.fit(X_train,y_train)
Bayesian_predictions = Bayesian.predict(X_test)

In [16]:
#print RMSLE
print ('Bayesian Ridge Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Bayesian_predictions)))

Bayesian Ridge Regression RMSLE is 0.12500121290260072


2.5 Algorithm: **XGBoost Regressor**

In [18]:
import xgboost as xgb

xgb=xgb.XGBRegressor()
xgb.fit(X_train,y_train)
xgb_predictions = xgb.predict(X_test)

  if getattr(data, 'base', None) is not None and \


In [19]:
#print RMSLE
print ('XGB Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, xgb_predictions)))

XGB Regression RMSLE is 0.1161786309567886


**3. Compare and chose best model**

In [20]:
print ('Simple Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, lm_predictions)))
print ('Lasso Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Lasso_predictions)))
print ('Ridge Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Ridge_predictions)))
print ('Boosted Decision Tree Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Tree_predictions)))
print ('Random Forest Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Forest_predictions)))
print ('Bayesian Ridge Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, Bayesian_predictions)))
print ('XGB Regression RMSLE is', np.sqrt(mean_squared_log_error(y_test, xgb_predictions)))

Simple Regression RMSLE is 0.12793763949775402
Lasso Regression RMSLE is 0.1264713654708869
Ridge Regression RMSLE is 0.12469258919157813
Boosted Decision Tree Regression RMSLE is 0.16514766057063374
Random Forest Regression RMSLE is 0.11890527806414833
Bayesian Ridge Regression RMSLE is 0.12500121290260072
XGB Regression RMSLE is 0.1161786309567886


Clearly, **XGB Regressor** is the best model. 

