# ML model selection

**Model selection** in machine learning is the process of choosing the nset suited model for a particular problem. Selecting a model depends on various factors such as the **dataset**, **task**, **nature of the model**, etc.


#### Factors to consider when choosing a model
1. Logical reason to select a model
2. Comparing the performance of the model

Models can be selected based on:

1. **Type of data available**
   - Images & videos - CNN
   - Text data or speech data - RNN
   - Numerical data - SVM, Logistic Regression, Decision trees, etc.

2. **Based on the task we need to carry out**
   - Classification tasks - SVM, Logistic regression, Decision trees, etc.
   - Regression tasks - Linear regression, Random forest, Polynomial regression, etc.
   - Clustering tasks - K-means clustering, Hierarchial clustering


### ML models
1. Linear Regression
   
   **Advantages**
   - Very simple to implement
   - Performs well on data with linear relationship
   
   **Disadvantages**
   - Not suitable for data having non-linear relationship
   - Underfitting issue
   - Sensitive to outliers

2. Logistic Regression

   **Advantages**
   - Easy to implement
   - Performs well on data with linear relationship
   - Less prone to over-fitting for low dimensional dataset
   **Disadvantages**
   - High dimensional dataset causes overfitting
   - Difficult to capture complex relationships in a dataset
   - Sensitive to outliers
   - Needs a large dataset

3. Decision trees
   
   **Advantages**
   - Can be used for both classification & regression
   - Easy to interpret
   - No need for normalization or scaling
   - Not sensitive to outliers
   **Disadvantages**
   - Overfitting issue
   - Small changes in the data alter the tree structure causing instability
   - Training time is relatively higher



In [30]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../../data/heart.csv")

In [3]:
df.head(n=10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [4]:
df.shape

(303, 14)

#### Statistical measures

In [5]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


#### Check for missing values

In [6]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Distribution of target variable

In [7]:
df["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

**0** rep. a healthy heart and **1** rep. a defective heart.

### Split features and target

In [8]:
X = df.drop(["target"], axis=1)
y = df["target"]

In [9]:
print(X)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  
0        0   0     1  
1        0   0     2  
2        2   0    

In [10]:
X = np.asarray(X)
y = np.asarray(y)

### Model selection

#### 1. Comparing models with default hyperparameter values using cross validation

In [11]:
# list of models
models = [LogisticRegression(max_iter=1000), SVC(kernel='linear'), KNeighborsClassifier(), RandomForestClassifier(random_state=0)]

In [18]:
def models_comparison_using_cross_validation():
    for model in models:
        cv_score = cross_val_score(model, X, y, cv=5)
        mean_accuracy = round(sum(cv_score) / len(cv_score) * 100, 2)
        print(f'Cross validaton for {model}: \n{cv_score}')
        print('---'*20)
        print(f'Accuracy score for {model}: \n{mean_accuracy}%')
        print('==='*20)

In [19]:
models_comparison_using_cross_validation()

Cross validaton for LogisticRegression(max_iter=1000): 
[0.80327869 0.86885246 0.85245902 0.86666667 0.75      ]
------------------------------------------------------------
Accuracy score for LogisticRegression(max_iter=1000): 
82.83%
Cross validaton for SVC(kernel='linear'): 
[0.81967213 0.8852459  0.80327869 0.86666667 0.76666667]
------------------------------------------------------------
Accuracy score for SVC(kernel='linear'): 
82.83%
Cross validaton for KNeighborsClassifier(): 
[0.60655738 0.6557377  0.57377049 0.73333333 0.65      ]
------------------------------------------------------------
Accuracy score for KNeighborsClassifier(): 
64.39%
Cross validaton for RandomForestClassifier(random_state=0): 
[0.85245902 0.90163934 0.81967213 0.81666667 0.8       ]
------------------------------------------------------------
Accuracy score for RandomForestClassifier(random_state=0): 
83.81%


**Inference**: For the heart disease dataset used, **RandomForestClassifier** has the highest accuracy value with default parameters.

#### 2. Comparing models with different hyperparameter values using **GridSearchCV**

In [47]:
# list of models
models_ = [LogisticRegression(max_iter=2000), SVC(kernel='linear'), KNeighborsClassifier(), RandomForestClassifier(random_state=0)]

In [21]:
# dictionary with hyperparameter values
model_hyperparams = {
    'logistic_reg_params': {
        'C': (1, 5, 10, 20),
    },
    'svc_params': {
        'kernel': ('linear', 'poly', 'rbf', 'sigmoid'),
        'C': (1, 5, 10, 20),
    },
    'kNN_params': {
        'n_neighbors': (3, 5, 10),
    },
    'random_forest_params': {
        'n_estimators': (10, 20, 50, 100),
    },
}

In [25]:
print(model_hyperparams.values())

dict_values([{'C': (1, 5, 10, 20)}, {'kernel': ('linear', 'poly', 'rbf', 'sigmoid'), 'C': (1, 5, 10, 20)}, {'n_neighbors': (3, 5, 10)}, {'n_estimators': (10, 20, 50, 100)}])


In [26]:
model_keys = list(model_hyperparams.keys())

In [27]:
model_keys

['logistic_reg_params', 'svc_params', 'kNN_params', 'random_forest_params']

#### Applying GridSearchCV

In [48]:
def model_selection(models, hyperparams_dict):
    results = []
    for idx, model in enumerate(models):
        key = model_keys[idx]
        params = hyperparams_dict[key]

        classifier = GridSearchCV(model, param_grid=params, cv=5)   # fit data to classifier
        classifier.fit(X, y)
        
        results.append({
            "model_used": model,
            "highest_score": classifier.best_score_,
            "best_parameters": classifier.best_params_,
        })
        result_df = pd.DataFrame(results, columns=['model_used', 'highest_score', 'best_parameters'])

    return result_df


In [49]:
model_selection(models=models_, hyperparams_dict=model_hyperparams)

Unnamed: 0,model_used,highest_score,best_parameters
0,LogisticRegression(max_iter=2000),0.831585,{'C': 5}
1,SVC(kernel='linear'),0.828306,"{'C': 1, 'kernel': 'linear'}"
2,KNeighborsClassifier(),0.64388,{'n_neighbors': 5}
3,RandomForestClassifier(random_state=0),0.838087,{'n_estimators': 100}


**Random Forest Classifier** with n_estimators=100 has the highest accuracy