### **MODEL SELECTION**

MODEL SELECTION IS THE PROCESS OF SELECTING THE BEST MODEL FOR A PARTICULAR PROBLEM. SELECTING A MODEL MAY DEPEND ON THE VARIOUS FACTORS, SUCH AS THE DATA-SETS, TASK, NATURE OF THE MODEL ETC.

TWO FACTORS ARE USUALLY CONSIDERED:
1. Logical Reason to select a model.
    
2. Comparing the performance of the models.

### **LOGICAL REASON** :

TYPES OF THE DATA : 

    1. Images or videos - CNN
    2. Text or Speech data - RNN
    3. Numericals data - Logistic regression, SVM, Decision Trees.


    BAsED ON THE TASK : 

    1. Classification Tasks - SVM, Logistic regression, Decision Trees, etc.
    2. Regression Task -  Linear Regression, Random Forest, Polynomial Regression etc.
    3. Clusering Task - K-Means Clustering , Hierarchial Clustering etc.

### **IMPLEMENTING MODEL_SELECTION**

In [1]:
# IMPORTING THE DEPENDECIES
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
heart_data = pd.read_csv(r"C:\Users\Subham Pathak\Desktop\ML\DATA_SETS\heart.csv")
heart_data.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [3]:
heart_data.shape

(303, 14)

In [4]:
heart_data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [5]:
heart_data['target'].value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [6]:
x = heart_data.drop(columns='target', axis=1)
y = heart_data['target']
print(x)
print(y)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  ca  thal
0     63    1   3       145   233    1        0      150      0      2.3      0   0     1
1     37    1   2       130   250    0        1      187      0      3.5      0   0     2
2     41    0   1       130   204    0        0      172      0      1.4      2   0     2
3     56    1   1       120   236    0        1      178      0      0.8      2   0     2
4     57    0   0       120   354    0        1      163      1      0.6      2   0     2
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...    ...  ..   ...
298   57    0   0       140   241    0        1      123      1      0.2      1   0     3
299   45    1   3       110   264    0        1      132      0      1.2      1   0     3
300   68    1   0       144   193    1        1      141      0      3.4      1   2     3
301   57    1   0       130   131    0        1      115      1      1.2      1   1     3
302   57  

In [7]:
x = np.asarray(x)
y = np.asarray(y)

MODEL SELECTION 

Comparing the model with the default hyperparameter.

In [8]:
# list of models 
models = [LogisticRegression(max_iter = 1000), SVC( kernel= 'linear'), RandomForestClassifier(random_state = 0), KNeighborsClassifier() ]

In [9]:
def compare_models_crossval():
    
    for model in models:
        cross_val = cross_val_score(model, x, y, cv = 5)
        mean_acc = sum(cross_val)/len(cross_val)
        mean_acc = mean_acc * 100
        mean_acc = round(mean_acc,2)
        print(f"The accuracy score of the {model} is : {mean_acc} %.")

In [10]:
compare_models_crossval()

The accuracy score of the LogisticRegression(max_iter=1000) is : 82.83 %.
The accuracy score of the SVC(kernel='linear') is : 82.83 %.
The accuracy score of the RandomForestClassifier(random_state=0) is : 83.81 %.
The accuracy score of the KNeighborsClassifier() is : 64.39 %.


In [11]:
# LETS APPLY HYPERPARAMETER TUNING TO SEARCH THE BEST PARAMETER FOR THE MODEL
    # creating a dictionary that contains hyperparamter values
    
    
model_hyperparameter = {
    
    'logistic_hyperparamter' : {
        'C' : [1, 5, 10, 20]
    },
    
    'svc_hyperparamter' : {
        
        'kernel' : ['linear', 'poly', 'rbf', 'sigmoid'],
        'C' : [1, 5, 10, 20],
    },
    
    'KNN_hyperparameter' : {
        
        'n_neighbors' : [3,5,10]
    },
    
    'random_forest_hyperparameter' : {
        
       'n_estimators' : [10, 20, 50, 100] 
    
    },
      
}

In [12]:
model_hyperparameter.keys()

dict_keys(['logistic_hyperparamter', 'svc_hyperparamter', 'KNN_hyperparameter', 'random_forest_hyperparameter'])

In [13]:
model_keys = list(model_hyperparameter.keys())
print(model_keys)

['logistic_hyperparamter', 'svc_hyperparamter', 'KNN_hyperparameter', 'random_forest_hyperparameter']


In [14]:
model_hyperparameter['logistic_hyperparamter']

{'C': [1, 5, 10, 20]}

APPLYING GRID_SEARCHCV

In [15]:
def ModelSelect(list_of_models, hyperparamter_dictionary):
    result = []
    i=0
    for MODEL in list_of_models:
        key = model_keys[i]
        params = hyperparamter_dictionary[key]
        i+=1
        print(MODEL)
        print(params)