## Coosing Best Model for ML Problem
**Choosing the best model for your machine learning problem** depends on several factors, including the type of problem you are trying to solve, the complexity of the data, and the computing resources available. It is important to understand the strengths and weaknesses of each model and determine which one best fits the task. It is also important to consider the various hyperparameters of each model and determine which ones will produce the most accurate results. Finally, it is important to assess the performance of each model using a validation set or cross-validation.
* Here we mostly discuss the model analysis, it means how to choose the train and test samples>

In [12]:
# For assessing different Model for a given problem, let's start from a simple example right here. 
from sklearn import svm, datasets
iris = datasets.load_iris()

In [13]:
# Let's here import pandas, and create a dataFrame from sklearn Iris dataset:
import pandas as pd
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df['flower'] = iris.target
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


**We have try three different approaches for choosing train and test samples right here.**

<h2 style="color:blue"> Approach 1: </h2> Using train_test_split method and manually tune parameters by trial and error.

In [18]:
# The traditional approach is using train_test_split method. Let's here call it again:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

In [19]:
# Here we randomely choose the parameters (parameter tuning), so I don't know which one is the best? 
# The main issue with train_test_split method is that: based on train and test sample your model score will be different. When
# you execute the [4] statement twice and trice, you will see that the score is changing. So we can't rely on this method because
# our score is changing based on execution. 
# Let's try first SVM model:
svm_model = svm.SVC(kernel='rbf', C = 30, gamma = 'auto')
svm_model.fit(X_train, y_train)
svm_model.score(X_test, y_test)

1.0

* To avoid the score changing, we use K-fold cross validation.

<h2 style="color:blue"> Approach 2: </h2> Using K-Fold Cross Validataion.
       
    * So here we use different Kernal prameter tuning for K-Fold validation technique:

In [20]:
# Let's call the K-Fold cross validation:
from sklearn.model_selection import cross_val_score

In [22]:
# Applying cross validation on SVM using 'kernal = linear' and 'C = 10':
cross_val_score(svm.SVC(kernel = 'linear', C = 10, gamma = 'auto'), iris.data, iris.target, cv=5)

array([1.        , 1.        , 0.9       , 0.96666667, 1.        ])

In [23]:
# Applying cross validation on SVM using 'Kernal = rbf' and 'C = 10':
cross_val_score(svm.SVC(kernel = 'rbf', C = 10, gamma = 'auto'), iris.data, iris.target, cv=5)

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [24]:
# Applying cross validation on SVM using 'Kernal = rbf' and 'C = 20':
cross_val_score(svm.SVC(kernel='rbf',C=20,gamma='auto'),iris.data, iris.target, cv=5)

array([0.96666667, 1.        , 0.9       , 0.96666667, 1.        ])

* So now from upper three different resulst we can take average and decide the best parameters (parameter tuning).

The main problem of this approach is that, you have so many parameters and till what you will be changing the parameters and taking the average to find the best parameters.

In [26]:
# So for cross fold validation we can use for loop which will help us to find the best parameters so quickly:
import numpy as np
kernels = ['rbf', 'linear']
C = [1, 10, 20]
avg_scores = {}
for kval in kernels:
    for cval in C:
        cv_scores = cross_val_score(svm.SVC(kernel = kval, C = cval, gamma = 'auto'), iris.data, iris.target, cv = 5)
        avg_scores[kval + '_' + str(cval)] = np.average(cv_scores)
avg_scores

{'rbf_1': 0.9800000000000001,
 'rbf_10': 0.9800000000000001,
 'rbf_20': 0.9666666666666668,
 'linear_1': 0.9800000000000001,
 'linear_10': 0.9733333333333334,
 'linear_20': 0.9666666666666666}

* From above results we can say that rbf with C=1 or 10 or linear with C=1 will give best performance

- The above approach (for loop) again has some issues like if I have 4 parameters instead of two (kernels & C), then I have to run 4 loops, So it will be much computation and iterations.

**So to avoid all these issues, Sklearn provide an API called GridSearchCV which will do the exact same thing (loop ...[26]).** 

In [27]:
# So the first thing is importing the GridSearchCV:
from sklearn.model_selection import GridSearchCV

In [28]:
# Then we define the classifier, in the classifier the first parameter is our model, the second parameter is you grid:
clf = GridSearchCV(svm.SVC(gamma='auto'), {
    'C': [1,10,20],
    'kernel': ['rbf','linear']
}, cv=5, return_train_score=False)
clf.fit(iris.data, iris.target)
clf.cv_results_

{'mean_fit_time': array([0.00200138, 0.00099974, 0.00080228, 0.00097823, 0.00120091,
        0.00040059]),
 'std_fit_time': array([0.00063333, 0.00062683, 0.00040117, 0.00059905, 0.00040209,
        0.00049062]),
 'mean_score_time': array([0.00160179, 0.00080171, 0.00060096, 0.00039997, 0.00019994,
        0.00060415]),
 'std_score_time': array([0.00049098, 0.00075041, 0.00049068, 0.00048986, 0.00039988,
        0.00049334]),
 'param_C': masked_array(data=[1, 1, 10, 10, 20, 20],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['rbf', 'linear', 'rbf', 'linear', 'rbf', 'linear'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 20, 'kernel': 'rbf'},
  {'C': 20, 'kernel': 'linear'}],


* The result oup putted here will be difficult to read, so we can import it into the DataFrame.

In [29]:
# Importing the result into the DataFrame:
df = pd.DataFrame(clf.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002001,0.000633,0.001602,0.000491,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.001,0.000627,0.000802,0.00075,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.000802,0.000401,0.000601,0.000491,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
3,0.000978,0.000599,0.0004,0.00049,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
4,0.001201,0.000402,0.0002,0.0004,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,5
5,0.000401,0.000491,0.000604,0.000493,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,6


* So the upper table is a nice visualization, we have C parameters, kernels parameter and scores for each splits (We run 5-fold validation so we see split0 to split4). And we also have rank_test scores.

In [31]:
# Some parameters in the above DataFrame might be not useful so we create a sub DataFrame for quick result:
df[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


* So here we clear see the parameters against their scores. So we can get which parameters can produce a better result.

In [35]:
# Let's dir() the classifier to show which other properties this object has:
dir(clf)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_estimator_type',
 '_format_results',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_pairwise',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run_search',
 '_select_best_index',
 '_validate_data',
 'best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 'classes_',
 'cv',
 'cv_results_',
 'decision_function',
 'error_score',
 'estimator',
 'fit',
 'get_params',
 'inverse_transform',
 'multimetric_',
 'n_features_

In [36]:
# Let's try the 'best_score_' property:
clf.best_score_

0.9800000000000001

In [37]:
# To see best parameters:
clf.best_params_

{'C': 1, 'kernel': 'rbf'}

* One issue with the GridSearchCV is computation cost. Now our dataset is very limited but if you have milion of data points for you dataset, so it will takes mcuh time.
* GridSearchCV do permutation and combination for every value in each of the parameters [C (...) & Kernals (...)].

* To tackle with this much pumutation and combination, sklearn library comes with another class called RandomizedSearchCV. This CV not try to do every single permutation and combination but try to do it randomly. You can choose for CV, what those iteration could be.

In [38]:
# Let's see how RadomizedSearchCV works?
# The API is looking most similar with GridSearchCV.
# The interesting parameter is here 'n_iter = 2'. here we have two iterations, but if you see previously [31] we had six iterations.
# So after we call fit method and dowload the result into the datafram.
from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(svm.SVC(gamma='auto'), {
        'C': [1,10,20],
        'kernel': ['rbf','linear']
    }, 
    cv=5, 
    return_train_score=False, 
    n_iter=2
)
rs.fit(iris.data, iris.target)
pd.DataFrame(rs.cv_results_)[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,10,linear,0.973333
1,20,linear,0.966667


* We see that it perform two iterations at a time. Every time when you run it again it changes the values and parameters.
* This work will in the practical life.

### So till now we were talking about Hyper Parameter Tuning. Now let's talk how to choose a best model?

In [40]:
# For sklearn Iris dataset let's try the three algorithms 1) Random Forest Classifier 2) Logistic Regression 3) SVM.
# We want to figure out which one give me the best performance?
# Here the parameters are defined by JSon Object using simple Python Dictionary.
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}

In [41]:
# Now the simple 'for loop' is define here, this 'for loop' is just going on the dictionary and for each of the values it 
# will used GridSearchCV. So we see the first parameter of GridSearchCV is your Model, and the second one is the parameters.
# So then the training is run and the scores append to scores list. When we run it the scors list has all those values.
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(iris.data, iris.target)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })


In [42]:
# So we convert those result into pandas DataFrame: 
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.953333,{'n_estimators': 5}
2,logistic_regression,0.966667,{'C': 5}


* Based on above, we can conclude that SVM with C=1 and kernel='rbf' is the best model for solving my problem of iris flower classification

* <h3 style = "color:green">So from this DataFrame we can conclude which classifier with which parameters has better performance for sklearn Iris dataset?</h3>
* Here we had three classifier, but we can have 10 or 20 classifier and dicide which one do better for a specific dataset.

### Exercise
For digits dataset in sklearn.dataset, please try following classifiers and find out the one that gives best performance. Also find the optimal parameters for that classifier.

In [5]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

In [2]:
# Let's first import the dataset:
from sklearn.datasets import load_digits
digits = load_digits()

In [3]:
# Let's again see what we have in the dataset:
dir(digits)

['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']

In [9]:
# Let's check the different parameters:
model = LogisticRegression()

In [16]:
# Now let's create the models parameters:
model_parameters = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1, 5, 10, 15, 20],
            'kernel': ['rbf','linear', 'poly', 'sigmoid'],
            'degree': [1, 2, 3, 4, 5, 6]     
        }  
    },
    'naive_bayes_gaussian': {
        'model': GaussianNB(),
        'params': {}
    },
    'naive_bayes_multinomial': {
        'model': MultinomialNB(),
        'params': {}
    },
    'decision_tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'criterion': ['gini','entropy'],
            'splitter' : ["best", "random"],
            'max_features' : ["auto", "sqrt", "log2"]
            
        }
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,3, 5, 8, 12, 15],
            'criterion' : ["gini", "entropy"]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10],
            'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
            'multi_class' : ['auto', 'ovr', 'multinomial']
        }
    }
}

In [17]:
# Call for the GridSearchCV and define the classifier:
from sklearn.model_selection import GridSearchCV

scores = []

for model_name, mp in model_parameters.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(digits.data, digits.target)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })

120 fits failed out of a total of 180.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Habib\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Habib\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\Habib\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 457, in _check_solver
    raise ValueError(
ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

-------------------------

In [18]:
# Call to pandas and print the result in a DataFrame:
import pandas as pd

df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svm,0.968842,"{'C': 1, 'degree': 3, 'kernel': 'poly'}"
1,naive_bayes_gaussian,0.806928,{}
2,naive_bayes_multinomial,0.87035,{}
3,decision_tree,0.755758,"{'criterion': 'entropy', 'max_features': 'sqrt..."
4,random_forest,0.915446,"{'criterion': 'entropy', 'n_estimators': 15}"
5,logistic_regression,0.92879,"{'C': 1, 'multi_class': 'ovr', 'penalty': 'l1'}"


Thats were all about Prameter tuning and choosing the best model for your ML problem.