# **`Best Model Selection` with Best Hyperparameters**

### **1. Import Required Libraries:**

In [4]:
#1 Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#2 Import to Train Test Split the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

#3 Import for Regression Algorithms/Models 
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

# Import Metrics for Regression 
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

#4 Import Grid Search CV for Cross-Validation
from sklearn.model_selection import GridSearchCV

#5 Import Preprocessors
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

### **2. Load the `Tips` Dataset:**

In [5]:
#6 Load the Dataset
df = sns.load_dataset('tips')

In [6]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [7]:
df.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

## **Rergression Tasks:**

### **3. Separate the Features (X) and Targets (y):**

In [8]:
# select features and variables
X = df.drop('tip', axis=1)
y = df['tip']

### **4. Encode Categorical Features using `Label Encoder`:**

In [9]:
# label encode categorical variables
le = LabelEncoder()
X['sex'] = le.fit_transform(X['sex'])
X['smoker'] = le.fit_transform(X['smoker'])
X['day'] = le.fit_transform(X['day'])
X['time'] = le.fit_transform(X['time'])

### **5. Split the Dataset into Training and Testing Sets:**

In [10]:
%%time
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

CPU times: total: 0 ns
Wall time: 21 ms


## **6. Testing Different Models:**

### **6.1 Build, Train, Predict and Evaluate the Models using `Mean Absolute Error (MAE)`:**

In [11]:
#1 Create a dictionary of list of models to evaluate performance
models = { 
          'LinearRegression' : LinearRegression(),
          'SVR' : SVR(),
          'DecisionTreeRegressor' : DecisionTreeRegressor(),
          'RandomForestRegressor' : RandomForestRegressor(),
          'KNeighborsRegressor' : KNeighborsRegressor(),
          'GradientBoostingRegressor' : GradientBoostingRegressor(),
          'XGBRegressor' : XGBRegressor()          
          }

#2 Train and predict each model with evaluation metrics as well making a for loop to iterate over the models

model_scores = []
for name, model in models.items():
    # fit each model from models on training data
    model.fit(X_train, y_train)

    # make prediction from each model
    y_pred = model.predict(X_test)
    metric = mean_absolute_error(y_test, y_pred)
    model_scores.append((name, metric))
    
    # # print the performing metric
    # print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    # print(name, 'R2: ', r2_score(y_test, y_pred))
    # print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    # print('\n')
# selecting the best model from all above models with evaluation metrics sorting method
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print('Mean Absolute error for', f"{model[0]} is {model[1]: .2f}") 

Mean Absolute error for DecisionTreeRegressor is  0.93
Mean Absolute error for RandomForestRegressor is  0.77
Mean Absolute error for KNeighborsRegressor is  0.73
Mean Absolute error for GradientBoostingRegressor is  0.72
Mean Absolute error for XGBRegressor is  0.67
Mean Absolute error for LinearRegression is  0.67
Mean Absolute error for SVR is  0.57


The above code is performing `best model selection for regression tasks`. It uses a `dictionary` to store different types of models, then it fits each model on the training data and makes predictions on the test data. The performance of each model is evaluated using the `Mean Absolute Error (MAE)` metric. The models are then sorted based on their MAE scores in descending order, and the MAE score for each model is printed.

**Observations from the Output:**
| Model | Mean Absolute Error | Remarks |
|--------|---------------------|---------|
| `DecisionTreeRegressor` | `0.90` | The model has the `highest error`, indicating it performed the `worst` among all models. |
| RandomForestRegressor | 0.78 | The model performed better than the DecisionTreeRegressor but still has a relatively high error. |
| GradientBoostingRegressor | 0.74 | The model performed better than the RandomForestRegressor but still has a relatively high error. |
| KNeighborsRegressor | 0.73 | The model performed slightly better than the GradientBoostingRegressor. |
| XGBRegressor | 0.67 | The model performed better than the KNeighborsRegressor, indicating it has a relatively lower error. |
| LinearRegression | 0.67 | The model performed as well as the XGBRegressor, indicating it has a relatively lower error. |
| `SVR` | `0.57` | The model has the `lowest error`, indicating it performed the `best` among all models. |

In summary, the `Support Vector Regressor` (SVR) model `performed the best` on the given dataset, as it has the lowest Mean Absolute Error. The `Decision Tree Regressor performed the worst`. It's important to note that these results are specific to this dataset and the specific train/test split used. Different datasets or different splits might lead to different results. Also, hyperparameter tuning could potentially improve the performance of these models.

### **6.2 Build, Train, Predict and Evaluate the Models using `R_Squared Score`:**

In [12]:
%%time
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a dictionaries of list of models to evaluate performance
models = { 
          'LinearRegression' : LinearRegression(),
          'SVR' : SVR(),
          'DecisionTreeRegressor' : DecisionTreeRegressor(),
          'RandomForestRegressor' : RandomForestRegressor(),
          'KNeighborsRegressor' : KNeighborsRegressor(),
          'GradientBoostingRegressor' : GradientBoostingRegressor(),
          'XGBRegressor' : XGBRegressor()          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

model_scores = []
for name, model in models.items():
    # fit each model from models on training data
    model.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = model.predict(X_test)
    metric = r2_score(y_test, y_pred)
    model_scores.append((name, metric))
    
    # # print the performing metric
    # print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    # print(name, 'R2: ', r2_score(y_test, y_pred))
    # print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    # print('\n')
# selecting the best model from all above models with evaluation metrics sorting method
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print('R_squared Score', f"{model[0]} is {model[1]: .2f}") 

R_squared Score SVR is  0.57
R_squared Score LinearRegression is  0.44
R_squared Score XGBRegressor is  0.41
R_squared Score GradientBoostingRegressor is  0.35
R_squared Score KNeighborsRegressor is  0.33
R_squared Score RandomForestRegressor is  0.23
R_squared Score DecisionTreeRegressor is -0.05
CPU times: total: 1.5 s
Wall time: 1.33 s


The provided code is performing `best model selection for regression tasks`. It uses a dictionary to store different types of models, then it fits each model on the training data and makes predictions on the test data. The performance of each model is evaluated using the `R-squared score`. The models are then sorted based on their R-squared scores in descending order, and the R-squared score for each model is printed.

**Observations from the Output**
| Model | R-squared Score | Remarks |
|--------|-----------------|---------|
| `SVR` | `0.57` | The model has the highest R-squared score, indicating it performed the best among all models. |
| LinearRegression | 0.44 | The model performed better than most models but has a lower score than SVR. |
| XGBRegressor | 0.41 | The model performed better than most models but has a lower score than LinearRegression. |
| GradientBoostingRegressor | 0.35 | The model performed better than RandomForestRegressor and DecisionTreeRegressor but has a lower score than XGBRegressor. |
| KNeighborsRegressor | 0.33 | The model performed better than RandomForestRegressor and DecisionTreeRegressor but has a lower score than GradientBoostingRegressor. |
| RandomForestRegressor | 0.28 | The model performed better than DecisionTreeRegressor but has a lower score than KNeighborsRegressor. |
| `DecisionTreeRegressor` | `-0.15` | The model has the lowest R-squared score, indicating it performed the worst among all models. |

In summary, the Support Vector Regressor (SVR) model performed the best on the given dataset, as it has the highest R-squared score. The Decision Tree Regressor performed the worst. It's important to note that these results are specific to this dataset and the specific train/test split used. Different datasets or different splits might lead to different results. Also, hyperparameter tuning could potentially improve the performance of these models. The total CPU time taken for this process was 856 milliseconds.

### **6.3 Build, Train, Predict and Evaluate the Models using `Mean Squared Error`:**

In [13]:
%%time
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a dictionaries of list of models to evaluate performance
models = { 
          'LinearRegression' : LinearRegression(),
          'SVR' : SVR(),
          'DecisionTreeRegressor' : DecisionTreeRegressor(),
          'RandomForestRegressor' : RandomForestRegressor(),
          'KNeighborsRegressor' : KNeighborsRegressor(),
          'GradientBoostingRegressor' : GradientBoostingRegressor(),
          'XGBRegressor' : XGBRegressor()          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

model_scores = []
for name, model in models.items():
    # fit each model from models on training data
    model.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = model.predict(X_test)
    metric = mean_squared_error(y_test, y_pred)
    model_scores.append((name, metric))
    
    # # print the performing metric
    # print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    # print(name, 'R2: ', r2_score(y_test, y_pred))
    # print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    # print('\n')
# selecting the best model from all above models with evaluation metrics sorting method
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=False)
for model in sorted_models:
    print('Mean Squared error for', f"{model[0]} is {model[1]: .2f}") 

Mean Squared error for SVR is  0.54
Mean Squared error for LinearRegression is  0.69
Mean Squared error for XGBRegressor is  0.74
Mean Squared error for GradientBoostingRegressor is  0.80
Mean Squared error for KNeighborsRegressor is  0.84
Mean Squared error for RandomForestRegressor is  0.92
Mean Squared error for DecisionTreeRegressor is  1.11
CPU times: total: 1.86 s
Wall time: 2.56 s


The above code is performing `best model selection` for regression tasks. It uses a dictionary to store different types of models, then it fits each model on the training data and makes predictions on the test data. The performance of each model is evaluated using the `Mean Squared Error (MSE)` metric. The models are then sorted based on their MSE scores in ascending order, and the MSE score for each model is printed.

**Observations from the Output**
| Model | Mean Squared Error | Remarks |
|--------|---------------------|---------|
| `SVR` | `0.54` | The model has the `lowest error`, indicating it performed the `best` among all models. |
| LinearRegression | 0.69 | The model performed better than most models but has a higher error than SVR. |
| XGBRegressor | 0.74 | The model performed better than most models but has a higher error than LinearRegression. |
| GradientBoostingRegressor | 0.80 | The model performed better than RandomForestRegressor and DecisionTreeRegressor but has a higher error than XGBRegressor. |
| KNeighborsRegressor | 0.84 | The model performed better than RandomForestRegressor and DecisionTreeRegressor but has a higher error than GradientBoostingRegressor. |
| RandomForestRegressor | 0.97 | The model performed better than DecisionTreeRegressor but has a higher error than KNeighborsRegressor. |
| `DecisionTreeRegressor` | `1.23` | The model has the `highest error`, indicating it performed the `worst` among all models. |

**In summary**, the `Support Vector Regressor` (SVR) model `performed the best` on the given dataset, as it has the lowest Mean Squared Error. The `Decision Tree Regressor performed the worst`. It's important to note that these results are specific to this dataset and the specific train/test split used. Different datasets or different splits might lead to different results. Also, hyperparameter tuning could potentially improve the performance of these models. The total CPU time taken for this process was 1.11 seconds.

## **Assignment:** Find the best model based on each metrics from above mentioned results?  with Diamonds dataset

In [16]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


---

## **Hyperparameter tuning:**

In [17]:
%%time
# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid']}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10]}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2)}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'n_estimators': [10, 100]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100]}),          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline
    pipeline = GridSearchCV(model, params, cv=5)
    
    # fit the pipeline
    pipeline.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = pipeline.predict(X_test)
    
      
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')

LinearRegression MSE:  0.6948129686287711
LinearRegression R2:  0.4441368826121931
LinearRegression MAE:  0.6703807496461158


SVR MSE:  1.460718141299992
SVR R2:  -0.1686013018011976
SVR MAE:  0.8935334948775431


DecisionTreeRegressor MSE:  0.8774153020453994
DecisionTreeRegressor R2:  0.2980516670532909
DecisionTreeRegressor MAE:  0.718948162948163


RandomForestRegressor MSE:  0.9318394751020421
RandomForestRegressor R2:  0.2545113304988039
RandomForestRegressor MAE:  0.7715061224489798


KNeighborsRegressor MSE:  0.6640950568462677
KNeighborsRegressor R2:  0.4687117753876745
KNeighborsRegressor MAE:  0.6203721488595437


GradientBoostingRegressor MSE:  0.8106801524004932
GradientBoostingRegressor R2:  0.35144101065487676
GradientBoostingRegressor MAE:  0.7657809818712309


XGBRegressor MSE:  0.6624107100882575
XGBRegressor R2:  0.4700592836840687
XGBRegressor MAE:  0.6549163442728472


CPU times: total: 11.7 s
Wall time: 13.7 s


In [None]:
# Create dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid'], 'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01], 'epsilon': [0.1, 0.01, 0.001]}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10], 'splitter': ['best', 'random']}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100, 1000], 'max_depth': [None, 5, 10]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2), 'weights': ['uniform', 'distance']}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'loss': ['ls', 'lad', 'huber', 'quantile'], 'n_estimators': [10, 100, 1000]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100, 1000], 'learning_rate': [0.1, 0.01, 0.001]}),          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline
    pipeline = GridSearchCV(model, params, cv=5)
    
    # fit the pipeline
    pipeline.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = pipeline.predict(X_test)
    
      
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')

# Assignment: How to get best parameters of each model, write in the for loop among the code, how to get best model out of it?

## Solution

In [None]:
# Write your Code here

##############################################################################################################

In [None]:
# How to get best parameters of each model using GridSearchCV, write the for loop to iterate over the models
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

---


# **Add preprocessor inside the pipeline**

## Assignment: Find the errors

In [None]:
# make a preprocessor

preprocessor = ColumnTransformer(
    transformers=['numeric_scaling', StandardScaler(), ['total_bill', 'size']], remainder='passthrough')


# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid'], 'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01], 'epsilon': [0.1, 0.01, 0.001]}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10], 'splitter': ['best', 'random']}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100, 1000], 'max_depth': [None, 5, 10]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2), 'weights': ['uniform', 'distance']}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'loss': ['ls', 'lad', 'huber', 'quantile'], 'n_estimators': [10, 100, 1000]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100, 1000], 'learning_rate': [0.1, 0.01, 0.001]}),          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline with preprocessor
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])   
    
    # make a grid search cv to tune the hyperparameter
    grid_search = GridSearchCV(pipeline, params, cv=5)
    
    
    # fit the pipeline
    grid_search.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = grid_search.predict(X_test)
    
      
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')

# Classifiers:

In [27]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# dont show warnings
import warnings
warnings.filterwarnings('ignore')

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a dictionary of classifiers to evaluate
classifiers = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier()
}

# Perform k-fold cross-validation and calculate the mean accuracy
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for name, classifier in classifiers.items():
    scores = cross_val_score(classifier, X, y, cv=kfold)
    accuracy = np.mean(scores)
    print("Classifier:", name)
    print("Mean Accuracy:", accuracy)
    print()

Classifier: Logistic Regression
Mean Accuracy: 0.9733333333333334

Classifier: Decision Tree
Mean Accuracy: 0.9533333333333335

Classifier: Random Forest
Mean Accuracy: 0.9600000000000002

Classifier: SVM
Mean Accuracy: 0.9666666666666668

Classifier: KNN
Mean Accuracy: 0.9733333333333334



# **Main Assignment:**

## Write the complete code to select the best Regressor and classifier for the given dataset called diamonds `(if you have a high end machine, you can use the whole dataset, else use the sample dataset provided in the link)` or you can use Tips datset for Regression task and Iris dataset for Classification task.

## You have to choose all possible models with their best or possible hyperparameters and compare them with each other and select the best model for the given dataset.

## Your code should be complete and explained properly. for layman, each and every step of the code should be commented properly.

## You code should also save the best model in the pickle file.

## You should also write the code to load the pickle file and use it for prediction. in the last snippet of the code.

## Submit your assignment to the discord inbox. (Do not share the link of your notebook, just upload the notebook in the discord inbox). Do not share the notebook in public channels on our discord server.


# **Deadline for Submission:**

## `29th December before 09:30 pm Pakistan time. (No late submission will be accepted).`
