### Tutorial 8 - Tips and Tricks

Packages to install: 

* `scikit-learn`
* `imbalanced-learn`

### Question 1: Hyper-parameters Tuning 
Hyperparameter tuning is an essential step in the machine learning pipeline to optimize model performance. One common method for hyperparameter tuning is GridSearchCV from the `sklearn.model_selection` module.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

1. Load the Iris dataset from sklearn.

2. Split the dataset into training and testing sets using `train_test_split`.

3. Define the SVC model.

4. Define the parameter grid to search over. In this example, we tune C, gamma, and kernel parameters of the SVC.

5. Create a GridSearchCV object with the SVC model, parameter grid, 5-fold cross-validation (`cv=5`), verbosity for logging (`verbose=2`).

6. Fit the `GridSearchCV` object to the training data. This will perform cross-validation and find the best parameters.

7. Best Parameters and Score: Print the best parameters and the best cross-validation score.

8. Use the best model (with the optimal hyperparameters found) to make predictions on the test set.

9. Print the classification report and accuracy score to evaluate the performance of the model on the test set.   
Use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html


In [120]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [106]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [108]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

In [134]:
model = SVC()

In [135]:
param_grid={'C': [1, 10, 2.5, 4],
            'kernel': ['linear', 'rbf','poly'],
           'gamma':['scale','auto']}

In [136]:
gs = GridSearchCV(model,
                  param_grid,
                  cv=5,
                  verbose=2)

In [137]:
gs.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV] END ....................C=1, gamma=scale, kernel=linear; total time=   0.0s
[CV] END ....................C=1, gamma=scale, kernel=linear; total time=   0.0s
[CV] END ....................C=1, gamma=scale, kernel=linear; total time=   0.0s
[CV] END ....................C=1, gamma=scale, kernel=linear; total time=   0.0s
[CV] END ....................C=1, gamma=scale, kernel=linear; total time=   0.0s
[CV] END .......................C=1, gamma=scale, kernel=rbf; total time=   0.0s
[CV] END .......................C=1, gamma=scale, kernel=rbf; total time=   0.0s
[CV] END .......................C=1, gamma=scale, kernel=rbf; total time=   0.0s
[CV] END .......................C=1, gamma=scale, kernel=rbf; total time=   0.0s
[CV] END .......................C=1, gamma=scale, kernel=rbf; total time=   0.0s
[CV] END ......................C=1, gamma=scale, kernel=poly; total time=   0.0s
[CV] END ......................C=1, gamma=scale

In [138]:
gs.best_params_

{'C': 1, 'gamma': 'scale', 'kernel': 'linear'}

In [139]:
gs.best_score_

0.980952380952381

In [140]:
best_model = gs.best_estimator_

In [141]:
pred = best_model.predict(X_test)

In [142]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.79      1.00      0.88        15
           2       1.00      0.73      0.85        15

    accuracy                           0.91        45
   macro avg       0.93      0.91      0.91        45
weighted avg       0.93      0.91      0.91        45



### Question 2 - Oversampling using SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is a popular method for handling imbalanced datasets. The imbalanced-learn package provides an easy way to implement SMOTE in Python. Install the package. Use the link here for reference: 

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

1. Import necessary libraries including NumPy, Pandas, sklearn's datasets, model_selection, ensemble, metrics, and SMOTE from imbalanced-learn.

2. Create an imbalanced synthetic dataset using make_classification. In this example, we set the weights parameter to `[0.95, 0.05]` to create an imbalance between two classes.

```
# Create a synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10,
                           n_clusters_per_class=1, weights=[0.95, 0.05], flip_y=0, random_state=42)
```

4. Print the original class distribution to confirm the imbalance.

5. Split the dataset into training and testing sets using train_test_split.

6. Instantiate a SMOTE object and apply it to the training set using `fit_resample`. This will generate synthetic samples for the minority class to balance the class distribution.

7. Print the class distribution after applying SMOTE to confirm that the dataset is now balanced.

8. Train a `LogisticRegression` on the resampled training set.

9. Use the trained classifier to make predictions on the test set.

10. Print the classification report and accuracy score to evaluate the model's performance on the test set.

11. Now, try the original dataset and compare the result.




In [159]:
from sklearn.datasets import make_classification
import pandas as pd
from imblearn.over_sampling import SMOTE

In [150]:
# Create a synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10,
                           n_clusters_per_class=1, weights=[0.95, 0.05], flip_y=0, random_state=42)

In [151]:
X.shape

(1000, 20)

In [155]:
pd.value_counts(y)

  pd.value_counts(y)


0    950
1     50
Name: count, dtype: int64

In [156]:
950/50

19.0

In [157]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

In [158]:
pd.value_counts(y_train)

  pd.value_counts(y_train)


0    665
1     35
Name: count, dtype: int64

In [160]:
sm = SMOTE()

In [161]:
X_res, y_res = sm.fit_resample(X_train,y_train)

In [162]:
pd.value_counts(y_res)

  pd.value_counts(y_res)


0    665
1    665
Name: count, dtype: int64

In [163]:
from sklearn.linear_model import LogisticRegression

In [164]:
model = LogisticRegression()

In [165]:
model.fit(X_res, y_res)

In [166]:
pred = model.predict(X_test)

In [168]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.99      0.94      0.97       285
           1       0.43      0.87      0.58        15

    accuracy                           0.94       300
   macro avg       0.71      0.90      0.77       300
weighted avg       0.96      0.94      0.95       300



In [169]:
model2 = LogisticRegression()

In [170]:
model2.fit(X_train, y_train)

In [171]:
pred2 = model2.predict(X_test)

In [172]:
print(classification_report(y_test, pred2))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       285
           1       1.00      0.67      0.80        15

    accuracy                           0.98       300
   macro avg       0.99      0.83      0.90       300
weighted avg       0.98      0.98      0.98       300

