# Hyperparameter Optimization
hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.  
* A **hyperparameter** is a parameter whose value is used to control the learning process.

### What is the difference between parameter and hyperparameter ❓ 

![image info](https://miro.medium.com/max/1400/1*9qdlqKa4dtjwXNDNaWe9UQ.jpeg)
<img src = "https://i.postimg.cc/5tCrG5Md/hyperparameters.png" />

* **Model parameters:** These are the parameters that are estimated by the model from the given data. For example the coefficients in a linear regression or logistic regression.
* **Model hyperparameters:** These are the parameters that cannot be estimated by the model from the given data. These parameters are used to estimate the model parameters. For example, the learning rate in deep neural networks.

### How To Fine-Tune Your Models

We have already explained about <span style="color:red">**Overfitting**</span>

One of the significant aspects of training your machine learning model is avoiding overfitting. This happens because your model is trying too hard to capture the noise in your training dataset. By noise, we mean the data points that don’t represent the actual properties of your data but random chance. Learning such data points makes your model more flexible, at the risk of `overfitting`.

### Regularization

Regularization is a way to avoid ``overfitting`` by penalizing high-valued regression coefficients. In simple terms, it reduces parameters and shrinks (`simplifies`) the model. In other words, this technique forces us not to learn a more complex or `flexible model`, to avoid the problem of overfitting.
This is a technique to minimize the complexity of the model (we will see what we mean by that) by penalizing the loss function to solve overfitting.

>**❗ NOTE:** With regularization, you add a penalty to your model for its weights; in other words, you evaluate the model both by **how well it performs and how complex it is.**

Two commonly used methods to find the best between Simple and Complicated are: 
    <span style="color:blue">**Regularization L1**</span> and <span style="color:green">**Regularization L2**</span>

## L2 Ridge Regression

In a multiple `LinearRegression`, there are many variables at play. This sometimes poses a problem of choosing the wrong variables for the ML, which gives undesirable output as a result. Ridge regression is used in order to overcome this. This method is a `Regularization` technique in which an extra variable (tuning parameter) is added and optimised to offset the effect of multiple variables in LinearRegression (in the statistical context, it is referred to as `noise`).
The main idea of Ridge Regression is to fit a new line that `doesn’t` fit the training data. In other words, we introduce a certain Amount on Bias into the new trend line.

Please look at the following image:
<img src="https://i.postimg.cc/0ymd1Vrf/ridge.png" width='600'/>

* Adds a penalty term to the least squares loss function:

$$\large\mathcal{L}_{Ridge} = \sum_{n=1}^{N} (y_n-(\mathbf{w}\mathbf{x_n} + w_0))^2 + \alpha \sum_{i=1}^{p} w_i^2$$ 

* Model is penalized if it uses large coefficients ($w$)
    * Each feature should have as little effect on the outcome as possible 
    * We don't want to penalize $w_0$, so we leave it out
    * Called L2 regularization because it uses the L2 norm: $\sum w_i^2$
* The strength of the regularization can be controlled with the $\alpha$ hyperparameter.
    * Increasing $\alpha$ causes more regularization (or shrinkage). Default is 1.0.

> **❗ NOTE:** by **changing the values of alpha**, we are controlling the penalty term. Greater the values of alpha, the higher is the penalty and therefore the magnitude of the coefficients is reduced.

## L1 Lasso Regression

### Lasso (Least Absolute Shrinkage and Selection Operator)

* Adds a different penalty term to the least squares sum:
$$\large\mathcal{L}_{Lasso} = \sum_{n=1}^{N} (y_n-(\mathbf{w}\mathbf{x_n} + w_0))^2 + \alpha \sum_{i=1}^{p} |w_i|$$ 
* Called L1 regularization because it uses the L1 norm
    * Will cause many weights to be exactly 0
* Same parameter $\alpha$ to control the strength of regularization. 

>**❗ NOTE:**  
>* L1 prefers coefficients to be exactly zero (sparse models)
>* Some features are ignored entirely: automatic feature selection
 

In [62]:
import pandas as pd
import numpy as np
import plotly.express as px
import warnings 
warnings.filterwarnings('ignore')


In [61]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

In [63]:
mpg_df = pd.read_csv("./dataset/cleaned_mpg.csv")  
mpg_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,name,origin
0,18.0,8,307.0,130.0,3504,12.0,70,chevrolet chevelle malibu,1
1,15.0,8,350.0,165.0,3693,11.5,70,buick skylark 320,1
2,18.0,8,318.0,150.0,3436,11.0,70,plymouth satellite,1
3,16.0,8,304.0,150.0,3433,12.0,70,amc rebel sst,1
4,17.0,8,302.0,140.0,3449,10.5,70,ford torino,1


In [64]:
mpg_df = mpg_df.drop("name", axis=1)

In [65]:
mpg_df.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
dtype: int64

In [66]:
mpg_df = mpg_df.apply(lambda x: x.fillna(x.median()),axis=0)

In [67]:
mpg_df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.30402,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,38.222625,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,76.0,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,125.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [68]:
mpg_df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,18.0,8,307.0,130.0,3504,12.0,70,1
1,15.0,8,350.0,165.0,3693,11.5,70,1
2,18.0,8,318.0,150.0,3436,11.0,70,1
3,16.0,8,304.0,150.0,3433,12.0,70,1
4,17.0,8,302.0,140.0,3449,10.5,70,1
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,1
394,44.0,4,97.0,52.0,2130,24.6,82,2
395,32.0,4,135.0,84.0,2295,11.6,82,1
396,28.0,4,120.0,79.0,2625,18.6,82,1


In [54]:
px.imshow(mpg_df.corr())

>**NOTE:** Looking at the correlations between the features, we can see that there is some collinearity between the features. In particular, cylinders and displacement are highly correlated, and displacement and horsepower are highly correlated. 

LASSO is **more likely to remove features in sets of correlated features**, so it won’t be surprising if LASSO removes one or two of the features cylinders, displacement, weight and horsepower. On the other hand, LASSO selects the regularization strength, the strength of the penalty, based on cross-validated metrics. So, if there is some additional explanatory power behind a feature, it likely won’t be removed even if it is closely correlated to another feature.

### separate independent and dependent variables

In [69]:
X = mpg_df.drop('mpg', axis=1)
y = mpg_df[['mpg']]

In [71]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

In [72]:
columns = X_train.columns

In [73]:
columns

Index(['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
       'model_year', 'origin'],
      dtype='object')

### ❗ Scaling is Important to LASSO and RIDGE

If we want to apply **LASSO** properly in SciKit-Learn, we need to **scale our data first.** Unlike in linear regression, scaling of features is essential in LASSO. This is because LASSO’s penalty function includes the sum of the absolute value of the feature coefficients.

In [74]:
from sklearn.preprocessing import MinMaxScaler

In [75]:
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### step 1) fit a simple linear model

In [76]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

for idx, col_name in enumerate(columns):
    print(f"The coefficient for {col_name:<20} {regression_model.coef_[0][idx]}")

The coefficient for cylinders            -1.9669573630929515
The coefficient for displacement         8.785295196631541
The coefficient for horsepower           -3.6566107110937605
The coefficient for weight               -24.833250984499376
The coefficient for acceleration         0.9852412163203974
The coefficient for model_year           9.510526882841905
The coefficient for origin               2.397771540513039


In [77]:
importance_reg = np.abs(regression_model.coef_[0])
feature_names = np.array(columns)
imp_reg = pd.DataFrame({'feature_names':feature_names, 'importance':importance_reg}).sort_values(by='importance', 
                                                                                           ascending=False)
px.bar(x=imp_reg.importance, y=imp_reg.feature_names)

In [78]:
print(f"regression model score on trainig set : {regression_model.score(X_train, y_train)}")
print(f"regression model score on test set : {regression_model.score(X_test, y_test)}")

regression model score on trainig set : 0.8081802739111359
regression model score on test set : 0.8472274567567305


### step 2) Create a regularized Ridge model and note the coefficients

In [79]:
ridge = Ridge(alpha=.3)
ridge.fit(X_train,y_train)
print ("Ridge model:", (ridge.coef_))

Ridge model: [[ -1.24399402   4.05713757  -4.68521885 -20.26479315  -0.16899841
    9.19734458   2.27741192]]


In [19]:
importance_ridge = np.abs(ridge.coef_[0])
feature_names = np.array(columns)
imp_ridge = pd.DataFrame({'feature_names':feature_names, 'importance':importance_ridge}).sort_values(by='importance', 
                                                                                           ascending=False)
px.bar(x=imp_ridge.importance, y=imp_ridge.feature_names)

In [80]:
print(f"ridge regression model score on trainig set : {ridge.score(X_train, y_train)}")
print(f"ridge regression model score on test set : {ridge.score(X_test, y_test)}")

ridge regression model score on trainig set : 0.8062510255121134
ridge regression model score on test set : 0.8488332780929787


### step 3) Create a regularized Lasso model and note the coefficients

In [81]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train,y_train)
print ("Lasso model:", (lasso.coef_))

Lasso model: [ -0.64927043  -0.          -0.         -19.72497562   0.
   8.64099519   1.57737775]


In [82]:
importance_lasso = np.abs(lasso.coef_)
feature_names = np.array(X.columns)
imp_lasso = pd.DataFrame({'feature_names':feature_names, 'importance':importance_lasso}).sort_values(by='importance', 
                                                                                           ascending=False)
px.bar(x=imp_lasso.importance, y=imp_lasso.feature_names)

In [83]:
print(f"lasso regression model score on trainig set : {lasso.score(X_train, y_train)}")
print(f"lasso regression model score on test set : {lasso.score(X_test, y_test)}")

lasso regression model score on trainig set : 0.7991106979955522
lasso regression model score on test set : 0.8484750550046531


### Performance of Lasso and Ridge Regularization when facing overfitting : 

### step 1) generate a polynomial model

In [84]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

In [85]:
poly = PolynomialFeatures(degree = 2)

In [86]:
X_poly = poly.fit_transform(X)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_poly, y, test_size=0.30, random_state=1)
X_train2.shape

(278, 36)

In [87]:
scaler = MinMaxScaler()
scaler.fit(X_train2)
X_train2 = scaler.transform(X_train2)
X_test2 = scaler.transform(X_test2)

In [92]:
regression_model2 = LinearRegression()
regression_model2.fit(X_train2, y_train2)
print(regression_model2.coef_[0])

[ 3.93061121e+13  8.04981909e+01 -2.60213149e+02  9.16412704e+01
  1.65452088e+01 -9.60397216e+01 -1.12625039e+02 -5.79765781e+01
 -5.13299422e+01  7.29480623e+01 -2.90933932e+01  5.55386549e+01
  3.28297773e+01 -9.62736719e+01  9.56268292e+00 -6.71891049e+01
  4.90030978e+01  3.34300163e+01 -4.17236800e+01  2.51026569e+02
 -3.25422301e+00  9.23395579e-01 -2.72313114e+01 -1.24804400e+01
 -8.22565213e+01  1.04151423e+01  1.05823600e+01  2.80468643e+01
 -9.51497800e+01  2.48876389e+00 -9.17292156e+00  9.22864128e+01
  2.35794797e+01  1.19428714e+02  4.82581624e+01 -8.11623189e+00]


In [95]:
print(f"polynomial regression model score on trainig set : {regression_model2.score(X_train2, y_train2)}")
print(f"polynomial regression model score on test set : {regression_model2.score(X_test2, y_test2)}")

polynomial regression model score on trainig set : 0.9013269003482024
polynomial regression model score on test set : 0.8512777336441706


### step 2) Create a regularized Ridge model and note the coefficients

In [101]:
ridge2 = Ridge(alpha=.3)
ridge2.fit(X_train2,y_train2)
print ("Ridge model:", (ridge2.coef_))

Ridge model: [[  0.          -0.77837453  -3.26553512  -3.28092971  -5.9792872
   -1.06881541   3.61593494   0.85635149  -0.23087224   1.32476827
    2.12897881   4.52378381   2.69257699  -3.90467015   3.10979676
    3.68790298   3.49817087   5.85488967  -3.20159957  -6.4622073
    0.8133432    3.82419428   3.80867009  -5.73961123  -7.55955555
   -4.79854812   4.13600404  -6.71404627 -10.08973985  -4.22673912
    1.91514419   2.01892227   6.15342641   8.37656862   5.64333591
   -5.6154968 ]]


In [97]:
print(f"ridge regression model score on trainig set : {ridge2.score(X_train2, y_train2)}")
print(f"ridge regression model score on test set : {ridge2.score(X_test2, y_test2)}")

ridge regression model score on trainig set : 0.8735934187825043
ridge regression model score on test set : 0.8716583937266741


### step 3) Create a regularized LASSO model and note the coefficients

In [99]:
lasso2 = Lasso(alpha=0.1)
lasso2.fit(X_train2,y_train2)
print ("Lasso model:", (lasso2.coef_))

Lasso model: [  0.          -0.20814011  -0.          -0.          -0.
   0.           0.           0.          -0.          -0.
  -0.          -0.          -0.          -0.           0.
  -0.          -0.          -0.          -0.          -0.
  -0.          -0.          -0.          -1.89359789  -0.
  -0.          -0.          -0.         -18.00938217  -0.
   0.           0.           2.89618843  11.05236595   0.
   0.        ]


In [100]:
print(f"lasso regression model score on trainig set : {lasso2.score(X_train2, y_train2)}")
print(f"lasso regression model score on test set : {lasso2.score(X_test2, y_test2)}")

lasso regression model score on trainig set : 0.8164385295948247
lasso regression model score on test set : 0.8611836424628936


### How does Alpha affect ridge regression?

In [104]:
alpha =np.logspace(-6,1,num=20)
print(f"alpha values : {alpha}")
ai = list(range(len(alpha)))
test_score=[]
train_score=[]
for a in alpha:
    r = Ridge(alpha=a).fit(X_train2, y_train2)
    test_score.append(r.score(X_test2, y_test2))
    train_score.append(r.score(X_train2, y_train2))


fig = px.line(labels={"x":"alpha" , "y":"score"})
fig.add_scatter(y= train_score, name="train score")
fig.add_scatter(y= test_score, name="test score")
fig.update_layout(yaxis_range=[0.8,1])
fig.show()


alpha values : [1.00000000e-06 2.33572147e-06 5.45559478e-06 1.27427499e-05
 2.97635144e-05 6.95192796e-05 1.62377674e-04 3.79269019e-04
 8.85866790e-04 2.06913808e-03 4.83293024e-03 1.12883789e-02
 2.63665090e-02 6.15848211e-02 1.43844989e-01 3.35981829e-01
 7.84759970e-01 1.83298071e+00 4.28133240e+00 1.00000000e+01]


In [105]:
print(alpha[10])

0.004832930238571752


### Hyperparameter tuning

There are several approaches to hyperparameter tuning, although two of the simplest and most common methods are random search and grid search.

* **Grid Search:**  set up a grid of hyperparameter values and for each combination, train a model and score on the validation data. In this approach, every single combination of hyperparameters values is tried which can be very inefficient!    
* **Random search:**  set up a grid of hyperparameter values and select random combinations to train the model and score. The number of search iterations is set based on time/resources.

<img src="https://i.postimg.cc/hP72yt5Z/gridvsrandom1.png" />
<img src="https://i.postimg.cc/5NcrMzLW/gridvsrandom2.png" />


* The curves on the left and on the top denote model accuracy


### Grid Search

- For each hyperparameter, create a list of interesting/possible values
- Evaluate all possible combinations of hyperparameter values
    - E.g. using cross-validation
- Split the training data into a training and validation set
- Select the hyperparameter values yielding the best results on the validation set

In [106]:
from sklearn.model_selection import GridSearchCV


alpha_range =np.logspace(-6,1,num=20)
print(alpha_range)

param_grid = [{"alpha": alpha_range}]

gs = GridSearchCV(estimator=Ridge(), 
                  param_grid=param_grid, 
                  scoring='neg_mean_squared_error', 
                  cv=10,
                  n_jobs=-1)
gs = gs.fit(X_train2, y_train2)
print(gs.best_score_)
print(gs.best_params_)
print(gs.best_estimator_)

[1.00000000e-06 2.33572147e-06 5.45559478e-06 1.27427499e-05
 2.97635144e-05 6.95192796e-05 1.62377674e-04 3.79269019e-04
 8.85866790e-04 2.06913808e-03 4.83293024e-03 1.12883789e-02
 2.63665090e-02 6.15848211e-02 1.43844989e-01 3.35981829e-01
 7.84759970e-01 1.83298071e+00 4.28133240e+00 1.00000000e+01]
-8.29237924227358
{'alpha': 0.002069138081114788}
Ridge(alpha=0.002069138081114788)


### Random Search

- Grid Search has a few downsides:
    - Optimizing many hyperparameters creates a combinatorial explosion
    - You have to predefine a grid, hence you may jump over optimal values
- Random Search:
    - Picks `n_iter` random parameter values
    - Scales better, you control the number of iterations
    - Often works better in practice, too
        - not all hyperparameters interact strongly
        - you don't need to explore all combinations

In [107]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [{"alpha": alpha_range}]

random_search = RandomizedSearchCV(estimator=Lasso(), 
                  param_distributions=param_grid, n_iter=20)
random_search.fit(X_train2, y_train2)

print(random_search.best_score_)
print(random_search.best_params_)
print(random_search.best_estimator_)

0.8506631886411137
{'alpha': 0.004832930238571752}
Lasso(alpha=0.004832930238571752)


# Resampling method

Resampling method is a tool consisting in repeatedly drawing samples from a dataset and calculating statistics and metrics on each of those samples in order to obtain further information about something, in the machine learning setting, this something is the performance of a model.  
  
Two commonly used resampling methods that you may encounter are k-fold cross-validation and the bootstrap.

* **Bootstrap.** Samples are drawn from the dataset with replacement (allowing the same sample to appear more than once in the sample), where those instances not drawn into the data sample may be used for the test set.
* **k-fold Cross-Validation.** A dataset is partitioned into k groups, where each group is given the opportunity of being used as a held out test set leaving the remaining groups as the training set.


## Cross-validation

You’re always ensuring your model performs well on your validation set, but what if your validation set is a little biased? Or worse, your training set is biased? To counter these possibilities, you can use cross validation.

### K Fold Cross Validation  
- _k-fold cross-validation_ (CV): split (randomized) data into _k_ equal-sized parts, called _folds_
    - First, fold 1 is the test set, and folds 2-5 comprise the training set
    - Then, fold 2 is the test set, folds 1,3,4,5 comprise the training set
    - Compute _k_ evaluation scores, aggregate afterwards (e.g. take the mean)

<img src="https://i.postimg.cc/FzLwL2h6/cv2.png" width="600"/>
<img src="https://i.postimg.cc/Kzx88DFB/cv.png" width="600"/>

In [108]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

iris = load_iris()
lg = LogisticRegression()


scores = cross_val_score(lg, iris.data, iris.target, cv=5 )

print("Cross-validation scores: {} ".format(scores) )
print("Average cross-validation score: {:.2f}".format(scores.mean()))
print("Variance in cross-validation score: {:.6f}".format(np.var(scores)))

Cross-validation scores: [0.96666667 1.         0.93333333 0.96666667 1.        ] 
Average cross-validation score: 0.97
Variance in cross-validation score: 0.000622


## The Bootstrap
- Sample _n_ (dataset size) data points, with replacement, as training set (the bootstrap)
    - On average, bootstraps include 66% of all data points (some are duplicates)
- Use the unsampled (out-of-bootstrap) samples as the test set
- Repeat $k$ times to obtain $k$ scores


<img src="https://i.postimg.cc/PJcHYWG6/bootstrap.png" width="500"/>

>**❗ NOTE:** Bootstrap and CV are both resampling methods, but they are used in different scenarios. Cross-validation assesses the model's performance. In contrast, the Bootstrap method estimates statistics on a population by sampling a dataset with replacement.

<img src="https://i.postimg.cc/c4w1wzHw/bootstrap2.png" width="500"/>
<img src="https://i.postimg.cc/8PQNpbH5/bootstrap3.png" width="500"/>

### Dealing with class imbalance
Here is how the class imbalance in the dataset can be visualized:


<!-- <img src='https://i.postimg.cc/4xY5WpD6/imbalanced.png' width="500"/> -->


<img src='https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/resampling.png' width="800"/>



In [109]:
from sklearn import datasets
import numpy as np
  


dataa= pd.read_csv('https://raw.githubusercontent.com/CheeAnt/Custormer-Segmentation/main/dataset/train.csv')

y_imbalanced = dataa['term_deposit_subscribed']
X_imbalanced = dataa.drop('term_deposit_subscribed', axis=1)

In [111]:
y_imbalanced.value_counts()

0    28253
1     3394
Name: term_deposit_subscribed, dtype: int64

The code results in creating an imbalanced dataset with 212 records labeled as malignant class reduced to 30. Thus, the total records count becomes benign tumour (357) + malignant tumour (30).

Next step is to use **resample** method to **oversample the minority class** (malignant tumour records in this example) and **undersample the majority class** (benign tumour records).

### Resample method for Over Sampling Minority Class

In [115]:
from sklearn.utils import resample

X_oversampled, y_oversampled = resample(X_imbalanced[y_imbalanced == 1],
                                        y_imbalanced[y_imbalanced == 1],
                                        replace=True,
                                        n_samples=X_imbalanced[y_imbalanced == 0].shape[0],
                                        random_state=123)

X_balanced = pd.concat([X[y == 0], X_oversampled])
y_balanced = pd.concat([y[y == 0], y_oversampled])

In [117]:
X_oversampled

Unnamed: 0,id,customer_age,job_type,marital,education,default,balance,housing_loan,personal_loan,communication_type,day_of_month,month,last_contact_duration,num_contacts_in_campaign,days_since_prev_campaign_contact,num_contacts_prev_campaign,prev_campaign_outcome
12419,id_5630,59.0,technician,married,secondary,no,593.0,yes,no,cellular,10,feb,491.0,3.0,183.0,6,success
10342,id_2297,50.0,technician,married,secondary,no,193.0,yes,yes,cellular,17,feb,90.0,1.0,,0,unknown
16365,id_35320,55.0,management,married,tertiary,no,3635.0,no,no,cellular,7,may,215.0,4.0,88.0,8,success
28790,id_7941,37.0,blue-collar,married,primary,no,193.0,yes,no,telephone,19,nov,605.0,1.0,175.0,1,other
19944,id_32736,50.0,management,married,tertiary,no,933.0,no,no,cellular,30,apr,786.0,1.0,,0,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16052,id_41091,44.0,management,married,unknown,no,,no,no,cellular,19,jul,255.0,1.0,,0,unknown
115,id_34013,62.0,admin.,married,secondary,no,,no,no,cellular,22,dec,170.0,1.0,188.0,9,success
20776,id_27372,30.0,self-employed,divorced,tertiary,no,3267.0,no,no,cellular,4,may,968.0,3.0,,0,unknown
4979,id_5686,56.0,blue-collar,married,primary,no,2110.0,no,no,cellular,12,aug,1136.0,2.0,,0,unknown


In [43]:
print("Number of class 0 samples before:", X_imbalanced[y_imbalanced == 1].shape)
print("Number of class 0 samples after: ",X_balanced[y_balanced == 1].shape)

Number of class 0 samples before: (3394, 17)
Number of class 0 samples after:  (28651, 24)


### Resample method for Under Sampling Majority Class

In [118]:
X_undersampled, y_undersampled = resample(X_imbalanced[y_imbalanced == 0],
                                          y_imbalanced[y_imbalanced == 0],
                                          replace=True,
                                          n_samples=X_imbalanced[y_imbalanced == 1].shape[0],
                                          random_state=123)

X_balanced = pd.concat([X[y == 1], X_undersampled])
y_balanced = pd.concat([y[y == 1], y_undersampled])

In [119]:
print("Number of class 1 samples before:", X_imbalanced[y_imbalanced == 0].shape)
print("Number of class 1 samples after: ",X_balanced[y_balanced == 0].shape)

Number of class 1 samples before: (28253, 17)
Number of class 1 samples after:  (3792, 24)


>**❗ NOTE:** In a Machine Learning problem, make sure to upsample/downsample **ONLY AFTER** you split into train, test (and validate if you wish). If you do upsample your dataset before you split into train and test, there is a high possibility that your model is exposed to data leakage.