SMOTE (Synthetic Minority Over-sampling Technique) is a popular algorithm used for addressing class imbalance in machine learning datasets. It works by generating synthetic samples of the minority class to balance the class distribution. SMOTE is particularly useful when the minority class is underrepresented and needs to be amplified to prevent bias towards the majority class.

Here is a step-by-step explanation of how to apply SMOTE:

### Import the necessary libraries:

In [14]:
!pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# check version number
import imblearn
print(imblearn.__version__)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
0.10.1


### Generate a synthetic imbalanced dataset:
In this example, we create a dataset with 1000 samples, 20 features, and a class imbalance of 90% majority class and 10% minority class.

In [15]:
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
 n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)

### Summarize the class distribution:
The Counter class is used to count the occurrences of each class. This step helps us understand the initial class distribution.

In [16]:
from collections import Counter
# summarize class distribution
counter = Counter(y)
print(counter)

Counter({0: 9900, 1: 100})


### Split the dataset into training and testing sets:
This step is essential to ensure a fair evaluation of the model's performance on unseen data

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Apply SMOTE to the training data:
The fit_resample method is used to apply SMOTE to the training data. It generates synthetic samples for the minority class to balance the class distribution.

In [18]:
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

### Summarize the resampled class distribution:
After applying SMOTE, it is important to check the new class distribution to verify if the imbalance has been mitigated.

In [19]:
counter_resampled = Counter(y_train_resampled)
print(counter_resampled)

Counter({0: 7917, 1: 7917})


### Train a classifier on the resampled data:
In this example, we use logistic regression as the classifier, but you can use any other algorithm suitable for your problem.

In [20]:
classifier = LogisticRegression()
classifier.fit(X_train_resampled, y_train_resampled)

### Evaluate the classifier on the original testing data:
Use the trained classifier to make predictions on the original testing data.

In [21]:
y_pred = classifier.predict(X_test)

### Print the classification report:
The classification report provides metrics such as precision, recall, and F1-score for each class, giving insights into the model's performance on both classes.

In [22]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.92      0.96      1983
           1       0.09      0.88      0.16        17

    accuracy                           0.92      2000
   macro avg       0.54      0.90      0.56      2000
weighted avg       0.99      0.92      0.95      2000



By following these steps, you can apply SMOTE to an imbalanced dataset and train a classifier on the balanced data to improve the performance on the minority class.