

```
Name : Sudarsun S
RegNo: 20BCE1699
Topic: Class Imbalance SMOTE & ADASYN
Machine Learning Embedded Lab
```




### 1. GENERATE DATASET HAVING CLASS IMBALANCE (BINARY DATA)

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
np.random.seed(0)

X, y = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_classes=2,
    weights=[0.95, 0.05],
    n_clusters_per_class=1,
    random_state=0
)

data = pd.DataFrame(data=np.c_[X, y], columns=['Feature1', 'Feature2', 'Target'])
print(data.head())


   Feature1  Feature2  Target
0 -0.400228 -0.926880     0.0
1 -1.073630  1.199259     0.0
2 -0.922953  0.306167     0.0
3 -0.748422  2.103053     0.0
4 -1.454758  2.645131     0.0


### 2. MAKE A DECISION TREE TO LEARN THE CLASIFICATION AND GENERATE THE ACCURACY, PRECISION AND RECALL

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.97
Precision: 0.7777777777777778
Recall: 0.6363636363636364


### 3. Show the problem in the above metrics



*  Accuracy: Accuracy is not always a reliable metric when dealing with imbalanced datasets. In this case, if the model predicts the majority class (class 0) for most instances, it can achieve high accuracy because it's getting most of the samples right. However, it may perform poorly in terms of classifying the minority class (class 1).
*   Precision: Precision is the fraction of relevant instances among the retrieved instances. In imbalanced datasets, precision might be low for the minority class because the model is less likely to correctly classify instances of that class. It's a problem if you want to minimize false positives for the minority class.


*   Recall: Recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. In imbalanced datasets, recall might be low for the minority class because the model might miss many of the actual instances of that class. It's a problem if you want to minimize false negatives for the minority class.












### 4. Use SMOTE to increase the no of points in minority class

In [None]:
!pip install -U imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.11.0-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.6/235.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: imbalanced-learn
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.10.1
    Uninstalling imbalanced-learn-0.10.1:
      Successfully uninstalled imbalanced-learn-0.10.1
Successfully installed imbalanced-learn-0.11.0


In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=0)
X_resampled, y_resampled = smote.fit_resample(X, y)

### 5. Again train the decision tree and test it to get the precision, recall and accuracy

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.9021164021164021
Precision: 0.8854166666666666
Recall: 0.918918918918919


### 6. Check if there is any improvement in the metrics

Comparing the metrics before and after SMOTE, we can observe the following changes:



*   Accuracy decreased slightly from 0.97 to 0.9021164021164021.
*   Precision improved from 0.7777777777777778 to 0.8854166666666666.
*   Recall significantly improved from 0.6363636363636364 to 0.918918918918919.

In this case, after applying SMOTE, there is a noticeable improvement in precision and a substantial improvement in recall, which suggests that the model is better at correctly classifying the minority class (class 1) while maintaining a relatively high level of precision for that class. However, the accuracy decreased, which can be expected when dealing with a more balanced dataset.

### 7. Similarly do the same for ADASYN

In [None]:
from imblearn.over_sampling import ADASYN

adasyn = ADASYN(sampling_strategy='auto', random_state=0)
X_resampled_adasyn, y_resampled_adasyn = adasyn.fit_resample(X, y)

X_train_adasyn, X_test_adasyn, y_train_adasyn, y_test_adasyn = train_test_split(X_resampled_adasyn, y_resampled_adasyn, test_size=0.2, random_state=0)
clf_adasyn = DecisionTreeClassifier(random_state=0)
clf_adasyn.fit(X_train_adasyn, y_train_adasyn)
y_pred_adasyn = clf_adasyn.predict(X_test_adasyn)

accuracy_after_adasyn = accuracy_score(y_test_adasyn, y_pred_adasyn)
precision_after_adasyn = precision_score(y_test_adasyn, y_pred_adasyn)
recall_after_adasyn = recall_score(y_test_adasyn, y_pred_adasyn)

print("Metrics after ADASYN:")
print("Accuracy:", accuracy_after_adasyn)
print("Precision:", precision_after_adasyn)
print("Recall:", recall_after_adasyn)


Metrics after ADASYN:
Accuracy: 0.8601583113456465
Precision: 0.8497409326424871
Recall: 0.8723404255319149


### 8. Compare the results using SMOTE and ADASYN



*   Accuracy: SMOTE resulted in higher accuracy (0.9021) compared to ADASYN (0.8602). SMOTE improved accuracy more in this case.
*   Precision: SMOTE also achieved a higher precision (0.8854) than ADASYN (0.8497).


*   Recall: Recall after SMOTE (0.9189) is slightly higher than recall after ADASYN (0.8723). SMOTE performed better in terms of recall as well.

In this specific scenario, SMOTE seems to outperform ADASYN in terms of accuracy, precision, and recall. However, the choice between SMOTE and ADASYN may depend on the specific characteristics of your dataset and your objectives. Different datasets may yield different results, so it's essential to consider the specific context when choosing a resampling method.

# Using The Wine Dataset

In [1]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


In [21]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from imblearn.over_sampling import SMOTE, ADASYN

winequality = pd.read_csv('winequality.csv')

# Step 2: Make a Decision Tree for classification and generate metrics
X = winequality.drop("quality", axis=1)
y = winequality["quality"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)

# Step 3: Show problems in the metrics

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')

print("Step 2:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)


# Step 4: Use SMOTE to increase the number of points in the minority class
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Step 5: Train the decision tree on the resampled data
dt_classifier.fit(X_resampled, y_resampled)
y_pred_resampled = dt_classifier.predict(X_test)

accuracy_resampled = accuracy_score(y_test, y_pred_resampled)
precision_resampled = precision_score(y_test, y_pred_resampled, average='weighted')
recall_resampled = recall_score(y_test, y_pred_resampled, average='weighted')

print("Step 5 (SMOTE):")
print("Accuracy:", accuracy_resampled)
print("Precision:", precision_resampled)
print("Recall:", recall_resampled)

# Step 7: Perform the same steps with ADASYN
sampling_strategy = {3: 1000, 4: 1000, 5: 1000, 6: 1000, 7: 1000, 8: 1000}

adasyn = ADASYN(sampling_strategy=sampling_strategy, random_state=42)
X_resampled_adasyn, y_resampled_adasyn = adasyn.fit_resample(X_train, y_train)

dt_classifier.fit(X_resampled_adasyn, y_resampled_adasyn)
y_pred_resampled_adasyn = dt_classifier.predict(X_test)

accuracy_resampled_adasyn = accuracy_score(y_test, y_pred_resampled_adasyn)
precision_resampled_adasyn = precision_score(y_test, y_pred_resampled_adasyn, average='weighted')
recall_resampled_adasyn = recall_score(y_test, y_pred_resampled_adasyn, average='weighted')

print("Step 7 (ADASYN):")
print("Accuracy:", accuracy_resampled_adasyn)
print("Precision:", precision_resampled_adasyn)
print("Recall:", recall_resampled_adasyn)

# Step 8: Compare the results between SMOTE and ADASYN
print("Step 8 (Comparison):")
print("SMOTE vs. ADASYN - Accuracy:")
print("SMOTE:", accuracy_resampled)
print("ADASYN:", accuracy_resampled_adasyn)
print("SMOTE vs. ADASYN - Precision:")
print("SMOTE:", precision_resampled)
print("ADASYN:", precision_resampled_adasyn)
print("SMOTE vs. ADASYN - Recall:")
print("SMOTE:", recall_resampled)
print("ADASYN:", recall_resampled_adasyn)


Step 2:
Accuracy: 0.559375
Precision: 0.550005791860414
Recall: 0.559375
Step 5 (SMOTE):
Accuracy: 0.546875
Precision: 0.6010336492551567
Recall: 0.546875




Step 7 (ADASYN):
Accuracy: 0.575
Precision: 0.6114680012950572
Recall: 0.575
Step 8 (Comparison):
SMOTE vs. ADASYN - Accuracy:
SMOTE: 0.546875
ADASYN: 0.575
SMOTE vs. ADASYN - Precision:
SMOTE: 0.6010336492551567
ADASYN: 0.6114680012950572
SMOTE vs. ADASYN - Recall:
SMOTE: 0.546875
ADASYN: 0.575
