The deadline for this homework is on **25.10.2023 08:59** (right before the practice session). After completing the exercises, you should

1. Download this file into your computer (`File` $\to$ `Download .ipynb`)

2. Name the file in the following way *HWx_NameSurname* (for example `HW3_NshanPotikyan.ipynb`)

4. Send the file to this email address `nshan.potikyan@gmail.com` with subject **ML3**

**Note**

* if you do not follow any of the above conditions, your homework will not be graded.

* you do not need to send any dataset files or helper scripts that I provide with your homework (since I already have them).

* you need to write the code for the exercises yourself; you can use ``built-in functions``, ``numpy``, ``pandas``, ``sklearn``
and ``matplotlib``.

**Problem.** During the practice session we tried to build a binary classifier on the adult dataset which is highly imbalanced.

* In this homework, you need to take the same dataset but this time you need to

 * use more features from the original data
 * try different sampling techniques from [imblearn](https://imbalanced-learn.org/stable/references/index.html) to tackle the class imbalance problem
 * experiment with different ensemble methods (the ones that we have discussed so far) to beat the score we got during the practice session
 * split the training dataset into train/val/test parts, so that you can evaluate which approach/algorithm results in better performance (use random_state=0, train=70%, val=10%, test=20% splits).


* Evaluate the model performance in terms of the accuracy score.

* Use the best data processing method to train a final model on the train+val dataset and report the accuracy score on the test dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.ensemble import StackingClassifier, VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Adult dataset

In [2]:
!wget https://archive.ics.uci.edu/static/public/2/adult.zip
!unzip adult.zip

--2023-10-24 23:14:20--  https://archive.ics.uci.edu/static/public/2/adult.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘adult.zip’

adult.zip               [  <=>               ] 605.70K  2.02MB/s    in 0.3s    

2023-10-24 23:14:20 (2.02 MB/s) - ‘adult.zip’ saved [620237]

Archive:  adult.zip
  inflating: Index                   
  inflating: adult.data              
  inflating: adult.names             
  inflating: adult.test              
  inflating: old.adult.names         


In [3]:
data = pd.read_csv('adult.data', header=None, na_values=[" ?", ""])
data.columns = ["age", "workclass", "fnlwgt", "education", "education_num",
                "marital_status", "occupation", "relationship", "race", "sex",
                "capital_gain", "capital_loss", "hours_per_week", "country",
                "income"]

In [4]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# Data Processing

In [5]:
# removes the rows with missing values
data.dropna(inplace = True)

In [6]:
categorical_features = ["workclass", "education", "marital_status",
                        "occupation", "relationship", "race", "sex", "country"]

In [7]:
X = data.drop("income", axis=1)
y = data["income"]

In [8]:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.125, random_state=0)

In [9]:
print(y_train.value_counts())
print(y_test.value_counts())

 <=50K    15876
 >50K      5236
Name: income, dtype: int64
 <=50K    4532
 >50K     1501
Name: income, dtype: int64


In [10]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,country
16538,22,Private,133833,Some-college,10,Never-married,Adm-clerical,Own-child,White,Female,0,0,40,United-States
23023,24,Private,117779,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,10,Hungary
6586,33,Self-emp-inc,40444,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States
10806,46,Private,331776,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States
2089,37,Private,123211,HS-grad,9,Married-civ-spouse,Other-service,Husband,White,Male,0,0,44,United-States


In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categorical_features)],
    remainder='passthrough')

X_train = preprocessor.fit_transform(X_train)
X_val = preprocessor.transform(X_val)
X_test = preprocessor.transform(X_test)

In [12]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# Initialize the ensemble models


In [13]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler


samplers = [SMOTE(random_state=0),
            RandomOverSampler(random_state=0),
            RandomUnderSampler(random_state=0)]
names = ["SMOTE", "Oversampling", "Undersampling"]

models = [
    BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                      n_estimators=10, random_state=0),
    RandomForestClassifier(n_estimators=100, random_state=0),
    GradientBoostingClassifier(n_estimators=100, random_state=0)
]

best_score = 0
best_name = ""
best_model_name = ""

for name, sampler in zip(names, samplers):
    X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

    for model in models:
        model_name = model.__class__.__name__
        model.fit(X_resampled, y_resampled)
        preds = model.predict(X_val)
        score = accuracy_score(y_val, preds)

        if score > best_score:
            best_score = score
            best_name = name
            best_model_name = model_name

        print(f"Model: {model_name} with {name} has accuracy: {score:.4f}")

print(f"Best model is {best_model_name} with {best_name} having accuracy: {best_score:.4f}")



Model: BaggingClassifier with SMOTE has accuracy: 0.8313
Model: RandomForestClassifier with SMOTE has accuracy: 0.8426
Model: GradientBoostingClassifier with SMOTE has accuracy: 0.8585




Model: BaggingClassifier with Oversampling has accuracy: 0.8429
Model: RandomForestClassifier with Oversampling has accuracy: 0.8416
Model: GradientBoostingClassifier with Oversampling has accuracy: 0.8270




Model: BaggingClassifier with Undersampling has accuracy: 0.8114
Model: RandomForestClassifier with Undersampling has accuracy: 0.8167
Model: GradientBoostingClassifier with Undersampling has accuracy: 0.8240
Best model is GradientBoostingClassifier with SMOTE having accuracy: 0.8585


In [14]:
X_temp = preprocessor.transform(X_temp)
X_full_train, y_full_train = samplers[names.index(best_name)].fit_resample(X_temp, y_temp)

final_model = None
if best_model_name == "BaggingClassifier":
    final_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=0)
elif best_model_name == "RandomForestClassifier":
    final_model = RandomForestClassifier(n_estimators=100, random_state=0)
elif best_model_name == "GradientBoostingClassifier":
    final_model = GradientBoostingClassifier(n_estimators=100, random_state=0)

final_model.fit(X_full_train, y_full_train)
final_preds = final_model.predict(X_test)
final_score = accuracy_score(y_test, final_preds)
print(f"Final Accuracy on Test set: {final_score:.4f}")

Final Accuracy on Test set: 0.8551
