In [2]:
!wget https://github.com/adrianstando/imbalanced-benchmarking-set/raw/main/datasets/wine_quality.csv

--2023-10-12 19:44:55--  https://github.com/adrianstando/imbalanced-benchmarking-set/raw/main/datasets/wine_quality.csv
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/wine_quality.csv [following]
--2023-10-12 19:44:55--  https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/wine_quality.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 315368 (308K) [text/plain]
Saving to: ‘wine_quality.csv.2’


2023-10-12 19:44:55 (6.21 MB/s) - ‘wine_quality.csv.2’ saved [315368/315368]



In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [4]:
dataset = pd.read_csv('wine_quality.csv', index_col=0)

X = dataset.drop('TARGET', axis=1)
y = dataset['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train = torch.tensor(X_train.values, dtype=torch.float32)
X_test = torch.tensor(X_test.values, dtype=torch.float32)
y_train = torch.tensor(y_train.values, dtype=torch.float32) / 2 + 0.5
y_test = torch.tensor(y_test.values, dtype=torch.float32) / 2 + 0.5

In [5]:
EPOCHS = 1000

class LinearRegressionModel(nn.Module):
  def __init__(self):
    super(LinearRegressionModel, self).__init__()
    self.fc1 = nn.Linear(11, 32)
    self.fc2 = nn.Linear(32, 64)
    self.fc3 = nn.Linear(64, 32)
    self.fc4 = nn.Linear(32, 1)

  def forward(self, x):
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = F.relu(self.fc3(x))
    x = F.sigmoid(self.fc4(x))
    return x.flatten()

model = LinearRegressionModel()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(EPOCHS):
    y_pred = model(X_train)

    loss = criterion(y_pred, y_train)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

linearRegressionAccuracy = 0.0
linearRegressionReport = ""
with torch.no_grad():
  y_eval = model(X_test)
  y_eval = y_eval > 0.5

  linearRegressionAccuracy = accuracy_score(y_test, y_eval)
  print(f'Accuracy: {linearRegressionAccuracy:%}')
  linearRegressionReport = classification_report(y_test, y_eval, digits=4)

Accuracy: 96.428571%


In [6]:
randomForest = RandomForestClassifier(n_estimators=100, random_state=42)
randomForest.fit(X_train, y_train)

y_eval = randomForest.predict(X_test)

randomForestAccuracy = accuracy_score(y_test, y_eval)
print(f'Accuracy: {randomForestAccuracy:%}')
randomForestReport = classification_report(y_test, y_eval, digits=4)

Accuracy: 97.448980%


In [7]:
gradientBoosting = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gradientBoosting.fit(X_train, y_train)

y_eval = gradientBoosting.predict(X_test)

gradientBoostingAccuracy = accuracy_score(y_test, y_eval)
print(f'Accuracy: {gradientBoostingAccuracy:%}')
gradientBoostingReport = classification_report(y_test, y_eval, digits=4)

Accuracy: 97.142857%


In [8]:
!pip install tabpfn



In [9]:
from tabpfn import TabPFNClassifier

classifier = TabPFNClassifier(device='cpu', N_ensemble_configurations=32)

classifier.fit(X_train[:1000], y_train[:1000], overwrite_warning=True)
y_eval = classifier.predict(X_test)

tabPFNAccuracy = accuracy_score(y_test, y_eval)
print(f'Accuracy: {tabPFNAccuracy:%}')
tabPFNReport = classification_report(y_test, y_eval, digits=4)

Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Accuracy: 96.938776%


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [12]:
print("Linear Regression:")
print(linearRegressionReport)
print("Random Forest:")
print(randomForestReport)
print("Gradient Boosting:")
print(gradientBoostingReport)
print("TabPFN:")
print(tabPFNReport)

Linear Regression:
              precision    recall  f1-score   support

         0.0     0.9721    0.9916    0.9818       950
         1.0     0.2727    0.1000    0.1463        30

    accuracy                         0.9643       980
   macro avg     0.6224    0.5458    0.5641       980
weighted avg     0.9507    0.9643    0.9562       980

Random Forest:
              precision    recall  f1-score   support

         0.0     0.9763    0.9979    0.9870       950
         1.0     0.7778    0.2333    0.3590        30

    accuracy                         0.9745       980
   macro avg     0.8770    0.6156    0.6730       980
weighted avg     0.9702    0.9745    0.9678       980

Gradient Boosting:
              precision    recall  f1-score   support

         0.0     0.9772    0.9937    0.9854       950
         1.0     0.5714    0.2667    0.3636        30

    accuracy                         0.9714       980
   macro avg     0.7743    0.6302    0.6745       980
weighted avg     0.96

The dataset I chose for this task was wine_quality with imbalance ration of 25.77 and 11 columns.

From the models I've tested Random Forest and Gradient Boosting seems to be performing the best, as they both have the biggest accuracy on the whole dataset and on the samples where the target was equal to "1" (though it's worth noting that in that case their accuracy was around 35%, so it still wasn't great). From those two Random Forest seems to perform marginally better.

Linear regression had a terrible accuracy on the "1" targets, but otherwise was performing decently.

TabPFN was always predicting "0", but it's probably due to very high imbalance ratio of the dataset and the fact that the training had to be limited to the first 1000 entries due to Colab performance constraints.