# Binary classification using yeast

We'll be using the yeast dataset that is made publicly available by [openml.org](https://www.openml.org). You can read all about this set [here](https://www.openml.org/search?type=data&sort=runs&id=40597&status=active).

## Data import and exploration

Let's explore it ourselves. First, import. We could import using the import from the openml-python package, but sklearn provides an easier way. We'll import it that way and explore the size of what we have imported.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml
import pandas as pd

X, Y = fetch_openml("yeast", version=4, return_X_y=True)

print(X.shape) 
print(Y.shape) 

We have 2417 rows. In X, the data, we have 103 attributes and in Y, what we should be predicting we see 14 classes. On the [description-page](https://www.openml.org/search?type=data&sort=runs&id=40597&status=active) of this data we read that only 13 of those are actually used because of label-sparsity (very few examples are available).

Funniest thing: X and Y are pandas dataframes without us losing any effort over that. Good!

Look at Y to see what we are predicting.

In [None]:
# Up to you!



Check the datatypes of the attributes (in X). Only show the different types.

In [None]:
# Up to you!



Ok, so this learns us we are strictly working with numbers. Maybe check some graphs on some of the number? Boxplots for example!

In [None]:
# Up to you!



All seems pretty normal. Some outliers; but we'll leave it at that. Every row can lead to multiple labels (in Y). Show for every column in Y how many rows have that label. Unfortunately they're not stored as a boolean but as a string value.

In [None]:
# Up to you!



Knowing that we have about 2400 rows, Class2 seems pretty close to the middle. Let's predict that!

## Binary classifier

Let's build a binary classifier for Class 2. We'll be using a random forest classifier. (Or you can use another model if you'd like...)

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# X = X
y = Y["Class2"]

Now split the dataset in a training and a test-set. Use random state of [42](https://en.wikipedia.org/wiki/42_(number)). Use 20% of the data as test-set.

In [None]:
# Up to you!



Next create the model and train it ons the data. Use 100 estimators and the same random state.

In [None]:
# Up to you!



Evaluate the model by predicting it on the test-set. Also create a confusion matrix and calculate all the scores.

In [None]:
# Up to you!



Precision and recall are about .67 and .83, which isn't great. Note how we can not draw an ROC curve because we simply predicted a label. Luckily there is a way to get the probabilities from random forest by using "rf.predict_proba(X_test)[:, 1]". Try it and display the first 10 probabilities.

In [None]:
# Up to you!



Now draw the ROC curve.

In [None]:
# Up to you!



Now you can see how bad the model is.

## Bigger forest

Using the same train/test split, retrain your model with 500 estimators. Draw the ROC-curves again.


In [None]:
# Up to you!



More bubbly but not better. You did notice it took longer, no? Try another model next!

## XGBoost

Now use XGBoost. Use the same train/test-split.

One problem though: XGBoost works best with "1" and "0", not with "TRUE" and "FALSE". Create "y_train_new" and "y_test_new" based on the existing ones, but with the correct data.


In [None]:
# Up to you!



Now you're ready to train XGboost. (You have to install it first.) 

In [None]:
# !pip install xgboost

In [None]:
import xgboost as xgb

model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    random_state=42
)

model.fit(X_train, y_train_new)

y_3_pred = model.predict(X_test)
y_3_proba = model.predict_proba(X_test)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test_new, y_3_pred))

# print("Classification Report:\n", classification_report(y_test_new, y_3_pred))
# print("AUC Score:", roc_auc_score(y_test_new, y_3_proba))

fpr, tpr, _ = roc_curve(y_test_new, y_3_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'XGBoost (AUC = {roc_auc_score(y_test_new, y_3_proba):.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()


This model is even worse than random forest! That is because when using XGboost hyperparameter tuning is essential. We're now looking at all 103 input parameters equally, which isn't a good idea. Check out "rf.feature_importances_" and "model.get_booster().get_score()" to see which are the more interesting features.

Start with "rf.feature_importances_".

In [None]:
# Up to you!



You'll see a list of numbers, one for every column. The numbers add up to 1, and bigger numbers mean the feature is more important (in more of the decision trees, but that's diving a bit deep before we looked into what a random forest does).

What does "model.get_booster().get_score()" tell you?

You can use a modifier in the last bracket, choosing the importance_type. The options are:

* 'weight' (default): number of times a feature is used to split across all trees.
* 'gain': average gain in accuracy brought by the feature when it is used in a split.
* 'cover': average number of samples affected by the splits using that feature.
* 'total_gain': total gain (sum across all splits).
* 'total_cover': total cover (sum across all splits).

In [None]:
# Up to you!



## Summary

We'll stop here, as we've already ventured to far. Why? Because we're tuning a model using only a two-set split. That is not allowed, we should have used a three set split (more on that in the next chapter).

But what have we learned? We've used a couple of models on a large dataset and noticed that results aren't always great from the start. We've also used the results of our model to calculate all the scores we went over in the powerpoint.

We still didn't start doing data augmentation or parameter tuning. But we'll get there!