# PREPROCESSING (Part2)

# Imbalanced data

There is a powerful package written in Python and developed by part of the developers of Scikit-Learn, called Imbalanced-Learn.

It is developed through GitHub (see https://github.com/scikit-learn-contrib/imbalanced-learn), and there is also an official website (see http://imbalanced-learn.org/en/stable/) where you can find all the info you might need.

I strongly recommend to read the user guide (see http://imbalanced-learn.org/en/stable/user_guide.html) as well as the general examples as a complement to it (see http://imbalanced-learn.org/en/stable/auto_examples/index.html).

The package is not available through Anaconda Navigator, but you can install install is from the prompt by entering

conda install -c conda-forge imbalanced-learn

## Undersampling

We will try NearMiss undersampling technique on Iris dataset. Since Iris is perfectly balanced, firstly we will imbalance it artificially.

In [None]:
from collections import Counter

from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

from imblearn.datasets import make_imbalance
from imblearn.under_sampling import NearMiss
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced

RANDOM_STATE = 0

# Load dataset and create an artificial imbalance
iris = load_iris()
X, y = make_imbalance(iris.data, iris.target,
                      sampling_strategy={0: 25, 1: 50, 2: 50},
                      random_state=RANDOM_STATE)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE)

print('Training statistics: {}'.format(Counter(y_train)))
print('Testing statistics: {}'.format(Counter(y_test)))

# Creation of a pipeline, i.e. concatenation of steps in a composed process (see documentation for further details)
pipeline = make_pipeline(NearMiss(version=2),
                         LinearSVC(random_state=RANDOM_STATE, max_iter=10000))
pipeline.fit(X_train, y_train)

# Classification and results presentation
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

## Oversampling

We try now SMOTE oversampling technique on a dataset about thyroid sickness. It has 3772 samples and 52 independent variables. It is imbalanced by a rate of 15 to 1.

In [None]:
from collections import Counter

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score


# Load dataset
tiroides = pd.read_csv('Thyroids.csv')
tiroides.values.astype(float)

# Separate inputs and target
X = tiroides.values[:,:-1]
y = tiroides.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)

print('Training statistics: {}'.format(Counter(y_train)))
print('Testing statistics: {}'.format(Counter(y_test)))

# Pipeline creation
pipeline = make_pipeline(SMOTE(random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=1000, random_state=RANDOM_STATE))
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

# Classification and results presentation
print(classification_report_imbalanced(y_test, y_pred))
print('AUC = ' + str(roc_auc_score(y_test, y_pred)))

#### Exercise 4:

(i) Try a different NearMiss version from the one in the example for the thyroids dataset with random forests classifier. Does it get better if we increase the number of trees to 100 in the forest (n_estimators)? And from 100 to 1000?

(ii) Plan a mixed strategy for thyroids dataset and chech its performance with random forests. Play with n_estimators parameter to increase f1 average score. Is the order of the mixed sampling strategies relevant?

(iii) Combine PCA with the mixed strategy. Quantify the percentage of data compression and get the number of PCs used, when capturing 95% and 99% of the total cummulative variance. Compare the performance with the one in (ii). In case of big differencies, which could be the reason?
Try to specify the number of PCs, instead of the percentage of variance, by chosing n_components to be an integer, e.g. n_components=15. Play with the number to try to get a good solution with the maximum possible compression. Do you want to change/qualify your answer to the previous question?

(iv) Compare the results of all strategies with the case of not correcting the imbalance. Is it beneficial to correct the lack of balance?

(v) Use ADASYN oversampling technique combined with an undersampling technique different from NearMiss. Explain the reason for your choice. See imbalanced-learn documentation for seeing which functions to use and checking how to use them. 

#### Solution:

In [None]:
# Your solution here

# In general, and for your future revisions of the material, it is better that you provided a complete code here.
# So it is better to define imports and functions here, so that this one single cell could be executed on its own.

# Exercises solutions

#### Exercise 4:

(i) Try a different NearMiss version from the one in the example for the thyroids dataset with random forests classifier. Does it get better if we increase the number of trees to 100 in the forest (n_estimators)? And from 100 to 1000?

(ii) Plan a mixed strategy for thyroids dataset and chech its performance with random forests. Play with n_estimators parameter to increase f1 average score. Is the order of the mixed sampling strategies relevant?

(iii) Combine PCA with the mixed strategy. Quantify the percentage of data compression and get the number of PCs used, when capturing 95% and 99% of the total cummulative variance. Compare the performance with the one in (ii). In case of big differencies, which could be the reason?
Try to specify the number of PCs, instead of the percentage of variance, by chosing n_components to be an integer, e.g. n_components=15. Play with the number to try to get a good solution with the maximum possible compression. Do you want to change/qualify your answer to the previous question?

(iv) Compare the results of all strategies with the case of not correcting the imbalance. Is it beneficial to correct the lack of balance?

(v) Use ADASYN oversampling technique combined with an undersampling technique different from NearMiss. Explain the reason for your choice. See imbalanced-learn documentation for seeing which functions to use and checking how to use them. 

#### Solution:

In [None]:
# Your solution here

# In general, and for your future revisions of the material, it is better that you provided a complete code here.
# So it is better to define imports and functions here, so that this one single cell could be executed on its own.

# (i)
from imblearn.under_sampling import NearMiss
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split

# Randomizing seed
RANDOM_STATE = 0

# Load dataset
tiroides = pd.read_csv('Thyroids.csv')
tiroides.values.astype(float)

# Separate inputs and target
X = tiroides.values[:,:-1]
y = tiroides.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)

# Pipelines creation
pipeline10 = make_pipeline(NearMiss(version=3, random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=10, random_state=RANDOM_STATE))
pipeline10.fit(X_train, y_train)

pipeline100 = make_pipeline(NearMiss(version=3, random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE))
pipeline100.fit(X_train, y_train)

pipeline1000 = make_pipeline(NearMiss(version=3, random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=1000, random_state=RANDOM_STATE))
pipeline1000.fit(X_train, y_train)

# Classification and results presentation
print("Results for n_estimators = 10")
print(classification_report_imbalanced(y_test, pipeline10.predict(X_test), digits=4, target_names=["Healthy", "Sick"]))

print("Results for n_estimators = 100")
print(classification_report_imbalanced(y_test, pipeline100.predict(X_test), digits=4, target_names=["Healthy", "Sick"]))

print("Results for n_estimators = 1000")
print(classification_report_imbalanced(y_test, pipeline1000.predict(X_test), digits=4, target_names=["Healthy", "Sick"]))

Increasing the number of trees to 100, makes the behavior much better. Nevertheless, the improvement from 100 to 1000 y small and the difference in computational time is big (10 times bigger).

In [None]:
# (ii)
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced
from sklearn.ensemble import RandomForestClassifier


# Randomizing seed
RANDOM_STATE = 0

# Load dataset
tiroides = pd.read_csv('Tyroids.csv')
tiroides.values.astype(float)

# Separate inputs and target
X = tiroides.values[:,:-1]
y = tiroides.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)

# Pipelines creation
pipeline_under_over_10 = make_pipeline(NearMiss(version=2, random_state=RANDOM_STATE),
                                    SMOTE(random_state=RANDOM_STATE),
                                    RandomForestClassifier(n_estimators=10, random_state=RANDOM_STATE))
pipeline_under_over_10.fit(X_train, y_train)

pipeline_under_over_100 = make_pipeline(NearMiss(version=2, random_state=RANDOM_STATE),
                                    SMOTE(random_state=RANDOM_STATE),
                                    RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE))
pipeline_under_over_100.fit(X_train, y_train)

pipeline_over_under_10 = make_pipeline(SMOTE(random_state=RANDOM_STATE),
                         NearMiss(version=2, random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=10, random_state=RANDOM_STATE))
pipeline_over_under_10.fit(X_train, y_train)

pipeline_over_under_100 = make_pipeline(SMOTE(random_state=RANDOM_STATE),
                         NearMiss(version=2, random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE))
pipeline_over_under_100.fit(X_train, y_train)

# Classification and results presentation
print("Results for first under- then oversampling")
print("Results for n_estimators = 10")
print(classification_report_imbalanced(y_test, pipeline_under_over_10.predict(X_test), digits=4, target_names=["Healthy", "Sick"]))
print("Results for n_estimators = 100")
print(classification_report_imbalanced(y_test, pipeline_under_over_100.predict(X_test), digits=4, target_names=["Healthy", "Sick"]))

print("Results for first over- then undersampling")
print("Results for n_estimators = 10")
print(classification_report_imbalanced(y_test, pipeline_over_under_10.predict(X_test), digits=4, target_names=["Healthy", "Sick"]))
print("Results for n_estimators = 100")
print(classification_report_imbalanced(y_test, pipeline_over_under_100.predict(X_test), digits=4, target_names=["Healthy", "Sick"]))

The order of the strategy matters, being clearly necessary to perform first over- and then undersampling. The benefit for using 100 trees instead of 10 in that case is not big, but in both cases the computation time is short.

In [None]:
# (iii)
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA


def pca_projections(df, n_components=0.95):
    pca = PCA(n_components)
    X = df[df.columns[:-1]]  # Assuming the class in in the last column
    pca.fit(X)
    X = pca.transform(X)
    proj_df = pd.DataFrame(data=X, columns=['PC' + str(x) for x in list(range(1, X.shape[1] + 1))])
    proj_df = pd.concat([proj_df, df[df.columns[-1]]], axis=1)
    return proj_df


# Randomizing seed
RANDOM_STATE = 0

# Load dataset
tiroides = pd.read_csv('Tyroids.csv')
tiroides.values.astype(float)

# Data compression by PCA
n_components = 0.999999
tiroides = pca_projections(tiroides, n_components)
if n_components < 1.0:
    n_comp = pca_projections(tiroides, n_components).shape[1] - 1
else:
    n_comp = n_components

# Separate inputs and target
X = tiroides.values[:,:-1]
y = tiroides.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)

# Pipeline creation
pipeline_pca_over_under_100 = make_pipeline(SMOTE(random_state=RANDOM_STATE),
                         NearMiss(version=2, random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE))
pipeline_pca_over_under_100.fit(X_train, y_train)

print("Results for pca projected data, using " + str(n_comp) + " PCs , and for first over- then undersampling with n_estimators = 100")
print(classification_report_imbalanced(y_test, pipeline_pca_over_under_100.predict(X_test), digits=4, target_names=["Healthy", "Sick"]))

Using 95% and 99%, also 99.9%, the amount of PCs is 4, with poor performance. Raising it up to 99.99%, it goes up to 11 PCs, with an acceptable performance that is comparable to the previous ones but a bit worse. Next jump, comes at a level of 99.999%, with 24 PCs but almost the same result as with 11 PCs. No further improvement is possible, even taking the number of PCs close to the maximum (52, that is the total amount of independent variables).

The reason could be that by projecting we loose information that is non-linear, even when we capture almost 100% of the linear variability.

(iv)

In general, the correction is slightly beneficial even when we have not used all the potential of the sampling algorithms.