# Database study
### Project developer: 
Mar Balibrea Rull, marbr@kth.se

### Instructions
For each of the theoretical tasks 1a and 1c in the examination (see below), you are requested to answer them by means of simulations/tests using a Jupyter notebook. You may employ real datasets and learning algorithms, e.g., as implemented in Scikit-learn, or use synthetic classifiers/predictions/data, e.g., output by some random functions. You may use Numpy, pandas, Scikit-learn and SciPy (send me a request in case you would like to use any other package).

You are expected to submit one notebook (by email to me), clearly separating the two tasks, with extensive comments explaining the assumptions and conclusions.

The deadline for submission is March 3.

## Load general libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

## Add complementary functions

In [3]:
def create_normalization(df, avoid, normalizationtype = 'minmax'):
    
    _df = df.select_dtypes(include = ['float', 'int']).drop(columns = avoid)
    
    if normalizationtype is 'minmax':
        
        normalization = {col: (normalizationtype, _df[col].min(), _df[col].max()) for col in _df.columns}
    elif normalizationtype is 'zscore':
        
        normalization = {col: (normalizationtype, _df[col].mean(), _df[col].std()) for col in _df.columns}
    
    df_out = apply_normalization(_df, normalization)
    return df_out, normalization

def apply_normalization(df, normalization):
    
    df_out = df.copy()
    for col in normalization:
        a = normalization[col][1]
        b = normalization[col][2]
    
        if normalization[col][0] is 'minmax':
            
            df_out[col] = (df_out[col] - a)/(b - a) # using broadcasting
            # df_out[col] = df_out[col].apply(lambda x: (x - a)/(b - a)) # using lambda
            df_out[col].clip(0, 1, inplace = True)
        elif normalization[col][0] is 'zscore':
    
            df_out[col] = df_out[col].apply(lambda x: (x - a)/b)
    return df_out

## 1a. Methodology
Assume that we want to compare a new algorithm to a baseline
algorithm, for some classification task. As we are not sure what
hyper-parameter settings to use for the new algorithm, we will
investigate 100 different settings for that, while we use a standard
hyper-parameter setting for the baseline algorithm. We first randomly
split a given dataset into two equal-sized halves; one for model
building and one for testing. We then employ 10-fold cross-validation
using the first half of the data, measuring the accuracy of each model
generated from an algorithm and hyper-parameter setting. Assume that
the best performing hyper-parameter setting for the new algorithm
results in a higher (cross-validation) accuracy than the baseline
algorithm. Should we expect to see the same relative performance,
i.e., the new algorithm (with the best-performing hyper-parameter
setting) outperforming the baseline (with the standard hyper-parameter
setting), when the two models (trained on the entire first half) are
evaluated on the second half of the data? Explain your reasoning.

**Interpretation of the assignment**:

I need:

- A database
- One algorithm with one hyper-parameter configuration
- Another algorithm with 100 hyper-parameter configurations

I will use the "glass.txt" database, as in past assignments. The partition in equal-sized halves is already done in "glass_train.txt" (for model building) and "glass_test.txt" (for testing). As for the algorithms, I will use MLPClassifier from sklearn as the baseline algorithm and RandomForestClassifier from sklearn as the new algorithm.

Steps:

1. Train and validate the baseline algorithm using 10-fold cross-validation on the first half of the dataset.
2. Train and validate the new algorithm (100 configurations) using 10-fold cross-validation on the first half of the dataset.
3. Make sure that the best performing configuration of the new algorithm outperforms the baseline algorithm in terms of accuracy.
4. Evaluate the best performing configuration of the new algorithm and the baseline algorithm on the second database half.

In [129]:
from sklearn.model_selection import KFold
from warnings import filterwarnings
filterwarnings('ignore')

# as both have the same length, we will use this partition:
build = pd.read_csv('glass_train.txt')
test = pd.read_csv('glass_test.txt')

_, normalization = create_normalization(build, avoid = ['ID', 'CLASS'])
build = apply_normalization(build, normalization)
test = apply_normalization(test, normalization)

kf = KFold(n_splits = 10)

In [130]:
# FIRST STEP

build_labels = build['CLASS']
build_df = build.drop(columns = ['ID', 'CLASS'])

# baseline algorithm

accuracies = []
for i_train, i_val in kf.split(build_df):

    data_train = build_df.loc[i_train]
    label_train = build_labels.loc[i_train]

    data_val = build_df.loc[i_val]
    label_val = build_labels.loc[i_val]

    baseline = MLPClassifier()
    baseline.fit(data_train, label_train)

    p = baseline.predict(data_val)
    accuracies.append(metrics.accuracy_score(label_val, p))

baseline_acc = np.mean(accuracies)

In [131]:
# SECOND STEP

# new algorithm

n_estimators_values = [10, 50, 100, 150, 200] # number of trees
max_depth_values = [50, 70] # max depth trees
min_samples_split_values = [4, 6, 8, 10, 12] # min samples to split
class_weight_values = [None, 'balanced'] # class weight mode

parameters = [(n_estimators, max_depth, min_samples, class_weight)
              for n_estimators in n_estimators_values
              for max_depth in max_depth_values
              for min_samples in min_samples_split_values
              for class_weight in class_weight_values]
best_new_acc = 0
best_acc_model = None
models_better_baseline = 0

for i in range(len(parameters)):

    accuracies = []
    for i_train, i_val in kf.split(build_df):

        data_train = build_df.loc[i_train]
        label_train = build_labels.loc[i_train]

        data_val = build_df.loc[i_val]
        label_val = build_labels.loc[i_val]

        new = RandomForestClassifier(n_estimators = parameters[i][0],
                                     max_depth = parameters[i][1],
                                     min_samples_split = parameters[i][2],
                                     class_weight = parameters[i][3],
                                     random_state = 8)
        new.fit(data_train, label_train)

        p = new.predict(data_val)
        accuracies.append(metrics.accuracy_score(label_val, p))

    aux_new_acc = np.mean(accuracies)
    
    if aux_new_acc > best_new_acc:
        
        best_new_acc = aux_new_acc
        best_new_acc_model = new
        
    models_better_baseline += aux_new_acc > baseline_acc


In [132]:
# THIRD STEP

if baseline_acc >= best_new_acc:
    
    raise ValueError('Best new algorithm does not work better than the baseline')

In [134]:
# FOURTH STEP

test_labels = test['CLASS']
test_df = test.drop(columns = ['ID', 'CLASS'])

baseline_test_p = baseline.predict(test_df)
baseline_test_acc = metrics.accuracy_score(test_labels, baseline_test_p)

new_test_p = best_new_acc_model.predict(test_df)
new_test_acc = metrics.accuracy_score(test_labels, new_test_p)

print('Model building accuracy for baseline algorithm: ' + '{0:.3f}'.format(baseline_acc))
print('Model building accuracy for best new algorithm: ' + '{0:.3f}'.format(best_new_acc))
print('Amount of new algorithms with higher accuracy: ' + str(models_better_baseline))
print('Test accuracy for baseline algorithm: ' + '{0:.3f}'.format(baseline_test_acc))
print('Test accuracy for best new algorithm: ' + '{0:.3f}'.format(new_test_acc))


Model building accuracy for baseline algorithm: 0.516
Model building accuracy for best new algorithm: 0.685
Amount of new algorithms with higher accuracy: 100
Test accuracy for baseline algorithm: 0.598
Test accuracy for best new algorithm: 0.729


### Analysis

We see that for these algorithms, the outperformance of the best new algorithm preserves in the testing, that is because the large majority (or all) of the configurations outperform the baseline algorithm during model building, so the probability is high. 

If we try *worse* configurations for the new algorithm so that less configurations perform better than the baseline, we would expect this doesn't happen due to the fact that the performance of the new algorithm is over-estimated.

Let's try: the code below will be the exact same as second (we don't need to repeat the baseline) to fourth step from above, but with the parameters for the configurations of the new algorithm changed.

In [139]:
# SECOND STEP (2nd experiment)

# new algorithm

n_estimators_values = [1, 3, 5, 7, 9] # number of trees
max_depth_values = [20, 25] # max depth trees
min_samples_split_values = [20, 25, 30, 35, 40] # min samples to split
class_weight_values = [None, 'balanced'] # class weight mode

parameters = [(n_estimators, max_depth, min_samples, class_weight)
              for n_estimators in n_estimators_values
              for max_depth in max_depth_values
              for min_samples in min_samples_split_values
              for class_weight in class_weight_values]
best_new_acc_2 = 0
best_acc_model_2 = None
models_better_baseline_2 = 0

for i in range(len(parameters)):

    accuracies = []
    for i_train, i_val in kf.split(build_df):

        data_train = build_df.loc[i_train]
        label_train = build_labels.loc[i_train]

        data_val = build_df.loc[i_val]
        label_val = build_labels.loc[i_val]

        new = RandomForestClassifier(n_estimators = parameters[i][0],
                                     max_depth = parameters[i][1],
                                     min_samples_split = parameters[i][2],
                                     class_weight = parameters[i][3],
                                     random_state = 8)
        new.fit(data_train, label_train)

        p = new.predict(data_val)
        accuracies.append(metrics.accuracy_score(label_val, p))

    aux_new_acc = np.mean(accuracies)
    
    if aux_new_acc > best_new_acc_2:
        
        best_new_acc_2 = aux_new_acc
        best_new_acc_model_2 = new
        
    models_better_baseline_2 += aux_new_acc > baseline_acc

In [140]:
# THIRD STEP (2nd experiment)

if baseline_acc >= best_new_acc_2:
    
    raise ValueError('Best new algorithm does not work better than the baseline')
    
    
# FOURTH STEP (2nd experiment)

# baseline done in 1st experiment

new_test_p_2 = best_new_acc_model_2.predict(test_df)
new_test_acc_2 = metrics.accuracy_score(test_labels, new_test_p_2)

print('Model building accuracy for baseline algorithm: ' + '{0:.3f}'.format(baseline_acc))
print('Model building accuracy for best new algorithm: ' + '{0:.3f}'.format(best_new_acc_2) + ' (was ' + '{0:.3f}'.format(best_new_acc) + ')')
print('Amount of new algorithms with higher accuracy: ' + str(models_better_baseline_2) + ' (was ' + str(models_better_baseline) + ')')
print('Test accuracy for baseline algorithm: ' + '{0:.3f}'.format(baseline_test_acc))
print('Test accuracy for best new algorithm: ' + '{0:.3f}'.format(new_test_acc_2) + ' (was ' + '{0:.3f}'.format(new_test_acc) + ')')


Model building accuracy for baseline algorithm: 0.516
Model building accuracy for best new algorithm: 0.597 (was 0.685)
Amount of new algorithms with higher accuracy: 38 (was 100)
Test accuracy for baseline algorithm: 0.598
Test accuracy for best new algorithm: 0.561 (was 0.729)


### Analysis (part 2)

As we can see, now the test accuracy for the best algorithm is lower even though the model building accuracy is still higher. We also see that the number of algorithms with higher accuracy is much lower. To try further, we can increase the minimum value of the `min_samples_split_values` parameter (which I have seen is the one that produces the more effect on the result).

---

## 1c. Performance metrics


Assume that we have evaluated a binary classification model on a test
set with 5000 instances; 4000 belonging to the majority class and 1000
to the minority class. Assume that we have measured the accuracy and
AUC, and also observed a much higher precision for the majority class
than for the minority class. If we would evaluate the model on a
class-balanced test set, which has been obtained from the first by
keeping all instances from the minority class and sampling (without
replacement) 1000 instances from the majority class, should we expect
to see about the same accuracy and AUC as previously observed? Explain
your reasoning.

**Interpretation of the assignment**:

I need:

- A database
- A model for binary classification

As the databases we had for this course didn't have enough instances for the experiment, I looked for another one on the internet. I found [these](https://www.kaggle.com/hackerrank/developer-survey-2018) results from a developer survey made on 2018. However, it had too many features, so I decided to first remove some of them that were string or that had a lot of NULL values. I decided to use one of them as the class (if the developers were students or not).
For the model, I will use the SCIKit MLPClassifier with standard parameters.

Steps:

1. First dataset creation: divide the dataset in two parts (training and testing), making sure that I have 5000 instances in the testing one (4000 belonging to the majority class). To make sure that the result is better for the majority class, I will build a training unbalanced dataset. For that purpose, I may not use the complete dataset.
2. Train the model and test it by calculating accuracy and AUC.
3. Second dataset creation: keep the previous division, but understample the testing one so that is stratified at 1000 instances per class.
4. Test the model by calculating accuracy and AUC.

In [4]:
from warnings import filterwarnings
filterwarnings('ignore')

# FIRST STEP

df = pd.read_csv('developer_2.txt')
# preprocessing: 'q1AgeBeginCoding' and 'q3Gender' have NULL values and have to be normalized
df['q1AgeBeginCoding'].replace('#NULL!', np.nan, inplace = True)
df['q3Gender'].replace('#NULL!', np.nan, inplace = True)
df = df.astype('float')
df = df.apply(lambda x: x.fillna(x.mean()), axis = 0)
_, normalization = create_normalization(df, avoid = ['RespondentID', 'q8Student'])
df = apply_normalization(df, normalization)

# divide dataset in training and testing (MLPClassifier already takes validation from training)
majorityclass = (len(df.loc[df['q8Student'] == 0]) < len(df.loc[df['q8Student'] == 1]))*1
maj_df = df.loc[df['q8Student'] == majorityclass]
min_df = df.loc[df['q8Student'] != majorityclass]

maj_te = 4000; min_te = 1000; maj_tr = 10000; min_tr = 1000;
if len(maj_df) < maj_te+maj_tr or len(min_df) < min_te+min_tr:
    
    raise ValueError('There are NOT enough instances for this distribution of classes')

test_df = maj_df[:maj_te].append(min_df[:min_te]).sample(frac = 1, random_state = 2)
train_df = maj_df[-maj_tr:].append(min_df[-min_tr:]).sample(frac = 1, random_state = 2)

In [10]:
# SECOND STEP

model = MLPClassifier()
train_labels = train_df['q8Student']
model.fit(train_df.drop(columns = ['RespondentID', 'q8Student']), train_labels)

test_labels = test_df['q8Student']
p = model.predict(test_df.drop(columns = ['RespondentID', 'q8Student']))
pp = model.predict_proba(test_df.drop(columns = ['RespondentID', 'q8Student']))
acc = metrics.accuracy_score(test_labels, p)
auc = metrics.roc_auc_score(test_labels, pp[:, 1])
print('ACC:', acc, '; AUC:', auc)

ACC: 0.8008 ; AUC: 0.67156525


In [9]:
# THIRD STEP

ttest_df = maj_df[:min_te].append(min_df[:min_te]).sample(frac = 1, random_state = 2)

In [7]:
# FORTH STEP

ttest_labels = ttest_df['q8Student']
p = model.predict(ttest_df.drop(columns = ['RespondentID', 'q8Student']))
pp = model.predict_proba(ttest_df.drop(columns = ['RespondentID', 'q8Student']))
acc = metrics.accuracy_score(ttest_labels, p)
auc = metrics.roc_auc_score(ttest_labels, pp[:, 1])
print('ACC:', acc, '; AUC:', auc)

ACC: 0.511 ; AUC: 0.6501625


### Analysis

It is given that both accuracy and AUC are higher for the majority class. This is what happens in this experiment. To understand the results of both the accuracy and the AUC once the test set has been changed to 1000 instances per class, we have to know how the calculations for accuracy and AUC work. Accuracy basically counts how many instances have been labelled correctly, so when taking out instances of the majority class (which has a great percentage of being well classified), we lower the amount of correct instances, so the value of accuracy decreases. However, the AUC evaluates how good the model labels a positive instance ahead of a negative, and that generally changes through the training, not much the testing (apart from sample size reasons).

### Comments

The files that can be downloaded in the website linked above are:

- Country-Code-Mapping.csv: mapping of countries to their country codes.
- HackerRank-Developer-Survey-2018-Codebook.csv: mapping of each feature name to its question.
- HackerRank-Developer-Survey-2018-Numeric-Mapping.csv: mapping of each feature name to all its possible numeric values with the explanation of them.
- HackerRank-Developer-Survey-2018-Numeric.csv: data with numeric values.
- HackerRank-Developer-Survey-2018-Values.csv: data with qualitative values.