# From study case: XGBooster without feature engineering

We implement an XGB algorithm as in the solution that classified second, but we will not do feature engineering first. Also, we will fill missing data with a simple mean. The goal is to verify on the website how much better the whole solution is (and thus gauge how much worth the hassle it is in this case).


In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

  from pandas import MultiIndex, Int64Index


We set the path to the training and test data and we define the features and the target variables.

In [2]:
data_dir = '../Datasets/killer-shrimp-invasion/'

train = pd.read_csv(data_dir + 'train.csv')
test = pd.read_csv(data_dir + 'test.csv')

# split data
Y_train = train['Presence']
ID_train = train['pointid']
X_train = train.drop(['Presence', 'pointid'], axis=1)
ID_test = test['pointid']
X_test = test.drop(['pointid'], axis=1)

### Missing data
We fill the missing data (which I explored in another notebook) filling with the mean:

In [3]:
X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())

## Models
I follow here the function written by Curtis Thompson (2nd classified!).

In the next few steps, we define a function allowing to perform statified cross validation. This consists in evaluating cross-validation with stratified sampling (that is, sampling performed making sure that the percentage of a certain value is the same as in the dataset). For this, 5-fold cross-validation is used. It also prints out the AUROC score. The ROC curve is the plot obtained by plotting the True Positive Rate (sensitivity) to the false positive rate (1 - specificity), and AUROC is the Area Under the Receiver Operating Characteristic. If the area is 1, the model is perfect. If AUROC = 0.5, it operates like a random model.

In [4]:
def five_fold_cv(model, X_train, Y_train, verbose = True):
    skf = StratifiedKFold(n_splits = 5)
    fold = 1
    scores = []

    for train_index, test_index in skf.split(X_train, Y_train):
        X_train_fold, X_test_fold = X_train.iloc[train_index], X_train.iloc[test_index]
        Y_train_fold, Y_test_fold = Y_train.iloc[train_index], Y_train.iloc[test_index]

        model.fit(X_train_fold, Y_train_fold)
        
        preds = model.predict_proba(X_test_fold)
        preds = [x[1] for x in preds]

        score = roc_auc_score(Y_test_fold, preds)
        scores.append(score)
        if verbose:
            print('Fold', fold, '    ', score)
        fold += 1

    avg = np.mean(scores)
    if verbose:
        print()
        print('Average', avg)
    return avg

Let's also define the list of features:

In [5]:
features = ['Temperature_today', 'Salinity_today', 'Depth', 'Substrate', 'Exposure']


Time to generate an XGB model (eXtreme Gradient Boosting). Gradient boosting is a ML technique used in regression and classification tasks. It gives a prediction model in the form of an ensemble of weak prediction models (typically, decision trees).

In [6]:
model = XGBClassifier(eval_metric = 'auc', objective = 'binary:logistic',
                      learning_rate = 0.3, max_depth = 5, subsample = 1, reg_lambda = 0.5)

score = five_fold_cv(model, X_train[features], Y_train, verbose=True)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


Fold 1      0.9999257424328684


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


Fold 2      0.9940346421070869


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


Fold 3      0.9994256462788391


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


Fold 4      0.998547407744493


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


Fold 5      0.9997140126200024

Average 0.9983294902366581


## Predictions
Time for predictions!

In [7]:
predictions = pd.DataFrame(ID_test, columns=['pointid'])

model.fit(X_train[features], Y_train)
predictions['Presence'] = model.predict_proba(X_test[features])[:,1]

predictions[['pointid', 'Presence']].to_csv('xgb_solution.csv', index=False)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
