This notebook presents some code to compute some basic baselines.

In particular, it shows how to:
1. Use the provided validation set
2. Compute the top-30 metric
3. Save the predictions on the test in the right format for submission

In [1]:
%pylab inline --no-import-all

import os
from pathlib import Path

import pandas as pd


# Change this path to adapt to where you downloaded the data
DATA_PATH = Path("data")

# Create the path to save submission files
SUBMISSION_PATH = Path("submissions")
os.makedirs(SUBMISSION_PATH, exist_ok=True)

Populating the interactive namespace from numpy and matplotlib


We also load the official metric, top-30 error rate, for which we provide efficient implementations:

In [2]:
from GLC.metrics import top_30_error_rate
help(top_30_error_rate)

Help on function top_30_error_rate in module GLC.metrics:

top_30_error_rate(y_true, y_score)
    Computes the top-30 error rate.
    
    Parameters
    ----------
    y_true: 1d array, [n_samples]
        True labels.
    y_score: 2d array, [n_samples, n_classes]
        Scores for each label.
    
    Returns
    -------
    float:
        Top-30 error rate value.
    
    Notes
    -----
    Complexity: :math:`O( n_\text{samples} \times n_\text{classes} )`.



In [3]:
from GLC.metrics import top_k_error_rate_from_sets
help(top_k_error_rate_from_sets)

Help on function top_k_error_rate_from_sets in module GLC.metrics:

top_k_error_rate_from_sets(y_true, s_pred)
    Computes the top-k error rate from predicted sets.
    
    Parameters
    ----------
    y_true: 1d array, [n_samples]
        True labels.
    s_pred: 2d array, [n_samples, k]
        Previously computed top-k sets for each sample.
    
    Returns
    -------
    float:
        Error rate value.



For submissions, we will also need to predict the top-30 sets for which we also provide an efficient implementation:

In [4]:
from GLC.metrics import predict_top_30_set
help(predict_top_30_set)

Help on function predict_top_30_set in module GLC.metrics:

predict_top_30_set(y_score)
    Predicts the top-30 sets from scores.
    
    Parameters
    ----------
    y_score: 2d array, [n_samples, n_classes]
        Scores for each sample and label.
    
    Returns
    -------
    2d array, [n_samples, 30]:
        Predicted top-30 sets for each sample.
    
    Notes
    -----
    Complexity: :math:`O( n_\text{samples} \times n_\text{classes} )`.



We also provide an utility function to generate submission files in the right format:

In [5]:
from GLC.submission import generate_submission_file
help(generate_submission_file)

Help on function generate_submission_file in module GLC.submission:

generate_submission_file(filename, observation_ids, s_pred)
    Generate submission file for Kaggle
    
    Parameters
    ----------
    filename : string
        Submission filename.
    observation_ids : 1d array-like
        Test observations ids
    s_pred : list of 1d array-like
        Set predictions for test observations.



# Observation data loading

We first need to load the observation data:

In [6]:
df_obs_fr = pd.read_csv(DATA_PATH / "observations" / "observations_fr_train.csv", sep=";", index_col="observation_id")
df_obs_us = pd.read_csv(DATA_PATH / "observations" / "observations_us_train.csv", sep=";", index_col="observation_id")
df_obs = pd.concat((df_obs_fr, df_obs_us))

Then, we retrieve the train/val split provided:

In [7]:
obs_id_train = df_obs.index[df_obs["subset"] == "train"].values
obs_id_val = df_obs.index[df_obs["subset"] == "val"].values

y_train = df_obs.loc[obs_id_train]["species_id"].values
y_val = df_obs.loc[obs_id_val]["species_id"].values

n_val = len(obs_id_val)
print("Validation set size: {} ({:.1%} of train observations)".format(n_val, n_val / len(df_obs)))

Validation set size: 40080 (2.5% of train observations)


We also load the observation data for the test set:

In [8]:
df_obs_fr_test = pd.read_csv(DATA_PATH / "observations" / "observations_fr_test.csv", sep=";", index_col="observation_id")
df_obs_us_test = pd.read_csv(DATA_PATH / "observations" / "observations_us_test.csv", sep=";", index_col="observation_id")

df_obs_test = pd.concat((df_obs_fr_test, df_obs_us_test))

obs_id_test = df_obs_test.index.values

print("Number of observations for testing: {}".format(len(df_obs_test)))

df_obs_test.head()

Number of observations for testing: 36421


Unnamed: 0_level_0,latitude,longitude
observation_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10782781,43.601788,6.940195
10364138,46.241711,0.683586
10692017,45.181095,1.533459
10222322,46.93845,5.298678
10241950,45.017433,0.960736


# Sample submission file

In this section, we will demonstrate how to generate the sample submission file provided.

To do so, we will use the function `generate_submission_file` from `GLC.submission`.

The sample submission consists in always predicting the first 30 species for all the test observations:

In [9]:
first_30_species = np.arange(30)
s_pred = np.tile(first_30_species[None], (len(df_obs_test), 1))

We can then generate the associated submission file using:

In [10]:
generate_submission_file(SUBMISSION_PATH / "sample_submission.csv", df_obs_test.index, s_pred)

# Constant baseline: 30 most observed species

The first baseline consists in predicting the 30 most observed species on the train set which corresponds exactly to the "Top-30 most present species":

In [11]:
species_distribution = df_obs.loc[obs_id_train]["species_id"].value_counts(normalize=True)
top_30_most_observed = species_distribution.index.values[:30]

As expected, it does not perform very well on the validation set:

In [12]:
s_pred = np.tile(top_30_most_observed[None], (n_val, 1))
score = top_k_error_rate_from_sets(y_val, s_pred)
print("Top-30 error rate: {:.1%}".format(score))

Top-30 error rate: 93.5%


We will however generate the associated submission file on the test using:

In [13]:
# Compute baseline on the test set
n_test = len(df_obs_test)
s_pred = np.tile(top_30_most_observed[None], (n_test, 1))

# Generate the submission file
generate_submission_file(SUBMISSION_PATH / "constant_top_30_most_present_species_baseline.csv", df_obs_test.index, s_pred)

# Random forest on environmental vectors

A classical approach in ecology is to train Random Forests on environmental vectors.

We show here how to do so using [scikit-learn](https://scikit-learn.org/).

We start by loading the environmental vectors:

In [14]:
df_env = pd.read_csv(DATA_PATH / "pre-extracted" / "environmental_vectors.csv", sep=";", index_col="observation_id")

X_train = df_env.loc[obs_id_train].values
X_val = df_env.loc[obs_id_val].values
X_test = df_env.loc[obs_id_test].values

Then, we need to handle properly the missing values.

For instance, using `SimpleImputer`:

In [15]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(
    missing_values=np.nan,
    strategy="constant",
    fill_value=np.finfo(np.float32).min,
)
imp.fit(X_train)

X_train = imp.transform(X_train)
X_val = imp.transform(X_val)
X_test = imp.transform(X_test)

We can now start training our Random Forest (as there are a lot of observations, over 1.8M, this can take a while):

In [16]:
from sklearn.ensemble import RandomForestClassifier
est = RandomForestClassifier(n_estimators=16, max_depth=10, n_jobs=-1)
est.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, n_estimators=16, n_jobs=-1)

As there are a lot of classes (over 17K), we need to be cautious when predicting the scores of the model.

This can easily take more than 5Go on the validation set.

For this reason, we will be predict the top-30 sets by batches using the following generic function:

In [17]:
def batch_predict(predict_func, X, batch_size=1024):
    res = predict_func(X[:1])
    n_samples, n_outputs, dtype = X.shape[0], res.shape[1], res.dtype
    
    preds = np.empty((n_samples, n_outputs), dtype=dtype)
    
    for i in range(0, len(X), batch_size):
        X_batch = X[i:i+batch_size]
        preds[i:i+batch_size] = predict_func(X_batch)
            
    return preds

We can know compute the top-30 error rate on the validation set:

In [18]:
def predict_func(X):
    y_score = est.predict_proba(X)
    s_pred = predict_top_30_set(y_score)
    return s_pred

s_val = batch_predict(predict_func, X_val, batch_size=1024)
score_val = top_k_error_rate_from_sets(y_val, s_val)
print("Top-30 error rate: {:.1%}".format(score_val))

Top-30 error rate: 80.4%


We now predict the top-30 sets on the test data and save them in a submission file:

In [19]:
# Compute baseline on the test set
s_pred = batch_predict(predict_func, X_test, batch_size=1024)

# Generate the submission file
generate_submission_file(SUBMISSION_PATH / "random_forest_on_environmental_vectors.csv", df_obs_test.index, s_pred)