# Introduction to RAMP platform and interaction with scikit-learn

RAMP is a Kaggle-like platform. It is used to run data science challenge. Indeed, a challenge is organized around a specific problem for which the data and the evaluation are already defined. Participants will only have to focus on the development of the machine learning pipeline.

Here, we will present how the RAMP platform works. RAMP relies on a data science problem which is formulated in `problem.py`. It defines both data and evaluation. If you are interested, you can open this file. Otherwise, we will only used a couple of the function defined there.

In [1]:
import problem

We will load the training and testing datasets available for the challenge.

In [2]:
X_train, y_train = problem.get_train_data()

In [3]:
X_test, y_test = problem.get_test_data()

Succently, we can check the type of features in `X` and the target that we would like to predict `y`.

In [4]:
X_train.head()

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,std_wtd
0,2012-06-19,ORD,DFW,12.875,9.812647
1,2012-09-10,LAS,DEN,14.285714,9.466734
2,2012-10-05,DEN,LAX,10.863636,9.035883
3,2011-10-09,ATL,ORD,11.48,7.990202
4,2012-02-21,DEN,SFO,11.45,9.517159


In [5]:
y_train[:5]

array([12.33129622, 10.77518151, 11.08317675, 11.16926784, 11.26936373])

The target `y` corresponds to a number of passengers (modified using a `log` function). Associated with each target, we have an information in `X` related to the date (`DateOfDeparture`) and the airports of departure (`Departure`) and arrival (`Arrival`). Besides, we have the information regarding the mean (`WeeksToDeparture`) and standard deviation (`std_wtd`) of the time in weeks between the booking and the departure.

So we try to answer to the following information: **With some flying information between airports, can we predict the (log) flow of passengers?**

Let's make a basic scikit-learn model that could use some the data in `X` to answer this question. We will create a factory function `get_estimator()` to return the model.

In [6]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor

def get_estimator():
    cat_processor = OrdinalEncoder()
    cat_columns = ["Departure", "Arrival"]

    num_processor = "passthrough"
    num_columns = ["WeeksToDeparture", "std_wtd"]

    preprocessor = make_column_transformer(
        (cat_processor, cat_columns),
        (num_processor, num_columns),
        remainder="drop",  # drop the unused columns
    )

    return make_pipeline(preprocessor, RandomForestRegressor())

At this stage, we could train and test our model using scikit-learn

In [7]:
import numpy as np
from sklearn.model_selection import cross_val_score

model = get_estimator()
cv = problem.get_cv(X_train, y_train)

scores = cross_val_score(
    model, X_train, y_train, cv=cv,
    scoring="neg_mean_squared_error",
)
rmse_scores = np.sqrt(-scores)
for cv_idx, score in enumerate(rmse_scores):
    print(f"CV Fold #{cv_idx}: {score:.3f}")
print(
    f"RMSE = {rmse_scores.mean():.3f} "
    f"+/- {rmse_scores.std():.3f}"
)

CV Fold #0: 0.854
CV Fold #1: 0.862
CV Fold #2: 0.851
CV Fold #3: 0.875
CV Fold #4: 0.864
CV Fold #5: 0.850
CV Fold #6: 0.846
CV Fold #7: 0.839
RMSE = 0.855 +/- 0.011


RAMP was developed to avoid running the last cell. Instead, the idea is to store the content of cell where `get_estimator` is defined into a file. For this challenge, the name of the file is called `estimator.py`. You can check the content of `submissions/starting_kit` directory.

Let's make our first kit. Alongside of the three kits, create a folder `my_first_kit` in the folder `submissions`. The command below will create a `estimator.py` file into this directory.

In [8]:
%%writefile submissions/my_first_kit/estimator.py

from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor

def get_estimator():
    cat_processor = OrdinalEncoder()
    cat_columns = ["Departure", "Arrival"]

    num_processor = "passthrough"
    num_columns = ["WeeksToDeparture", "std_wtd"]

    preprocessor = make_column_transformer(
        (cat_processor, cat_columns),
        (num_processor, num_columns),
        remainder="drop",  # drop the unused columns
    )

    return make_pipeline(preprocessor, RandomForestRegressor())

Overwriting submissions/my_first_kit/estimator.py


Once the `estimator.py` was created, we can use the `ramp-test` command to automatically execute the evaluation on our newly created estimator. 

In [9]:
!ramp-test --submission my_first_kit

[38;5;178m[1mTesting Number of air passengers prediction[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m
[38;5;178m[1mTraining submissions/my_first_kit ...[0m
[38;5;178m[1mCV fold 0[0m
	[38;5;178m[1mscore   rmse      time[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.323[0m  [38;5;150m1.898484[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.854[0m  [38;5;105m0.174880[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m0.886[0m  [38;5;218m0.055788[0m
[38;5;178m[1mCV fold 1[0m
	[38;5;178m[1mscore   rmse      time[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.316[0m  [38;5;150m1.230795[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.863[0m  [38;5;105m0.164316[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m0.879[0m  [38;5;218m0.054035[0m
[38;5;178m[1mCV fold 2[0m
	[38;5;178m[1mscore   rmse      time[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.321[0m  [38;5;150m1.292139[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.847

This command is the exact replica of what would happen on our server when you will submit your file `estimator.py` on https://ramp.studio.

We can now make a demo on submitting this kit on the RAMP platform.