## Only for Colab users

To get this to run on colab. Click on the badge.


{{ badge }}

The script is also set up to get the needed files from the repo and performs the installation of the script's [dependencies](https://github.com/mnm-rnd/competitions/blob/master/zindi/airqo-ugandan-air-quality-forecast-challenge/requirements.txt)

In [3]:
!git clone https://github.com/mnm-rnd/competitions.git
!mv ./competitions/zindi/airqo-ugandan-air-quality-forecast-challenge/mlod ./mlod
!mv ./competitions/zindi/airqo-ugandan-air-quality-forecast-challenge/requirements.txt ./requirements.txt
!rm -rf ./competitions

# AirQo Ugandan Air Quality Forecast Challenge

This notebook contains the reformat of the code, for proper set up during implementation.
Most of the code abstractions are written inside our `mlod` package, which should be included with this notebook.


In [None]:
# Installing the requirements
!pip install -r ./requirements.txt

## Init Steps

This section involves setting up the data from `Zindi` to use for the competition

### Setting up the data

Please upload the train and test files to the `./data` path inside the workspace folder. Run the cell below, repeateadly till when there are no errors.
Make sure the uploaded data is the `Train.csv` and `Test.csv` used in the competition

In [5]:
from pathlib import Path

train_file_csv = Path('./data/Train.csv')
test_file_csv = Path('./data/Test.csv')

# check if Train file doesn't exist
assert train_file_csv.exists(), 'Make sure the Test csv file exists the path "%s"' % train_file_csv

# check if Test file doesn't exist
assert test_file_csv.exists(), 'Make sure the Test csv file exists the path "%s"' % test_file_csv

## Actual sequence of processes

### Initiating different processes

Necessary steps before training

In [6]:
import random
import numpy as np

# using our chosen seed number for reproducibility
from mlod import SEED_NUMBER as MLOD_SEED_NUMBER

# Setting the seed
random.seed(MLOD_SEED_NUMBER)
np.random.seed(MLOD_SEED_NUMBER)

### Load and preprocess data

In [7]:
import pandas as pd

## Fetching the data
train_df = pd.read_csv(train_file_csv)
test_df = pd.read_csv(test_file_csv)

TEST_IDS = test_df['ID']


### Preprocessing the data

[Low-level Preprocessing]<br />
By using the `mlod.preprocessors.*` involves preprocessing the data in the following ways
- Modifying the data such that each row has its atomic values, thus making the data **grow** in size
- Performing **special** feature engineering some of which include:
    - Acquiring Cyclic Representation of selected (idx) features
    - Using wind speed (`wind_spd`) and direction (`wind_dir`) to obtain 
        catersian components (`u` and `v`) of the wind variable.
    - Add lag features

### Preprocess + Model Training

Owing to our ensemble, the data is preprocessed differently before training either model. The code below therefore contains the `Model` paired with its appropriate `PreProcessor`.

Predictions from the first model are fed back into the data before training the second model.

#### 1: LightGBM + Version 1 Pre Processing

This first approach includes using our `MlodPreProcessor` and a `LightGBM`

In [8]:
from mlod.preprocessors import MlodPreProcessor

mlod_preprocessor = MlodPreProcessor()
mlod_pp_opts = dict(cols_to_retain=['ID'])
x_train, y_train = mlod_preprocessor.process(train_df, **mlod_pp_opts)

100%|██████████████████████████████████████████████████████████████████████████| 15539/15539 [00:10<00:00, 1415.11it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:11<00:00,  3.88s/it]


In [9]:
# Training the LightGBM
# ------------------------------
import pandas as pd
from mlod.models import LGBModel
from sklearn.model_selection import GroupKFold

lgb_model = LGBModel('airqo')

In [10]:
x_train_ids = x_train.pop('ID')
fold_group = x_train['day_idx'].astype(str) + '_' + x_train['24hr_idx'].astype(str)

In [11]:
assert 'ID' not in x_train.columns, 'Make sure ID is NOT in the columns'

In [12]:
# perfoming evalution using cross validation.
lgb_eval_out = lgb_model.train(x_train, y_train, cv=True, kfold=GroupKFold, group=fold_group, n_splits=3)

# Save predictions then to feed to the next
df_to_feed = pd.DataFrame.from_dict({ 'ID': x_train_ids, 'oof': lgb_eval_out['oof'] })

save_path = './lgb_eval.csv'
df_to_feed.to_csv(save_path)
print('Saving the OOF values to path: {}'.format(save_path))

Training until validation scores don't improve for 1000 rounds
[2000]	training's rmse: 19.4028	valid_1's rmse: 24.4098
[4000]	training's rmse: 15.919	valid_1's rmse: 23.4244
[6000]	training's rmse: 13.8079	valid_1's rmse: 23.0922
[8000]	training's rmse: 12.3078	valid_1's rmse: 22.9697
[10000]	training's rmse: 11.2047	valid_1's rmse: 22.9154
Early stopping, best iteration is:
[10836]	training's rmse: 10.8278	valid_1's rmse: 22.909
Training until validation scores don't improve for 1000 rounds
[2000]	training's rmse: 19.4194	valid_1's rmse: 24.1375
[4000]	training's rmse: 15.9905	valid_1's rmse: 23.1524
[6000]	training's rmse: 13.8764	valid_1's rmse: 22.8295
[8000]	training's rmse: 12.3868	valid_1's rmse: 22.7106
[10000]	training's rmse: 11.2997	valid_1's rmse: 22.674
Early stopping, best iteration is:
[10080]	training's rmse: 11.2594	valid_1's rmse: 22.6723
Training until validation scores don't improve for 1000 rounds
[2000]	training's rmse: 19.5012	valid_1's rmse: 24.0629
[4000]	train

In [13]:
import lightgbm as lgb

# training the model on the full set
lgb_model.train(x_train, y_train, cv=False)

# save the model
lgb_model.model.save_model('./lgb-airqo')

<lightgbm.basic.Booster at 0x22c3ad41c40>

#### 2: CatBoost + Version 2 Pre Processing

This approach includes using our `AirQoPreProcessor` and a `CatBoostModel`

In [14]:
## Training the CatBoost Model
# ------------------------------
from mlod.preprocessors import AirQoPreProcessor

airqo_preprocessor = AirQoPreProcessor()

airqo_pp_opts = dict(cols_to_retain=['ID'])
x_train, y_train = airqo_preprocessor.process(train_df, **airqo_pp_opts)

100%|████████████████████████████████████████████████████████████████████████████████| 121/121 [00:05<00:00, 21.63it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  8.68it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:06<00:00,  1.08s/it]


In [15]:
# Averaging predictions from LightGBM to obtain the mean value of the target
train_feed = df_to_feed.groupby('ID').mean()

# Add the averaged values to the data before training the new model
x_train = x_train.join(train_feed, on='ID')

# drop ID column after joining
del x_train['ID']

In [16]:
from mlod.models import CatBoostModel
from sklearn.model_selection import KFold

cb_model = CatBoostModel('airqo')

In [17]:
# performing cross validation training
cb_eval_out = cb_model.train(x_train, y_train, cv=True, store_cv_models=True, kfold=KFold, n_splits=50)

0:	learn: 37.0652628	test: 37.0652628	test1: 40.7618000	best: 40.7618000 (0)	total: 148ms	remaining: 24m 42s
20:	learn: 20.1991491	test: 20.1991491	test1: 21.4047649	best: 21.4047649 (20)	total: 1.05s	remaining: 8m 20s
40:	learn: 19.3355316	test: 19.3355316	test1: 20.9389828	best: 20.9325021 (35)	total: 1.93s	remaining: 7m 48s
60:	learn: 18.5942429	test: 18.5942429	test1: 20.8765409	best: 20.8628185 (50)	total: 2.8s	remaining: 7m 35s
Stopped by overfitting detector  (10 iterations wait)

bestTest = 20.86281845
bestIteration = 50

Shrink model to first 51 iterations.
0:	learn: 37.1597171	test: 37.1597171	test1: 35.5353855	best: 35.5353855 (0)	total: 48.3ms	remaining: 8m 2s
20:	learn: 20.2752787	test: 20.2752787	test1: 18.9412098	best: 18.8592769 (18)	total: 932ms	remaining: 7m 22s
Stopped by overfitting detector  (10 iterations wait)

bestTest = 18.85927694
bestIteration = 18

Shrink model to first 19 iterations.
0:	learn: 37.1441632	test: 37.1441632	test1: 35.9132467	best: 35.9132467 (

### Ensemble Prediction

Since we are dealing with an ensemble model, the prediction will most likely also have to be different.
We would need to take the output of `lgb_model` and use it as an input to the `cb_model`.

Below is a function to facilitate this process.

In [18]:
import pandas as pd
import numpy as np
from tqdm import tqdm

from mlod.models import Model
from mlod.preprocessors import PreProcessor
from typing import Tuple

import logging
logger = logging.getLogger('mlod')

class EnsemblePredictor:
    def __init__(self, 
                 trained_lgb_model: Model, 
                 cv_trained_cb_model: Model, 
                 lgb_pp_opts: Tuple[PreProcessor, dict], 
                 cb_pp_opts: Tuple[PreProcessor, dict]):
        
        # Checks if the models are trained
        assert trained_lgb_model.model is not None, "the lgb model is not trained"
        assert cv_trained_cb_model.is_cv_trained, "the cb model needs to be trained by cross validation"
        
        self.lgb = trained_lgb_model
        self.cb = cv_trained_cb_model
        
        # load up the preprocessor and config used in LGB model
        lgb_pp, lgp_opts = lgb_pp_opts
        self.lgb_pp = lgb_pp
        self.lgp_opts = lgp_opts
        
        # load up the preprocessor and config used in CatBoost model
        cb_pp, cb_opts = cb_pp_opts
        self.cb_pp = cb_pp
        self.cb_opts = cb_opts
        
    def predict(self, x: pd.DataFrame) -> np.ndarray:
        
        # pre-process like lgb
        x_out_lgb = self.lgb_pp.process(x.copy(), test=True, **self.lgp_opts)
        x_ids = x_out_lgb.pop('ID')
        
        # pre-process like cb
        x_out_cb = self.cb_pp.process(x.copy(), test=True, **self.cb_opts)
        
        logger.info('Making prediction using base model')
        # output for the lgb + merge with x_out_cb
        to_merge = pd.DataFrame.from_dict({ 'ID': x_ids, 'oof': self.lgb.predict(x_out_lgb) })
        
        # mean merge the values
        to_merge = to_merge.groupby('ID').mean()
        x_out_cb = x_out_cb.join(to_merge, on='ID')
        
        # remove ID col + empty unneeded data
        del x_out_cb['ID']
        del to_merge
        
        # store the list of predictions
        ls_preds = []
        
        logger.info('Making prediction using each %d cv models' % len(self.cb.cv_models))
        # get the models used in the cross validations
        for cv_model in tqdm(self.cb.get_cv_models()):
            # make prediction using combined values with the cb model
            pred = cv_model.predict(x_out_cb)
            ls_preds.append(pred)
        
        # compute the mean of the predictions of 
        #  the cross validation models
        return np.mean(ls_preds, 0)

Using this `EnsemblePredictor` and saving predictions

In [20]:
import numpy as np
from mlod.file_utils import PredictionStorage

# Building the ensemble predictor
predictor = EnsemblePredictor(
                    lgb_model, 
                    cb_model, 
                    (mlod_preprocessor, mlod_pp_opts),
                    (airqo_preprocessor, airqo_pp_opts)
                )

y_test = predictor.predict(test_df)

100%|████████████████████████████████████████████████████████████████████████████| 5035/5035 [00:03<00:00, 1531.09it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.23s/it]
100%|████████████████████████████████████████████████████████████████████████████████| 121/121 [00:02<00:00, 55.02it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 28.83it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:02<00:00,  2.71it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 113.42it/s]


In [21]:
# Store the results for submission
mean_rmse = np.mean([cb_eval_out['rmse'], lgb_eval_out['rmse']])

out_df = pd.DataFrame.from_dict(dict(ID=TEST_IDS.values, target=y_test)).set_index('ID')
out_df.to_csv(f'./airqo_sub{mean_rmse}.csv')

The file to upload should be name `airqo_subXXX.csv`. Where XXX is the RMSE of the OOF predictions.