# Driverless AI Timeseries & NLP Demo
## See Click Predict Fix Kaggle Competition

In this notebook, we will see how to use Driverless AI python client to submit baseline model to the See Click Predict Fix kaggle competition.

Our very first model should score in the silver zone. With some additional tweaks you could reach the gold zone.

In [1]:
%matplotlib inline
import os
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import datetime
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import getpass

In [2]:
# !pip install driverlessai==1.9.0.2
import driverlessai

## Download the Data from Kaggle
https://www.kaggle.com/c/see-click-predict-fix

## Quick Overview
The purpose of the competition was to quantify and predict how people will react to a specific 311 issue. What makes an issue urgent? What do citizens really care about? How much does location matter? Being able to predict the most pressing topics will allow governments to focus their efforts on fixing the most important problems.

The competition dataset contains several hundred thousand issues from four US cities.

![](imgs/mapbox.png)

In [3]:
start = datetime.datetime.now()
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sampleSubmission.csv')
train.shape, test.shape, sample_submission.shape
targets = ['num_views', 'num_votes', 'num_comments']

((223129, 11), (149575, 8), (149575, 4))

In [4]:
'-------- TRAIN ----------'
train.head(2)
'-------- TEST ----------'
test.head(2)
'-------- SAMPLE SUBMISSION ----------'
sample_submission.head(2)

'-------- TRAIN ----------'

Unnamed: 0,id,latitude,longitude,summary,description,num_votes,num_comments,num_views,source,created_time,tag_type
0,368683,37.590139,-77.456841,Alleyway light out.,There is a streetlight lamp out in the alleywa...,4,0,62,New Map Widget,2012-01-01 01:20:08,street_light
1,77642,37.541534,-77.451985,brick side walk has sink hole,bricks are falling into deep hole. please rep...,2,2,28,,2012-01-01 03:18:40,pothole


'-------- TEST ----------'

Unnamed: 0,id,latitude,longitude,summary,description,source,created_time,tag_type
0,21523,41.913652,-87.70605,Graffiti Removal,,remote_api_created,2013-05-01 00:13:47,
1,87152,41.913646,-87.706479,Graffiti Removal,,remote_api_created,2013-05-01 00:14:57,


'-------- SAMPLE SUBMISSION ----------'

Unnamed: 0,id,num_views,num_votes,num_comments
0,21523,0,0,0
1,87152,0,0,0


In [5]:
# To reproduce the scatter maps you may need a Mapbox account and a public Mapbox Access Token.
# See more at https://plotly.com/python/scattermapbox/
if os.path.exists('.mapbox_token'):
    px.set_mapbox_access_token(open('.mapbox_token').read())
    geo = train.round(2).groupby(['latitude', 'longitude']).count().reset_index()
    fig = px.scatter_mapbox(
        geo, lat="latitude", lon="longitude",
        color_continuous_scale=px.colors.cyclical.IceFire, zoom=3,
        title='Four US Cities'
    )
    fig.show()
    fig = px.scatter_mapbox(
        geo, lat="latitude", lon="longitude", color="id", opacity=0.7,
        center=go.layout.mapbox.Center(lat=41.81, lon=-87.6),
        color_continuous_scale=px.colors.cyclical.IceFire, zoom=8,
        title='Issues in Chicago'
    )
    fig.show()

This compact dataset is actually quite complex. We have
* **numeric** features (*latitude*, *longitude*) for the location of the issue,
* raw **text** *description* and *summary*,
* an important **time** (*created_time*) dimension,
* **categorical** (*source*, *tag_type*) features as well.

Beside the complex data types, the records could have missing values.

#### Evaluation

We should predict for each issue in the test set, the **number of views, votes, and comments**.

The competition used Root Mean Squared Logarithmic Error (**RMSLE**) to measure the accuracy.

### Logtransform target variables

For **RMSLE** objective usually it is a good trick to log-transform the target variables. That way we could optimize for **RMSE**.

In [6]:
log_train = train.copy()
for t in targets:
    log_train[t] = np.log(train[t] + 1)
log_train.to_csv('log_train.csv', index=False)

## Connect to Driverless AI

Make sure to use the correct `address` and `username`.

In [7]:
address = 'http://52.87.241.250:12345'
username = 'h2oai'

In [8]:
dai = driverlessai.Client(
    address=address,
    username=username,
    password=getpass.getpass("Enter Driverless AI password: "))

Enter Driverless AI password: ········


## Create the Datasets in DAI

In [9]:
datasets = dai.datasets.list()
dataset_names = [d.name for d in datasets]

In [10]:
if not 'scpf_train_log' in dataset_names:
    _ = dai.datasets.create('log_train.csv', name='scpf_train_log')
if not 'scpf_test' in dataset_names:
    _ = dai.datasets.create('test.csv', name='scpf_test')

datasets = dai.datasets.list()
train_dataset = [d for d in datasets if d.name =='scpf_train_log'][0]
test_dataset = [d for d in datasets if d.name =='scpf_test'][0]

In [11]:
train_dataset.shape, test_dataset.shape

((223129, 11), (149575, 8))

## Create Experiments with GUI

We could use the UI and select
* the previously uploaded 'scpf_train_log' and 'scpf_test' dataset
* `RMSE` as the main loss function to optimize for
* `created_time` as time column
* `num_views` as target
* and give a name to the experiment 

![](imgs/create_experiment.png)


## Create Experiments with Python Client

Since we need to train 3 models for the targets, we could use the python client to create the experiments. It will be easier to collect the predictions as well.


In [12]:
settings = {
    'task': 'regression',
    'scorer': 'RMSE',
    'time_column': 'created_time',
    'train_dataset': train_dataset,
    'test_dataset': test_dataset,
}

In [13]:
dai.experiments.preview(target_column='num_views', **settings)

ACCURACY [7/10]:
- Training data size: *223,129 rows, 11 cols*
- Feature evolution: *[Constant, LightGBM, XGBoostGBM, ZeroInflatedLightGBM, ZeroInflatedXGBoost]*, *up to 4 time-based validation split(s)*
- Final pipeline: *One of [Constant, LightGBM, XGBoostGBM, ZeroInflatedLightGBM, ZeroInflatedXGBoost], single final model, validated during feature evolution with up to 4 time-based back-testing windows*

TIME [7/10]:
- Feature evolution: *8 individuals*, up to *192 iterations*
- Early stopping: After *15* iterations of no improvement

INTERPRETABILITY [5/10]:
- Feature pre-pruning strategy: None
- Monotonicity constraints: disabled
- Feature engineering search space: [CVCatNumEncode, CatOriginal, Cat, DateOriginal, DateTimeOriginal, Dates, EwmaLags, Frequent, Interactions, IsHoliday, LagsAggregates, LagsInteraction, Lags, OneHotEncoding, Original, TextLinModel, TextOriginal, Text]

[Constant, LightGBM, XGBoostGBM, ZeroInflatedLightGBM, ZeroInflatedXGBoost] models to train:
- Target tr

In [14]:
launch_experiments = False
if launch_experiments:
    for t in targets:
        _ = dai.experiments.create_async(
            target_column=t,
            name=f'log_{t}',
            force=True,
            **settings
        )

In [15]:
all_experiments = dai.experiments.list()
all_experiments[:3]

[<class 'driverlessai._experiments.Experiment'> 53aa2dd4-fd89-11ea-9704-0242ac110002 log_num_comments,
 <class 'driverlessai._experiments.Experiment'> 51e113c8-fd89-11ea-9704-0242ac110002 log_num_votes,
 <class 'driverlessai._experiments.Experiment'> 50863abc-fd89-11ea-9704-0242ac110002 log_num_views]

## Let's wait for the experiments


![](imgs/coffee.gif)

## Download Predictions
Training the models took me 1-2 hours. For quick results we could reduce the `time` parameter in `settings`.

In [16]:
experiments = {}
for t in targets:
    experiments[t] = [ex for ex in all_experiments if ex.name == f'log_{t}'][0]
    f'{t} completed succesfully: {experiments[t].is_complete()}'


'num_views completed succesfully: True'

'num_votes completed succesfully: True'

'num_comments completed succesfully: True'

In [17]:
test_prediction_paths = []
for t in targets:
    prediction = experiments[t].predict(test_dataset, include_columns=['id'])
    path = prediction.download(dst_dir='predictions')
    test_prediction_paths.append(path)

Complete
Downloaded 'predictions/50863abc-fd89-11ea-9704-0242ac110002_preds_a1b2976e.csv'
Complete
Downloaded 'predictions/51e113c8-fd89-11ea-9704-0242ac110002_preds_33f09b77.csv'
Complete
Downloaded 'predictions/53aa2dd4-fd89-11ea-9704-0242ac110002_preds_6826f0e2.csv'


In [18]:
predictions = pd.concat([
    pd.read_csv(path).set_index('id') for path in test_prediction_paths
], axis=1)[targets]
predictions.head()

Unnamed: 0_level_0,num_views,num_votes,num_comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
21523,1.1e-05,0.697567,0.000925
87152,7.1e-05,0.694944,0.000925
182789,9.5e-05,0.697885,0.000925
312571,4e-06,0.692584,0.008114
246776,4e-06,0.695687,0.000925


## Create Submission

Since we used log transform for the targets we need to transform back them to the original space.

In [19]:
for t in targets:
    predictions[t] = np.exp(predictions[t]) - 1

Our `predictions` DataFrame is ready for submission.

In [20]:
sample_submission.head(2)
predictions.head(2)

Unnamed: 0,id,num_views,num_votes,num_comments
0,21523,0,0,0
1,87152,0,0,0


Unnamed: 0_level_0,num_views,num_votes,num_comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
21523,1.1e-05,1.008859,0.000925
87152,7.1e-05,1.003596,0.000925


In [21]:
predictions.to_csv('first_submission.csv')

## Submit to Kaggle

My very first submission scored 0.30897 on the private LB it would be in the silver zone. Actually it beat my original submission!

![](imgs/scpf_lb_progress.png)

Of course I could not stop here and wanted to boost the model further. Within a day (90% computation time 10% tweaking the expert settings) I was able to reach the gold zone.

## Hints for Further Experiments
* With proper GPU we could try advanced transformers or NLP models (e.g. BERT) in the expert settings
* Increasing `Time` and `Accuracy` could lead to better final models
* The top teams reported better results when they trained only on the most recent 3-4 months
* Blending different experiments usually helps

In [22]:
end = datetime.datetime.now()
print(f'{end}\nFinished in {(end - start).seconds} seconds')

2020-09-23 17:28:26.516391
Finished in 831 seconds
