# Model creation

The aim of this phase is to test different algorithms, check which predicts the 'delayInSceonds' with the highest accuracy. The metric in the competition to compare results is RMSE. I will start with creating DummyRegressor that predicts just mean value for all observations, to be able to compare how more complex models improved results. I will also test newly created features and remove features that do not add too much value but increase model complexity. I will finally decide which algorithm and feateres should I use for prediction. 

In [1]:
# import libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb 
from sklearn.model_selection import cross_val_score, cross_validate

In [2]:
# set column types to minimize DataFrame
column_types= {'agencyId': 'int16',
 'clouds_all': 'int8',
 'delayInSeconds': 'int64',
 'humidity': 'int8',
 'id': 'int32',
 'pressure': 'int16',
 'routeId': 'int16',
 'stopId': 'int32',
 'stopLat': 'float64',
 'stopLon': 'float64',
 'stopSequence': 'int32',
 'temp': 'float64',
 'tripId': 'int16',
 'vehicleId': 'int32',
 'wind_deg': 'int16',
 'wind_speed': 'int8'}

In [3]:
# create DataFrame from cleaned data
df=pd.read_csv('df_cleaned', dtype=column_types, parse_dates=['delayPredictionTimestamp','scheduleTime'])

I will define features function to quickly exclude columns I do not want to use as features. I will exclude 'id' which is a table key, 'delayInSeonds'- dependant variable, columns with datatime format that cannot be used by models. 

In [4]:
# define features function to exclude some columns from features matrix
def features(df):
    feats = df.columns.values
    black_list = ['id', 'delayInSeconds', 'delayPredictionTimestamp', 'scheduleTime']
    return [feat for feat in feats if feat not in black_list]

In [5]:
# have a look on remaining features columns
df[features(df)].head()

Unnamed: 0,stopId,routeId,vehicleId,tripId,stopSequence,agencyId,stopLat,stopLon,temp,pressure,humidity,wind_speed,wind_deg,clouds_all,weather_id
0,204,11,377,22,37,2,54.39919,18.59498,291.15,1009,59,5,250,75,803
1,2096,11,377,22,34,2,54.41262,18.58775,291.15,1009,59,5,250,75,803
2,2008,11,377,11,29,2,54.36391,18.63838,291.15,1009,59,4,230,40,802
3,2024,11,377,11,21,2,54.3822,18.59866,291.15,1009,59,4,230,40,802
4,2028,11,377,11,19,2,54.3869,18.58521,291.15,1009,59,4,230,40,802


I will now create a features matrix and array with dependant variable. Then I'll create models using different algorithms and compare results using cross validation.

In [6]:
# create features matrix and dependant variable array
X=df[features].values
y=df['delayInSeconds'].values

In [7]:
# create models to test
def create_models():
     return [
        ('dr', DummyRegressor()),
        ('lr', LinearRegression()),
        ('dt', DecisionTreeRegressor(max_depth=8)),
        ('rf', RandomForestRegressor(max_depth=8, n_estimators=20, min_samples_leaf=10, n_jobs=-1, random_state=2018)),
        ('xgbr', xgb.XGBRegressor(max_depth=3, n_estimators=100, n_jobs=-1))]

In [8]:
# compare results
for name, model in create_models():
    scores=cross_validate(model, X, y,cv=5, scoring='neg_mean_squared_error', return_train_score=True)
    print ( name + ' train_score:' + str(np.round(np.sqrt(-scores['train_score'].mean()),3))+ ' test_score:' + str(np.round(np.sqrt(-scores['test_score'].mean()),3)))

dr train_score:892.977 test_score:893.053
lr train_score:889.306 test_score:889.819
dt train_score:782.339 test_score:1164.022
rf train_score:833.226 test_score:911.121
xgbr train_score:853.72 test_score:920.063


The test_score for DummyRegressor is: 893.053. Other model achieve similar results or even worse. Models parameters are not optimal yet but it seems that current feature do not predict delay well. I will now use feats_engineering function defined in previous phase and test if newly created features improve result.

In [9]:
# define function with new features

def feats_engineering(df):  
    
    # mean 'delayInSeconds'
    stopIdMeanDelay = df[ ['stopId', 'delayInSeconds'] ].groupby(['stopId']).mean().to_dict()['delayInSeconds']
    df['stopIdMeanDelay'] = df['stopId'].map(lambda x: stopIdMeanDelay[x])
    agencyMeanDelay = df[ ['agencyId', 'delayInSeconds'] ].groupby(['agencyId']).mean().to_dict()['delayInSeconds']
    df['agencyMeanDelay'] = df['agencyId'].map(lambda x: agencyMeanDelay[x])
    routeMeanDelay = df[ ['routeId', 'delayInSeconds'] ].groupby(['routeId']).mean().to_dict()['delayInSeconds']
    df['routeMeanDelay'] = df['routeId'].map(lambda x: routeMeanDelay[x])
    vehicleIdMeanDelay = df[ ['vehicleId', 'delayInSeconds'] ].groupby(['vehicleId']).mean().to_dict()['delayInSeconds']
    df['vehicleIdMeanDelay'] = df['vehicleId'].map(lambda x: vehicleIdMeanDelay[x])
    tripIdMeanDelay = df[ ['tripId', 'delayInSeconds'] ].groupby(['tripId']).mean().to_dict()['delayInSeconds']
    df['tripIdMeanDelay'] = df['tripId'].map(lambda x: tripIdMeanDelay[x])
    
    # clusters of stopId localization
    from sklearn.cluster import KMeans
    X = df[['stopLat','stopLon']].values
    kmeans = KMeans(n_clusters=8, random_state=2018).fit(X)
    df['clusters']=kmeans.predict(X)
    
    # difference between delayPrediction and schedule time. Wrong date when time close to midnight need to add or subtract 24h
    df['time_diff']=(df['delayPredictionTimestamp']-df['scheduleTime']).astype('timedelta64[s]')
    df['time_diff']=pd.to_numeric(df['time_diff'], downcast='signed')
    df['time_diff']=df['time_diff'].apply(lambda x: x+86400 if x <-50000 else x)
    df['time_diff']=df['time_diff'].apply(lambda x: x-86400 if x >50000 else x)
    
    # time features
    df['hour'] = df['scheduleTime'].dt.hour
    df['dayofweek'] = df['scheduleTime'].dt.dayofweek
    df['weekend'] = df['dayofweek'].map(lambda x: int(x in [5,6]))
    df['dayofweek'] = df['scheduleTime'].dt.dayofweek
    df['holidays'] = df['scheduleTime'].map(lambda x: int(x > pd.Timestamp(2018, 6, 22, 23, 59, 59)))
    df['time']=df['delayPredictionTimestamp'].dt.second+df['delayPredictionTimestamp'].dt.minute*60+df['delayPredictionTimestamp'].dt.hour*3600
    df['peak_hours']=(df[['hour','weekend']].apply(lambda x: (x['weekend'] == 1 and x['hour'] in (0,1,2,12,13,14)) or (x['weekend'] == 0 and  9 <= x['hour'] <= 18),axis=1))
    
    return df

In [10]:
# create features matrix and dependant variable array
X=feats_engineering(df)[features].values
y=df['delayInSeconds'].values

In [11]:
# compare results
for name, model in create_models():
    scores=cross_validate(model, X, y, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
    print ( name + ' train_score:' + str(np.round(np.sqrt(-scores['train_score'].mean()),3))+ ' test_score:' + str(np.round(np.sqrt(-scores['test_score'].mean()),3)))

dr train_score:892.977 test_score:893.053
lr train_score:817.834 test_score:819.238
dt train_score:554.122 test_score:1370.636
rf train_score:646.716 test_score:740.212
xgbr train_score:413.278 test_score:595.364


Feature engineering definitely helped. Result of linear regression is a bit better but it does not seem that linear model is suitable for those data. Decison Tree model with current parameters is definetely overfitted. This problem is partially resolved in Random Forest, for which RMSE significantly decrased to 740.212. The best score was achieved by XGBoost: 595.364. I will continue with Random Forest and XGBoost and check features importancies for those two models.

In [12]:
# Check features importances
rf=RandomForestRegressor(max_depth=8, n_estimators=20, min_samples_leaf=10, n_jobs=-1, random_state=2018)
rf.fit(X,y)
xgbr= xgb.XGBRegressor(max_depth=3, n_estimators=100, n_jobs=-1)
xgbr.fit(X,y)
Feature_importancies=pd.DataFrame({'features':feats_engineering(df)[features].columns,'rf':np.round(rf.feature_importances_,3),
                                   'xgbr':np.round(xgbr.feature_importances_,3)},columns=['features','rf','xgbr'])
Feature_importancies

Unnamed: 0,features,rf,xgbr
0,stopId,0.002,0.044
1,routeId,0.006,0.017
2,vehicleId,0.002,0.012
3,tripId,0.006,0.006
4,stopSequence,0.011,0.082
5,agencyId,0.0,0.0
6,stopLat,0.002,0.006
7,stopLon,0.001,0.006
8,temp,0.004,0.002
9,pressure,0.01,0.011


There are a few features with very low importance. I removed the one with the lowest importance and checked if it affected model accuracy. I repeated it a few times  until I got final set of features. The accuracy remained the same but model became a bit simpler. I will now create my final functions that drop columns I do not want to use and create new features.

In [13]:
# define features function to exclude some columns from features matrix
def features(df):
    feats = df.columns.values  
    black_list = ['id', 'delayInSeconds', 'delayPredictionTimestamp', 'scheduleTime', 'weather_id', 'clouds_all', 'pressure']
    return [feat for feat in feats if feat not in black_list]

In [14]:
# define final function with new features

def feats_engineering(df):  
    
    # mean 'delayInSeconds'
    agencyMeanDelay = df[ ['agencyId', 'delayInSeconds'] ].groupby(['agencyId']).mean().to_dict()['delayInSeconds']
    df['agencyMeanDelay'] = df['agencyId'].map(lambda x: agencyMeanDelay[x])
    stopIdMeanDelay = df[ ['stopId', 'delayInSeconds'] ].groupby(['stopId']).mean().to_dict()['delayInSeconds']
    df['stopIdMeanDelay'] = df['stopId'].map(lambda x: stopIdMeanDelay[x])
    routeMeanDelay = df[ ['routeId', 'delayInSeconds'] ].groupby(['routeId']).mean().to_dict()['delayInSeconds']
    df['routeMeanDelay'] = df['routeId'].map(lambda x: routeMeanDelay[x])
    vehicleIdMeanDelay = df[ ['vehicleId', 'delayInSeconds'] ].groupby(['vehicleId']).mean().to_dict()['delayInSeconds']
    df['vehicleIdMeanDelay'] = df['vehicleId'].map(lambda x: vehicleIdMeanDelay[x])
    tripIdMeanDelay = df[ ['tripId', 'delayInSeconds'] ].groupby(['tripId']).mean().to_dict()['delayInSeconds']
    df['tripIdMeanDelay'] = df['tripId'].map(lambda x: tripIdMeanDelay[x])
     
    # difference between delayPrediction and schedule time. Wrong date when time close to midnight need to add or subtract 24h
    df['time_diff']=(df['delayPredictionTimestamp']-df['scheduleTime']).astype('timedelta64[s]')
    df['time_diff']=pd.to_numeric(df['time_diff'], downcast='signed')
    df['time_diff']=df['time_diff'].apply(lambda x: x+86400 if x <-50000 else x)
    df['time_diff']=df['time_diff'].apply(lambda x: x-86400 if x >50000 else x)
    
    # time features
    df['hour'] = df['scheduleTime'].dt.hour
    df['dayofweek'] = df['scheduleTime'].dt.dayofweek
    df['time']=df['delayPredictionTimestamp'].dt.second+df['delayPredictionTimestamp'].dt.minute*60+df['delayPredictionTimestamp'].dt.hour*3600
    return df

In [15]:
# create features matrix and dependant variable array
X=feats_engineering(df)[features].values
y=df['delayInSeconds'].values

In [16]:
# create models to test
def create_models():
     return [
        ('rf', RandomForestRegressor(max_depth=8, n_estimators=20, min_samples_leaf=10, n_jobs=-1, random_state=2018)),
        ('xgbr', xgb.XGBRegressor(max_depth=3, n_estimators=100, n_jobs=-1))]

In [17]:
# compare results
for name, model in create_models():
    scores=cross_validate(model, X, y, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
    print ( name + ' train_score:' + str(np.round(np.sqrt(-scores['train_score'].mean()),3))+ ' test_score:' + str(np.round(np.sqrt(-scores['test_score'].mean()),3)))

rf train_score:647.108 test_score:740.299
xgbr train_score:416.269 test_score:594.677


I so far got the following RMSE: rf:740.299, xgbs:594.677. In the next phase I will look for the best parameters for those two models and train my best model.