# Simple predictions fitted on hour and distance

This notebook creates two new features for the taxi database set, culls outliers, fits a gradient boosted regressor to the data. 

With 200 trees, it generates a prediction in the to 62%. Note, I have trees set to 50 for a quicker first run of the notebook.

This is my first kernel and and so thank you for checking it out. Constructive criticism is welcome.

# I. Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBRegressor
%matplotlib inline

In [2]:
train = pd.read_csv('train.csv') #change to '../input/train.csv' for kaggle upload
test = pd.read_csv('test.csv') #change to '../input/test.csv' for kaggle upload
sample_submission = pd.read_csv('sample_submission.csv') #change to '../input/sample_submission.csv' for kaggle upload

# II. Data Overview

In [3]:
train.head(1)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455


In [4]:
test.head(1)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag
0,id3004672,1,2016-06-30 23:59:58,1,-73.988129,40.732029,-73.990173,40.75668,N


In [5]:
sample_submission.head(1)

Unnamed: 0,id,trip_duration
0,id3004672,959


In [6]:
train.columns

Index(['id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime',
       'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'trip_duration'],
      dtype='object')

In [7]:
test.columns

Index(['id', 'vendor_id', 'pickup_datetime', 'passenger_count',
       'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'store_and_fwd_flag'],
      dtype='object')

In [8]:
len(train)

1458644

In [9]:
len(test)

625134

In [10]:
# Extra features in train (which are note present in test)
[column for column in train.columns if column not in test.columns]

['dropoff_datetime', 'trip_duration']

# III. Cleaning

In [11]:
train.isnull().sum()

id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
trip_duration         0
dtype: int64

In [12]:
train[train.isnull().any(axis=1)] #No missing values

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration


In [13]:
def remove_outliers(old_df,number_of_std,columns="All",skip="None"):
    """
    Removes outliers from a dataframe.
    
    Parameters:
    old_df: Series or dataframe
    
    number_of_std: Number of standard deviations for threshhold. 
                   Function will remove all outliers beyond this many standard deviations.
                   
    columns: The columns upon which the operation will be performed. (List of column names)
    
    skip: List of columns to be skipped.
    
    Returns:
    A dataframe with the outliers removed.
    
    """
    
    if isinstance(old_df,pd.core.series.Series): #If series passed, then only 
        current_series = old_df #set current series
        
        mean = np.mean(current_series)    #Mean
        std = np.std(current_series)      #Std
        threshold = number_of_std*std     #Threshhold = number of std * std
        
        new_df = old_df[np.abs(current_series-mean)<threshold] #Remove outliers from series
    else:
        if columns=="All": #Set columns
            columns=old_df.columns
            
        if skip!="None": #Skip any columns to be skipped
            columns = [name for name in list(old_df.columns) if name not in skip]
        
        for column in columns:
            current_series = old_df[column] #Iterate through each column

            mean = np.mean(current_series) #Set up threshold for which x should be within
            std = np.std(current_series)
            threshold = number_of_std*std

            new_df = old_df[np.abs(current_series-mean)<threshold] #Remove outliers from this column
    
    return new_df

In [14]:
train = remove_outliers(train,3,columns=['trip_duration','pickup_longitude','pickup_latitude', 'dropoff_longitude','dropoff_latitude']) 

***
# IV. Feature Engineering

Feature engineering:
- Direct Distance (as the crow flies)
- Manhattan distance
- Hour, month, and day of the week of departure
- Weekend boolean variable

### FE 1: 'dist' (distance travelled)

In [15]:
train['dist'] = np.sqrt((train['pickup_latitude']-train['dropoff_latitude'])**2 
                         + (train['pickup_longitude']-train['dropoff_longitude'])**2) 

test['dist'] = np.sqrt((test['pickup_latitude']-test['dropoff_latitude'])**2 
                         + (test['pickup_longitude']-test['dropoff_longitude'])**2) 

### FE 2: 'manh' (manhattan distance)

In [16]:
train['manh'] = abs(train['pickup_latitude']-train['dropoff_latitude']) + abs(train['pickup_longitude']-train['dropoff_longitude'])

test['manh'] = abs(test['pickup_latitude']-test['dropoff_latitude']) + abs(test['pickup_longitude']-test['dropoff_longitude'])

### FE 3: 'hour' (hour picked up), 'month, and 'dayofweek' (day of the week picked up)

In [17]:
#Convert to datetime
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'])

In [18]:
#Hour
train['hour'] = train['pickup_datetime'].dt.hour
test['hour'] = test['pickup_datetime'].dt.hour

#Day of the week
train['dayofweek'] = train['pickup_datetime'].dt.dayofweek
test['dayofweek'] = test['pickup_datetime'].dt.dayofweek

#Month
train['month'] = train['pickup_datetime'].dt.month
test['month'] = test['pickup_datetime'].dt.month

### FE 4: 'weekend'

In [1]:
train['weekend']=train['dayofweek'].apply(lambda x: 1 if x>5 else 0)
test['weekend']=test['dayofweek'].apply(lambda x: 1 if x>5 else 0)

NameError: name 'train' is not defined

### FE 5: PCA of longtitude and latitude

In [20]:
#To complete for a later attempt

***
# V. Visualization

In [21]:
"""
plt.figure()
train[train['vendor_id']==1]['trip_duration'].hist(bins=40)
plt.xlim(0,5000)
"""

"\nplt.figure()\ntrain[train['vendor_id']==1]['trip_duration'].hist(bins=40)\nplt.xlim(0,5000)\n"

In [22]:
"""
plt.figure()
train[train['vendor_id']==2]['trip_duration'].hist(bins=40)
plt.xlim(0,5000)
"""

"\nplt.figure()\ntrain[train['vendor_id']==2]['trip_duration'].hist(bins=40)\nplt.xlim(0,5000)\n"

In [23]:
"""
#plt.scatter(x=train['passenger_count'],y=train['trip_duration'])
sns.boxplot(x='passenger_count',y='trip_duration',data=train)
"""

"\n#plt.scatter(x=train['passenger_count'],y=train['trip_duration'])\nsns.boxplot(x='passenger_count',y='trip_duration',data=train)\n"

In [24]:
#train[['hour','dist','trip_duration']].corr()

In [25]:
#sns.barplot(x='hour',y='trip_duration',data=train)

# VI. Feature Selection

In [26]:
train.columns

Index(['id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime',
       'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'trip_duration', 'dist', 'manh', 'hour', 'dayofweek', 'month',
       'weekend'],
      dtype='object')

In [27]:
features= ['dist', 'manh', 'hour', 'dayofweek', 'month', 'weekend','pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']

In [28]:
X = train[features]
y = train['trip_duration']

X_final = test[features]

***
# VI. Fitting and predicting

#### Grid Search XGBoost

In [29]:
#param_grid = [{'min_child_weight':[1,10],'max_depth':[3,6,10]}]

#xgbr = XGBRegressor(n_jobs=-1)

#grid_search = GridSearchCV(xgbr, param_grid, cv=3,verbose=3)

#grid_search.fit(X,y)

print("Best was max_child_weight=10 and max_depth=10")

Best was max_child_weight=10 and max_depth=10


In [30]:
#grid_search.best_params_

In [31]:
parameters = {'max_depth': 10, 'min_child_weight': 10}

In [32]:
"""
param_grid = [{'min_child_weight':[10],'max_depth':[10]}]

xgbr = XGBRegressor(n_jobs=-1)

grid_search = GridSearchCV(xgbr, param_grid, cv=2,verbose=3)

grid_search.fit(X,y)
"""

"\nparam_grid = [{'min_child_weight':[10],'max_depth':[10]}]\n\nxgbr = XGBRegressor(n_jobs=-1)\n\ngrid_search = GridSearchCV(xgbr, param_grid, cv=2,verbose=3)\n\ngrid_search.fit(X,y)\n"

#### Training XGBoost

In [33]:
y_log = np.log(y + 1)

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X.values,y_log, test_size=0.2, random_state=42)

dtrain = xgb.DMatrix(X_train,label=y_train)
dtest = xgb.DMatrix(X_test,label=y_test)

dfinal = xgb.DMatrix(X_final.values)
watchlist = [(dtrain,'dtrain'),(dtest,'dtest')]

In [35]:
xgbr = xgb.train(params=parameters,
                 dtrain=dtrain,
                 num_boost_round=100,
                 evals=watchlist,
                 early_stopping_rounds=30,
                 maximize=False,
                verbose_eval=10)

[0]	dtrain-rmse:4.21792	dtest-rmse:4.21946
Multiple eval metrics have been passed: 'dtest-rmse' will be used for early stopping.

Will train until dtest-rmse hasn't improved in 30 rounds.
[10]	dtrain-rmse:0.425774	dtest-rmse:0.437313
[20]	dtrain-rmse:0.393458	dtest-rmse:0.411911
[30]	dtrain-rmse:0.384995	dtest-rmse:0.409151
[40]	dtrain-rmse:0.376482	dtest-rmse:0.406025
[50]	dtrain-rmse:0.373081	dtest-rmse:0.405307
[60]	dtrain-rmse:0.366403	dtest-rmse:0.403254
[70]	dtrain-rmse:0.36406	dtest-rmse:0.403054
[80]	dtrain-rmse:0.361805	dtest-rmse:0.403189
[90]	dtrain-rmse:0.358572	dtest-rmse:0.402681
[99]	dtrain-rmse:0.355876	dtest-rmse:0.402458


#### Feature Importance XGB

In [36]:
feature_importance_dict = xgbr.get_fscore()
fs = ['f%i' % i for i in range(len(features))]
f1 = pd.DataFrame({'f': list(feature_importance_dict.keys()), 'importance': list(feature_importance_dict.values())})
feature_importance = f1
feature_importance = feature_importance.fillna(0)

feature_importance[['f', 'importance']].sort_values(by='importance', ascending=False)

Unnamed: 0,f,importance
2,f6,6422
1,f7,6144
6,f9,5860
7,f8,5591
0,f0,5329
5,f1,3505
4,f2,3346
3,f3,1811
8,f4,1228


In [37]:
xgbr.best_score

0.402458

#### XGBoost Predction

In [38]:
xgb_pred = xgbr.predict(dfinal)

In [39]:
xgb_pred = np.exp(xgb_pred) - 1 #Convert back from log

***
# X. Exporting submission_df

In [40]:
submission_pred = xgb_pred
submission_name = 'XGBoost from 6th look v2.csv' 

In [41]:
#submission_pred[submission_pred<0] #Check for invalid submission values

In [42]:
submission_df = pd.DataFrame(submission_pred,index=test['id'],columns=['trip_duration']).reset_index()

submission_df.to_csv(submission_name,index=False)

***
# Thank you for checking out this kernel.