<a href="https://colab.research.google.com/github/numustafa/ML-Projects-/blob/main/Google%20Cloud%20Kaggle%20Competition/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End-to-End ML Project
In this project, I will be using my skills in Data Analysis and concepts of Machine Learning to develop a predictive model, that predicts the taxi fare for a ride in NYC. I am using a dataset from [Kaggle](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) by Google Cloud. My main Goal is to attain respectable score in the competition using just the fraction of the data. The project follows the following sequence:
1. Download Dataset
2. Data Exploration
3. Data Modelling
4. Model (Hardcoded & baseline) Models
5. Generate Predictions
6. Feature Engineering
7. Train & Evaluate Different Models
8. Hyperparameter Tuning
9. Train on GPU
10. documentation & Publishing


## 1. Dataset
* Install necessary Lib
* download data from Kaggle
* convert data into workable DataFrame

In [1]:
# lib
!pip install opendatasets --upgrade --quiet   # Lib by Jovian to download the public datasets from Kaggle
import opendatasets as od
od.version()

dataset_url = 'https://www.kaggle.com/c/new-york-city-taxi-fare-prediction'     # url for competition
od.download(dataset_url)

Downloading new-york-city-taxi-fare-prediction.zip to ./new-york-city-taxi-fare-prediction


100%|██████████| 1.56G/1.56G [00:15<00:00, 109MB/s]



Extracting archive ./new-york-city-taxi-fare-prediction/new-york-city-taxi-fare-prediction.zip to ./new-york-city-taxi-fare-prediction


In [2]:
data_dir = "new-york-city-taxi-fare-prediction"

In [3]:
# visualize the data
!ls -lh {data_dir}        # ! - This is for system var

total 5.4G
-rw-r--r-- 1 root root  486 Nov 30 18:44 GCP-Coupons-Instructions.rtf
-rw-r--r-- 1 root root 336K Nov 30 18:44 sample_submission.csv
-rw-r--r-- 1 root root 960K Nov 30 18:44 test.csv
-rw-r--r-- 1 root root 5.4G Nov 30 18:45 train.csv


This shows almost 5.4GB of Data, out of which almost entirely is the Train set, a tiny test file, sample submission and information file.

In [4]:
# check the no of rows in trainset
!wc -l {data_dir}/train.csv
!wc -l {data_dir}/test.csv

55423856 new-york-city-taxi-fare-prediction/train.csv
9914 new-york-city-taxi-fare-prediction/test.csv


Almost 55.4 million observations in the training set, while only 9900 observations for test set.

In [5]:
# check the head of train data
!head {data_dir}/train.csv

key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
2012-12-03 13:10:00.000000125,9,2012-12-03 13:10:00 UTC,-74.006462,40.7267

first 10 rows, including header. Each row has a unique identifier, interms of date, time & serial no.

In [6]:
!head {data_dir}/test.csv

key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.973320007324219,40.7638053894043,-73.981430053710938,40.74383544921875,1
2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862182617188,40.719383239746094,-73.998886108398438,40.739200592041016,1
2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.75126,-73.979654,40.746139,1
2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.98116,40.767807,-73.990448,40.751635,1
2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966046,40.789775,-73.988565,40.744427,1
2012-12-01 21:12:12.0000005,2012-12-01 21:12:12 UTC,-73.960983,40.765547,-73.979177,40.740053,1
2011-10-06 12:10:20.0000001,2011-10-06 12:10:20 UTC,-73.949013,40.773204,-73.959622,40.770893,1
2011-10-06 12:10:20.0000003,2011-10-06 12:10:20 UTC,-73.777282,40.646636,-73.985083,40.759368,1
2011-10-06 12:10:20.0000002,2011-10-06 12:10:20 UTC,

The test data also has a key column and other cols as train set, except the fare amount column.

In [7]:
# submission file
!head {data_dir}/sample_submission.csv

key,fare_amount
2015-01-27 13:08:24.0000002,11.35
2015-01-27 13:08:24.0000003,11.35
2011-10-08 11:53:44.0000002,11.35
2012-12-01 21:12:12.0000002,11.35
2012-12-01 21:12:12.0000003,11.35
2012-12-01 21:12:12.0000005,11.35
2011-10-06 12:10:20.0000001,11.35
2011-10-06 12:10:20.0000003,11.35
2011-10-06 12:10:20.0000002,11.35


The sample submission contains exactly the same keys as test set, and a a sample fare amount (which later be replaced with the prediction fare amount)

#### Observations:
* This is a supervised Learning regression problem.
* Training Data is 5.4GB
* The Data file contains Tran, Test, sample submission and info files.
* Test set is much smaller (< 10000 rows)
* The training set has 8 columns:
  * `key`,   (unique identifier)
  * `fare_amount`,   (Target variable)
  * `pickup_datetime`,
  * `pickup_longitude`,
  * `pickup_latitude`,
  * `dropoff_longitude`,
  * `dropoff_latitude`,
  * `passenger_count`
* The test set has every column except target column
* Submission file has key (same as test set) and sample target column

Starting off with a sample of training set (1% of data) 55k rows, which is itself huge, to iterate faster.

In [8]:
# lib
import pandas as pd
import random
import numpy as np

sample_frac = 0.01

In [9]:
# taking all cols except key col
selected_cols = "fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count".split(",")
selected_cols

['fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [10]:
# set the dtypes of data
dtypes = {
  'fare_amount' : "float32",
  'pickup_longitude' : "float32",
  'pickup_latitude' : "float32",
  'dropoff_longitude' : "float32",
  'dropoff_latitude' : "float32",
  'passenger_count' : "uint8"
}

# skip row
def skip_row(key):
  if key == 0:
    return False
  else:
    return random.random()> sample_frac

random.seed(42)
df = pd.read_csv(data_dir+"/train.csv",
                 usecols= selected_cols,
                 dtype= dtypes,
                 parse_dates= ['pickup_datetime'],
                 skiprows= skip_row)

In [11]:
# test data
test_df = pd.read_csv(data_dir+"/test.csv", dtype= dtypes)


In [12]:
df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.0,2014-12-06 20:36:22+00:00,-73.979813,40.751904,-73.979446,40.755482,1
1,8.0,2013-01-17 17:22:00+00:00,0.000000,0.000000,0.000000,0.000000,2
2,8.9,2011-06-15 18:07:00+00:00,-73.996330,40.753223,-73.978897,40.766964,3
3,6.9,2009-12-14 12:33:00+00:00,-73.982430,40.745747,-73.982430,40.745747,1
4,7.0,2013-11-06 11:26:54+00:00,-73.959061,40.781059,-73.962059,40.768604,1
...,...,...,...,...,...,...,...
552445,45.0,2014-02-06 23:59:45+00:00,-73.973587,40.747669,-73.999916,40.602894,1
552446,22.5,2015-01-05 15:29:08+00:00,-73.935928,40.799656,-73.985710,40.726952,2
552447,4.5,2013-02-17 22:27:00+00:00,-73.992531,40.748619,-73.998436,40.740143,1
552448,14.5,2013-01-27 12:41:00+00:00,-74.012115,40.706635,-73.988724,40.756218,1


In [13]:
test_df

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.973320,40.763805,-73.981430,40.743835,1
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862,40.719383,-73.998886,40.739201,1
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982521,40.751259,-73.979652,40.746140,1
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.981163,40.767807,-73.990448,40.751637,1
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966049,40.789776,-73.988564,40.744427,1
...,...,...,...,...,...,...,...
9909,2015-05-10 12:37:51.0000002,2015-05-10 12:37:51 UTC,-73.968124,40.796997,-73.955643,40.780388,6
9910,2015-01-12 17:05:51.0000001,2015-01-12 17:05:51 UTC,-73.945511,40.803600,-73.960213,40.776371,6
9911,2015-04-19 20:44:15.0000001,2015-04-19 20:44:15 UTC,-73.991600,40.726608,-73.789742,40.647011,6
9912,2015-01-31 01:05:19.0000005,2015-01-31 01:05:19 UTC,-73.985573,40.735432,-73.939178,40.801731,6


## 2. Dataset Exploration
* Basic info about training set
* Basic info about test set
* Basic Exploratory Analysis


In [14]:
# Training set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552450 entries, 0 to 552449
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   fare_amount        552450 non-null  float32            
 1   pickup_datetime    552450 non-null  datetime64[ns, UTC]
 2   pickup_longitude   552450 non-null  float32            
 3   pickup_latitude    552450 non-null  float32            
 4   dropoff_longitude  552450 non-null  float32            
 5   dropoff_latitude   552450 non-null  float32            
 6   passenger_count    552450 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 15.3 MB


No missing values

In [15]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,552450.0,552450.0,552450.0,552450.0,552450.0,552450.0
mean,11.354059,-72.497063,39.9105,-72.504326,39.934265,1.684983
std,9.811924,11.618246,8.061114,12.074346,9.255057,1.337664
min,-52.0,-1183.362793,-3084.490234,-3356.729736,-2073.150635,0.0
25%,6.0,-73.99202,40.734875,-73.991425,40.73399,1.0
50%,8.5,-73.981819,40.752621,-73.980179,40.753101,1.0
75%,12.5,-73.967155,40.767036,-73.963737,40.768059,2.0
max,499.0,2420.209473,404.983337,2467.752686,3351.403076,208.0


In [16]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9914 entries, 0 to 9913
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   key                9914 non-null   object 
 1   pickup_datetime    9914 non-null   object 
 2   pickup_longitude   9914 non-null   float32
 3   pickup_latitude    9914 non-null   float32
 4   dropoff_longitude  9914 non-null   float32
 5   dropoff_latitude   9914 non-null   float32
 6   passenger_count    9914 non-null   uint8  
dtypes: float32(4), object(2), uint8(1)
memory usage: 319.6+ KB


In [17]:
test_df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.75174,1.671273
std,0.042774,0.033541,0.039072,0.035435,1.278747
min,-74.25219,40.573143,-74.263245,40.568974,1.0
25%,-73.9925,40.736125,-73.991249,40.735253,1.0
50%,-73.982327,40.753052,-73.980015,40.754065,1.0
75%,-73.968012,40.767113,-73.964062,40.768757,2.0
max,-72.986534,41.709557,-72.990967,41.696682,6.0


In [18]:
df["pickup_datetime"].min(), df["pickup_datetime"].max()

(Timestamp('2009-01-01 00:11:46+0000', tz='UTC'),
 Timestamp('2015-06-30 23:59:54+0000', tz='UTC'))

Observations:
* As expected 552k rows in Train & 9.9k rows in test set.
* No missing values in either data
* `fare_amount` in train set ranges from -$/ 52.0 to $/499.0. The negative fare seems odd.
* `passenger_count` in train set ranges from 0 - 208. Though 75% or less rides has only upto 2 passangers, so either its a fleet booking or some error. In test set thoug, it is quite reasonable, upto 6.
* In train set, the longititude values ranges fom -3356.7 - 2467.7 & latitude values have a range -3084.7 - 3351.4. In test set, the longitude values are -74.4 - -72.2 & latitude values are 40.5 - 41.6. There seems to be abruption in train set values, required fixing, and should be brought in range of test set.
* Data is b/w 1 Jan 2009 - 30 June 2015 (both, train & test)


### 2.1 Exploratory Data Analysis

In this section, I will be exploring the following metrics for this fraction data, and later compare it with the original data metrics:
1. What is the busiest day of the week?
2. What is the busiest time of the day?
3. In which month are fare the highest?
4. Which pickup locations have the highest fares?
5. Which drop locations have the highest fares?
6. What is the average ride distance?
7. how passanger

## 3. Data Modelling - Prepare Data for Training

* Split Training & Validation set
* Extract Inputs /outputs

In [19]:
from sklearn.model_selection import train_test_split


train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

len(train_df), len(val_df)


(441960, 110490)

In [20]:
# Drop any na values
train_df = train_df.dropna()
val_df = val_df.dropna()

In [21]:
train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

In [22]:
# input & output cols
input_cols = ['pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_col = 'fare_amount'

In [23]:
# Train inputs & outputs
X_train = train_df[input_cols]
y_train = train_df[target_col]


In [24]:
X_train

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
353352,-73.993652,40.741543,-73.977974,40.742352,4
360070,-73.993805,40.724579,-73.993805,40.724579,1
372609,-73.959160,40.780750,-73.969116,40.761230,1
550895,-73.952187,40.783951,-73.978645,40.772602,1
444151,-73.977112,40.746834,-73.991104,40.750404,2
...,...,...,...,...,...
110268,-73.987152,40.750633,-73.979073,40.763168,1
259178,-73.972656,40.764042,-74.013176,40.707840,2
365838,-73.991982,40.749767,-73.989845,40.720551,3
131932,-73.969055,40.761398,-73.990814,40.751328,1


In [25]:
y_train

353352     6.0
360070     3.7
372609    10.0
550895     8.9
444151     7.3
          ... 
110268     9.3
259178    18.5
365838    10.1
131932    10.9
121958     9.5
Name: fare_amount, Length: 441960, dtype: float32

In [26]:
# validation input & target
X_val = val_df[input_cols]
y_val = val_df[target_col]

In [27]:
X_val

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
15971,-73.995834,40.759190,-73.973679,40.739086,1
149839,-73.977386,40.738335,-73.976143,40.751205,1
515867,-73.983910,40.749470,-73.787170,40.646645,1
90307,-73.790794,40.643463,-73.972252,40.690182,1
287032,-73.976593,40.761944,-73.991463,40.750309,2
...,...,...,...,...,...
467556,-73.968567,40.761238,-73.983406,40.750019,3
19482,-73.986725,40.755920,-73.985855,40.731171,1
186063,0.000000,0.000000,0.000000,0.000000,1
382260,-73.980057,40.760334,-73.872589,40.774300,1


In [28]:
y_val

15971     14.000000
149839     6.500000
515867    49.570000
90307     49.700001
287032     8.500000
            ...    
467556     6.100000
19482      7.300000
186063     4.500000
382260    32.900002
18838     11.500000
Name: fare_amount, Length: 110490, dtype: float32

#### Test Data


In [45]:
test_inputs = test_df[input_cols]
test_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,-73.973320,40.763805,-73.981430,40.743835,1
1,-73.986862,40.719383,-73.998886,40.739201,1
2,-73.982521,40.751259,-73.979652,40.746140,1
3,-73.981163,40.767807,-73.990448,40.751637,1
4,-73.966049,40.789776,-73.988564,40.744427,1
...,...,...,...,...,...
9909,-73.968124,40.796997,-73.955643,40.780388,6
9910,-73.945511,40.803600,-73.960213,40.776371,6
9911,-73.991600,40.726608,-73.789742,40.647011,6
9912,-73.985573,40.735432,-73.939178,40.801731,6


## 4. Train & Evaluate Hardcoded (Simple) Model

* Hardcoded Model: Always predicts the avg fare
* Baseline Model: Linear Regression

For evaluation of model prediction, we use rmse criteria, as mentioned in the [Kaggle](https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/overview) competition brief

### 4.1 Train & Evaluate Hardcoded Model

In [46]:
import numpy as np

a = np.full(10, 3)        # return array of 10 values whuich arw 3
np.full(a.shape[0], 2)

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [30]:
class MeanRegressor:

  # Takes inputs & targets for training and calc the avg of targets
  def fit(self, inputs, targets):
    self.mean_= targets.mean()

  # Takes inputs and creates targets for prediction
  def predict(self, inputs):
    preds_ = np.full(inputs.shape[0], self.mean_)    # takes the shape of input (val or test inputs) and fill the whole shape with avg
    return preds_

In [31]:
# model
mean_model = MeanRegressor()

# fit the model
mean_model.fit(X_train, y_train)

In [32]:
# check the mean (avg for each row of training set)
mean_model.mean_

11.354714

In [33]:
# prediction from mean_model (training set)
train_preds = mean_model.predict(X_train)

# check the predictions (mean)
train_preds, len(train_preds)

(array([11.354714, 11.354714, 11.354714, ..., 11.354714, 11.354714,
        11.354714], dtype=float32),
 441960)

In [34]:
y_train

353352     6.0
360070     3.7
372609    10.0
550895     8.9
444151     7.3
          ... 
110268     9.3
259178    18.5
365838    10.1
131932    10.9
121958     9.5
Name: fare_amount, Length: 441960, dtype: float32

In [35]:
# prediction from mean_model (validation set)
val_preds = mean_model.predict(X_val)
val_preds, len(val_preds)

(array([11.354714, 11.354714, 11.354714, ..., 11.354714, 11.354714,
        11.354714], dtype=float32),
 110490)

In [36]:
y_val

15971     14.000000
149839     6.500000
515867    49.570000
90307     49.700001
287032     8.500000
            ...    
467556     6.100000
19482      7.300000
186063     4.500000
382260    32.900002
18838     11.500000
Name: fare_amount, Length: 110490, dtype: float32

#### Evaluation
* y_train with train_preds
* y_val with val_preds

criteria with rmse

In [37]:
from sklearn.metrics import mean_squared_error


In [38]:
# rmse
def rmse(targets, predictions):
  root_mse = mean_squared_error(targets, predictions, squared=False)
  return root_mse

In [39]:
train_rmse = rmse(y_train, train_preds)
val_rmse = rmse(y_val, val_preds)

train_rmse, val_rmse

(9.789782, 9.899954)

On avg, our mean model prediction deviates from train targets (y_train) is Dollar 9.7 off from each fare & from validation targets (y_val) is Dollar 9.8.

Now any model should do better than this basic hardcoded model.

### 4.2 Train & Evaluate baseline Model

In [40]:
from sklearn.linear_model import LinearRegression


In [41]:
# model
model = LinearRegression()

#### Train Data

In [42]:
# fit the model
model.fit(X_train, y_train)

# generate predictions from the model
train_preds = model.predict(X_train)
train_preds

array([11.546237, 11.28461 , 11.28414 , ..., 11.458918, 11.284281,
       11.284448], dtype=float32)

#### Valuidation Data

In [43]:
# fit the model
model.fit(X_val, y_val)

# generate predictions from the model
val_preds = model.predict(X_val)
val_preds

array([11.260979, 11.260483, 11.265015, ..., 12.472838, 11.260638,
       11.259938], dtype=float32)

Appearwntly, the predictions are bit differe for each dataset but one this is certain that they are way off from actual targets

#### Evaluation: rmse

In [44]:
rmse(y_train, train_preds), rmse(y_val, val_preds)

(9.788632, 9.897244)

The Linear model behaves similar to our hardcoded mean model.

## 5. Kaggle Submission

In [47]:
# test dataset
test_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,-73.973320,40.763805,-73.981430,40.743835,1
1,-73.986862,40.719383,-73.998886,40.739201,1
2,-73.982521,40.751259,-73.979652,40.746140,1
3,-73.981163,40.767807,-73.990448,40.751637,1
4,-73.966049,40.789776,-73.988564,40.744427,1
...,...,...,...,...,...
9909,-73.968124,40.796997,-73.955643,40.780388,6
9910,-73.945511,40.803600,-73.960213,40.776371,6
9911,-73.991600,40.726608,-73.789742,40.647011,6
9912,-73.985573,40.735432,-73.939178,40.801731,6


In [48]:
# prediction on test set
test_preds = model.predict(test_inputs)        # we dont have any targets

### 5.1 Create a Submission File
Since we dont have any targets in test set, like in train and validation set. Thus, the only way to evaluate the test predictions is to create a submission file.
* Read-in the submission Dataframe (a sample submission file)
* Replace the sample targets with test predictions (as rows in submission file are same as rows in test se)
* save the amended datarrame. (csv or anyother req format)


/content/new-york-city-taxi-fare-prediction

In [49]:
# read the submission dataframe
sub_df = pd.read_csv(data_dir+"/sample_submission.csv")
sub_df

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.35
1,2015-01-27 13:08:24.0000003,11.35
2,2011-10-08 11:53:44.0000002,11.35
3,2012-12-01 21:12:12.0000002,11.35
4,2012-12-01 21:12:12.0000003,11.35
...,...,...
9909,2015-05-10 12:37:51.0000002,11.35
9910,2015-01-12 17:05:51.0000001,11.35
9911,2015-04-19 20:44:15.0000001,11.35
9912,2015-01-31 01:05:19.0000005,11.35


In [50]:
# Replace the fareamount with our test predictions
sub_df["fare_amount"] = test_preds
sub_df

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.260607
1,2015-01-27 13:08:24.0000003,11.260734
2,2011-10-08 11:53:44.0000002,11.260633
3,2012-12-01 21:12:12.0000002,11.260369
4,2012-12-01 21:12:12.0000003,11.260462
...,...,...
9909,2015-05-10 12:37:51.0000002,11.741593
9910,2015-01-12 17:05:51.0000001,11.741496
9911,2015-04-19 20:44:15.0000001,11.746951
9912,2015-01-31 01:05:19.0000005,11.741280


In [51]:
# create a submission file
sub_df.to_csv("linear_model_submission.csv", index=None)


In Kaggle it shows a score of 9.4, which appearently near to our test rmse score.

#### Reusable function

In [52]:
def predict_and_submit(model, test_inputs, file_name):
  # 1- create test predictions
  test_preds = model.predict(test_inputs)
  # 2- read the submission file
  sub_df = pd.read_csv(data_dir+"/sample_submission.csv")
  # 3- replate the submission fares with test predictions
  sub_df["fare_amount"] = test_preds
  # 4- Export it ro csv
  sub_df.to_csv(file_name, index=None)
  return sub_df




It is import to track the ideas systematically to avoid become overwhelmed with dozenss of models.