In this Python worflow we explore the Montreal Bixi biking data set for the year 2017 https://www.kaggle.com/aubertsigouin/biximtl/data

We have additionally enriched this data set with the biking distance/duration available via Google map API as gmdata2017

Our objective is to predict the "trip duration", given the distance between two stations.

Import AIDA components

In [1]:
from aida.aida import *;

Connection information to AIDA's server

In [2]:
host='cerberus'; dbname='bixi'; user='bixi'; passwd='bixi'; jobName='bixiLinear'; port=55660;

Establish a connection and get a handle to the database workspace.

In [3]:
dw = AIDA.connect(host, dbname, user, passwd, jobName, port);

Let us see what tables we have in the database

In [4]:
print(dw._tables());

      tablename
0  stations2017
1  tripdata2017
2    gmdata2017


Let us take a peek into tripdata2017.

We can see the attributes and explore some sample data.This can be accomplished via the head() or tail() functions (similar to Pandas API) provided by TabularData that sends a sample of data from the server to the client side.

Further we can also use the describe() function to look at the data distribution characteristics. This semantic is very similar to the functionality provided by pandas DataFrame to get a summary of the overall distribution of each attribute.

In [5]:
print(dw.tripdata2017.head());
print(dw.tripdata2017.describe());

   id                  starttm  stscode                    endtm  endscode  \
0   0  2017-04-15 00:00:00.000     7060  2017-04-15 00:31:00.000      7060   
1   1  2017-04-15 00:01:00.000     6173  2017-04-15 00:10:00.000      6173   
2   2  2017-04-15 00:01:00.000     6203  2017-04-15 00:04:00.000      6204   
3   3  2017-04-15 00:01:00.000     6104  2017-04-15 00:06:00.000      6114   
4   4  2017-04-15 00:01:00.000     6174  2017-04-15 00:11:00.000      6174   

   duration  ismember  
0      1841         1  
1       553         1  
2       195         1  
3       285         1  
4       569         1  
                            id                 starttm  \
count   [          4018721.00]  [          4018721.00]   
unique  [          4018721.00]  [           227660.00]   
nulls   [                0.00]  [                0.00]   
max     [          4018721.00]  [ 2017-09-30 23:59:00]   
min     [                0.00]  [ 2017-04-15 00:00:00]   
avg     [          2009360.21]  [      

So we have 4 million + records in tripdata2017. Also, the station codes are labels. We may have to enrich this information.
Let us take a look at the contents of stations2017.

In [6]:
print(dw.stations2017.head());
print(dw.stations2017.describe());

   scode                      sname  slatitude  slongitude  sispublic
0   7060  "de l'Église / de Verdun"  45.463001  -73.571569          1
1   6173         "Berri / Cherrier"  45.519088  -73.569509          1
2   6203   "Hutchison / Sherbrooke"  45.507810  -73.572080          1
3   6204        "Milton / Durocher"  45.508144  -73.574772          1
4   6104    "Wolfe / René-Lévesque"  45.516818  -73.554188          1
                         scode                   sname  \
count   [              546.00]  [              546.00]   
unique  [              546.00]  [              546.00]   
nulls   [                0.00]  [                0.00]   
max     [            10002.00]   [Émile-Journault / d]   
min     [             5002.00]  ["10e Avenue / Rosemo]   
avg     [             6412.74]  [                    ]   
median  [             6305.00]  [                    ]   
25%     [             6143.00]  [                    ]   
50%     [             6305.00]  [                    ]   


This is good, we have the longitude and latitude associated with each station, which can be used to enrich the tripdata.

Since we have 546 stations, this gives the possibility of 546 x 546 = 298116 possible scenarios for trips. However, we need not be concerned with trips that started and ended at the same station as those are noise. Also, to weed out any further fluctuations in the input data set, we will limit ourselves to only those station combinations which has at the least 50 trips.

We can use AIDA's powerful relational API to accomplish this.

In [7]:
freqStations = dw.tripdata2017.filter(Q('stscode', 'endscode', CMP.NE)) \
    .aggregate(('stscode','endscode',{COUNT('*'):'numtrips'}), ('stscode','endscode')) \
    .filter(Q('numtrips',C(50), CMP.GTE));
print(freqStations.head());
print(freqStations.describe());

   stscode  endscode  numtrips
0     6203      6204       101
1     6104      6114       308
2     6719      6354        91
3     6175      6118        81
4     6280      6160        50
                       stscode                endscode                numtrips
count   [            19300.00]  [            19300.00]  [            19300.00]
unique  [              544.00]  [              544.00]  [              684.00]
nulls   [                0.00]  [                0.00]  [                0.00]
max     [            10002.00]  [            10002.00]  [             2200.00]
min     [             5002.00]  [             5002.00]  [               50.00]
avg     [             6302.13]  [             6291.78]  [              116.91]
median  [             6194.00]  [             6180.00]  [               81.00]
25%     [             6100.00]  [             6078.00]  [               62.00]
50%     [             6194.00]  [             6180.00]  [               81.00]
75%     [             63

We can see that there are 19,300 station combinations that is of interest to us.
Next stop, we need to include the longitude and latitude information of the start and end stations.

This can be done by joining with the station information using AIDA's relational join operator.

In [8]:
freqStationsCord = freqStations \
    .join(dw.stations2017, ('stscode',), ('scode',), COL.ALL, ({'slatitude':'stlat'}, {'slongitude':'stlong'})) \
    .join(dw.stations2017, ('endscode',), ('scode',), COL.ALL, ({'slatitude':'enlat'}, {'slongitude':'enlong'}));
print(freqStationsCord.head());

   stscode  endscode  numtrips      stlat     stlong      enlat     enlong
0     6203      6204       101  45.507810 -73.572080  45.508144 -73.574772
1     6104      6114       308  45.516818 -73.554188  45.523530 -73.551990
2     6719      6354        91  45.460729 -73.634073  45.471743 -73.613924
3     6175      6118        81  45.520541 -73.567751  45.525048 -73.560036
4     6280      6160        50  45.524505 -73.594142  45.532977 -73.581222


It would be easier if we can translate the coordinates to a distance metric. Python's geopy module supports this computation using Vincenty's formula. This provides us with a distance as crow flies between two coordiantes. This might be a reasonable approximation of actual distance travelled in a trip. 

Using TabularData's user transform operator, we can generate a dataset which also includes this distance metric. 

In [9]:
def computeDist(tblrData):
    import geopy.distance;     #We will use this module to compute distance.
    import copy, numpy as np;
    #We are going to keep all the columns of the source tabularData object.
    data = copy.copy(tblrData.rows); #This only makes a copy of the metadata, but retains original column data
    vdistm = data['vdistm'] = np.empty(tblrData.numRows, dtype=int); #add a new empty column to hold distance.
    #These are the inputs to Vincenty's formula.
    stlat = data['stlat']; stlong = data['stlong']; enlat = data['enlat']; enlong = data['enlong'];
    for i in range(0, tblrData.numRows): #populate the distance metric using longitude/latitude of coordinates.
        vdistm[i] = int(geopy.distance.distance((stlat[i],stlong[i]), (enlat[i],enlong[i])).meters);
    return data;

freqStationsDist = freqStationsCord._U(computeDist); #Execute the user transform
print(freqStationsDist.head());                      #Take a peek at a sample data.

   stscode  endscode  numtrips      stlat     stlong      enlat     enlong  \
0     6203      6204       101  45.507810 -73.572080  45.508144 -73.574772   
1     6104      6114       308  45.516818 -73.554188  45.523530 -73.551990   
2     6719      6354        91  45.460729 -73.634073  45.471743 -73.613924   
3     6175      6118        81  45.520541 -73.567751  45.525048 -73.560036   
4     6280      6160        50  45.524505 -73.594142  45.532977 -73.581222   

   vdistm  
0     213  
1     765  
2    1995  
3     783  
4    1380  


We can next enrich our trip data set with the distance information by joining these computed distances with each trip.

In [10]:
tripData = dw.tripdata2017.join(freqStationsDist, ('stscode','endscode'), ('stscode', 'endscode')
                                             , ('id', 'duration'), ('vdistm',));
print(tripData.head());
print(tripData.describe());

   id  duration  vdistm
0   2       195     213
1   3       285     765
2   5       620    1995
3  12       395     783
4  13      1085    1380
                            id                duration                  vdistm
count   [          2256283.00]  [          2256283.00]  [          2256283.00]
unique  [          2256283.00]  [             6663.00]  [             3652.00]
nulls   [                0.00]  [                0.00]  [                0.00]
max     [          4018721.00]  [             7199.00]  [             9074.00]
min     [                2.00]  [               61.00]  [               47.00]
avg     [          2003455.03]  [              630.45]  [             1369.89]
median  [          2005024.00]  [              482.00]  [             1117.00]
25%     [           994510.00]  [              298.00]  [              711.00]
50%     [          2005024.00]  [              482.00]  [             1117.00]
75%     [          3010571.00]  [              793.00]  [         

So we have trip duration for each trip and the distance as crow flies, between the two stations involved in the trip.

Also, we have about 2 million trips for which we have distance between stations metric.
Given that there are only a few thousand unique values for distance, we might want to keep some values of distance apart for testing.
For this purpose, we will first get distinct values for distance and then sort it.

In [11]:
uniqueTripDist = tripData[:,['vdistm']].distinct().order('vdistm');
print(uniqueTripDist.head());
print(uniqueTripDist.tail());
print(uniqueTripDist.describe());

   vdistm
0      47
1      71
2      81
3      85
4      88
   vdistm
0    8529
1    8752
2    8860
3    9031
4    9074
                        vdistm
count   [             3652.00]
unique  [             3652.00]
nulls   [                0.00]
max     [             9074.00]
min     [               47.00]
avg     [             2203.77]
median  [             2013.00]
25%     [             1096.00]
50%     [             2013.00]
75%     [             3068.00]
stddev  [             1386.05]


We will keep some data apart for testing. A rule of thumb is 30%. The neat trick below sets apart 33%, across the entire range of distance values. close enough.

In [12]:
testTripDist = uniqueTripDist[::3];
print(testTripDist.head());
print(testTripDist.tail());
print(testTripDist.describe());

   vdistm
0      47
1      85
2     110
3     126
4     148
   vdistm
0    6796
1    7057
2    7530
3    8752
4    9074
                        vdistm
count   [             1218.00]
unique  [             1218.00]
nulls   [                0.00]
max     [             9074.00]
min     [               47.00]
avg     [             2205.06]
median  [             2012.00]
25%     [             1095.00]
50%     [             2012.00]
75%     [             3069.00]
stddev  [             1390.92]


Now let us get the remaining values for distances to be used for training.

In [13]:
trainTripDist = uniqueTripDist.filter(Q('vdistm', testTripDist, CMP.NOTIN));
print(trainTripDist.head());
print(trainTripDist.tail());
print(trainTripDist.describe());

   vdistm
0      71
1      81
2      88
3      94
4     120
   vdistm
0    7307
1    7650
2    8529
3    8860
4    9031
                        vdistm
count   [             2434.00]
unique  [             2434.00]
nulls   [                0.00]
max     [             9031.00]
min     [               71.00]
avg     [             2203.13]
median  [             2013.00]
25%     [             1096.00]
50%     [             2013.00]
75%     [             3068.00]
stddev  [             1383.60]


Let us now extract the fields of interest to us for the training data, which is just the distance of each trip and it's duration

In [14]:
trainData = tripData.project(('vdistm', 'duration')).filter(Q('vdistm', trainTripDist, CMP.IN));
print(trainData.head());
print(trainData.tail());
print(trainData.describe());

   vdistm  duration
0     213       195
1    1995       620
2     495       565
3     499       280
4     802       252
   vdistm  duration
0    1010       364
1    1318       288
2    1300       692
3     668       179
4     956       306
                        vdistm                duration
count   [          1503408.00]  [          1503408.00]
unique  [             2434.00]  [             6267.00]
nulls   [                0.00]  [                0.00]
max     [             9031.00]  [             7199.00]
min     [               71.00]  [               61.00]
avg     [             1370.76]  [              629.42]
median  [             1109.00]  [              480.00]
25%     [              706.00]  [              295.00]
50%     [             1109.00]  [              480.00]
75%     [             1774.00]  [              793.00]
stddev  [              926.28]  [              536.08]


As the values are huge, we should normalize the data attributes. First get the max values for these attributes.

In [15]:
maxdist = uniqueTripDist.max('vdistm');
print(maxdist);
maxduration = tripData.max('duration');
print(maxduration);

9074
7199


Now let us normalize the training data. As we are working with integer data, we will also have to convert it to float. That can be accomplished by multiplying with 1.0.

In [16]:
trainData = trainData.project((1.0*F('vdistm')/maxdist, 1.0*F('duration')/maxduration));

print(trainData.head());
print(trainData.tail());

     vdistm  duration
0  0.023474  0.027087
1  0.219859  0.086123
2  0.054551  0.078483
3  0.054992  0.038894
4  0.088384  0.035005
     vdistm  duration
0  0.111307  0.050563
1  0.145250  0.040006
2  0.143266  0.096124
3  0.073617  0.024865
4  0.105356  0.042506


Our linear regression equation is of the form.

dur = a + b*dist

we will re-organize the training data set to fit this format and also setup our initial parameters for a and b. 

In [17]:
trainDataSet = dw._ones((trainData.numRows, 1), ("x0",)).hstack(trainData[:,['vdistm']]);
print(trainDataSet.head());
trainDataSetDuration = trainData[:,['duration']];
print(trainDataSetDuration.head());
params = dw._ones((1,2), ("a","b"));
print(params.rows);

    x0    vdistm
0  1.0  0.023474
1  1.0  0.219859
2  1.0  0.054551
3  1.0  0.054992
4  1.0  0.088384
   duration
0  0.027087
1  0.086123
2  0.078483
3  0.038894
4  0.035005
OrderedDict([('a', array([1.])), ('b', array([1.]))])


Let us try to run a prediction using these parameters.

In [18]:
pred = trainDataSet @ params.T;
print(pred.columns);
print(pred.head());

VirtualOrderedColumnsDict([(0, <aidas.dborm.DBTable.Column object at 0x7ff0197e8400>)])
   r_0000000000
0      1.023474
1      1.219859
2      1.054551
3      1.054992
4      1.088384


We need to compute the squared error for the predictions. Since we will be reusing them, we might as well store it as a function.

In [19]:
def squaredErr(actual, predicted):
    return ((predicted-actual)**2).sum()/(2*(actual.shape[0]));

Let us see what is the error for the first iteration.

In [20]:
sqerr = squaredErr(trainDataSetDuration, pred);
print(sqerr);

0.5694865536695626


We need to perform a gradient descent based on the squared errors. We will write another function to perform this.

In [21]:
def gradDesc(actual, predicted, indata):
    return (predicted-actual).T @ indata / actual.shape[0];

Let us update our params using gradient descent using the error we got. We also need to use a learning rate, alpha (arbitrarily chosen).

In [22]:
alpha = 0.1;

params = params - alpha * gradDesc(trainDataSetDuration, pred, trainDataSet);
print(params.rows);

OrderedDict([('a', array([0.89363664])), ('b', array([0.98330567]))])


Now let us try to use the updated params to train the model again and see if the error is decreasing.

In [23]:
pred = trainDataSet @ params.T;
print(pred.head());
sqerr = squaredErr(trainDataSetDuration, pred);
print(sqerr);

   r_0000000000
0      0.916718
1      1.109825
2      0.947277
3      0.947711
4      0.980546
0.4594973598186901


Before we proceed, may be we should check if google maps API's distance metric gives a better learning rate. Let us see what fields we can use from Google.

In [24]:
print(dw.gmdata2017.head());
print(dw.gmdata2017.describe());

   stscode  endscode  gdistm  gduration
0     6406      6052    3568        596
1     6050      6406    3821        704
2     6148      6173    1078        293
3     6110      6114    1319        337
4     6123      6114     725        177
                       stscode                endscode  \
count   [            19516.00]  [            19516.00]   
unique  [              544.00]  [              544.00]   
nulls   [                0.00]  [                0.00]   
max     [            10002.00]  [            10002.00]   
min     [             5002.00]  [             5002.00]   
avg     [             6302.62]  [             6291.74]   
median  [             6194.00]  [             6180.00]   
25%     [             6100.00]  [             6078.00]   
50%     [             6194.00]  [             6180.00]   
75%     [             6350.00]  [             6362.00]   
stddev  [              350.30]  [              350.46]   

                        gdistm               gduration  
count 

We can build a new data set for the trips between frequently used station combination that includes google's distance.

In [25]:
gtripData = dw.gmdata2017 \
    .join(dw.tripdata2017, ('stscode','endscode'), ('stscode', 'endscode'), COL.ALL, COL.ALL) \
    .join(freqStations, ('stscode','endscode'), ('stscode', 'endscode') \
                          , ('id', 'duration', 'gdistm', 'gduration') );
print(gtripData.head());
print(gtripData.describe());

   id  duration  gdistm  gduration
0   2       195     288        218
1   3       285    1007        296
2   5       620    2587        538
3  12       395    1615        322
4  13      1085    1710        352
                            id                duration  \
count   [          2256283.00]  [          2256283.00]   
unique  [          2256283.00]  [             6663.00]   
nulls   [                0.00]  [                0.00]   
max     [          4018721.00]  [             7199.00]   
min     [                2.00]  [               61.00]   
avg     [          2003455.03]  [              630.45]   
median  [          2005024.00]  [              482.00]   
25%     [           994510.00]  [              298.00]   
50%     [          2005024.00]  [              482.00]   
75%     [          3010571.00]  [              793.00]   
stddev  [          1162662.25]  [              533.44]   

                        gdistm               gduration  
count   [          2256283.00]  [   

Google also provides its estimated duration for the trip. We will have to see in the end if our trained model is able to predict the trip duration better than google's estimate. So we will also save Google's estimate for the trip duration for that comparison.

Next up, we need to format this dataset the same way we did the first one.

In [26]:
guniqueTripDist = gtripData[:,['gdistm']].distinct().order('gdistm');
gtestTripDist = guniqueTripDist[::3];
gtrainTripDist = guniqueTripDist.filter(Q('gdistm', gtestTripDist, CMP.NOTIN));
gtrainData = gtripData.project(('gdistm', 'duration')).filter(Q('gdistm', gtrainTripDist, CMP.IN));

gmaxdist = guniqueTripDist.max('gdistm');
print(gmaxdist);
gmaxduration = gtripData.max('duration');
print(gmaxduration);
gtrainData = gtrainData.project((1.0*F('gdistm')/gmaxdist, 1.0*F('duration')/gmaxduration));

gtrainDataSet = dw._ones((gtrainData.numRows, 1), ("x0",)).hstack(gtrainData[:,['gdistm']]);
gtrainDataSetDuration = gtrainData[:,['duration']];
gparams = dw._ones((1,2), ("a","b"));

14530
7199


Let us see how the error rate is progressing for the new dataset.

In [27]:
gpred = gtrainDataSet @ gparams.T;
gsqerr = squaredErr(gtrainDataSetDuration, gpred);
print(gsqerr);
gparams = gparams - alpha * gradDesc(gtrainDataSetDuration, gpred, gtrainDataSet);
gpred = gtrainDataSet @ gparams.T;
gsqerr = squaredErr(gtrainDataSetDuration, gpred);
print(gsqerr);

0.5419675497480458
0.4379084758530774


It looks like using Google maps' distance is giving us a slight advantage. That makes sense, since Vincenty's formula computes distances as a crow flies, where as Google maps' distance metric is based on the actual road network distances. Better data gives better prediction results !

We are done with the feature selection and feature engineering phase for now.

Next we will proceed to train our linear regression model using the training data set.

Meanwhile, we will also let it printout the error rate at frequent intervals so that we know it is decreasing.

In [28]:
for i in range(0, 1000):
    gpred = gtrainDataSet @ gparams.T;
    gparams = gparams - alpha*gradDesc(gtrainDataSetDuration, gpred, gtrainDataSet);
    if((i+1)%100 == 0):
        print("Error rate after {} iterations is {}".format(i+1, squaredErr(gtrainDataSetDuration, gpred)))
    
print(gparams.rows);
gsqerr = squaredErr(gtrainDataSetDuration, gpred);
print(gsqerr);

Error rate after 100 iterations is 0.002414885788154281
Error rate after 200 iterations is 0.0023518940367528206
Error rate after 300 iterations is 0.0022972381907587756
Error rate after 400 iterations is 0.0022498149319043433
Error rate after 500 iterations is 0.0022086671747070705
Error rate after 600 iterations is 0.0021729644844393505
Error rate after 700 iterations is 0.002141986317482605
Error rate after 800 iterations is 0.0021151074794789697
Error rate after 900 iterations is 0.002091785507800193
Error rate after 1000 iterations is 0.00207154972368958
OrderedDict([('a', array([0.00295944])), ('b', array([0.67162878]))])
0.00207154972368958


Let us see how our model performs in predictions against the test data set we had kept apart.

In [29]:
gtestData = gtripData.project(('gdistm', 'duration', 'gduration')).filter(Q('gdistm', gtestTripDist, CMP.IN));
gtestData = gtestData.project((1.0*F('gdistm')/gmaxdist, 1.0*F('duration')/gmaxduration, 'gduration'));
gtestDataSet = dw._ones((gtestData.numRows, 1), ("x0",)).hstack(gtestData[:,['gdistm']]);
gtestDataSetDuration = gtestData[:,['duration']];

gtestpred = gtestDataSet @ gparams.T;

gtestsqerr1 = squaredErr(gtestDataSetDuration*gmaxduration, gtestpred*gmaxduration);
print(gtestsqerr1);

99215.98424571125


We would also like to check how the duration provided by Google maps' API hold up to the test data set.

In [30]:
gtestsqerr2 = squaredErr(gtestDataSetDuration*gmaxduration, gtestData[:,['gduration']]);
print(gtestsqerr2);

111763.37983591038


So yes, our model is able to do a better job.