Step 1: Import Libraries and Load Data

There are two beijing datasets. beijing.Rdata and beijing_clus.Rdata. beijing_clus.Rdata clusters the 315 regions into 26 regions and takes significantly less time to load. 

The path might need to be changed as the directory in the dropbox might not match what I used locally. 

In [1]:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
import pandas as pd
import numpy as np
import math

pandas2ri.activate()
robjects.r['load']("./CrowdFlow/data/beijing_clus.RData")

0
'alldata'


Step 2: Process data

The original data is consisted of many tables. I seperated them into each individual dataframe so I could use them in python. 

Data_flow consists of information of the time index, region, inflow, outflow of the data. I added one extra column called trafiic_count which is the sum of inflow and outflow. 

Data_timemap consists of mappings of the 'time' column of the data_flow dataframe into actual time in the day, day in the week, hour in the day, hour in the week and the week number. 

Data_weather consists of weather information including temperature, pressure, humidity, windspeed, winddirection and weather (a 0~16 catagorical data for weather like sunny, rainy, snowy) for each time index. I was focusing on just the 'Weather' column of the data_weather dataframe. 

Data_flow and data_weather are merged together for easier processing. 

I also defined good weather as weather type 0 ~ 2 and bad weather is everything else. 

I then seperated the data_flow dataset into training set and testing test. The data already came with a pre-determined testing set range from time index 5087 and above which is what I used. Thus the training set consists of data up to time index 5086 and tesing set consists of data from time index 5087 to 5857. 

In [2]:
data = robjects.r['alldata']
data_flow = robjects.r['alldata'][0]
traffic_count = data_flow['outflow'] + data_flow['inflow']
data_flow['traffic_count'] = traffic_count
data_timemap = robjects.r['alldata'][1]
data_weather = robjects.r['alldata'][2]
data_flow = data_flow.merge(data_weather, how='left', on='time')
testtime = robjects.r['alldata'][-1]
good_weather = data_weather.loc[data_weather['Weather'] <= 2]['time'].values.tolist()
bad_weather = data_weather.loc[data_weather['Weather'] > 2]['time'].values.tolist()
weather_info = [-1 for _ in range (len(data_flow))]
data_flow['Weather_T'] = weather_info
data_flow.loc[data_flow['time'].isin(good_weather), 'Weather_T'] = 0
data_flow.loc[data_flow['time'].isin(bad_weather), 'Weather_T'] = 1
test_data = data_flow.loc[data_flow['time'] >= 5087].reset_index()
train_data = data_flow.loc[data_flow['time'] < 5087].reset_index()

This is a function which get all the time in data_timemap with the same dayinweek, hourinday, hourinweek given a certain time index. 

In [4]:
def newGetTimeFromTime(time, data_timemap):
    dayinweek = data_timemap.loc[data_timemap['time'] == time]['dayinweek'].values[0]
    hourinday = data_timemap.loc[data_timemap['time'] == time]['hourinday'].values[0]
    hourinweek = data_timemap.loc[data_timemap['time'] == time]['hourinweek'].values[0]
    time_array = []
    return data_timemap.loc[(data_timemap['dayinweek'] == dayinweek) & (data_timemap['hourinday'] == hourinday)]
    # return data_timemap.loc[(data_timemap['dayinweek'] == dayinweek) & (data_timemap['hourinday'] == hourinday) & (data_timemap['hourinweek'] == hourinweek)]

Step 3: Running the prediction. 

I traversed the entire testing set and for each item, I get a dataframe with all the data with the same region, same weather and same time (dayinweek, hourinday and hourinweek). Then I take the average of inflow, outflow and traffic_count and append it to a list. Repeat this for all the data in test data and I get the predicted value of inflow, outflow and traffic_count for all testing data. 

In [15]:
predicted_outflow = []
predicted_inflow = []
predicted_trafficcount = []

for i in range (len(test_data.index)):
    time = test_data.iloc[[i]]['time'].values[0]
    region = test_data.iloc[[i]]['region'].values[0]
    weather = test_data.iloc[[i]]['Weather_T'].values[0]
    dataofsameregion = train_data.loc[train_data['region'] == region]
    sameweathersameregion = dataofsameregion.loc[dataofsameregion['Weather_T'] == weather]
    time_array = newGetTimeFromTime(time, data_timemap)

    
    df_new = sameweathersameregion[sameweathersameregion['time'].isin(time_array['time'].values.tolist())]

    predicted_outflow.append(round(df_new['outflow'].mean(), 3))
    predicted_inflow.append(round(df_new['inflow'].mean(), 3))
    predicted_trafficcount.append(round(df_new['traffic_count'].mean(), 3))

# print ("end of loop")
# print (predicted_inflow)
# print (predicted_outflow)
test_data['Pred_Outflow'] = predicted_outflow
test_data['Pred_Inflow'] = predicted_inflow
test_data['Pred_TrafficCount'] = predicted_trafficcount

KeyboardInterrupt: 

Step 4. Generate squared error for inflow, outflow and traffic_count. 

In [9]:
test_data['MSE_Out'] = [number ** 2 for number in (test_data['outflow'].values - test_data['Pred_Outflow'].values)]
test_data['MSE_In'] = [number ** 2 for number in (test_data['inflow'].values - test_data['Pred_Inflow'].values)]
test_data['MSE_Traffic_Count'] = [number ** 2 for number in (test_data['traffic_count'].values - test_data['Pred_TrafficCount'].values)]


Step 5. Print RMSE for inflow, outflow and traffic_count for testing set. 

First I printed RMSE for all 17 weather types. Then I printed RMSE for good weather and bad weather. Lastly I printed the total RMSE, without the weather impact. 

In [11]:
print ('-------- Outflow MSE for all weather type --------')

for i in range (0, 17):

    print ('Weather = ', i, ' ', math.sqrt(test_data.loc[test_data['Weather'] == i]['MSE_Out'].mean()))

print ('-------- Inflow MSE for all weather type --------')

for i in range (0, 17):
    
    print ('Weather = ', i, ' ', math.sqrt(test_data.loc[test_data['Weather'] == i]['MSE_In'].mean()))

print ('-------- Traffic_Count MSE for all weather type --------')

for i in range (0, 17):
    
    print ('Weather = ', i, ' ', math.sqrt(test_data.loc[test_data['Weather'] == i]['MSE_Traffic_Count'].mean()))


print ('-------- Traffic_Count MSE for good and bad weather type --------')

print ('good weather inflow: ', math.sqrt(test_data.loc[test_data['Weather_T'] == 0]['MSE_In'].mean()))
print ('bad weather inflow: ', math.sqrt(test_data.loc[test_data['Weather_T'] == 1]['MSE_In'].mean()))
print ('good weather outflow: ', math.sqrt(test_data.loc[test_data['Weather_T'] == 0]['MSE_Out'].mean()))
print ('bad weather outflow: ', math.sqrt(test_data.loc[test_data['Weather_T'] == 1]['MSE_Out'].mean()))
print ('good weather TC: ', math.sqrt(test_data.loc[test_data['Weather_T'] == 0]['MSE_Traffic_Count'].mean()))
print ('bad weather TC: ', math.sqrt(test_data.loc[test_data['Weather_T'] == 1]['MSE_Traffic_Count'].mean()))
print ('MSE_TC: ', math.sqrt(test_data['MSE_Traffic_Count'].mean()))
print ('MSE_Inflow: ', math.sqrt(test_data['MSE_In'].mean()))
print ('MSE_Outflow: ', math.sqrt(test_data['MSE_Out'].mean()))

-------- Outflow MSE for all weather type --------
Weather =  0   4.076764977292716
Weather =  1   6.285187891571001
Weather =  2   5.092411854060477
Weather =  3   nan
Weather =  4   5.417985187508682
Weather =  5   nan
Weather =  6   nan
Weather =  7   nan
Weather =  8   4.928654430129434
Weather =  9   nan
Weather =  10   nan
Weather =  11   nan
Weather =  12   nan
Weather =  13   nan
Weather =  14   5.38231814918106
Weather =  15   nan
Weather =  16   nan
-------- Inflow MSE for all weather type --------
Weather =  0   5.950600337906161
Weather =  1   6.255922086248066
Weather =  2   4.466630821521653
Weather =  3   nan
Weather =  4   4.501972671674934
Weather =  5   nan
Weather =  6   nan
Weather =  7   nan
Weather =  8   4.386210891971109
Weather =  9   nan
Weather =  10   nan
Weather =  11   nan
Weather =  12   nan
Weather =  13   nan
Weather =  14   4.4627019631900735
Weather =  15   nan
Weather =  16   nan
-------- Traffic_Count MSE for all weather type --------
Weather =  0  

In [10]:
test_data

Unnamed: 0,index,time,region,outflow,inflow,traffic_count,Temperature,Pressure,Humidity,WindSpeed,WindDirection,Weather,Weather_T,Pred_Outflow,Pred_Inflow,Pred_TrafficCount
0,4844,5087,1,4,0,4,25.5,1004.0,54.0,2.4,23.0,0.0,0,1.786,0.214,2.000
1,4845,5088,1,1,0,1,24.7,1004.0,57.0,2.2,23.0,0.0,0,1.571,0.071,1.643
2,4846,5089,1,2,0,2,24.3,1005.0,58.0,2.1,23.0,0.0,0,1.571,0.071,1.643
3,4847,5090,1,2,0,2,23.4,1005.0,62.0,0.8,23.0,0.0,0,1.357,0.071,1.429
4,4848,5091,1,0,0,0,23.3,1005.0,62.0,1.0,23.0,0.0,0,1.286,0.071,1.357
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213565,1739425,5852,315,0,0,0,23.2,994.0,55.0,3.6,23.0,0.0,0,0.000,0.071,0.071
213566,1739426,5853,315,0,1,1,23.2,994.0,55.0,3.6,23.0,0.0,0,0.214,0.000,0.214
213567,1739427,5854,315,0,1,1,23.2,994.0,55.0,3.6,23.0,0.0,0,0.000,0.000,0.000
213568,1739428,5855,315,0,0,0,23.2,994.0,55.0,3.6,23.0,0.0,0,0.000,0.143,0.143


In [19]:
test_data.iloc[[0]]['Pred_Outflow'].values[0]
squared_error_out = []
squared_error_in = []
trafficcount_error = []
abs_out = []
abs_in = []
abs_tce = []
for i in range (len(test_data.index)):
    yiyhati_out = test_data.iloc[[i]]['outflow'].values[0] - test_data.iloc[[i]]['Pred_Outflow'].values[0]
    yiyhati_in = test_data.iloc[[i]]['inflow'].values[0] - test_data.iloc[[i]]['Pred_Inflow'].values[0]
    tce = test_data.iloc[[i]]['traffic_count'].values[0] - test_data.iloc[[i]]['Pred_TrafficCount'].values[0]
    squared_error_out.append(yiyhati_out * yiyhati_out)
    squared_error_in.append(yiyhati_in * yiyhati_in)
    trafficcount_error.append(tce * tce)
    abs_out.append(abs(yiyhati_out))
    abs_in.append(abs(yiyhati_in))
    abs_tce.append(abs(tce))
test_data['MSE_Out'] = squared_error_out
test_data['MSE_In'] = squared_error_in
test_data['MSE_Traffic'] = trafficcount_error
test_data['MAE_Out'] = abs_out
test_data['MAE_In'] = abs_in
test_data['MAE_In'] = abs_tce

In [20]:
goodweather_testdata = test_data.loc[test_data['Weather'] == 0]
badweather_testdata = test_data.loc[test_data['Weather'] == 1]
print (goodweather_testdata['MSE_Out'].mean())
print (badweather_testdata['MSE_Out'].mean())
print (goodweather_testdata['MSE_In'].mean())
print (badweather_testdata['MSE_In'].mean())
print ('good_weather_testdata_MSE_Traffic', goodweather_testdata['MSE_Traffic'].mean())
print ('badweather_testdata_MSE_Traffic', badweather_testdata['MSE_Traffic'].mean())
print ('testdata_MSE_Traffic', test_data['MSE_Traffic'].mean())
print ('good_weather_testdata_MSE_Traffic', goodweather_testdata['MSE_Traffic'].mean())
print ('badweather_testdata_MSE_Traffic', badweather_testdata['MSE_Traffic'].mean())
print ('testdata_RMSE_Traffic', math.sqrt(test_data['MSE_Traffic'].mean()))
print (test_data['MSE_Out'].mean())
print (test_data['MSE_In'].mean())
print (math.sqrt(test_data['MSE_Out'].mean()))
print (math.sqrt(test_data['MSE_In'].mean()))
print (test_data['MAE_Out'].mean())
print (test_data['MAE_In'].mean())

26.686417186227256
34.58230915298701
29.25453369422183
21.142584596527172
good_weather_testdata_MSE_Traffic 63.299969946573746
badweather_testdata_MSE_Traffic 63.33752620795575
testdata_MSE_Traffic 63.318805153649826
good_weather_testdata_MSE_Traffic 63.299969946573746
badweather_testdata_MSE_Traffic 63.33752620795575
testdata_RMSE_Traffic 7.957311427464042
30.64636300542288
25.186230955256423
5.535915733229949
5.018588542135769
2.4130561875814163
3.7875033772374205


In [41]:
good_weather = data_weather.loc[data_weather['Weather'] <= 5]['time'].values.tolist()
bad_weather = data_weather.loc[data_weather['Weather'] > 5]['time'].values.tolist()
invalid = data_weather.loc[data_weather['Weather']  == 16]['time'].values.tolist()
print (len(good_weather))
print (len(bad_weather))
print (len(invalid))
print (len(data_weather.index))

4429
939
34
5857


In [54]:
w0 = data_weather.loc[data_weather['Weather'] == 0.0]['time'].values.tolist()
w1 = data_weather.loc[data_weather['Weather'] == 1.0]['time'].values.tolist()
w2 = data_weather.loc[data_weather['Weather'] == 2.0]['time'].values.tolist()
w3 = data_weather.loc[data_weather['Weather'] == 3.0]['time'].values.tolist()
w4 = data_weather.loc[data_weather['Weather'] == 4.0]['time'].values.tolist()
w5 = data_weather.loc[data_weather['Weather'] == 5.0]['time'].values.tolist()
w6 = data_weather.loc[data_weather['Weather'] == 6.0]['time'].values.tolist()
w7 = data_weather.loc[data_weather['Weather'] == 7.0]['time'].values.tolist()
w8 = data_weather.loc[data_weather['Weather'] == 8.0]['time'].values.tolist()
w9 = data_weather.loc[data_weather['Weather'] == 9.0]['time'].values.tolist()
w10 = data_weather.loc[data_weather['Weather'] == 10.0]['time'].values.tolist()
w11 = data_weather.loc[data_weather['Weather'] == 11.0]['time'].values.tolist()
w12 = data_weather.loc[data_weather['Weather'] == 12.0]['time'].values.tolist()
w13 = data_weather.loc[data_weather['Weather'] == 13.0]['time'].values.tolist()
w14 = data_weather.loc[data_weather['Weather'] == 14.0]['time'].values.tolist()
w15 = data_weather.loc[data_weather['Weather'] == 15.0]['time'].values.tolist()
w16 = data_weather.loc[data_weather['Weather'] == 16.0]['time'].values.tolist()
print (len(w0) + len(w1) + len(w2) + len(w3) + len(w4) + len(w5) + len(w6) + len(w7) + len(w8) + len(w9) + len(w10) + len(w11) + len(w12) + len(w13) + len(w14) + len(w15) + len(w16))

5368


In [47]:
len(data_weather)

5857

In [57]:
len(data_weather.loc[data_weather['Weather'] >= 0]['time'].values.tolist())

5368