## Business Understanding
Our task for this lab is to create our own logistic regression model which is able to classify how many Uber pickups there will be (low, medium, or high) based off of different information in our dataset. The dataset is a collection of information about Uber pickups like time and location, joined with other data such as the weather for that time and location, what borough it is in, and whether or not it was a NYC public holiday. We split our predictions up by borough because certain boroughs like Manhattan generally always have a higher volume of pickups than boroughs like the Bronx, so aggregate predictions over all of NYC would not have been very insightful. Instead, we make predictions specific to each borough, with the exception of EWR and Staten Island, which we threw out because they did not contain enough data to make accurate predictions. We denote a "high" amount of pickups as greater than half a standard deviation above the mean for that borough. A "low" amount is less than half a standard deviation below the mean for that borough. A "medium" amount is inbetween. 

Our prediction task is valuable because it gives Uber insight into the time periods where they can be most profitable, and time periods where they can save money. For example, on New Years Eve there is most likely an extreme surge in the number of rides requested. If there are not enough drivers to satisfy all of these rides, people will go to Lyft or even just hail a yellow cab. However, if they prepare for this surge by incentivising drivers with an extra percentage of the ride money, there will be more drivers to satisfy the extra rides requests. Our model's insights would help pull in more profits and increases market share compared to treating every day and location as equally profitable. In production, this model would provide the best results if it were deployed so that it would run constantly and react to changing weather conditions, social movements, etc.

In [161]:
import numpy as np

## Data Prep

In [162]:
import pandas as pd
data = pd.read_csv("./data/uber_nyc_enriched.csv")

In [163]:
data.describe()

Unnamed: 0,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd
count,29101.0,29101.0,29101.0,29101.0,29101.0,29101.0,29101.0,29101.0,29101.0,29101.0
mean,490.215903,5.984924,8.818125,47.669042,30.823065,1017.817938,0.00383,0.026129,0.090464,2.529169
std,995.649536,3.699007,2.442897,19.814969,21.283444,7.768796,0.018933,0.093125,0.219402,4.520325
min,0.0,0.0,0.0,2.0,-16.0,991.4,0.0,0.0,0.0,0.0
25%,1.0,3.0,9.1,32.0,14.0,1012.5,0.0,0.0,0.0,0.0
50%,54.0,6.0,10.0,46.0,30.0,1018.2,0.0,0.0,0.0,0.0
75%,449.0,8.0,10.0,64.5,50.0,1022.9,0.0,0.0,0.05,2.958333
max,7883.0,21.0,10.0,89.0,73.0,1043.4,0.28,1.24,2.1,19.0


<p> checking for nan or null values in the dataset </p>

In [164]:
data.isnull().values.any()

True

<p> We found only Borough has nan values so we remove the nan rows </p>

In [165]:
data.isnull().any()

pickup_dt    False
borough       True
pickups      False
spd          False
vsb          False
temp         False
dewp         False
slp          False
pcp01        False
pcp06        False
pcp24        False
sd           False
hday         False
dtype: bool

In [166]:
data = data.dropna()

In [167]:
data.isnull().any()

pickup_dt    False
borough      False
pickups      False
spd          False
vsb          False
temp         False
dewp         False
slp          False
pcp01        False
pcp06        False
pcp24        False
sd           False
hday         False
dtype: bool

<p> We found that most of our data didnt have much correlation except temperate and the dew point temperature. We decided to get rid of this variable becasue it seemed very similar to temperature and did not think it would impact the machine learning. </p>

In [168]:
data.corr()

Unnamed: 0,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd
pickups,1.0,0.009741,-0.008429,0.063692,0.040082,-0.015708,0.005007,-0.002821,-0.022935,-0.009676
spd,0.009741,1.0,0.086177,-0.296126,-0.321606,-0.092761,-0.000357,0.016668,-0.010412,0.097041
vsb,-0.008429,0.086177,1.0,0.025214,-0.231294,0.167039,-0.488407,-0.118346,0.000895,-0.047834
temp,0.063692,-0.296126,0.025214,1.0,0.896544,-0.224537,-0.013343,-0.037295,-0.014408,-0.545558
dewp,0.040082,-0.321606,-0.231294,0.896544,1.0,-0.311156,0.115399,0.013293,0.001519,-0.489372
slp,-0.015708,-0.092761,0.167039,-0.224537,-0.311156,1.0,-0.089752,-0.10494,-0.134689,0.121508
pcp01,0.005007,-0.000357,-0.488407,-0.013343,0.115399,-0.089752,1.0,0.128064,0.000997,0.00031
pcp06,-0.002821,0.016668,-0.118346,-0.037295,0.013293,-0.10494,0.128064,1.0,0.251166,0.039943
pcp24,-0.022935,-0.010412,0.000895,-0.014408,0.001519,-0.134689,0.000997,0.251166,1.0,0.069664
sd,-0.009676,0.097041,-0.047834,-0.545558,-0.489372,0.121508,0.00031,0.039943,0.069664,1.0


In [169]:
del data['dewp']

<p> We made the holiday column count 1 for yes and 0 for no. </p>

In [170]:
data['hday'] = data['hday'].apply(lambda x: 0 if x=='N' else 1)

<p> We one hot encoded our boroughs becuase they were string values </p>

In [171]:
from sklearn.preprocessing import LabelEncoder
encoders = dict() 
categorical_headers = ['borough']
data['borough'] = data['borough'].str.strip()
# integer encoded variables
encoders['borough'] = LabelEncoder() # save the encoder
data['borough'+'_int'] = encoders['borough'].fit_transform(data['borough'])
# oneHotCols = pd.get_dummies(data['borough'])
# data = data.join(oneHotCols)
data

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,slp,pcp01,pcp06,pcp24,sd,hday,borough_int
0,2015-01-01 01:00:00,Bronx,152,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,0
1,2015-01-01 01:00:00,Brooklyn,1519,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,1
2,2015-01-01 01:00:00,EWR,0,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,2
3,2015-01-01 01:00:00,Manhattan,5258,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,3
4,2015-01-01 01:00:00,Queens,405,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,4
5,2015-01-01 01:00:00,Staten Island,6,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,5
7,2015-01-01 02:00:00,Bronx,120,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,0
8,2015-01-01 02:00:00,Brooklyn,1229,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,1
9,2015-01-01 02:00:00,EWR,0,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,2
10,2015-01-01 02:00:00,Manhattan,4345,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,3


In [172]:
del data['borough']

### Label encoding the time of day

<p> We based our hour groups by sunrise and sunset. Night is the time when the sun is down, which on average is from 8pm to 6am. Morning is from 6am till noon. Afternoon is from noon till 5pm. Evening is from 5pm till 8pm, which is around when the sunsets. 

In [173]:
dateTest = data['pickup_dt'][0]
print(int(dateTest[11:13]))
data['time_of_day_int'] = data['pickup_dt'].apply(lambda x: 0 if (int(x[11:13]) >= 6 and int(x[11:13]) < 12) else (
                                              1 if(int(x[11:13]) >= 12 and int(x[11:13]) < 17)
                                                  else (
                                                  2 if (int(x[11:13]) >= 17 and int(x[11:13]) < 21)
                                                    else (
                                                    3 if (int(x[11:13]) >= 21 or int(x[11:13]) < 6) else -1))))
# data['time_of_day'] = data['pickup_dt'].apply(lambda x: 1 if (int(x[11:13]) >= 12 and int(x[11:13]) < 17))
# data['time_of_day'] = data['pickup_dt'].apply(lambda x: 1 if (int(x[11:13]) >= 17 and int(x[11:13]) < 21))
# data['time_of_day']  = data['pickup_dt'].apply(lambda x: 1 if (int(x[11:13]) >= 21 or int(x[11:13]) < 6))
data

1


Unnamed: 0,pickup_dt,pickups,spd,vsb,temp,slp,pcp01,pcp06,pcp24,sd,hday,borough_int,time_of_day_int
0,2015-01-01 01:00:00,152,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,0,3
1,2015-01-01 01:00:00,1519,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,1,3
2,2015-01-01 01:00:00,0,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,2,3
3,2015-01-01 01:00:00,5258,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,3,3
4,2015-01-01 01:00:00,405,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,4,3
5,2015-01-01 01:00:00,6,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,5,3
7,2015-01-01 02:00:00,120,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,0,3
8,2015-01-01 02:00:00,1229,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,1,3
9,2015-01-01 02:00:00,0,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,2,3
10,2015-01-01 02:00:00,4345,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,3,3


### Label encoding the weekday

The weekday from the pickup_dt feature has been 1 hot encoded into monday-sunday. We believe having each day as a feature will help classify & predict the number of ubers necessary at a future date

In [174]:
import datetime 
data["day_int"] = np.nan
for index, row in data.iterrows():
    if(datetime.date(int(row['pickup_dt'][0:4]),int(row['pickup_dt'][5:7]),int(row['pickup_dt'][8:10]))
       .weekday() == 0):
        data.set_value(index,'day_int',0)
    elif(datetime.date(int(row['pickup_dt'][0:4]),int(row['pickup_dt'][5:7]),int(row['pickup_dt'][8:10]))
         .weekday() == 1):
        data.set_value(index,'day_int',1)
    elif(datetime.date(int(row['pickup_dt'][0:4]),int(row['pickup_dt'][5:7]),int(row['pickup_dt'][8:10]))
         .weekday() == 2):
        data.set_value(index,'day_int',2)
    elif(datetime.date(int(row['pickup_dt'][0:4]),int(row['pickup_dt'][5:7]),int(row['pickup_dt'][8:10]))
         .weekday() == 3):
        data.set_value(index,'day_int',3)
    elif(datetime.date(int(row['pickup_dt'][0:4]),int(row['pickup_dt'][5:7]),int(row['pickup_dt'][8:10]))
         .weekday() == 4):
        data.set_value(index,'day_int',4)
    elif(datetime.date(int(row['pickup_dt'][0:4]),int(row['pickup_dt'][5:7]),int(row['pickup_dt'][8:10]))
         .weekday() == 5):
        data.set_value(index,'day_int',5)
    elif(datetime.date(int(row['pickup_dt'][0:4]),int(row['pickup_dt'][5:7]),int(row['pickup_dt'][8:10]))
         .weekday() == 6):
        data.set_value(index,'day_int',6)
# data['is_monday'] = data['pickup_dt'].apply(lambda x: 1 if datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10])).weekday() == 0 else 0)
# data['is_tuesday'] = data['pickup_dt'].apply(lambda x: 1 if datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10])).weekday() == 1 else 0)
# data['is_wednesday'] = data['pickup_dt'].apply(lambda x: 1 if datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10])).weekday() == 2 else 0)
# data['is_thursday'] = data['pickup_dt'].apply(lambda x: 1 if datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10])).weekday() == 3 else 0)
# data['is_friday'] = data['pickup_dt'].apply(lambda x: 1 if datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10])).weekday() == 4 else 0)
# data['is_saturday'] = data['pickup_dt'].apply(lambda x: 1 if datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10])).weekday() == 5 else 0)
# data['is_sunday'] = data['pickup_dt'].apply(lambda x: 1 if datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10])).weekday() == 6 else 0)
data

Unnamed: 0,pickup_dt,pickups,spd,vsb,temp,slp,pcp01,pcp06,pcp24,sd,hday,borough_int,time_of_day_int,day_int
0,2015-01-01 01:00:00,152,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,0,3,3.0
1,2015-01-01 01:00:00,1519,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,1,3,3.0
2,2015-01-01 01:00:00,0,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,2,3,3.0
3,2015-01-01 01:00:00,5258,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,3,3,3.0
4,2015-01-01 01:00:00,405,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,4,3,3.0
5,2015-01-01 01:00:00,6,5.0,10.0,30.0,1023.5,0.0,0.0,0.0,0.0,1,5,3,3.0
7,2015-01-01 02:00:00,120,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,0,3,3.0
8,2015-01-01 02:00:00,1229,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,1,3,3.0
9,2015-01-01 02:00:00,0,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,2,3,3.0
10,2015-01-01 02:00:00,4345,3.0,10.0,30.0,1023.0,0.0,0.0,0.0,0.0,1,3,3,3.0


In [175]:
del data['pickup_dt']

<p> We found that the borough EWR averages about 2.4 pickups every 96 hours so we are getting rid of the EWR borough from our dataset. We found that the borough Staten Island averages 1.6 pickups and hour and had a max 13 pickups in an hour over 6 months so we got rid of it from our dataset. </p>

In [176]:
d1 = data.where(data['borough_int']==2)[['pickups']]
print(d1.describe())
d1 = d1.dropna()
data = data[data.borough_int != 2]
d1 = data.where(data['borough_int']==5)[['pickups']]
print(d1.describe())
d1 = d1.dropna()
data = data[data.borough_int != 5]

           pickups
count  4343.000000
mean      0.024177
std       0.160937
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max       2.000000
           pickups
count  4343.000000
mean      1.601888
std       1.640451
min       0.000000
25%       0.000000
50%       1.000000
75%       2.000000
max      13.000000


## Making our Categories

<p> We have three cateogries of pickup traffic low, medium, high. We found these by finding the mean and standard deviation of each borough. Low is half a standard deviation below the mean and high is half a stadard deviation above the mean. Anything else is counted as a medium amount of pickups. This means that the category of low, medium, and high pickup amount depends on the borough. If we did base our categories by borough then Manhattan would always be in the high pickup amount category and Queens would always be in the low pickup amount category.  </p> 

In [177]:
Man = data[data.borough_int == 3]
Bronx = data[data.borough_int == 0]
Queens = data[data.borough_int == 4]
Brooklyn = data[data.borough_int == 1]

print(Man['pickups'].describe())
print(Bronx['pickups'].describe())
print(Queens['pickups'].describe())
print(Brooklyn['pickups'].describe())

count    4343.000000
mean     2387.253281
std      1434.724668
min         0.000000
25%      1223.500000
50%      2269.000000
75%      3293.500000
max      7883.000000
Name: pickups, dtype: float64
count    4343.000000
mean       50.667050
std        31.029223
min         0.000000
25%        29.000000
50%        46.000000
75%        66.000000
max       262.000000
Name: pickups, dtype: float64
count    4343.000000
mean      309.354824
std       154.368300
min         0.000000
25%       196.000000
50%       308.000000
75%       410.000000
max       831.000000
Name: pickups, dtype: float64
count    4343.000000
mean      534.431269
std       294.810182
min         0.000000
25%       331.500000
50%       493.000000
75%       675.000000
max      2009.000000
Name: pickups, dtype: float64


<p> Using the mean and stadard deviation of each borough to place each row of data into a category. We can see that our categories are almost fairly balanced within each borough.  </p> 

In [178]:
mstd = Man['pickups'].std()
mmean = Man['pickups'].mean()
bstd = Bronx['pickups'].std()
bmean = Bronx['pickups'].mean()
qstd = Queens['pickups'].std()
qmean = Queens['pickups'].mean()
brstd = Brooklyn['pickups'].std()
brmean = Brooklyn['pickups'].mean()
data['pickupPrediction'] = 0
for index, row in data.iterrows():
    if(row['borough_int'] == 3):
        if(row['pickups']  < (mmean - mstd/2)):
            data.set_value(index,'pickupPrediction',0)
        elif(row['pickups'] > (mmean + mstd/2)):
            data.set_value(index,'pickupPrediction',2)
        else:
            data.set_value(index,'pickupPrediction',1)
    if(row['borough_int'] == 1):
        if(row['pickups']  < (bmean - bstd/2)):
            data.set_value(index,'pickupPrediction',0)
        elif(row['pickups'] > (bmean + bstd/2)):
            data.set_value(index,'pickupPrediction',2)
        else:
            data.set_value(index,'pickupPrediction',1)
    if(row['borough_int'] == 4):
        if(row['pickups']  < (qmean - qstd/2)):
            data.set_value(index,'pickupPrediction',0)
        elif(row['pickups'] > (qmean + qstd/2)):
            data.set_value(index,'pickupPrediction',2)
        else:
            data.set_value(index,'pickupPrediction',1)
    if(row['borough_int'] == 0):
        if(row['pickups']  < (brmean - brstd/2)):
            data.set_value(index,'pickupPrediction',0)
        elif(row['pickups'] > (brmean + brstd/2)):
            data.set_value(index,'pickupPrediction',2)
        else:
            data.set_value(index,'pickupPrediction',1)
print(data['pickupPrediction'].describe())
print(data['pickupPrediction'].value_counts())

count    17372.000000
mean         0.981292
std          0.898477
min          0.000000
25%          0.000000
50%          1.000000
75%          2.000000
max          2.000000
Name: pickupPrediction, dtype: float64
0    7177
2    6852
1    3343
Name: pickupPrediction, dtype: int64


In [179]:
data.reset_index(inplace=True, drop=True)
for index, row in data.iterrows():
    if(index < 12):
        continue
    if(data['pickupPrediction'][index-4] == 0):        
        data.set_value(index,'1hrAgo',0)
    elif(data['pickupPrediction'][index-4] == 1):        
        data.set_value(index,'1hrAgo',1)
    else:
        data.set_value(index,'1hrAgo',2)
        
    if(data['pickupPrediction'][index-8] == 0):        
        data.set_value(index,'2hrAgo',0)
    elif(data['pickupPrediction'][index-8] == 1):        
        data.set_value(index,'2hrAgo',1)
    else:
        data.set_value(index,'2hrAgo',2)
        
    if(data['pickupPrediction'][index-12] == 0):        
        data.set_value(index,'3hrAgo',0)
    elif(data['pickupPrediction'][index-12] == 1):        
        data.set_value(index,'3hrAgo',1)
    else:
        data.set_value(index,'3hrAgo',2)
data = data.iloc[12:]

In [180]:
x = data
y = data['pickupPrediction']

In [181]:
del x['pickupPrediction']
del x['pickups']

## Normalizing data

In [182]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
print(x)
x_scaled = min_max_scaler.fit_transform(x.iloc[:,0:8].values)
x_not_sclaed = x.iloc[:,8:].values
x = np.concatenate((x_scaled, x_not_sclaed), axis=1)
x = pd.DataFrame(x, columns=['spd','vsb','temp','slp','pcp01','pcp06','pcp24','sd','hday','borough_int',
                            'timeOfDay_int','day_int','1hrAgo','2hrAgo','3hrAgo'])

             spd   vsb  temp     slp  pcp01  pcp06  pcp24   sd  hday  \
12      5.000000  10.0  29.0  1022.0    0.0    0.0    0.0  0.0     1   
13      5.000000  10.0  29.0  1022.0    0.0    0.0    0.0  0.0     1   
14      5.000000  10.0  29.0  1022.0    0.0    0.0    0.0  0.0     1   
15      5.000000  10.0  29.0  1022.0    0.0    0.0    0.0  0.0     1   
16      5.000000  10.0  28.0  1021.8    0.0    0.0    0.0  0.0     1   
17      5.000000  10.0  28.0  1021.8    0.0    0.0    0.0  0.0     1   
18      5.000000  10.0  28.0  1021.8    0.0    0.0    0.0  0.0     1   
19      5.000000  10.0  28.0  1021.8    0.0    0.0    0.0  0.0     1   
20     10.000000  10.0  28.0  1020.7    0.0    0.0    0.0  0.0     1   
21     10.000000  10.0  28.0  1020.7    0.0    0.0    0.0  0.0     1   
22     10.000000  10.0  28.0  1020.7    0.0    0.0    0.0  0.0     1   
23     10.000000  10.0  28.0  1020.7    0.0    0.0    0.0  0.0     1   
24      9.000000  10.0  28.0  1020.5    0.0    0.0    0.0  0.0  

In [183]:
x

Unnamed: 0,spd,vsb,temp,slp,pcp01,pcp06,pcp24,sd,hday,borough_int,timeOfDay_int,day_int,1hrAgo,2hrAgo,3hrAgo
0,0.238095,1.0,0.310345,0.588462,0.0,0.0,0.0,0.0,1.0,0.0,3.0,3.0,0.0,0.0,0.0
1,0.238095,1.0,0.310345,0.588462,0.0,0.0,0.0,0.0,1.0,1.0,3.0,3.0,2.0,2.0,2.0
2,0.238095,1.0,0.310345,0.588462,0.0,0.0,0.0,0.0,1.0,3.0,3.0,3.0,2.0,2.0,2.0
3,0.238095,1.0,0.310345,0.588462,0.0,0.0,0.0,0.0,1.0,4.0,3.0,3.0,2.0,1.0,2.0
4,0.238095,1.0,0.298851,0.584615,0.0,0.0,0.0,0.0,1.0,0.0,3.0,3.0,0.0,0.0,0.0
5,0.238095,1.0,0.298851,0.584615,0.0,0.0,0.0,0.0,1.0,1.0,3.0,3.0,2.0,2.0,2.0
6,0.238095,1.0,0.298851,0.584615,0.0,0.0,0.0,0.0,1.0,3.0,3.0,3.0,1.0,2.0,2.0
7,0.238095,1.0,0.298851,0.584615,0.0,0.0,0.0,0.0,1.0,4.0,3.0,3.0,1.0,2.0,1.0
8,0.476190,1.0,0.298851,0.563462,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0
9,0.476190,1.0,0.298851,0.563462,0.0,0.0,0.0,0.0,1.0,1.0,0.0,3.0,2.0,2.0,2.0


## Cross Product On Features

We did the cross product on weekday features and the time of day features. This will help our neural network understand the important relationship between two features because they both focus on time. Our other categorical features do not share any important relationships that would make them worthwhile to cross. 

## Metric Used
<p> We believe using an f1-socre would be the best evaluation metric in our case. We care about the misclassification of our data because misclassifying an amount of ubers for an area would either lead to there being too many or few drivers. This would either result in wasting drivers time or missing out rides from uber users. F1-score takes into account the false positive and false negative, which represents both cases mentioned. We will be taking the f1-score of each class and then averaging the f1-scores. Earlier, we used this dataset for logisitc regression and found that it frequently only guessed one class. Averaging the f1-scores of each class would punish the classifier for guessing only class and limit it to an average f1-score of 33%.  </p>

## Splitting Data
We choose to use continuous test and training sets as our uber pickups were given hourly over a 6 month span. Our algorithm in the real world would be getting the data in hourly and then have to predict based off this new hourly data. We split our data into 6 folds splitting our data by the time that it occured. This represents us retraining our algorithm every month off new data. We believe that this would help avoid having to retrain the model frequently and keep up with the new season's new data. This would be crucial becuase we do not have a full year's amount of data. When we got a full year's amount of data, we would use a 12 fold split to represent each month of the year. Also, we believe it would be best for the Uber to only keep the last year's data for the model. This is because the avergage amount of uber trips could change yearly and keeping old data would drag down the means and standard deviations we computed above. Overall, we thing using a continuous time series split helps represent the model's training and usage in the real world. 

<p> The data was already in order by hour. Each four rows represented one boroughs uber and weather data. We split our data into 10 folds for training. </p>

In [184]:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=6)
for train_index, test_index in tscv.split(x.values,y.values):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = x.iloc[train_index], x.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

TRAIN: [   0    1    2 ... 2477 2478 2479] TEST: [2480 2481 2482 ... 4957 4958 4959]
TRAIN: [   0    1    2 ... 4957 4958 4959] TEST: [4960 4961 4962 ... 7437 7438 7439]
TRAIN: [   0    1    2 ... 7437 7438 7439] TEST: [7440 7441 7442 ... 9917 9918 9919]
TRAIN: [   0    1    2 ... 9917 9918 9919] TEST: [ 9920  9921  9922 ... 12397 12398 12399]
TRAIN: [    0     1     2 ... 12397 12398 12399] TEST: [12400 12401 12402 ... 14877 14878 14879]
TRAIN: [    0     1     2 ... 14877 14878 14879] TEST: [14880 14881 14882 ... 17357 17358 17359]


## Modeling With Keras

In [185]:
import keras

keras.__version__

'2.2.4'

In [186]:
from keras.layers import Dense, Activation, Input
from keras.layers import Embedding, Flatten, Concatenate
from keras.models import Model
from sklearn import metrics as mt
# This returns a tensor
for train_index, test_index in tscv.split(x.values,y.values):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = x.iloc[train_index], x.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    inputs = Input(shape=(X_train.shape[1],))

TRAIN: [   0    1    2 ... 2477 2478 2479] TEST: [2480 2481 2482 ... 4957 4958 4959]
TRAIN: [   0    1    2 ... 4957 4958 4959] TEST: [4960 4961 4962 ... 7437 7438 7439]
TRAIN: [   0    1    2 ... 7437 7438 7439] TEST: [7440 7441 7442 ... 9917 9918 9919]
TRAIN: [   0    1    2 ... 9917 9918 9919] TEST: [ 9920  9921  9922 ... 12397 12398 12399]
TRAIN: [    0     1     2 ... 12397 12398 12399] TEST: [12400 12401 12402 ... 14877 14878 14879]
TRAIN: [    0     1     2 ... 14877 14878 14879] TEST: [14880 14881 14882 ... 17357 17358 17359]


## Deep

## Wide setup

In [189]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
X_train_ohe = ohe.fit_transform(x.iloc[:,9:].values)
inputs = Input(shape=(X_train_ohe.shape[1],), sparse=True)

# a layer instance is callable on a tensor, and returns a tensor
x2 = Dense(units=10, activation='relu')(inputs)
predictions = Dense(1,activation='sigmoid')(x2)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)

for train_index, test_index in tscv.split(x.values,y.values):
    model = Model(inputs=inputs, outputs=predictions)
    model.compile(optimizer='sgd',
              loss='mean_squared_error',
              metrics=['accuracy'])

    model.summary()
    X_train, X_test =  x.iloc[train_index], x.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    ohe = OneHotEncoder()
    X_train_ohe = ohe.fit_transform(X_train.iloc[:,9:])
    X_test_ohe = ohe.transform(X_test.iloc[:,9:])
    %%time
    model.fit(X_train_ohe,y_train, epochs=10, batch_size=50, verbose=0)

    # test on the data
    yhat = np.round(model.predict(X_test_ohe))
    print('\n \n \n')
    print(mt.confusion_matrix(y_test,yhat),'accuracy:',mt.accuracy_score(y_test,yhat))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_39 (InputLayer)        (None, 24)                0         
_________________________________________________________________
dense_41 (Dense)             (None, 10)                250       
_________________________________________________________________
dense_42 (Dense)             (None, 1)                 11        
Total params: 261
Trainable params: 261
Non-trainable params: 0
_________________________________________________________________
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.87 µs

 
 

[[978 101   0]
 [129 365   0]
 [  6 901   0]] accuracy: 0.5415322580645161
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_39 (InputLayer)        (None, 24)                0         
_________________________________________________________________

In [188]:
# combine the features with two branches
from keras.layers import concatenate