# Tableau Assignment - Citi Bike Analytics
----

### Summary
This notebook shows the process to clean, explore, merge and analyze the data sets
with information for the `New York Citi Bike Program` from `2018`.


__Note__. The csv files were not saved in this repository because they are very large, 
but you can download them from https://s3.amazonaws.com/tripdata/index.html

----

In [1]:
# Dependencies
import pandas as pd
import datetime
import math
import numpy as np

In [48]:
# Function to calculate distance between latitude longitude pairs with Python -  haversine.py
# Available in https://gist.github.com/rochacbruno/2883505

# Function to calculate the distance between two points using coordinates
def distance(origin, destination):
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 3956 # for miles

    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c

    return d

### __Explore data sets from 2018__

Took January to define the variables for Tableau

In [2]:
# Read file and print the number of records and data types
csv_name = "./2018/201801.csv"
jan_df = pd.read_csv(csv_name, low_memory=False) 
print("Records Jan2018 : " + str(jan_df.count()))

Records Jan2018 : tripduration               718994
starttime                  718994
stoptime                   718994
start station id           718994
start station name         718994
start station latitude     718994
start station longitude    718994
end station id             718994
end station name           718994
end station latitude       718994
end station longitude      718994
bikeid                     718994
usertype                   718994
birth year                 718994
gender                     718994
dtype: int64


In [3]:
# Add columns to make transformations
jan_df.insert(0, 'year', 2018)
jan_df.insert(1, 'month', 'Jan')
jan_df.insert(3, 'tripdurmin', 0)
jan_df.insert(5, 'starthour', 0)
jan_df.insert(6, 'weekday', 0)
jan_df.insert(16, 'distance', 0)
jan_df.insert(21, 'age', 0)
jan_df.insert(22, 'sgender', '')
jan_df.insert(23, 'season', 'Winter')
jan_df.insert(24, 'mileage', 0)

In [4]:
# Transform the values for gender
jan_df.loc[jan_df['gender'] == 0, 'sgender'] = 'Unknown'
jan_df.loc[jan_df['gender'] == 1, 'sgender'] = 'Male'
jan_df.loc[jan_df['gender'] == 2, 'sgender'] = 'Female'

In [5]:
# Calculate the age of the person considering that the year was 2018
jan_df['age'] = 2018 - jan_df['birth year']

In [6]:
# Transform the durantion of the trip from seconds to minutes
jan_df['tripdurmin'] = jan_df['tripduration'] / 60

In [7]:
# Extact the hour fron the startime
jan_df['starthour'] = pd.to_datetime(jan_df["starttime"]).dt.strftime('%H')

In [8]:
# Calculate the weekday fron the startime
jan_df['weekday'] = pd.to_datetime(jan_df["starttime"]).dt.strftime('%A')

In [176]:
# Calculate the distance in miles from start station to end station
jan_df['distance'] = jan_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [9]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
jan_df.loc[jan_df['tripdurmin'] <= 120, 'mileage'] = (jan_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
jan_df.loc[jan_df['tripdurmin'] > 120, 'mileage'] = (jan_df['tripdurmin'] / 60) * 14.9

In [10]:
# View results
jan_df.head()

Unnamed: 0,year,month,tripduration,tripdurmin,starttime,starthour,weekday,stoptime,start station id,start station name,...,end station longitude,distance,bikeid,usertype,birth year,gender,age,sgender,season,mileage
0,2018,Jan,970,16.166667,2018-01-01 13:50:57.4340,13,Monday,2018-01-01 14:07:08.1860,72,W 52 St & 11 Ave,...,-73.988484,0,31956,Subscriber,1992,1,26,Male,Winter,2.008978
1,2018,Jan,723,12.05,2018-01-01 15:33:30.1820,15,Monday,2018-01-01 15:45:33.3410,72,W 52 St & 11 Ave,...,-73.994685,0,32536,Subscriber,1969,1,49,Male,Winter,1.497413
2,2018,Jan,496,8.266667,2018-01-01 15:39:18.3370,15,Monday,2018-01-01 15:47:35.1720,72,W 52 St & 11 Ave,...,-74.002116,0,16069,Subscriber,1956,1,62,Male,Winter,1.027271
3,2018,Jan,306,5.1,2018-01-01 15:40:13.3720,15,Monday,2018-01-01 15:45:20.1910,72,W 52 St & 11 Ave,...,-73.985162,0,31781,Subscriber,1974,1,44,Male,Winter,0.63376
4,2018,Jan,306,5.1,2018-01-01 18:14:51.5680,18,Monday,2018-01-01 18:19:57.6420,72,W 52 St & 11 Ave,...,-73.984706,0,30319,Subscriber,1992,1,26,Male,Winter,0.63376


### Explore the data to find obvious outliers or false data

Only for January

In [179]:
# View data frame information - How many trips have been recorded - Percentage of ridership growth
jan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 718994 entries, 0 to 718993
Data columns (total 25 columns):
year                       718994 non-null int64
month                      718994 non-null object
tripduration               718994 non-null int64
tripdurmin                 718994 non-null float64
starttime                  718994 non-null object
starthour                  718994 non-null object
weekday                    718994 non-null object
stoptime                   718994 non-null object
start station id           718994 non-null int64
start station name         718994 non-null object
start station latitude     718994 non-null float64
start station longitude    718994 non-null float64
end station id             718994 non-null int64
end station name           718994 non-null object
end station latitude       718994 non-null float64
end station longitude      718994 non-null float64
distance                   718994 non-null float64
bikeid                     718994 non

In [65]:
# Proportion of short-term customers and annual subscribers
jan_df.usertype.value_counts()

Subscriber    696886
Customer       22108
Name: usertype, dtype: int64

In [33]:
# Peak hours in which bikes are used during the month
jan_df.starthour.value_counts()

17    71400
18    67558
08    64041
16    51919
09    49092
15    44969
19    44519
14    42309
13    39309
12    36183
07    36156
10    31659
11    31528
20    29593
21    20314
06    17018
22    15010
23     9163
05     5505
00     5163
01     2753
02     1585
04     1259
03      989
Name: starthour, dtype: int64

In [37]:
# Number of unique start stations
jan_df['start station id'].nunique()

763

In [34]:
# Number of trips from start stations
jan_df['start station id'].value_counts()

519     8080
435     5093
3255    4852
402     4526
497     4503
490     4281
523     4278
477     4256
285     4223
459     4057
359     3773
379     3652
293     3616
284     3561
492     3487
432     3454
3263    3351
229     3306
505     3305
168     3287
382     3181
504     3147
465     3128
446     3117
499     3112
368     3091
358     3068
3641    3062
457     3031
537     2941
        ... 
2005      65
3629      63
3340      63
3395      57
3525      52
3394      47
3564      45
3554      44
3661      41
3492      39
3468      39
3590      37
3650      35
3333      30
3649      29
3559      29
3631      27
3330      26
3620      25
3596      24
3557      23
3239      11
3512      11
3432      10
3594       5
3485       2
428        1
3662       1
3488       1
3250       1
Name: start station id, Length: 763, dtype: int64

In [38]:
# Number of unique end stations
jan_df['end station id'].nunique()

768

In [35]:
# Number of arrivals to end stations
jan_df['end station id'].value_counts()

519     8026
435     5132
3255    4956
497     4924
402     4784
459     4441
490     4368
523     4321
285     4296
477     4267
492     3733
379     3721
293     3701
284     3661
3263    3484
359     3473
229     3417
168     3390
382     3342
505     3303
368     3244
504     3200
358     3160
465     3132
446     3127
499     3088
3443    3079
3641    3060
457     2957
482     2940
        ... 
3468      56
3554      52
3492      50
3661      49
3394      47
3333      39
3330      39
3564      35
3620      35
3650      35
3649      35
3590      34
3512      33
3631      28
3596      28
3557      27
3559      24
3432      22
3245      12
3239      11
3594       9
428        4
3485       2
3184       2
3652       1
3662       1
3488       1
3240       1
3428       1
3250       1
Name: end station id, Length: 768, dtype: int64

In [31]:
# Gender breakdown of active participants - Gender (Zero=unknown; 1=male; 2=female) 
jan_df.sgender.value_counts()

Male       537589
Female     151806
Unknown     29599
Name: sgender, dtype: int64

In [78]:
# Average trip duration by age - Get the max and the min
# Min and Max Age = 2018 - birth year
print(f'Max age: {jan_df.age.max()}')
print(f'Min age: {jan_df.age.min()}')
print(f'Uniques: {jan_df.age.nunique()}')

Max age: 133
Min age: 16
Uniques: 88


In [77]:
# Average trip duration by age - Differents values for age
# Age = 2018 - birth year 
jan_df.age.value_counts()

49     39851
29     26774
28     26741
33     26133
30     25817
31     25462
32     24962
27     24616
34     23916
26     22930
35     22169
36     22133
37     20133
25     18780
39     18637
38     18467
40     16652
48     16389
41     15605
42     15265
47     14736
43     14432
45     13990
46     13960
44     13851
50     13139
51     12750
24     12270
54     12115
52     11800
       ...  
75       441
77       393
74       345
118      282
78       211
81       181
79        98
80        78
82        54
131       53
84        50
95        48
133       33
119       30
106       29
85        24
117       17
108       16
97        14
89        12
124       11
86         9
101        9
132        5
16         4
83         4
100        3
99         2
87         2
102        1
Name: age, Length: 88, dtype: int64

In [11]:
# Number of trips by age
jan_df.groupby('age')['tripdurmin'].count()

age
16         4
17       613
18      1180
19      2164
20      3101
21      4114
22      4931
23      8328
24     12270
25     18780
26     22930
27     24616
28     26741
29     26774
30     25817
31     25462
32     24962
33     26133
34     23916
35     22169
36     22133
37     20133
38     18467
39     18637
40     16652
41     15605
42     15265
43     14432
44     13851
45     13990
       ...  
74       345
75       441
76       505
77       393
78       211
79        98
80        78
81       181
82        54
83         4
84        50
85        24
86         9
87         2
89        12
95        48
97        14
99         2
100        3
101        9
102        1
106       29
108       16
117       17
118      282
119       30
124       11
131       53
132        5
133       33
Name: tripdurmin, Length: 88, dtype: int64

In [127]:
# Average trip duration by age
jan_df.groupby('age')['tripdurmin'].mean()

age
16      7.166667
17     12.047716
18     10.949237
19     10.200493
20     11.011201
21     10.127390
22     10.306574
23     11.789371
24     11.285030
25     11.468579
26     11.113303
27     11.242637
28     12.849242
29     11.826700
30     11.431055
31     11.393095
32     10.891832
33     11.840094
34     10.380009
35     15.467853
36     10.873013
37     11.237665
38     11.356871
39     11.847639
40     11.169677
41     16.112838
42     11.444746
43     14.663241
44     11.253198
45     12.973532
         ...    
74     12.458937
75     15.063719
76     15.469769
77     13.218914
78     10.034597
79     13.435714
80     10.173077
81      8.743002
82      8.356173
83      7.279167
84     13.657667
85      7.191667
86     11.881481
87      4.600000
89      7.575000
95     78.111111
97      3.728571
99     36.133333
100     9.200000
101     7.877778
102    12.933333
106    11.726437
108     3.405208
117    21.448039
118     6.358570
119    23.570000
124    12.548485
131    13.

In [79]:
# Average distance in miles that a bike is ridden - Uniques
print(f'Uniques: {jan_df.bikeid.nunique()}')

Uniques: 10449


In [69]:
# Average distance in miles that a bike is ridden
jan_df.groupby(['bikeid']).agg({'distance': 'mean'}).sort_values(['distance'],ascending=False).head(10)

Unnamed: 0_level_0,distance
bikeid,Unnamed: 1_level_1
25062,3.228474
18815,3.041296
32850,3.036479
29858,2.256642
27005,2.146544
19159,2.135479
19561,2.1279
16099,2.064585
27718,2.064042
33229,2.05547


In [106]:
# Which bikes (by ID) are most likely due for repair or inspection in the timespan?
jan_df.groupby(['bikeid']).agg({'distance': 'sum'}).sort_values(['distance'],ascending=False).head(10)

Unnamed: 0_level_0,distance
bikeid,Unnamed: 1_level_1
31453,224.228734
31962,215.509393
33133,215.50633
30167,214.199713
32090,213.669306
29984,213.648404
33057,212.72713
30245,209.930314
30203,208.377209
31623,208.341975


In [74]:
# Variability by bike ID
jan_df.groupby(['bikeid']).agg({'distance': 'std'}).sort_values(['distance'],ascending=False).head(10)

Unnamed: 0_level_0,distance
bikeid,Unnamed: 1_level_1
18815,2.571948
27718,2.429265
21715,2.206381
19003,1.989564
25633,1.968165
30075,1.902171
21607,1.876201
15413,1.851931
21519,1.827602
26705,1.804082


### __Make final data frames__

In [180]:
# Usertype + Gender
user_df = jan_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in jan_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Jan')
user_df

Unnamed: 0,year,month,usertype,gender,trips
0,2018,Jan,Customer,Female,2061
1,2018,Jan,Customer,Male,2906
2,2018,Jan,Customer,Unknown,17141
3,2018,Jan,Subscriber,Female,149745
4,2018,Jan,Subscriber,Male,534683
5,2018,Jan,Subscriber,Unknown,12458


In [181]:
# Trips by hour, season and weekday
season_df = jan_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in jan_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Jan')
season_df

Unnamed: 0,year,month,starthour,season,weekday,trips
0,2018,Jan,00,Winter,Friday,614
1,2018,Jan,00,Winter,Monday,634
2,2018,Jan,00,Winter,Saturday,926
3,2018,Jan,00,Winter,Sunday,1153
4,2018,Jan,00,Winter,Thursday,581
5,2018,Jan,00,Winter,Tuesday,556
6,2018,Jan,00,Winter,Wednesday,699
7,2018,Jan,01,Winter,Friday,315
8,2018,Jan,01,Winter,Monday,408
9,2018,Jan,01,Winter,Saturday,603


In [5]:
# Start Stations
stat_df = jan_df.groupby(['start station id',]).agg({i:'count' for i in jan_df.columns[14:15]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Jan')
stat_df

Unnamed: 0,year,month,stationid,gender
0,2018,Jan,72,1324
1,2018,Jan,79,1106
2,2018,Jan,82,436
3,2018,Jan,83,685
4,2018,Jan,119,239
5,2018,Jan,120,393
6,2018,Jan,127,2376
7,2018,Jan,128,2208
8,2018,Jan,143,1041
9,2018,Jan,144,205


In [6]:
# End Stations
statend_df = jan_df.groupby(['end station id',]).agg({i:'count' for i in jan_df.columns[14:15]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)
statend_df

Unnamed: 0,stationid,gender
0,72,1322
1,79,1155
2,82,464
3,83,680
4,119,269
5,120,339
6,127,2400
7,128,2281
8,143,1034
9,144,236


In [7]:
# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))
stat_df

Unnamed: 0,year,month,stationid,gender_left,gender_right
0,2018.0,Jan,72,1324.0,1322
1,2018.0,Jan,79,1106.0,1155
2,2018.0,Jan,82,436.0,464
3,2018.0,Jan,83,685.0,680
4,2018.0,Jan,119,239.0,269
5,2018.0,Jan,120,393.0,339
6,2018.0,Jan,127,2376.0,2400
7,2018.0,Jan,128,2208.0,2281
8,2018.0,Jan,143,1041.0,1034
9,2018.0,Jan,144,205.0,236


In [12]:
# Age - Trip duration
#agedur_df = jan_df.groupby(['age','usertype'])['tripdurmin'].mean().to_frame(name='avgtripdur')
agedur_df = jan_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Jan')
agedur_df

Unnamed: 0,year,month,age,usertype,count,mean
0,2018,Jan,16,Subscriber,4,7.166667
1,2018,Jan,17,Customer,22,24.976515
2,2018,Jan,17,Subscriber,591,11.566441
3,2018,Jan,18,Customer,59,25.710452
4,2018,Jan,18,Subscriber,1121,10.172331
5,2018,Jan,19,Customer,111,27.331832
6,2018,Jan,19,Subscriber,2053,9.274249
7,2018,Jan,20,Customer,95,19.227193
8,2018,Jan,20,Subscriber,3006,10.751547
9,2018,Jan,21,Customer,96,22.894271


In [186]:
# Bikes considering distance from start station to end station
bike_df = jan_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Jan')
bike_df

Unnamed: 0,year,month,bikeid,trips,totmiles,avgmiles,stdmiles
0,2018,Jan,14529,69,60.887367,0.882426,0.831832
1,2018,Jan,14530,56,54.760654,0.977869,0.749576
2,2018,Jan,14532,7,4.963338,0.709048,0.235022
3,2018,Jan,14533,17,11.717877,0.689287,0.583706
4,2018,Jan,14534,83,80.509785,0.969997,0.727214
5,2018,Jan,14535,2,1.532518,0.766259,0.330207
6,2018,Jan,14536,55,42.699128,0.776348,0.675804
7,2018,Jan,14537,80,89.386814,1.117335,0.936547
8,2018,Jan,14539,82,75.162186,0.916612,0.657427
9,2018,Jan,14540,42,43.142563,1.027204,0.691892


In [187]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = jan_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Jan')
bikemil_df

Unnamed: 0,year,month,bikeid,trips,totmileage,avgmileage,stdmileage
0,2018,Jan,14529,69,140.835160,2.041089,5.892852
1,2018,Jan,14530,56,77.356000,1.381357,1.033084
2,2018,Jan,14532,7,6.683476,0.954782,0.471089
3,2018,Jan,14533,17,14.574409,0.857318,0.739630
4,2018,Jan,14534,83,121.787547,1.467320,1.133812
5,2018,Jan,14535,2,2.980329,1.490164,0.678062
6,2018,Jan,14536,55,65.476107,1.190475,1.240433
7,2018,Jan,14537,80,513.586006,6.419825,44.684734
8,2018,Jan,14539,82,113.935964,1.389463,1.241047
9,2018,Jan,14540,42,54.961076,1.308597,0.950058


### __Export final data frames to csv files for Tableau__

In [188]:
user_df.to_csv("./2018/user2018.csv", encoding="utf-8", index=False, header=True) 

season_df.to_csv("./2018/season2018.csv", encoding="utf-8", index=False, header=True)

stat_df.to_csv("./2018/station2018.csv", encoding="utf-8", index=False, header=True)

agedur_df.to_csv("./2018/agedur2018.csv", encoding="utf-8", index=False, header=True)

bike_df.to_csv("./2018/bike2018.csv", encoding="utf-8", index=False, header=True)

bikemil_df.to_csv("./2018/mileage2018.csv", encoding="utf-8", index=False, header=True)

### Make the same transformations from February to December

`February`

In [14]:
# Read file and print the number of records and data types
csv_name = "./2018/201802.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Feb')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Winter')
month_df.insert(24, 'mileage', 0)
print("Records Feb2018 : " + str(month_df.count()))

Records Feb2018 : year                       843114
month                      843114
tripduration               843114
tripdurmin                 843114
starttime                  843114
starthour                  843114
weekday                    843114
stoptime                   843114
start station id           843114
start station name         843114
start station latitude     843114
start station longitude    843114
end station id             843114
end station name           843114
end station latitude       843114
end station longitude      843114
distance                   843114
bikeid                     843114
usertype                   843114
birth year                 843114
gender                     843114
age                        843114
sgender                    843114
season                     843114
mileage                    843114
dtype: int64


In [15]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [191]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [16]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [193]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Feb')

In [194]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Feb')

In [10]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[14:15]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Feb')

In [11]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[14:15]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [17]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Feb')

In [198]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Feb')

In [199]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Feb')

Unnamed: 0,year,month,bikeid,trips,totmileage,avgmileage,stdmileage
0,2018,Feb,14529,72,97.845502,1.358965,0.949129
1,2018,Feb,14530,53,73.570009,1.388113,0.871532
2,2018,Feb,14532,29,41.063920,1.415997,1.043441
3,2018,Feb,14533,54,62.673893,1.160628,0.875837
4,2018,Feb,14534,21,375.185523,17.865977,75.574381
5,2018,Feb,14535,49,66.004240,1.347025,1.074608
6,2018,Feb,14536,75,159.879216,2.131723,7.063206
7,2018,Feb,14537,33,48.764311,1.477706,1.042015
8,2018,Feb,14539,71,104.702951,1.474689,1.181816
9,2018,Feb,14540,52,383.635254,7.377601,43.112113


In [200]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`March`

In [19]:
# Read file and print the number of records and data types
csv_name = "./2018/201803.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Mar')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Spring')
month_df.insert(24, 'mileage', 0)
print("Records Mar2018 : " + str(month_df.count()))

Records Mar2018 : year                       976672
month                      976672
tripduration               976672
tripdurmin                 976672
starttime                  976672
starthour                  976672
weekday                    976672
stoptime                   976672
start station id           976672
start station name         976672
start station latitude     976672
start station longitude    976672
end station id             976672
end station name           976672
end station latitude       976672
end station longitude      976672
distance                   976672
bikeid                     976672
usertype                   976672
birth year                 976672
gender                     976672
age                        976672
sgender                    976672
season                     976672
mileage                    976672
dtype: int64


In [20]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [203]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [21]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [205]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Mar')

In [206]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Mar')

In [14]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Mar')

In [15]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [22]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Mar')

In [210]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Mar')

In [211]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Mar')

In [212]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)
    
with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`April`

In [24]:
# Read file and print the number of records and data types
csv_name = "./2018/201804.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Apr')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Spring')
month_df.insert(24, 'mileage', 0)
print("Records Apr2018 : " + str(month_df.count()))

Records Apr2018 : year                       1307543
month                      1307543
tripduration               1307543
tripdurmin                 1307543
starttime                  1307543
starthour                  1307543
weekday                    1307543
stoptime                   1307543
start station id           1307543
start station name         1307543
start station latitude     1307543
start station longitude    1307543
end station id             1307543
end station name           1307543
end station latitude       1307543
end station longitude      1307543
distance                   1307543
bikeid                     1307543
usertype                   1307543
birth year                 1307543
gender                     1307543
age                        1307543
sgender                    1307543
season                     1307543
mileage                    1307543
dtype: int64


In [25]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [215]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [26]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [217]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Apr')

In [218]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Apr')

In [18]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Apr')

In [19]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [27]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Apr')

In [222]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Apr')

In [223]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Apr')

In [224]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`May`

In [58]:
# Read file and print the number of records and data types
csv_name = "./2018/201805.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'May')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Spring')
month_df.insert(24, 'mileage', 0)
print("Records May2018 : " + str(month_df.count()))

Records May2018 : year                       1824710
month                      1824710
tripduration               1824710
tripdurmin                 1824710
starttime                  1824710
starthour                  1824710
weekday                    1824710
stoptime                   1824710
start station id           1824710
start station name         1824710
start station latitude     1824710
start station longitude    1824710
end station id             1824710
end station name           1824710
end station latitude       1824710
end station longitude      1824710
distance                   1824710
bikeid                     1824710
usertype                   1824710
birth year                 1824710
gender                     1824710
age                        1824710
sgender                    1824710
season                     1824710
mileage                    1824710
dtype: int64


In [68]:
# Example of a station that is not in the catalog and has a trip duration 
# in seconds like the trip has lasted almost a month
month_df.loc[(month_df['end station id'] == 3426), :'distance']

Unnamed: 0,year,month,tripduration,tripdurmin,starttime,starthour,weekday,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,distance
1544616,2018,May,2587767,0,2018-05-28 11:15:06.0410,0,0,2018-06-27 10:04:33.4550,3374,Central Park North & Adam Clayton Powell Blvd,40.799484,-73.955613,3426,JCBS Depot,40.709651,-74.068601,0


In [30]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [227]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [31]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [229]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'May')

In [230]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'May')

In [22]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'May')

In [23]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [32]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'May')

In [234]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'May')

In [235]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'May')

In [236]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`June`

In [69]:
# Read file and print the number of records and data types
csv_name = "./2018/201806.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Jun')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Summer')
month_df.insert(24, 'mileage', 0)
print("Records Jun2018 : " + str(month_df.count()))

Records Jun2018 : year                       1953103
month                      1953103
tripduration               1953103
tripdurmin                 1953103
starttime                  1953103
starthour                  1953103
weekday                    1953103
stoptime                   1953103
start station id           1953103
start station name         1953103
start station latitude     1953103
start station longitude    1953103
end station id             1953103
end station name           1953103
end station latitude       1953103
end station longitude      1953103
distance                   1953103
bikeid                     1953103
usertype                   1953103
birth year                 1953103
gender                     1953103
age                        1953103
sgender                    1953103
season                     1953103
mileage                    1953103
dtype: int64


In [34]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [239]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [35]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [241]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Jun')

In [242]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Jun')

In [26]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Jun')

In [27]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [36]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Jun')

In [246]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Jun')

In [247]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Jun')

In [248]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`July`

In [37]:
# Read file and print the number of records and data types
csv_name = "./2018/201807.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Jul')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Summer')
month_df.insert(24, 'mileage', 0)
print("Records Jul2018 : " + str(month_df.count()))

Records Jul2018 : year                       1913625
month                      1913625
tripduration               1913625
tripdurmin                 1913625
starttime                  1913625
starthour                  1913625
weekday                    1913625
stoptime                   1913625
start station id           1913625
start station name         1913625
start station latitude     1913625
start station longitude    1913625
end station id             1913625
end station name           1913625
end station latitude       1913625
end station longitude      1913625
distance                   1913625
bikeid                     1913625
usertype                   1913625
birth year                 1913625
gender                     1913625
age                        1913625
sgender                    1913625
season                     1913625
mileage                    1913625
dtype: int64


In [38]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [251]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [39]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [253]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Jul')

In [254]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Jul')

In [30]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Jul')

In [31]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [40]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Jul')

In [258]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Jul')

In [259]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Jul')

In [260]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`August`

In [48]:
# Read file and print the number of records and data types
csv_name = "./2018/201808.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
print("Records Aug2018 : " + str(month_df.count()))

Records Aug2018 : tripduration               1977177
starttime                  1977177
stoptime                   1977177
start station id           1975789
start station name         1975789
start station latitude     1977177
start station longitude    1977177
end station id             1975789
end station name           1975789
end station latitude       1977177
end station longitude      1977177
bikeid                     1977177
usertype                   1977177
birth year                 1977177
gender                     1977177
dtype: int64


In [49]:
# Find missing values to define what to do with them
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station id,0.070201
start station name,0.070201
end station id,0.070201
end station name,0.070201
tripduration,0.0
starttime,0.0
stoptime,0.0
start station latitude,0.0
start station longitude,0.0
end station latitude,0.0


In [50]:
# Replace missing values for stations ids
month_df['start station id'] = month_df['start station id'].transform(lambda x: x.fillna(8000))
month_df['end station id'] = month_df['end station id'].transform(lambda x: x.fillna(9000))

In [51]:
# Verify that there only the name stations are missing values
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station name,0.070201
end station name,0.070201
tripduration,0.0
starttime,0.0
stoptime,0.0
start station id,0.0
start station latitude,0.0
start station longitude,0.0
end station id,0.0
end station latitude,0.0


In [52]:
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Aug')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Summer')
month_df.insert(24, 'mileage', 0)

In [72]:
# Station not in catalog
month_df.loc[(month_df['end station id'] == 3705), :'distance']

Unnamed: 0,year,month,tripduration,tripdurmin,starttime,starthour,weekday,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,distance
1331669,2018,Aug,4281,0,2018-08-22 12:42:47.0810,0,0,2018-08-22 13:54:08.3270,539.0,Metropolitan Ave & Bedford Ave,40.715348,-73.960241,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1332119,2018,Aug,1615,0,2018-08-22 12:51:04.2130,0,0,2018-08-22 13:17:59.6220,379.0,W 31 St & 7 Ave,40.749156,-73.991600,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1333837,2018,Aug,237,0,2018-08-22 13:23:43.7200,0,0,2018-08-22 13:27:41.3440,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1335073,2018,Aug,2562,0,2018-08-22 13:46:06.6440,0,0,2018-08-22 14:28:49.2890,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1337508,2018,Aug,2052,0,2018-08-22 14:30:14.8850,0,0,2018-08-22 15:04:27.3610,474.0,5 Ave & E 29 St,40.745168,-73.986831,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1337538,2018,Aug,2002,0,2018-08-22 14:30:47.8540,0,0,2018-08-22 15:04:10.6550,474.0,5 Ave & E 29 St,40.745168,-73.986831,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1340220,2018,Aug,1496,0,2018-08-22 15:18:25.8580,0,0,2018-08-22 15:43:22.5670,267.0,Broadway & W 36 St,40.750977,-73.987654,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1368851,2018,Aug,52407,0,2018-08-22 19:57:40.3070,0,0,2018-08-23 10:31:07.5900,388.0,W 26 St & 10 Ave,40.749718,-74.002950,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1406880,2018,Aug,1705,0,2018-08-23 13:24:45.0730,0,0,2018-08-23 13:53:10.7920,514.0,12 Ave & W 40 St,40.760875,-74.002777,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0
1406944,2018,Aug,1619,0,2018-08-23 13:26:03.9050,0,0,2018-08-23 13:53:03.8910,514.0,12 Ave & W 40 St,40.760875,-74.002777,3705.0,Thompson St & Bleecker St,40.728401,-73.999688,0


In [53]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [263]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [54]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [265]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Aug')

In [266]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Aug')

In [55]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Aug')

In [56]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))
with open('./2018/station18.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

In [44]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Aug')

In [270]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Aug')

In [271]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Aug')

In [272]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`September`

In [12]:
# Read file and print the number of records and data types
csv_name = "./2018/201809.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
print("Records Sep2018 : " + str(month_df.count()))

Records Sep2018 : tripduration               1877884
starttime                  1877884
stoptime                   1877884
start station id           1877168
start station name         1877168
start station latitude     1877884
start station longitude    1877884
end station id             1877168
end station name           1877168
end station latitude       1877884
end station longitude      1877884
bikeid                     1877884
usertype                   1877884
birth year                 1877884
gender                     1877884
dtype: int64


In [13]:
# Find missing values to define what to do with them
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station id,0.038128
start station name,0.038128
end station id,0.038128
end station name,0.038128
tripduration,0.0
starttime,0.0
stoptime,0.0
start station latitude,0.0
start station longitude,0.0
end station latitude,0.0


In [14]:
# Replace missing values for stations ids
month_df['start station id'] = month_df['start station id'].transform(lambda x: x.fillna(8000))
month_df['end station id'] = month_df['start station id'].transform(lambda x: x.fillna(9000))

In [15]:
# Verify that there only the name stations are missing values
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station name,0.038128
end station name,0.038128
tripduration,0.0
starttime,0.0
stoptime,0.0
start station id,0.0
start station latitude,0.0
start station longitude,0.0
end station id,0.0
end station latitude,0.0


In [16]:
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Sep')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Autumn')
month_df.insert(24, 'mileage', 0)

In [17]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [275]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [18]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [277]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Sep')

In [278]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Sep')

In [19]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Sep')

In [20]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [48]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Sep')

In [282]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Sep')

In [283]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Sep')

In [284]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`October`

In [21]:
# Read file and print the number of records and data types
csv_name = "./2018/201810.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
print("Records Oct2018 : " + str(month_df.count()))

Records Oct2018 : tripduration               1878657
starttime                  1878657
stoptime                   1878657
start station id           1878433
start station name         1878433
start station latitude     1878657
start station longitude    1878657
end station id             1878433
end station name           1878433
end station latitude       1878657
end station longitude      1878657
bikeid                     1878657
usertype                   1878657
birth year                 1878657
gender                     1878657
dtype: int64


In [22]:
# Find missing values to define what to do with them
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station id,0.011923
start station name,0.011923
end station id,0.011923
end station name,0.011923
tripduration,0.0
starttime,0.0
stoptime,0.0
start station latitude,0.0
start station longitude,0.0
end station latitude,0.0


In [23]:
# Replace missing values for stations ids
month_df['start station id'] = month_df['start station id'].transform(lambda x: x.fillna(8000))
month_df['end station id'] = month_df['start station id'].transform(lambda x: x.fillna(9000))

In [24]:
# Verify that there only the name stations are missing values
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station name,0.011923
end station name,0.011923
tripduration,0.0
starttime,0.0
stoptime,0.0
start station id,0.0
start station latitude,0.0
start station longitude,0.0
end station id,0.0
end station latitude,0.0


In [25]:
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Oct')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Autumn')
month_df.insert(24, 'mileage', 0)

In [26]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [287]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [27]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [289]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Oct')

In [290]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Oct')

In [28]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Oct')

In [29]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [52]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Oct')

In [294]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Oct')

In [295]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Oct')

In [296]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`November`

In [30]:
# Read file and print the number of records and data types
csv_name = "./2018/201811.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
print("Records Nov2018 : " + str(month_df.count()))

Records Nov2018 : tripduration               1260355
starttime                  1260355
stoptime                   1260355
start station id           1260275
start station name         1260275
start station latitude     1260355
start station longitude    1260355
end station id             1260275
end station name           1260275
end station latitude       1260355
end station longitude      1260355
bikeid                     1260355
usertype                   1260355
birth year                 1260355
gender                     1260355
dtype: int64


In [31]:
# Find missing values to define what to do with them
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station id,0.006347
start station name,0.006347
end station id,0.006347
end station name,0.006347
tripduration,0.0
starttime,0.0
stoptime,0.0
start station latitude,0.0
start station longitude,0.0
end station latitude,0.0


In [32]:
# Replace missing values for stations ids
month_df['start station id'] = month_df['start station id'].transform(lambda x: x.fillna(8000))
month_df['end station id'] = month_df['start station id'].transform(lambda x: x.fillna(9000))

In [33]:
# Verify that there only the name stations are missing values
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station name,0.006347
end station name,0.006347
tripduration,0.0
starttime,0.0
stoptime,0.0
start station id,0.0
start station latitude,0.0
start station longitude,0.0
end station id,0.0
end station latitude,0.0


In [34]:
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Nov')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Autumn')
month_df.insert(24, 'mileage', 0)

In [35]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [299]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [36]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [301]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Nov')

In [302]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Nov')

In [37]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Nov')

In [38]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [56]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Nov')

In [306]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Nov')

In [307]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Nov')

In [308]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

`December`

In [39]:
# Read file and print the number of records and data types
csv_name = "./2018/201812.csv"
month_df = pd.read_csv(csv_name, low_memory=False) 
print("Records Dec2018 : " + str(month_df.count()))

Records Dec2018 : tripduration               1016505
starttime                  1016505
stoptime                   1016505
start station id           1016416
start station name         1016416
start station latitude     1016505
start station longitude    1016505
end station id             1016416
end station name           1016416
end station latitude       1016505
end station longitude      1016505
bikeid                     1016505
usertype                   1016505
birth year                 1016505
gender                     1016505
dtype: int64


In [40]:
# Find missing values to define what to do with them
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station id,0.008755
start station name,0.008755
end station id,0.008755
end station name,0.008755
tripduration,0.0
starttime,0.0
stoptime,0.0
start station latitude,0.0
start station longitude,0.0
end station latitude,0.0


In [41]:
# Replace missing values for stations ids
month_df['start station id'] = month_df['start station id'].transform(lambda x: x.fillna(8000))
month_df['end station id'] = month_df['start station id'].transform(lambda x: x.fillna(9000))

In [42]:
# Verify that there only the name stations are missing values
percent_missing = month_df.isnull().sum() * 100 / len(month_df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True, ascending=False)
missing_value_df

Unnamed: 0,percent_missing
start station name,0.008755
end station name,0.008755
tripduration,0.0
starttime,0.0
stoptime,0.0
start station id,0.0
start station latitude,0.0
start station longitude,0.0
end station id,0.0
end station latitude,0.0


In [43]:
# Add columns to make transformations
month_df.insert(0, 'year', 2018)
month_df.insert(1, 'month', 'Dec')
month_df.insert(3, 'tripdurmin', 0)
month_df.insert(5, 'starthour', 0)
month_df.insert(6, 'weekday', 0)
month_df.insert(16, 'distance', 0)
month_df.insert(21, 'age', 0)
month_df.insert(22, 'sgender', '')
month_df.insert(23, 'season', 'Winter')
month_df.insert(24, 'mileage', 0)

In [44]:
# Transform the values for gender
month_df.loc[month_df['gender'] == 0, 'sgender'] = 'Unknown'
month_df.loc[month_df['gender'] == 1, 'sgender'] = 'Male'
month_df.loc[month_df['gender'] == 2, 'sgender'] = 'Female'

# Calculate the age of the person considering that the year was 2018
month_df['age'] = 2018 - month_df['birth year']

# Transform the durantion of the trip from seconds to minutes
month_df['tripdurmin'] = month_df['tripduration'] / 60

# Extact the hour fron the startime
month_df['starthour'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%H')

# Calculate the weekday fron the startime
month_df['weekday'] = pd.to_datetime(month_df["starttime"]).dt.strftime('%A')

In [311]:
# Calcularte the distance in miles from start station to end station
month_df['distance'] = month_df.apply(lambda row: distance((row['start station latitude'],row['start station longitude']), 
                         (row['end station latitude'], row['end station longitude'])),
                         axis=1)

In [45]:
# Transform the values for mileage estimates - assumed speed of 7.456 miles per hour, up to two hours. 
month_df.loc[month_df['tripdurmin'] <= 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 7.456

# Trips over two hours max-out at 14.9 miles
month_df.loc[month_df['tripdurmin'] > 120, 'mileage'] = (month_df['tripdurmin'] / 60) * 14.9

In [313]:
# Usertype + Gender
user_df = month_df.groupby(['usertype', 'sgender']).agg({i:'count' for i in month_df.columns[19:20]})
user_df.reset_index(inplace=True)
user_df.rename(columns={"sgender":"gender", "birth year":"trips"}, inplace=True)
user_df.insert(0, 'year', 2018)
user_df.insert(1, 'month', 'Dec')

In [314]:
# Trips by hour, season and weekday
season_df = month_df.groupby(['starthour', 'season', 'weekday']).agg({i:'count' for i in month_df.columns[19:20]})
season_df.reset_index(inplace=True)
season_df.rename(columns={"birth year":"trips"}, inplace=True)
season_df.insert(0, 'year', 2018)
season_df.insert(1, 'month', 'Dec')

In [46]:
# Start Stations
stat_df = month_df.groupby(['start station id',]).agg({i:'count' for i in month_df.columns[19:20]})
stat_df.reset_index(inplace=True)
stat_df.rename(columns={"start station id" : "stationid", "birth year":"startrips"}, inplace=True)
stat_df.insert(0, 'year', 2018)
stat_df.insert(1, 'month', 'Dec')

In [47]:
# End Stations
statend_df = month_df.groupby(['end station id',]).agg({i:'count' for i in month_df.columns[19:20]})
statend_df.reset_index(inplace=True)
statend_df.rename(columns={"end station id" : "stationid", "birth year":"endtrips"}, inplace=True)

# Merge the start trips with the end trips by station id
stat_df = stat_df.merge(statend_df, how="outer", left_on='stationid', right_on='stationid', suffixes=('_left', '_right'))

In [60]:
# Age - Trip duration
agedur_df = month_df.groupby(by=['age','usertype'])['tripdurmin'].agg(['count', 'mean'])
agedur_df.reset_index(inplace=True)
agedur_df.insert(0, 'year', 2018)
agedur_df.insert(1, 'month', 'Dec')

In [318]:
# Bikes
bike_df = month_df.groupby(by=['bikeid'])['distance'].agg(['count', sum, 'mean', 'std'])
bike_df.reset_index(inplace=True)
bike_df.rename(columns={"count":"trips", "sum" : "totmiles", "mean" : "avgmiles", "std" : "stdmiles"}, inplace=True)
bike_df.insert(0, 'year', 2018)
bike_df.insert(1, 'month', 'Dec')

In [319]:
# Bikes considering trip duration to calculate the mileage 
bikemil_df = month_df.groupby(by=['bikeid'])['mileage'].agg(['count', sum, 'mean', 'std'])
bikemil_df.reset_index(inplace=True)
bikemil_df.rename(columns={"count":"trips", "sum" : "totmileage", "mean" : "avgmileage", "std" : "stdmileage"}, inplace=True)
bikemil_df.insert(0, 'year', 2018)
bikemil_df.insert(1, 'month', 'Dec')

In [320]:
with open('./2018/user2018.csv', 'a') as f:
    user_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/season2018.csv', 'a') as f:
    season_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/station2018.csv', 'a') as f:
    stat_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/agedur2018.csv', 'a') as f:
    agedur_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/bike2018.csv', 'a') as f:
    bike_df.to_csv(f, encoding="utf-8", index=False, header=False)

with open('./2018/mileage2018.csv', 'a') as f:
    bikemil_df.to_csv(f, encoding="utf-8", index=False, header=False)

### __Stations with trips but not in the JSON file (catalog)__

In [47]:
# Read file and print the number of records and data types
csv_name = "./2018/station2018.csv"
stations_df = pd.read_csv(csv_name, low_memory=False) 
csv_name = "./2018/stationsny.csv"
catalogue_df = pd.read_csv(csv_name, low_memory=False)

In [48]:
# Merge data 
stations_df = stations_df.merge(catalogue_df, how="outer", left_on='stationid', right_on='Id', suffixes=('_left', '_right'))

In [49]:
# View results
stations_df.head()

Unnamed: 0,year,month,stationid,startrips,endtrips,Número de registros,Número de registros por stations.json,Altitude,Available Bikes,Available Docks,...,Postal Code,stAddress1,stAddress2,stationBeanList Índice (generado),Station Name,Status Key,Status Value,Test Station,Total Docks,Índice Del Documento (Generado)
0,2018.0,Jan,72.0,1324.0,1322.0,1.0,1.0,,9.0,44.0,...,,W 52 St & 11 Ave,,21.0,W 52 St & 11 Ave,1.0,In Service,Falso,55.0,1.0
1,2018.0,Feb,72.0,1504.0,1526.0,1.0,1.0,,9.0,44.0,...,,W 52 St & 11 Ave,,21.0,W 52 St & 11 Ave,1.0,In Service,Falso,55.0,1.0
2,2018.0,Mar,72.0,2127.0,2142.0,1.0,1.0,,9.0,44.0,...,,W 52 St & 11 Ave,,21.0,W 52 St & 11 Ave,1.0,In Service,Falso,55.0,1.0
3,2018.0,Apr,72.0,2797.0,2807.0,1.0,1.0,,9.0,44.0,...,,W 52 St & 11 Ave,,21.0,W 52 St & 11 Ave,1.0,In Service,Falso,55.0,1.0
4,2018.0,May,72.0,4602.0,4614.0,1.0,1.0,,9.0,44.0,...,,W 52 St & 11 Ave,,21.0,W 52 St & 11 Ave,1.0,In Service,Falso,55.0,1.0


In [52]:
# Create an array with the stations ids from trips
stat_trip = stations_df['stationid'].unique()

In [53]:
# Create an array with the stations ids from catalog
stat_cat = stations_df['Id'].unique()

In [57]:
print("Unique id of stations that are not in the catalog:")
print(np.setdiff1d(stat_trip, stat_cat))

Unique id of stations that are not in the catalogue:
[ 152.  232.  253.  306.  322.  345.  407.  409.  428.  430.  433.  449.
  457.  527.  537. 2001. 3036. 3040. 3073. 3090. 3091. 3103. 3120. 3147.
 3153. 3176. 3180. 3183. 3224. 3238. 3239. 3240. 3245. 3250. 3258. 3316.
 3325. 3371. 3401. 3421. 3426. 3428. 3431. 3432. 3438. 3441. 3447. 3455.
 3462. 3466. 3468. 3474. 3485. 3487. 3488. 3489. 3594. 3625. 3632. 3635.
 3642. 3643. 3644. 3645. 3650. 3651. 3652. 3653. 3658. 3660. 3663. 3666.
 3669. 3672. 3683. 3684. 3685. 3688. 3695. 3700. 3701. 3705. 3719.   nan]
