# Executive Summary
***

This notebook pre-processes the bike share data from citybike company in New York. 

The original dataset is one table where each record is an individual ride. (Columns of the table can be seen below)
The dataset is of good quality as there are no duplicated values and there are not that many missing data points. There is however one data integrity issue -the individual users can't be identified so the columns including user information are technically a dimension of the rides and not the users themselves. 

The dataset is checked for sanity and validated, missing data is identified and handled with insignificant bias introduced to the dataset. 

New columns are included - The distance between start and end points of each ride is calculated and the city, neighbourhood and postal code of the station is determined as well using API from https://opencagedata.com/.
Some minor changes are made to the dataset to speed up future analysis and the dataset is normalised to 2NF. The dataset is exported to .csv file to be uploaded to PostgreSQL database.

# Importing Data and Packages
***

In [1]:
import pandas as pd
import glob
import numpy as np
import geopy
from geopy.distance import geodesic 
from opencage.geocoder import OpenCageGeocode
from pprint import pprint

In [2]:
files = glob.glob('C:/Users/Dell/OneDrive/Dokumenty/Data Engineering/Codecademy Course/Bike Rental Project/bike-rental-starter-kit/data/JC-201***-citibike-tripdata.csv')
df_lst = []
for file in files:
    data = pd.read_csv(file)
    df_lst.append(data)

bike_data_df = pd.concat(df_lst)
bike_data_df.head(5)

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,362,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,Grove St PATH,40.719586,-74.043117,3209,Brunswick St,40.724176,-74.050656,24647,Subscriber,1964.0,2
1,200,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24605,Subscriber,1962.0,1
2,202,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24689,Subscriber,1962.0,2
3,248,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,Brunswick St,40.724176,-74.050656,3203,Hamilton Park,40.727596,-74.044247,24693,Subscriber,1984.0,1
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24573,Customer,,0


## Initial Exploration of the Data
***

Exploring the time span

In [3]:
print('Start of dataset: ' + str(bike_data_df['Start Time'].min()))
print('End of dataset: ' + str(bike_data_df['Start Time'].max()))
print('Number of entries: ' + str(bike_data_df['Trip Duration'].count()))

Start of dataset: 2016-01-01 00:02:52
End of dataset: 2016-12-31 23:44:50
Number of entries: 247584


Exploring the span of the gender column

In [4]:
print('The unique values of the gender column are: ' + str(bike_data_df['Gender'].unique()))
print('0 - unknown')
print('1 - male')
print('2 - female')

The unique values of the gender column are: [2 1 0]
0 - unknown
1 - male
2 - female


## Investigating Missing Data and Duplicates
***

In [5]:
bike_data_missing = bike_data_df.isna().sum().reset_index()
bike_data_missing = bike_data_missing.rename(columns = {'index': 'Column', 0 : 'No_missing_values'})
bike_data_missing['Percentage_missing'] = 100.0 * bike_data_missing['No_missing_values'] / bike_data_df['Trip Duration'].count()
print(bike_data_missing)
print('\nThe number of duplicates in the dataseset is: ' + str(bike_data_df.duplicated().sum()))

                     Column  No_missing_values  Percentage_missing
0             Trip Duration                  0            0.000000
1                Start Time                  0            0.000000
2                 Stop Time                  0            0.000000
3          Start Station ID                  0            0.000000
4        Start Station Name                  0            0.000000
5    Start Station Latitude                  0            0.000000
6   Start Station Longitude                  0            0.000000
7            End Station ID                  0            0.000000
8          End Station Name                  0            0.000000
9      End Station Latitude                  0            0.000000
10    End Station Longitude                  0            0.000000
11                  Bike ID                  0            0.000000
12                User Type                380            0.153483
13               Birth Year              18999            7.67

Missing data in the columns User Type, Birth year and Gender will be investigated next. 

## Investigating Missing Data
***

### User Type Missing Data

It is assumed that this missing data is of the MNAR form, if this hypothesis is falsified, then it will be considered to be MCAR. 
Due to the lack of domain knowledge, it is assumed that there might be a correlation between Bike ID and missing User Type, maybe the user logs into the system using the Bike ID and IDs of some bikes might be lost or corrupted somehow.

In [6]:
bike_data_missing_user_type = bike_data_df[bike_data_df['User Type'].isnull()]
new_df = bike_data_missing_user_type.groupby('Bike ID')['Trip Duration'].count().reset_index()
new_df = new_df.rename(columns = {'Trip Duration' : 'sum_missing_values'})
new_df = new_df.sort_values(by = 'sum_missing_values', ascending = False)
print(new_df)

     Bike ID  sum_missing_values
200    26287                   7
33     24467                   6
49     24512                   6
124    24675                   6
27     24450                   5
..       ...                 ...
122    24671                   1
120    24668                   1
119    24667                   1
48     24511                   1
216    26891                   1

[217 rows x 2 columns]


The hypothesis is falsified as there are many bikes with User Type missing data. 
The missing data is therefore considered to be MCAR.

### Gender Unknown Data
Although not apparent at first sight, there are some unknown values in the gender column as the column has three unique values [0, 1, 2]

In [7]:
gender_unknown = bike_data_df[bike_data_df['Gender'] == 0]
gender_unknown_number = gender_unknown['Gender'].count()
gender_total_number = bike_data_df['Gender'].count()
gender_percentage_unknown = 100 * gender_unknown_number / gender_total_number
print('Percentage unknown: ' + str(gender_percentage_unknown))

Percentage unknown: 8.038080005169963


Because of the lack of domain knowledge and slightly sensitive nature of gender, it is assumed that this data is missing because people simply didn't input their gender.

### Birth Year Missing Data
Because of the lack of domain knowledge and sensitive nature of the dates of birth, it is assumed that this data is missing because people simply didn't input their date of birth. 

### Actions on Missing Data
Rows containing empty value in the User Type column are deleted to prevent null value from being present in any calculation that would aggregate over the User Type column. Since the data is assumed to be MCAR and it is only 0.15% of it, this will introduce no bias. 
The unknown data in the Gender column and missing data in the Birth Year column is not acted upon. It is assumed that the data is missing because users do not input their data. It is also important to highlight that the table has no user_id column, therefore we can't distinguish different users and the information in the User Type, Birth Year and Gender column can be used to analyse the rides and not the users. Nothing will be done with this missing data as any data filling method might introduce more bias to the dataset. 

In [8]:
bike_data_df = bike_data_df.dropna(subset = ['User Type'])

# Sanity Check on the whole dataset:
***

In [9]:
bike_data_df.describe()

Unnamed: 0,Trip Duration,Start Station ID,Start Station Latitude,Start Station Longitude,End Station ID,End Station Latitude,End Station Longitude,Bike ID,Birth Year,Gender
count,247204.0,247204.0,247204.0,247204.0,247204.0,247204.0,247204.0,247204.0,228205.0,247204.0
mean,884.9792,3207.063721,40.723127,-74.046444,3203.573,40.722599,-74.04586,24935.036169,1979.328547,1.123246
std,35965.33,26.950888,0.008198,0.011211,61.618127,0.007957,0.011284,748.39163,9.595552,0.518716
min,61.0,3183.0,40.69264,-74.096937,147.0,40.692216,-74.096937,14552.0,1900.0,0.0
25%,248.0,3186.0,40.717732,-74.050656,3186.0,40.71654,-74.050444,24491.0,1974.0,1.0
50%,389.0,3201.0,40.721525,-74.044247,3199.0,40.721124,-74.043117,24609.0,1981.0,1.0
75%,665.0,3211.0,40.727596,-74.038051,3211.0,40.727224,-74.036486,24718.25,1986.0,1.0
max,16329810.0,3426.0,40.752559,-74.032108,3426.0,40.801343,-73.95739,27274.0,2000.0,2.0


The range of values differs between start station id and end station id. This is unexpected and will be investigated. Apart from that sanity check is passed.

# Investigating the Difference between Start and End Stations:
***

In [10]:
print('Start Stations:')
start_stations = bike_data_df['Start Station ID']
start_stations_unique = list(start_stations.unique())
print(start_stations_unique)

print('End Stations:')
end_stations = bike_data_df['End Station ID']
end_stations_unique = list(end_stations.unique())
print(end_stations_unique)


Start Stations:
[3186, 3209, 3195, 3211, 3187, 3183, 3213, 3193, 3194, 3202, 3196, 3214, 3207, 3199, 3203, 3210, 3190, 3185, 3197, 3192, 3212, 3225, 3215, 3206, 3184, 3205, 3198, 3220, 3191, 3201, 3217, 3188, 3200, 3216, 3189, 3270, 3267, 3272, 3268, 3278, 3279, 3276, 3273, 3275, 3274, 3281, 3271, 3269, 3280, 3426, 3277]
End Stations:
[3209, 3213, 3203, 3210, 3214, 3187, 3211, 3202, 3199, 3183, 3212, 3193, 3194, 3195, 3186, 3185, 3184, 3220, 3192, 3196, 3215, 3207, 3205, 3197, 3206, 3189, 3225, 3201, 3190, 3191, 3198, 3217, 534, 3200, 3188, 3216, 249, 225, 276, 403, 295, 430, 2008, 3267, 405, 147, 3178, 173, 252, 484, 505, 327, 363, 329, 457, 358, 315, 477, 428, 3002, 514, 390, 501, 351, 3265, 304, 284, 3270, 3272, 3268, 3279, 3278, 3273, 3276, 3271, 3274, 3281, 3275, 3314, 498, 3169, 380, 3331, 386, 3426, 427, 347, 359, 426, 3085, 360, 328, 3269, 446, 520, 3280, 2004, 393, 3277, 401, 376, 224]


As can be seen, the list of start and end stations is different. It is obvious that there are more end stations than start stations. The below cell tests if there are some start stations that are only start stations.

In [11]:
for station in start_stations:
    if not (station in end_stations_unique):
        print('Start station not end station')
        break
print('done')

done


All stations serve as end stations and some of them serve also as start stations. The question is whether or not some stations were never used as start stations in 2016 because they are designed to be only end stations or because people simply didn't use them. The following cell counts the number of started trips and groups them by the station ID.

In [12]:
bike_data_df.groupby('Start Station ID')['Trip Duration'].count()

Start Station ID
3183    18961
3184     7785
3185     7779
3186    28712
3187     8326
3188      518
3189      240
3190      924
3191      356
3192     6545
3193     3183
3194     7272
3195    17130
3196     1772
3197     1661
3198     1788
3199     9012
3200      243
3201     2190
3202    13336
3203    15293
3205     4678
3206     2588
3207     3979
3209     9570
3210     3311
3211     9166
3212     2365
3213     8579
3214     9413
3215     2234
3216      192
3217      345
3220     2728
3225     4515
3267     5935
3268      829
3269      857
3270     3847
3271       60
3272     3392
3273     1758
3274       29
3275     2766
3276     4152
3277       34
3278     3386
3279     2735
3280      176
3281      558
3426        1
Name: Trip Duration, dtype: int64

Some stations have a very low number of started trips. This means that the possibility of some stations simply not being used can't be ruled out. This will be investigated further with geographic data.    

# Normalizing and enriching the Dataset
***

##  Stations Table
The station table will include info on whether or not the stations are start_stations as well as other data from the original dataset. It will also include decoded geo info regarding the city and postcode.

### Table Creation

In [13]:
start_station_df = bike_data_df[['Start Station ID', 'Start Station Name', 'Start Station Latitude', 'Start Station Longitude']]
end_station_df = bike_data_df[['End Station ID', 'End Station Name', 'End Station Latitude', 'End Station Longitude']]
start_station_df = start_station_df.rename(columns = {'Start Station ID': 'station_id',
                                                      'Start Station Name': 'station_name',
                                                      'Start Station Latitude': 'station_latitude',
                                                      'Start Station Longitude': 'station_longitude'})
end_station_df = end_station_df.rename(columns = {'End Station ID': 'station_id',
                                                  'End Station Name': 'station_name',
                                                  'End Station Latitude': 'station_latitude',
                                                  'End Station Longitude': 'station_longitude'})

stations_concat = pd.concat([start_station_df, end_station_df])
stations = stations_concat.drop_duplicates().reset_index(drop = True)

station_start_function = lambda row: True if row['station_id'] in start_stations_unique else False
stations['start_station'] = stations.apply(station_start_function, axis = 1)

stations.head(150)

Unnamed: 0,station_id,station_name,station_latitude,station_longitude,start_station
0,3186,Grove St PATH,40.719586,-74.043117,True
1,3209,Brunswick St,40.724176,-74.050656,True
2,3195,Sip Ave,40.730743,-74.063784,True
3,3211,Newark Ave,40.721525,-74.046305,True
4,3187,Warren St,40.721124,-74.038051,True
...,...,...,...,...,...
97,2004,6 Ave & Broome St,40.724399,-74.004704,False
98,393,E 5 St & Avenue C,40.722992,-73.979955,False
99,401,Allen St & Rivington St,40.720196,-73.989978,False
100,376,John St & William St,40.708621,-74.007222,False


### Geocoding the latitude and longitude info

The OpenCage API was used to decode the stations location - https://opencagedata.com/

In [14]:
key = '7b5eda2beb33494cb6310fb5ffd0489e'
geocoder = OpenCageGeocode(key)

station_latitude_lst = stations['station_latitude'].tolist()
station_longitude_lst = stations['station_longitude'].tolist()

station_city_lst = []
station_neighbourhood_lst = []
station_postcode_lst = []

for i in range(len(station_latitude_lst)):
    latitude = station_latitude_lst[i]
    longitude = station_longitude_lst[i]
    
    station_city = geocoder.reverse_geocode(latitude, longitude)[0]['components']['city']
    station_postcode = geocoder.reverse_geocode(latitude, longitude)[0]['components']['postcode']
    
    station_city_lst.append(station_city)
    station_postcode_lst.append(station_postcode)
    
    try:
        station_neighbourhood = geocoder.reverse_geocode(latitude, longitude)[0]['components']['neighbourhood']
        station_neighbourhood_lst.append(station_neighbourhood)
    except KeyError:
        station_neighbourhood_lst.append('')
        
stations['city'] = station_city_lst
stations['neighbourhood'] = station_neighbourhood_lst
stations['postcode'] = station_postcode_lst

stations.head(5)

Unnamed: 0,station_id,station_name,station_latitude,station_longitude,start_station,city,neighbourhood,postcode
0,3186,Grove St PATH,40.719586,-74.043117,True,Jersey City,Downtown Jersey City,7302
1,3209,Brunswick St,40.724176,-74.050656,True,Jersey City,,7302
2,3195,Sip Ave,40.730743,-74.063784,True,Jersey City,Indian Square,7306
3,3211,Newark Ave,40.721525,-74.046305,True,Jersey City,,7302
4,3187,Warren St,40.721124,-74.038051,True,Jersey City,Newport,7302


There is a lot of missing data in the neighbourhood column. The dataset can't be analysed by this dimension, but the information can be still useful.

# Creating the Rides Table
***
Some columns of the main rides table are adjusted and some are added to speed up the subsequent analysis of the dataset. A new column geo_distance_km is added which calculates the distance between the start and end stations.

In [15]:
bike_data_df.head(5)
rides_df = bike_data_df[['Start Time', 'Stop Time', 'Start Station ID', 'End Station ID', 'Bike ID', 'User Type', 'Birth Year', 'Gender']]
rides_df = rides_df.rename(columns = { 'Start Time' : 'start_time', 'Stop Time' : 'stop_time',
                                       'Start Station ID' : 'start_station_id',
                                       'End Station ID' : 'end_station_id',
                                       'Bike ID' : 'bike_id', 'User Type' : 'user_type',
                                       'Birth Year' : 'birth_year', 'Gender' : 'gender'})

rides_df['ride_id'] = 'NaN'
rides_df['date'] = 'NaN'
rides_df['day'] = 'NaN'
rides_df.head(5)

Unnamed: 0,start_time,stop_time,start_station_id,end_station_id,bike_id,user_type,birth_year,gender,ride_id,date,day
0,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,3209,24647,Subscriber,1964.0,2,,,
1,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,3213,24605,Subscriber,1962.0,1,,,
2,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,3213,24689,Subscriber,1962.0,2,,,
3,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,3203,24693,Subscriber,1984.0,1,,,
4,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,3210,24573,Customer,,0,,,


## Date column

In [16]:
split_start_date = rides_df['start_time'].str.split(" ")
rides_df['date'] = split_start_date.str.get(0)

rides_df.head(5)

Unnamed: 0,start_time,stop_time,start_station_id,end_station_id,bike_id,user_type,birth_year,gender,ride_id,date,day
0,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,3209,24647,Subscriber,1964.0,2,,2016-01-01,
1,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,3213,24605,Subscriber,1962.0,1,,2016-01-01,
2,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,3213,24689,Subscriber,1962.0,2,,2016-01-01,
3,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,3203,24693,Subscriber,1984.0,1,,2016-01-01,
4,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,3210,24573,Customer,,0,,2016-01-01,


It should be noted that only the start_date is used as an entry in the date column.

## ride_id column

In [17]:
counter = 0
ride_id_list = rides_df['ride_id'].tolist()
for i in range(len(ride_id_list)):
    ride_id_list[i] = counter
    counter += 1
    
rides_df['ride_id'] = ride_id_list
rides_df.head(5)

Unnamed: 0,start_time,stop_time,start_station_id,end_station_id,bike_id,user_type,birth_year,gender,ride_id,date,day
0,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,3209,24647,Subscriber,1964.0,2,0,2016-01-01,
1,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,3213,24605,Subscriber,1962.0,1,1,2016-01-01,
2,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,3213,24689,Subscriber,1962.0,2,2,2016-01-01,
3,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,3203,24693,Subscriber,1984.0,1,3,2016-01-01,
4,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,3210,24573,Customer,,0,4,2016-01-01,


## Gender column

In [18]:
gender_list = rides_df['gender'].tolist()
for i in range(len(gender_list)):
    if gender_list[i] == 0:
        gender_list[i] = ''
    if gender_list[i] == 1:
        gender_list[i] = 'Male'
    if gender_list[i] == 2:
        gender_list[i] = 'Female'

rides_df['gender'] = gender_list
rides_df.head(5)

Unnamed: 0,start_time,stop_time,start_station_id,end_station_id,bike_id,user_type,birth_year,gender,ride_id,date,day
0,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,3209,24647,Subscriber,1964.0,Female,0,2016-01-01,
1,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,3213,24605,Subscriber,1962.0,Male,1,2016-01-01,
2,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,3213,24689,Subscriber,1962.0,Female,2,2016-01-01,
3,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,3203,24693,Subscriber,1984.0,Male,3,2016-01-01,
4,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,3210,24573,Customer,,,4,2016-01-01,


## Day column

In [19]:
print(rides_df['date'].nunique())

362


It seems like some days are missing, after inspection of the data, there are some days in January with missing values. This is however just a small amount which will introduce insignificant bias to the dataset. 

In [20]:
date_series = rides_df['date'].tolist()
day_series = []
for i in range(len(date_series)):
    date = pd.to_datetime(date_series[i])
    day = date.strftime('%A')
    day_series.append(day)

rides_df['day'] = day_series

In [21]:
rides_df.head(5)

Unnamed: 0,start_time,stop_time,start_station_id,end_station_id,bike_id,user_type,birth_year,gender,ride_id,date,day
0,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,3209,24647,Subscriber,1964.0,Female,0,2016-01-01,Friday
1,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,3213,24605,Subscriber,1962.0,Male,1,2016-01-01,Friday
2,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,3213,24689,Subscriber,1962.0,Female,2,2016-01-01,Friday
3,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,3203,24693,Subscriber,1984.0,Male,3,2016-01-01,Friday
4,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,3210,24573,Customer,,,4,2016-01-01,Friday


## Adding the distance column

In the following cells, the distance between the start and end stations is calculated. 

In [22]:
merged_df = pd.merge(rides_df, stations[['station_id', 'station_latitude', 'station_longitude']], left_on = 'start_station_id', right_on = 'station_id')
merged_df = merged_df.rename(columns = {'station_latitude' : 'station_latitude_start', 'station_longitude' : 'station_longitude_start'})
merged_df.drop('station_id', axis = 1, inplace = True)
merged_df = pd.merge(merged_df, stations[['station_id', 'station_latitude', 'station_longitude']], left_on = 'end_station_id', right_on = 'station_id')
merged_df = merged_df.rename(columns = {'station_latitude' : 'station_latitude_end', 'station_longitude' : 'station_longitude_end'})
merged_df.drop('station_id', axis = 1, inplace = True)

def calculate_distance(row):
    start_coordinates = (row['station_latitude_start'], row['station_longitude_start'])
    end_coordinates = (row['station_latitude_end'], row['station_longitude_end'])
    return (geodesic(start_coordinates, end_coordinates).kilometers)

merged_df['distance'] = merged_df.apply(calculate_distance, axis = 1)
merged_df['distance'] = merged_df['distance'].round(2)


In [23]:
merged_df = merged_df.rename(columns = {'distance' : 'geo_distance_km'})
columns_to_drop = ['station_latitude_start', 'station_longitude_start', 'station_latitude_end', 'station_longitude_end']
merged_df.drop(columns = columns_to_drop, inplace = True)


## Cosmetic adjustments and formatting of the final table

In [24]:
merged_df = merged_df.sort_values(by = 'start_time', ascending = True).reset_index(drop = True)
rides_df = merged_df[['ride_id', 'date', 'day', 'start_time', 'stop_time', 'start_station_id', 'end_station_id', 'geo_distance_km', 'bike_id', 'user_type', 'birth_year', 'gender']]
rides_df.head(5)

Unnamed: 0,ride_id,date,day,start_time,stop_time,start_station_id,end_station_id,geo_distance_km,bike_id,user_type,birth_year,gender
0,0,2016-01-01,Friday,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,3209,0.82,24647,Subscriber,1964.0,Female
1,1,2016-01-01,Friday,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,3213,0.41,24605,Subscriber,1962.0,Male
2,2,2016-01-01,Friday,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,3213,0.41,24689,Subscriber,1962.0,Female
3,3,2016-01-01,Friday,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,3203,0.66,24693,Subscriber,1984.0,Male
4,4,2016-01-01,Friday,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,3210,1.67,24573,Customer,,


## Converting birth_year column to integer instead of numeric

In [25]:
rides_df['birth_year'] = rides_df['birth_year'].fillna(0)
rides_df['birth_year'] = rides_df['birth_year'].replace('', 0)
rides_df['birth_year'] = rides_df['birth_year'].astype(int)
rides_df['birth_year'] = rides_df['birth_year'].replace(0, '')

Thi is done to satisfy the integer constraint of this column in the future database.

# Exporting the DataFrame
***

In [26]:
rides_df.to_csv('C:/Users/Dell/rides.csv', index=False)
stations.to_csv('C:/Users/Dell/stations.csv', index=False)

# Final Inspection
***

In [27]:
rides_df.head(5)

Unnamed: 0,ride_id,date,day,start_time,stop_time,start_station_id,end_station_id,geo_distance_km,bike_id,user_type,birth_year,gender
0,0,2016-01-01,Friday,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,3209,0.82,24647,Subscriber,1964.0,Female
1,1,2016-01-01,Friday,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,3213,0.41,24605,Subscriber,1962.0,Male
2,2,2016-01-01,Friday,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,3213,0.41,24689,Subscriber,1962.0,Female
3,3,2016-01-01,Friday,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,3203,0.66,24693,Subscriber,1984.0,Male
4,4,2016-01-01,Friday,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,3210,1.67,24573,Customer,,


In [28]:
stations.head(5)

Unnamed: 0,station_id,station_name,station_latitude,station_longitude,start_station,city,neighbourhood,postcode
0,3186,Grove St PATH,40.719586,-74.043117,True,Jersey City,Downtown Jersey City,7302
1,3209,Brunswick St,40.724176,-74.050656,True,Jersey City,,7302
2,3195,Sip Ave,40.730743,-74.063784,True,Jersey City,Indian Square,7306
3,3211,Newark Ave,40.721525,-74.046305,True,Jersey City,,7302
4,3187,Warren St,40.721124,-74.038051,True,Jersey City,Newport,7302
