## CitiBike NYC Data Engineer and Analysis Project

In this project I'll explore, clean, and merge Citi Bike ridership and NOAA weather data to produce a PostgreSQL database with analytics-ready views. 

### Scenario

The scenario is as follow: 
A bike rental company has asked me to create a database to help their analysts understand the effects of weather on bike rentals. I've been given a year of bike rental data from the company and I'll source weather data from the government. 

### Project Objectives

- Use Jupyter notebooks and pandas to explore, clean, and transform datasets
- Design and implement a relational PostgreSQL database
- Use SQL to develop analytics-ready database views

Let's start with the analysis

In [2]:
# Importing modules

import pandas as pd
import numpy as np
from glob import glob

# Reading citi bike 2016 data files
citi_bike_files = glob("./data/JC-2016*.csv")

# Creating dataframes
ride = pd.concat((pd.read_csv(file) for file in citi_bike_files), ignore_index=True)
weather = pd.read_csv("./data/newark_airport_2016.csv")

Inspecting ride data

In [3]:
ride.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247584 entries, 0 to 247583
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Trip Duration            247584 non-null  int64  
 1   Start Time               247584 non-null  object 
 2   Stop Time                247584 non-null  object 
 3   Start Station ID         247584 non-null  int64  
 4   Start Station Name       247584 non-null  object 
 5   Start Station Latitude   247584 non-null  float64
 6   Start Station Longitude  247584 non-null  float64
 7   End Station ID           247584 non-null  int64  
 8   End Station Name         247584 non-null  object 
 9   End Station Latitude     247584 non-null  float64
 10  End Station Longitude    247584 non-null  float64
 11  Bike ID                  247584 non-null  int64  
 12  User Type                247204 non-null  object 
 13  Birth Year               228585 non-null  float64
 14  Gend

In [4]:
ride.head(10)

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,362,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,Grove St PATH,40.719586,-74.043117,3209,Brunswick St,40.724176,-74.050656,24647,Subscriber,1964.0,2
1,200,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24605,Subscriber,1962.0,1
2,202,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,Grove St PATH,40.719586,-74.043117,3213,Van Vorst Park,40.718489,-74.047727,24689,Subscriber,1962.0,2
3,248,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,Brunswick St,40.724176,-74.050656,3203,Hamilton Park,40.727596,-74.044247,24693,Subscriber,1984.0,1
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24573,Customer,,0
5,883,2016-01-01 01:03:28,2016-01-01 01:18:11,3195,Sip Ave,40.730743,-74.063784,3210,Pershing Field,40.742677,-74.051789,24442,Customer,,0
6,445,2016-01-01 01:07:45,2016-01-01 01:15:11,3186,Grove St PATH,40.719586,-74.043117,3203,Hamilton Park,40.727596,-74.044247,24510,Subscriber,1988.0,2
7,192,2016-01-01 01:18:51,2016-01-01 01:22:03,3211,Newark Ave,40.721525,-74.046305,3203,Hamilton Park,40.727596,-74.044247,24625,Subscriber,1980.0,1
8,409,2016-01-01 01:23:44,2016-01-01 01:30:34,3187,Warren St,40.721124,-74.038051,3214,Essex Light Rail,40.712774,-74.036486,24429,Subscriber,1990.0,1
9,285,2016-01-01 01:25:12,2016-01-01 01:29:57,3187,Warren St,40.721124,-74.038051,3214,Essex Light Rail,40.712774,-74.036486,24407,Subscriber,1988.0,2


In [10]:
ride.describe(include="all")

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
count,247584.0,247584,247584,247584.0,247584,247584.0,247584.0,247584.0,247584,247584.0,247584.0,247584.0,247204,228585.0,247584.0
unique,,244407,244137,,51,,,,102,,,,2,,
top,,2016-09-28 08:24:23,2016-04-17 17:33:34,,Grove St PATH,,,,Grove St PATH,,,,Subscriber,,
freq,,3,4,,28736,,,,38295,,,,231683,,
mean,885.6305,,,3207.065206,,40.723121,-74.046438,3203.572553,,40.722594,-74.045855,24935.260481,,1979.335276,1.123534
std,35937.98,,,26.955103,,0.008199,0.011211,61.579494,,0.007958,0.011283,748.469712,,9.596809,0.518687
min,61.0,,,3183.0,,40.69264,-74.096937,147.0,,40.692216,-74.096937,14552.0,,1900.0,0.0
25%,248.0,,,3186.0,,40.717732,-74.050656,3186.0,,40.71654,-74.050444,24491.0,,1974.0,1.0
50%,390.0,,,3201.0,,40.721525,-74.044247,3199.0,,40.721124,-74.043117,24609.0,,1981.0,1.0
75%,666.0,,,3211.0,,40.727596,-74.038051,3211.0,,40.727224,-74.036486,24719.0,,1986.0,1.0


In [None]:
# Checking for null
ride.isna().sum()

Trip Duration                  0
Start Time                     0
Stop Time                      0
Start Station ID               0
Start Station Name             0
Start Station Latitude         0
Start Station Longitude        0
End Station ID                 0
End Station Name               0
End Station Latitude           0
End Station Longitude          0
Bike ID                        0
User Type                    380
Birth Year                 18999
Gender                         0
dtype: int64

In [52]:
# Checking for null values in rows where user Type is already null
ride[ride["User Type"].isna()].isna().sum()

Trip Duration                0
Start Time                   0
Stop Time                    0
Start Station ID             0
Start Station Name           0
Start Station Latitude       0
Start Station Longitude      0
End Station ID               0
End Station Name             0
End Station Latitude         0
End Station Longitude        0
Bike ID                      0
User Type                  380
Birth Year                   0
Gender                       0
dtype: int64

After inspecting the data about bike riding, I found that information about station is being repeated over and over. I will need to split the dataframe into two different dataframes to extract information about trips, then information about stations.

Moreover, the **Birth Year** column, is missing 18999 values, aside from the **User Type** column that is missing 380 values from different rows than the missing values of the Birth Year. In total, the dataframe has 19379 rows with missing values.

In [58]:
# Extracting trips data
trip = ride.loc[:, ["Trip Duration", "Start Time", "Stop Time", "Start Station ID", "End Station ID", \
                     "Bike ID", "User Type", "Birth Year", "Gender"]]

# Renaming the column for better handling
old_columns = trip.columns
new_columns = old_columns.str.replace(" ", "_").str.lower()

trip.rename(columns=dict(zip(old_columns, new_columns)), inplace=True)
trip.head(10)


Unnamed: 0,trip_duration,start_time,stop_time,start_station_id,end_station_id,bike_id,user_type,birth_year,gender
0,362,2016-01-01 00:02:52,2016-01-01 00:08:54,3186,3209,24647,Subscriber,1964.0,2
1,200,2016-01-01 00:18:22,2016-01-01 00:21:42,3186,3213,24605,Subscriber,1962.0,1
2,202,2016-01-01 00:18:25,2016-01-01 00:21:47,3186,3213,24689,Subscriber,1962.0,2
3,248,2016-01-01 00:23:13,2016-01-01 00:27:21,3209,3203,24693,Subscriber,1984.0,1
4,903,2016-01-01 01:03:20,2016-01-01 01:18:24,3195,3210,24573,Customer,,0
5,883,2016-01-01 01:03:28,2016-01-01 01:18:11,3195,3210,24442,Customer,,0
6,445,2016-01-01 01:07:45,2016-01-01 01:15:11,3186,3203,24510,Subscriber,1988.0,2
7,192,2016-01-01 01:18:51,2016-01-01 01:22:03,3211,3203,24625,Subscriber,1980.0,1
8,409,2016-01-01 01:23:44,2016-01-01 01:30:34,3187,3214,24429,Subscriber,1990.0,1
9,285,2016-01-01 01:25:12,2016-01-01 01:29:57,3187,3214,24407,Subscriber,1988.0,2


In [None]:
# Extracting stations' data
start_station = ride.loc[:, ["Start Station ID", "Start Station Name", "Start Station Latitude", "Start Station Longitude"]]
end_station = ride.loc[:, ["End Station ID", "End Station Name", "End Station Latitude", "End Station Longitude"]]

In [59]:
# Renaming the columns
start_station_columns = start_station.columns
end_station_columns = end_station.columns 
station_new_columns = ["station_id", "station_name", "station_latitude", "station_longitude"]

start_station.rename(columns=dict(zip(start_station_columns, station_new_columns)), inplace=True)
end_station.rename(columns=dict(zip(end_station_columns, station_new_columns)), inplace=True)

# Stack the station data and remove any duplicates
station = pd.concat([start_station, end_station]).drop_duplicates().reset_index(drop=True)
station.head(10)

Unnamed: 0,station_id,station_name,station_latitude,station_longitude
0,3186,Grove St PATH,40.719586,-74.043117
1,3209,Brunswick St,40.724176,-74.050656
2,3195,Sip Ave,40.730743,-74.063784
3,3211,Newark Ave,40.721525,-74.046305
4,3187,Warren St,40.721124,-74.038051
5,3183,Exchange Place,40.716247,-74.033459
6,3213,Van Vorst Park,40.718489,-74.047727
7,3193,Lincoln Park,40.724605,-74.078406
8,3194,McGinley Square,40.72534,-74.067622
9,3202,Newport PATH,40.727224,-74.033759


Inspecting weather data

In [7]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 16 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   STATION  366 non-null    object 
 1   NAME     366 non-null    object 
 2   DATE     366 non-null    object 
 3   AWND     366 non-null    float64
 4   PGTM     0 non-null      float64
 5   PRCP     366 non-null    float64
 6   SNOW     366 non-null    float64
 7   SNWD     366 non-null    float64
 8   TAVG     366 non-null    int64  
 9   TMAX     366 non-null    int64  
 10  TMIN     366 non-null    int64  
 11  TSUN     0 non-null      float64
 12  WDF2     366 non-null    int64  
 13  WDF5     364 non-null    float64
 14  WSF2     366 non-null    float64
 15  WSF5     364 non-null    float64
dtypes: float64(9), int64(4), object(3)
memory usage: 45.9+ KB


In [8]:
weather.head(10)

Unnamed: 0,STATION,NAME,DATE,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,TSUN,WDF2,WDF5,WSF2,WSF5
0,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-01,12.75,,0.0,0.0,0.0,41,43,34,,270,280.0,25.9,35.1
1,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-02,9.4,,0.0,0.0,0.0,36,42,30,,260,260.0,21.0,25.1
2,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-03,10.29,,0.0,0.0,0.0,37,47,28,,270,250.0,23.9,30.0
3,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-04,17.22,,0.0,0.0,0.0,32,35,14,,330,330.0,25.9,33.1
4,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-05,9.84,,0.0,0.0,0.0,19,31,10,,360,350.0,25.1,31.1
5,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-06,5.37,,0.0,0.0,0.0,28,42,15,,230,250.0,12.1,16.1
6,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-07,3.36,,0.0,0.0,0.0,35,46,24,,20,360.0,8.9,10.1
7,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-08,8.05,,0.0,0.0,0.0,38,45,31,,20,30.0,14.1,16.1
8,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-09,6.71,,0.01,0.0,0.0,44,48,38,,60,70.0,13.0,17.0
9,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-10,15.43,,1.77,0.0,0.0,53,65,39,,260,270.0,36.0,42.9


In [9]:
weather.describe(include="all")

Unnamed: 0,STATION,NAME,DATE,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,TSUN,WDF2,WDF5,WSF2,WSF5
count,366,366,366,366.0,0.0,366.0,366.0,366.0,366.0,366.0,366.0,0.0,366.0,364.0,366.0,364.0
unique,1,1,366,,,,,,,,,,,,,
top,USW00014734,"NEWARK LIBERTY INTERNATIONAL AIRPORT, NJ US",2016-01-01,,,,,,,,,,,,,
freq,366,366,1,,,,,,,,,,,,,
mean,,,,9.429973,,0.104945,0.098087,0.342623,57.196721,65.991803,48.459016,,217.84153,228.269231,20.484426,26.801648
std,,,,3.748174,,0.307496,1.276498,2.07851,17.466981,18.606301,17.13579,,102.548282,97.415777,6.84839,8.88261
min,,,,2.46,,0.0,0.0,0.0,8.0,18.0,0.0,,10.0,10.0,6.9,10.1
25%,,,,6.765,,0.0,0.0,0.0,43.0,51.25,35.0,,150.0,150.0,15.0,19.9
50%,,,,8.72,,0.0,0.0,0.0,56.0,66.0,47.0,,240.0,260.0,19.9,25.1
75%,,,,11.41,,0.03,0.0,0.0,74.0,83.0,64.0,,300.0,300.0,23.9,31.1
