<a href="https://colab.research.google.com/github/ongks-useR/united_states_bike_share/blob/main/data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Bike Sharing in New York, U.S 2019***

This is part of my journey towards mastering Python for Data Science.

I use United States bike sharing data (particularly New York City) for data cleaning; please visit [bikeshare.com](https://www.bikeshare.com/data/) for data.

In [None]:
# install haversine
# https://pypi.org/project/haversine/ >> to calculate distance between geometry coordinates

pip install haversine

## Python Library

In [1]:
# data analysis
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sb

# functional loop
from functools import reduce

# Operating System
from glob import glob
from os import path
from pathlib import Path

# calculate distance between geo coordinates
from haversine import haversine_vector, Unit

## ***Lesson 01: Import CSV, efficiently***

After trials & errors, I find out various ways of import CSV, specifically multiple CSV files. Most of the courses will teach *Pandas read_csv()* method for single file, and usually the file size is small for education purpose..

However, in reality, an analyst might has to gather multipe CSV files which each file could be huge in size. In this case, bike sharing monthly file is 100 MB+. We will need to analyze 12-month worth of data

In [2]:
# list files available in the 'New York Bike Share' directory

!ls /content/drive/MyDrive/'New York Bike Share'

201901-citibike-tripdata.csv  201908-citibike-tripdata.csv
201902-citibike-tripdata.csv  201909-citibike-tripdata.csv
201903-citibike-tripdata.csv  201910-citibike-tripdata.csv
201904-citibike-tripdata.csv  201911-citibike-tripdata.csv
201905-citibike-tripdata.csv  201912-citibike-tripdata.csv
201906-citibike-tripdata.csv  new_york_bikeshare_2019.csv
201907-citibike-tripdata.csv


In [25]:
'''
For demo only...

Method 1: Inefficient way ~ ~

Result: long and repetitive codes that are error prone.
Note: Imagine we have 30 files within the same directory??

'''

file_01 = pd.read_csv('/content/drive/MyDrive/New York Bike Share/201901-citibike-tripdata.csv')
file_02 = pd.read_csv('/content/drive/MyDrive/New York Bike Share/201902-citibike-tripdata.csv')
file_03 = pd.read_csv('/content/drive/MyDrive/New York Bike Share/201903-citibike-tripdata.csv')
file_04 = pd.read_csv('/content/drive/MyDrive/New York Bike Share/201904-citibike-tripdata.csv')
file_05 = pd.read_csv('/content/drive/MyDrive/New York Bike Share/201905-citibike-tripdata.csv')
file_06 = pd.read_csv('/content/drive/MyDrive/New York Bike Share/201906-citibike-tripdata.csv')

df = pd.concat([file_01, file_02, file_03], ignore_index=True)

In [3]:
'''
Method 2: Efficient way ~ ~

Step 1: Getting file names within directory that are required for analysis

'''

# current path of working directory for jupyter notebook and CSV files in Google Colab
file_dir = '/content/drive/MyDrive/New York Bike Share'

# getting file names within the directory and sort file name
file_names = glob(path.join(file_dir, '*-citibike-tripdata.csv'))
file_names

['/content/drive/MyDrive/New York Bike Share/201901-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201902-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201903-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201904-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201905-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201906-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201907-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201908-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201909-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201910-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201911-citibike-tripdata.csv',
 '/content/drive/MyDrive/New York Bike Share/201912-citibike-tripdata.csv']

In [4]:
# only import 100 line items for quick view of column names and data type

pd.read_csv('/content/drive/MyDrive/New York Bike Share/201901-citibike-tripdata.csv', nrows=100).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tripduration             100 non-null    int64  
 1   starttime                100 non-null    object 
 2   stoptime                 100 non-null    object 
 3   start station id         100 non-null    int64  
 4   start station name       100 non-null    object 
 5   start station latitude   100 non-null    float64
 6   start station longitude  100 non-null    float64
 7   end station id           100 non-null    int64  
 8   end station name         100 non-null    object 
 9   end station latitude     100 non-null    float64
 10  end station longitude    100 non-null    float64
 11  bikeid                   100 non-null    int64  
 12  usertype                 100 non-null    object 
 13  birth year               100 non-null    int64  
 14  gender                   10

In [5]:
'''
Method 2: Efficient way ~ ~

Step 2: Define function to create Pandas DataFrame

'''

col_index = [0, 1, 5, 6, 9 ,10, 12, 13, 14]

col_name = ['duration', 
            'time_start', 
            'station_latitude_start', 
            'station_longitude_start',
            'station_latitude_end', 
            'station_longitude_end', 
            'user_type', 
            'birth_year', 
            'gender']

col_type = {
    'duration': np.int32,
    'station_latitude_start': np.float32,
    'station_longitude_start': np.float32,
    'station_latitude_end': np.float32,
    'station_longitude_end': np.float32,
    'user_type': 'category',
    'birth_year': 'object',
    'gender': 'category'
}

# self defined function to create dataframe
def create_df(f, size = 100_000):

    # create chunks of data frame with 100K per chunk. Result is an iteratable of dataframes
    result = pd.read_csv(f, chunksize=size, usecols=col_index, names=col_name, dtype=col_type, parse_dates=['time_start'], header=0)

    return result

In [6]:
'''
Method 2: Efficient way ~ ~

Step 3: Use .map() to apply 'create_df' function to each file in 'file_names'

'''

# .map() will apply 'create_df' function to each file in file_names
# result is list of iteratable. Each iterable contains many dataframes with 100,000 rows

df = list(map(create_df, file_names))
df

[<pandas.io.parsers.TextFileReader at 0x7fbb63678090>,
 <pandas.io.parsers.TextFileReader at 0x7fbb636757d0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb63675890>,
 <pandas.io.parsers.TextFileReader at 0x7fbb636784d0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb636786d0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb636788d0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb63678ad0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb63678cd0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb63678ed0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb63678fd0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb63678dd0>,
 <pandas.io.parsers.TextFileReader at 0x7fbb747a9910>]

In [7]:
'''
Method 2: Efficient way ~ ~

Step 4: apply python 'list comprehension' to get list of DataFrame

'''

# loop through each iteratable and store each dataframe to list with 'list comprehension'

df = [chunk for ls in df for chunk in ls]

# let's check one of the dataframe
# each dataframe contains up to 100,000 line items
df[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration                 100000 non-null  int32         
 1   time_start               100000 non-null  datetime64[ns]
 2   station_latitude_start   100000 non-null  float32       
 3   station_longitude_start  100000 non-null  float32       
 4   station_latitude_end     100000 non-null  float32       
 5   station_longitude_end    100000 non-null  float32       
 6   user_type                100000 non-null  category      
 7   birth_year               100000 non-null  object        
 8   gender                   100000 non-null  category      
dtypes: category(2), datetime64[ns](1), float32(4), int32(1), object(1)
memory usage: 3.6+ MB


In [8]:
# df[1] index number range from 100,000 to 199,999

df[1].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 100000 to 199999
Data columns (total 9 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration                 100000 non-null  int32         
 1   time_start               100000 non-null  datetime64[ns]
 2   station_latitude_start   100000 non-null  float32       
 3   station_longitude_start  100000 non-null  float32       
 4   station_latitude_end     100000 non-null  float32       
 5   station_longitude_end    100000 non-null  float32       
 6   user_type                100000 non-null  category      
 7   birth_year               100000 non-null  object        
 8   gender                   100000 non-null  category      
dtypes: category(2), datetime64[ns](1), float32(4), int32(1), object(1)
memory usage: 3.6+ MB


In [9]:
'''
Method 2: Efficient way ~ ~

Step 5: use pd.concat() to merge list of dataframes

'''

# p.concat() >> append list of dataframe on top of each other to produce master dataframe
# note: each dataframe has different index number. parameter 'ignore_index' is set to 'True' and pd.concat() will reset index number after merge.

df = pd.concat(df, ignore_index=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20551697 entries, 0 to 20551696
Data columns (total 9 columns):
 #   Column                   Dtype         
---  ------                   -----         
 0   duration                 int32         
 1   time_start               datetime64[ns]
 2   station_latitude_start   float32       
 3   station_longitude_start  float32       
 4   station_latitude_end     float32       
 5   station_longitude_end    float32       
 6   user_type                category      
 7   birth_year               object        
 8   gender                   category      
dtypes: category(2), datetime64[ns](1), float32(4), int32(1), object(1)
memory usage: 744.8+ MB
