# To eBike or Not to eBike?

## Combine .csv Files to Export to .parquet

<a id=toc></a>
## Table of Contents

<ul>
    <li><a href=#01-import-packages>Import Packages</a>
    <li><a href=#02-load-dataset>Load Datasets</a>
    <li><a href=#03-drop-columns>Drop Columns</a>
    <li><a href=#04-convert-pandas>Convert DataFrame Format</a>
    <li><a href=#05-write-parquet>Write .parquet File</a>
</ul>

<a id=01-import-packages></a>
## Import Packages

Import necessary packages.

In [1]:
# Numerical and data
import pandas as pd
import numpy as np

# Apache parquet files (to save space)
import pyarrow as pa
import pyarrow.parquet as pq

<a href=#toc>Back to the top</a>

<a id=02-load-dataset></a>
## Load Datasets

Load CSV files for months and years into their own dataframes and combine them.

In [2]:
CB_01 = pd.read_csv('CitiBike_data/Raw/New_Format/202106-citibike-tripdata.csv', low_memory=False)
CB_02 = pd.read_csv('CitiBike_data/Raw/New_Format/202107-citibike-tripdata.csv', low_memory=False)
CB_03 = pd.read_csv('CitiBike_data/Raw/New_Format/202108-citibike-tripdata.csv', low_memory=False)
CB_04 = pd.read_csv('CitiBike_data/Raw/New_Format/202109-citibike-tripdata.csv', low_memory=False)
CB_05 = pd.read_csv('CitiBike_data/Raw/New_Format/202110-citibike-tripdata.csv', low_memory=False)
CB_06 = pd.read_csv('CitiBike_data/Raw/New_Format/202111-citibike-tripdata.csv', low_memory=False)
CB_07 = pd.read_csv('CitiBike_data/Raw/New_Format/202112-citibike-tripdata.csv', low_memory=False)
CB_08 = pd.read_csv('CitiBike_data/Raw/New_Format/202201-citibike-tripdata.csv', low_memory=False)
CB_09 = pd.read_csv('CitiBike_data/Raw/New_Format/202202-citibike-tripdata.csv', low_memory=False)
CB_10 = pd.read_csv('CitiBike_data/Raw/New_Format/202203-citibike-tripdata.csv', low_memory=False)
CB_11 = pd.read_csv('CitiBike_data/Raw/New_Format/202204-citibike-tripdata.csv', low_memory=False)
CB_12 = pd.read_csv('CitiBike_data/Raw/New_Format/202205-citibike-tripdata.csv', low_memory=False)

months = [CB_01, CB_02, CB_03, CB_04, CB_05, CB_06, CB_07, CB_08, CB_09, CB_10, CB_11, CB_12]

CB_Data = pd.concat(months, ignore_index=True, sort=False)

<a href=#toc>Back to the top</a>

<a id=03-drop-columns></a>
## Drop Columns

Drop columns that are not relevant.

In [7]:
drop_col = ['ride_id', 'start_station_id', 'end_station_id']
CB_Data_clean = CB_Data.drop(axis = 1, columns = drop_col)

<a href=#toc>Back to the top</a>

<a id=04-convert-pandas></a>
## Convert DataFrame Format

Convert combined pandas dataframe into pyarrow.Table format

In [8]:
CB_Data_arrow = pa.Table.from_pandas(CB_Data_clean)

<a href=#toc>Back to the top</a>

<a id=05-write-parquet></a>
## Write .parquet File

Write a parquet file based on the combined dataframe for the year range of interest.

In [9]:
pq.write_table(CB_Data_arrow, 'CitiBike_data/202106-202205-citibike-tripdata.parquet')

<a href=#toc>Back to the top</a>