# Data Preperation

### Ian Heung

In this notebook, I will use Pandas to first select and combine all the datafiles into a single dataframe, then I will store the data onto my local MySQL server through the use of the package `SQLAlchemy`.

## Data Sourcing

The data was obtained from an [online database](https://divvy-tripdata.s3.amazonaws.com/index.html), and the data is made avaible by Motivate International Inc. under this [license](https://divvybikes.com/data-license-agreement). I have downloaded the ridershare data from the past 12 months (August 2023 - July 2024). The code in this notebook is compatible with data from past or future time frames.

## Data Orgnaization

Once data has been downloaded, unzip and save the data to a folder. From there, the code below will combine the .csv files into one single Pandas dataframe. Because of the name formatting, we have to use the `datetime` package to create an array to extract each .csv file. 

In [None]:
# if you prefer pip to anaconda
!pip install pandas --quiet
# restart kernal after installation

In [None]:
# import packages
import os
from datetime import datetime
import pandas as pd

In [None]:
# data time frame YYYYMM, change to your needs
start = 202308
end = 202407

start_date = datetime.strptime(str(start), "%Y%m")
end_date = datetime.strptime(str(end), "%Y%m")

dates_array = []

while start_date <= end_date:
    dates_array.append(start_date.strftime("%Y%m"))
    start_date = start_date.replace(month = start_date.month % 12 + 1, year = start_date.year + (start_date.month // 12))

# data range
print(dates_array)
print("Number of datasets: ", len(dates_array))

In [None]:
# iterate through months_array to extract and save each csv file as a pandas dataframe

# file path to data, change to your needs
filepath = "data"
combined_df = pd.DataFrame()

# concatenate each individual df vertically
for date in dates_array:
    df = pd.read_csv(os.path.join(filepath, f"{date}-divvy-tripdata", f"{date}-divvy-tripdata.csv"))
    combined_df = pd.concat([combined_df, df], ignore_index=False)


In [None]:
# verify head and tail of the new df
print(combined_df.head())

In [None]:
print(combined_df.tail())

In [None]:
# in case there is a problem with data on MySQL server, we save a backup csv in the data folder
combined_df.to_csv(os.path.join(filepath, "combined-divvy-tripdata.csv"))

## Saving to SQL

We will store the data in a MySQL database. This allows us to securely place our data in an environment where we can perform additional processing and queries. SQL is well-suited for handling large datasets. For the purposes of the Google Data Analytics Course, the use of SQL is incorporated into this project as I wanted to explore integrating SQL within a Python environment, though it is ultimately not necessary.

In [None]:
!pip install sqlalchemy PyMySQL ipython-sql --quiet

In [None]:
# imports
from getpass import getpass
from sqlalchemy import create_engine

In [None]:
# enter your login info for your SQL server
user = "root"
password = getpass() # used to hide your password

conn_str = f"mysql+pymysql://{user}:{password}@localhost:3306/"

The `ipython-sql` library allows us to directly communicate with MySQL in the JuypterNotebook enviroment.

In [None]:
# load SQL session
%load_ext sql

In [None]:
%sql {conn_str}

In [None]:
# create a new database for our data
%sql CREATE DATABASE IF NOT EXISTS CyclisticDatabase;

The newly created database should show up.

In [None]:
%sql SHOW DATABASES;

To upload data onto our newly created database, we will have to use `SQLAlchemy` engine.

In [None]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@localhost:3306/CyclisticDatabase')

In [None]:
# upload data onto our SQL database, this will take a while
combined_df.to_sql('combined_tripdata', con=engine, if_exists='replace')

Lets verify that our newly created dataframe has been uploaded onto our SQL database.

In [None]:
%sql USE CyclisticDatabase;

In [None]:
%sql SHOW TABLES;

In [None]:
# Preview of the first 5 entries
%sql SELECT * FROM combined_tripdata LIMIT 5;

With that, the data is prepared for cleaning and proccessing. We will move onto cleaning the data, ensuring the accuracy and validity of our data for further analysis into the business problem.