# Data Transformation and Analysis

### Ian Heung

In this notebook, we will take data table we cleaned in SQL and use Pandas to add some additional columns of data, then use grouping functions and data filters to conduct analysis on the main differences between members and casual riders. The reason why I chose to do the data transformation in Pandas is because of the ability to apply custom functions to a column of data with ease, allowing more flexibility for complex data transformations and manipulations. The seamless integration into Python also allows for usage of data visualization packages like matplotlib. 

## Data Transformation

Lets first load our cleaned table from SQL into Pandas, and add some columns for more descriptive analysis.

In [None]:
!pip install sqlalchemy PyMySQL --quiet

In [4]:
# imports
from getpass import getpass
from sqlalchemy import create_engine
import pandas as pd
 

In [2]:
# enter your login info for your SQL server
user = "root"
password = getpass() # used to hide your password

#conn_str = f"mysql+pymysql://{user}:{password}@localhost:3306/"

In [3]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@localhost:3306/CyclisticDatabase')

In [5]:
# read in the SQL table, this will take a few minutes
df = pd.read_sql_table('cleaned_tripdata', con=engine)

Now that we have loaded in our data, lets check the column data types and preview the first few rows of our dataset.

In [10]:
# strings in SQL are represented as VAR(255), and Pandas converts them into the "object" datatype. We will convert them back into strings for faster computation and memory usage.
df.dtypes

index                          int64
ride_id                       object
rideable_type                 object
started_at            datetime64[ns]
ended_at              datetime64[ns]
start_station_name            object
start_station_id              object
end_station_name              object
end_station_id                object
start_lat                    float64
start_lng                    float64
end_lat                      float64
end_lng                      float64
member_casual                 object
dtype: object

In [8]:
df.head()

Unnamed: 0,index,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,0,903C30C2D810A53B,electric_bike,2023-08-19 15:41:53,2023-08-19 15:53:36,LaSalle St & Illinois St,13430,Clark St & Elm St,TA1307000039,41.890721,-87.631477,41.902973,-87.63128,member
1,1,F2FB18A98E110A2B,electric_bike,2023-08-18 15:30:18,2023-08-18 15:45:25,Clark St & Randolph St,TA1305000030,,,41.884511,-87.63155,41.93,-87.64,member
2,2,D0DEC7C94E4663DA,electric_bike,2023-08-30 16:15:08,2023-08-30 16:27:37,Clark St & Randolph St,TA1305000030,,,41.884981,-87.630793,41.91,-87.63,member
3,3,E0DDDC5F84747ED9,electric_bike,2023-08-30 16:24:07,2023-08-30 16:33:34,Wells St & Elm St,KA1504000135,,,41.903105,-87.634667,41.9,-87.62,member
4,4,7797A4874BA260CA,electric_bike,2023-08-22 15:59:44,2023-08-22 16:20:38,Clark St & Randolph St,TA1305000030,,,41.885548,-87.632019,41.89,-87.68,member


Lets first convert our nessesary columns into string datatypes

In [11]:
columns_to_convert = ["ride_id", "rideable_type", "start_station_name", "start_station_id", "end_station_name", "end_station_id", "member_casual"]
df[columns_to_convert] = df[columns_to_convert].astype("string")

In [12]:
# verify our columns are now the appropriate datatypes
df.dtypes

index                          int64
ride_id               string[python]
rideable_type         string[python]
started_at            datetime64[ns]
ended_at              datetime64[ns]
start_station_name    string[python]
start_station_id      string[python]
end_station_name      string[python]
end_station_id        string[python]
start_lat                    float64
start_lng                    float64
end_lat                      float64
end_lng                      float64
member_casual         string[python]
dtype: object

Now our data is reformatted, lets add some new columns. We will start by adding columns for temporal data, thus we will be using the start time and end time to calculate total trip time elapsed, as well as the day and month of the ride. We will use the ride start time to record month and day of the week, as there are cases where the ride will overlap into the next day.

In [17]:
# first lets take the month and day of the week of the ride start time
df["month"] = df["started_at"].dt.month_name()
df["day_of_week"] = df["started_at"].dt.day_name()

In [22]:
# now lets add a column that will calculate the ridetime by subtracting the start time from the end time
df["trip_time"] = (df["ended_at"] - df["started_at"]).dt.total_seconds() / 60 # there is no function for total_minutes(), so we divide by 60 seconds for a minute

Now lets add a distance travelled column. We can't get the exact path of travel from the coordinates, but we can get a rough idea of how far a rider traveled.

In [24]:
df.head()

Unnamed: 0,index,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,month,day_of_week,trip_time
0,0,903C30C2D810A53B,electric_bike,2023-08-19 15:41:53,2023-08-19 15:53:36,LaSalle St & Illinois St,13430,Clark St & Elm St,TA1307000039,41.890721,-87.631477,41.902973,-87.63128,member,August,Saturday,11.716667
1,1,F2FB18A98E110A2B,electric_bike,2023-08-18 15:30:18,2023-08-18 15:45:25,Clark St & Randolph St,TA1305000030,,,41.884511,-87.63155,41.93,-87.64,member,August,Friday,15.116667
2,2,D0DEC7C94E4663DA,electric_bike,2023-08-30 16:15:08,2023-08-30 16:27:37,Clark St & Randolph St,TA1305000030,,,41.884981,-87.630793,41.91,-87.63,member,August,Wednesday,12.483333
3,3,E0DDDC5F84747ED9,electric_bike,2023-08-30 16:24:07,2023-08-30 16:33:34,Wells St & Elm St,KA1504000135,,,41.903105,-87.634667,41.9,-87.62,member,August,Wednesday,9.45
4,4,7797A4874BA260CA,electric_bike,2023-08-22 15:59:44,2023-08-22 16:20:38,Clark St & Randolph St,TA1305000030,,,41.885548,-87.632019,41.89,-87.68,member,August,Tuesday,20.9
