# May Mobility (Data Scientist)

## Rough Idea of how the route looks

I don't know if this is the actual route.

As I comb through more of the data, I'll get a better understanding of what order the stops are.

I just ploted the lat/lon in order from the appendix to give myself a visual aid.

![rough_image](./resources/pics/rough_idea_route.png)


Point of Interest (PoI)

| Stop       | Description                      | Latitude | Longitude |
|:-----------|:--------------------------------:|----------|-----------|
| Bus        | Bus stop on a major transit line | 39.77285 | -86.16168 |
| Dentist    | School of Dentistry              | 39.77467 | -86.17895 |
| Doctor     | Pediatrician’s office            | 39.77926 | -86.17496 |
| Admin      | Administrative building          | 39.77459 | -86.17433 |
| Hospital   | Campus hospital                  | 39.77567 | -86.17557 |
| Lime       | Bus stop on campus               | 39.77473 | -86.18376 |
| Parking    | Campus parking lot               | 39.77882 | -86.18121 |
| School     | School of Art and Design         | 39.77148 | -86.17148 |
| University | University lecture hall          | 39.77271 | -86.17575 |


# Data Wrangling

## Read Data and Import modules

In [45]:
import pandas as pd
import numpy as np
import datetime
# QoL for viewing df output
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.options.display.float_format = "{:,.2f}".format

# read-in *.csv files
pickups_df   = pd.read_csv("resources/csv/Data_Science_pickups.csv", na_values = np.nan)
ridership_df = pd.read_csv("resources/csv/Data_Science_site_ridership.csv", na_values = np.nan)

## Quick Check of DataFrames

In [46]:
pickups_df.head()

Unnamed: 0,row_id,timestamp,pickup,dropoff,stop,vehicle,time,date,name
0,1,2021-11-01 07:10:54,1,0,Bus,Marble,07:00:00,2021-11-01,ES
1,2,2021-11-01 07:51:13,1,0,Bus,Marble,07:50:00,2021-11-01,ES
2,3,2021-11-01 08:02:13,1,0,Lime,Marble,08:01:00,2021-11-01,ES
3,4,2021-11-01 08:41:16,1,0,Doctor,Motto,08:41:00,2021-11-01,CM
4,5,2021-11-01 09:24:10,1,0,Bus,Myao,09:22:00,2021-11-01,CM


## Inspect and convert data types

In [47]:
print('pickups data frame\n')
pickups_df.info()

pickups data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363 entries, 0 to 362
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   row_id     363 non-null    int64 
 1   timestamp  363 non-null    object
 2   pickup     363 non-null    int64 
 3   dropoff    363 non-null    int64 
 4   stop       363 non-null    object
 5   vehicle    363 non-null    object
 6   time       363 non-null    object
 7   date       363 non-null    object
 8   name       363 non-null    object
dtypes: int64(3), object(6)
memory usage: 25.6+ KB


In [48]:
# Data type changes in pickups_df
# timestamp    object
pickups_df["timestamp"] = pd.to_datetime(pickups_df["timestamp"], format="%Y-%m-%d %H:%M:%S")

# stop         object
pickups_df["stop"] = pickups_df["stop"].astype("category")

# vehicle      object
pickups_df["vehicle"] = pickups_df["vehicle"].astype("category")

# date         object
pickups_df["date"] = pd.to_datetime(pickups_df["date"], format="%Y-%m-%d")

# name         object
pickups_df["name"] = pickups_df["name"].astype("category")

pickups_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363 entries, 0 to 362
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   row_id     363 non-null    int64         
 1   timestamp  363 non-null    datetime64[ns]
 2   pickup     363 non-null    int64         
 3   dropoff    363 non-null    int64         
 4   stop       363 non-null    category      
 5   vehicle    363 non-null    category      
 6   time       363 non-null    object        
 7   date       363 non-null    datetime64[ns]
 8   name       363 non-null    category      
dtypes: category(3), datetime64[ns](2), int64(3), object(1)
memory usage: 19.1+ KB


In [49]:
print('ridership data frame\n')
ridership_df.info()

ridership data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4352 entries, 0 to 4351
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  4352 non-null   object 
 1   pickup     4340 non-null   float64
 2   dropoff    4337 non-null   float64
 3   stop       4352 non-null   object 
 4   vehicle    4352 non-null   object 
 5   time       4352 non-null   object 
 6   date       4352 non-null   object 
 7   name       4352 non-null   object 
dtypes: float64(2), object(6)
memory usage: 272.1+ KB


In [50]:
# Data type changes in ridership_df
# timestamp     object
ridership_df["timestamp"] = pd.to_datetime(ridership_df["timestamp"], format="%Y-%m-%d %H:%M:%S")

# # pickup       float64 
# Should be int64
# ridership_df["pickup"] = ridership_df["pickup"]

# # dropoff      float64
# Should be int64
# ridership_df["dropoff"] = ridership_df["dropoff"]

# stop          object
ridership_df["stop"] = ridership_df["stop"].astype("category")

# vehicle       object
ridership_df["vehicle"] = ridership_df["vehicle"].astype("category")

# date          object
ridership_df["date"] = pd.to_datetime(ridership_df["date"], format="%Y-%m-%d")

# name          object
ridership_df["name"] = ridership_df["name"].astype("category")

ridership_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4352 entries, 0 to 4351
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   timestamp  4352 non-null   datetime64[ns]
 1   pickup     4340 non-null   float64       
 2   dropoff    4337 non-null   float64       
 3   stop       4352 non-null   category      
 4   vehicle    4352 non-null   category      
 5   time       4352 non-null   object        
 6   date       4352 non-null   datetime64[ns]
 7   name       4352 non-null   category      
dtypes: category(3), datetime64[ns](2), float64(2), object(1)
memory usage: 184.8+ KB


In [51]:
# add days of the week for QoL
day_number = ridership_df["timestamp"].dt.weekday
day_name = ridership_df["timestamp"].dt.day_name()
weekend = day_number > 5 # is weekend?

if ~any(weekend):
    print(f"since any(weekend) = {any(weekend)}, There are no rides on the weekends.")
else:
    print(f"since any(weekend) = {any(weekend)}, There are rides on the weekends.")

# nice to check side-by-side
op_days = pd.DataFrame({"day_num":day_number,"day":day_name,"is_weekday":weekend}) 

ridership_df.insert(ridership_df.columns.get_loc("name"),"day",ridership_df["timestamp"].dt.day_name().astype("category"))
ridership_df.head()

since any(weekend) = False, There are no rides on the weekends.


Unnamed: 0,timestamp,pickup,dropoff,stop,vehicle,time,date,day,name
0,2021-06-03 13:08:10,1.0,0.0,Bus,Motto,13:05:00,2021-06-03,Thursday,JR
1,2021-06-03 13:31:41,0.0,1.0,Bus,Motto,13:31:43,2021-06-03,Thursday,JR
2,2021-06-04 11:06:02,1.0,0.0,School,Motto,11:03:00,2021-06-04,Friday,CM
3,2021-06-04 11:07:48,0.0,1.0,Bus,Motto,11:07:00,2021-06-04,Friday,CM
4,2021-06-04 12:43:54,1.0,0.0,Bus,Motto,12:40:00,2021-06-04,Friday,MN


In [52]:
days_of_the_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
ridership_stats_days = ridership_df[["pickup","dropoff"]].groupby(ridership_df["day"]).describe().reindex(days_of_the_week).dropna().transpose()
ridership_stats_days

Unnamed: 0,day,Monday,Tuesday,Wednesday,Thursday,Friday
pickup,count,716.0,817.0,874.0,1003.0,930.0
pickup,mean,0.6,0.61,0.63,0.62,0.64
pickup,std,0.66,0.7,0.79,0.71,0.75
pickup,min,0.0,0.0,0.0,0.0,0.0
pickup,25%,0.0,0.0,0.0,0.0,0.0
pickup,50%,1.0,1.0,1.0,1.0,1.0
pickup,75%,1.0,1.0,1.0,1.0,1.0
pickup,max,3.0,4.0,8.0,3.0,4.0
dropoff,count,715.0,815.0,874.0,1003.0,930.0
dropoff,mean,0.61,0.61,0.62,0.62,0.65


In [53]:
# inspect missing data from pickups
null_ridership_pickup = ridership_df[ridership_df["pickup"].isnull()]
null_ridership_pickup

Unnamed: 0,timestamp,pickup,dropoff,stop,vehicle,time,date,day,name
79,2021-06-17 14:09:37,,1.0,Hospital,Mette,14:06:00,2021-06-17,Thursday,MV
81,2021-06-17 16:52:03,,1.0,Doctor,Motto,16:51:00,2021-06-17,Thursday,JR
83,2021-06-17 20:25:00,,1.0,Bus,Mette,17:03:00,2021-06-17,Thursday,JR
85,2021-06-18 11:23:45,,1.0,Parking,Myao,11:16:00,2021-06-18,Friday,CM
87,2021-06-18 14:16:28,,1.0,School,Mette,14:16:00,2021-06-18,Friday,MN
89,2021-06-18 17:08:13,,1.0,University,Motto,17:07:00,2021-06-18,Friday,JR
91,2021-06-21 11:50:34,,1.0,University,Myao,11:49:00,2021-06-21,Monday,CM
93,2021-06-21 11:51:33,,1.0,Bus,Mette,09:15:00,2021-06-21,Monday,CM
95,2021-06-21 11:52:30,,1.0,Bus,Mette,09:39:00,2021-06-21,Monday,CM
97,2021-06-21 11:53:31,,1.0,Doctor,Myao,10:33:00,2021-06-21,Monday,CM


In [54]:
# inspect missing data from pickups
null_ridership_dropoff = ridership_df[ridership_df["dropoff"].isnull()]
null_ridership_dropoff

Unnamed: 0,timestamp,pickup,dropoff,stop,vehicle,time,date,day,name
78,2021-06-17 14:02:53,1.0,,Lime,Mette,14:00:00,2021-06-17,Thursday,MV
80,2021-06-17 16:49:04,1.0,,Bus,Motto,16:40:00,2021-06-17,Thursday,JR
82,2021-06-17 16:53:10,1.0,,Doctor,Mette,16:52:00,2021-06-17,Thursday,JR
84,2021-06-18 11:05:49,1.0,,Bus,Myao,11:05:00,2021-06-18,Friday,CM
86,2021-06-18 14:12:59,1.0,,Lime,Mette,14:10:00,2021-06-18,Friday,MN
88,2021-06-18 17:05:47,1.0,,Doctor,Motto,17:04:00,2021-06-18,Friday,JR
90,2021-06-21 11:26:12,1.0,,University,Myao,11:25:00,2021-06-21,Monday,CM
92,2021-06-21 11:51:08,1.0,,Lime,Mette,08:56:00,2021-06-21,Monday,CM
94,2021-06-21 11:52:02,1.0,,Doctor,Mette,09:29:00,2021-06-21,Monday,CM
96,2021-06-21 11:53:04,1.0,,Bus,Myao,10:21:00,2021-06-21,Monday,CM


In [55]:
# find data that shows pickup and dropoffs > 4
entry_error_ridership_df = ridership_df[(ridership_df[["pickup","dropoff"]] > 4).any(axis=1)]
entry_error_ridership_df

Unnamed: 0,timestamp,pickup,dropoff,stop,vehicle,time,date,day,name
346,2021-07-14 16:45:21,8.0,0.0,Parking,Myao,16:44:00,2021-07-14,Wednesday,LA
348,2021-07-14 16:58:55,0.0,8.0,Admin,Myao,16:54:00,2021-07-14,Wednesday,LA
456,2021-07-21 14:18:18,8.0,0.0,Admin,Mette,14:18:00,2021-07-21,Wednesday,CM
457,2021-07-21 14:19:58,0.0,8.0,University,Mette,14:19:00,2021-07-21,Wednesday,CM


In [56]:
entry_error = entry_error_ridership_df["name"].groupby(entry_error_ridership_df["name"]).count()

In [57]:
# Find out why its null
#    RangeIndex: 4352 entries, 0 to 4351
# 1   pickup     4340 non-null   float64
# 2   dropoff    4337 non-null   float64

print(f'Entry error data {entry_error_ridership_df.shape[0]} out of {ridership_df.shape[0]} records, {entry_error_ridership_df.shape[0]/ridership_df.shape[0]:.2%}.')
print(f'Missing pickup data {ridership_df["pickup"].isnull().sum()} out of {ridership_df.shape[0]} records, {ridership_df["pickup"].isnull().sum()/ridership_df.shape[0]:.2%}.')
print(f'Missing dropoff data {ridership_df["dropoff"].isnull().sum()} out of {ridership_df.shape[0]} records, {ridership_df["dropoff"].isnull().sum()/ridership_df.shape[0]:.2%}.')
print(f'Total errors is {ridership_df[["pickup","dropoff"]].isnull().sum().sum() + entry_error_ridership_df.shape[0]} out of {ridership_df.shape[0]} records, {(ridership_df[["pickup","dropoff"]].isnull().sum().sum() + entry_error_ridership_df.shape[0])/ridership_df.shape[0]:.2%}.')

Entry error data 4 out of 4352 records, 0.09%.
Missing pickup data 12 out of 4352 records, 0.28%.
Missing dropoff data 15 out of 4352 records, 0.34%.
Total errors is 31 out of 4352 records, 0.71%.


In [58]:
# investigate errors/nulls/missing data.
record_counts_per = ridership_df["name"].value_counts()
total_null_per =  ridership_df[["pickup","dropoff"]].isnull().groupby(ridership_df["name"]).sum().sum(axis=1)
total_errors_per = total_null_per + entry_error
percent_error_per = total_errors_per/record_counts_per

print(f"Total errors or missing data in ridership_df is {total_errors_per.sum()}.")

errors_df = pd.DataFrame({"percent_error":percent_error_per,
                          "errors":total_errors_per,
                          "records":record_counts_per})

errors_df.loc[:, "percent_error"] = errors_df["percent_error"].map('{:.2%}'.format)
errors_df[errors_df["errors"] != 0].sort_values("records",ascending=False)

Total errors or missing data in ridership_df is 31.


Unnamed: 0,percent_error,errors,records
JR,0.39%,6,1557
CM,1.20%,18,1497
MV,4.17%,2,48
LA,6.06%,2,33
MN,25.00%,2,8
HK,25.00%,1,4


In [59]:
# assume missing values are suppposed to be zero
# because most likely google forms wasnt defaulting to zero. 
# On the time period of 
null_idx = ridership_df[["pickup","dropoff"]].isnull().any(axis=1)
print(f'The messed-up days range, could include from: {ridership_df[null_idx]["timestamp"].min():%b %d %Y} to {ridership_df[null_idx]["timestamp"].max():%b %d %Y}.')

The messed-up days range, could include from: Jun 17 2021 to Jun 22 2021.


In [60]:
# check to see if there is any simultaneous pickups and dropoffs
# subset removing NaN's
NaN_df = ridership_df[~null_idx]

# sanity check!
print(f"OG record count: {ridership_df.shape[0]}")
print(f"Reduced record count: {NaN_df.shape[0]}")
print(f"differance: {ridership_df.shape[0] - NaN_df.shape[0]}")

sim_check = NaN_df[["pickup","dropoff"]].astype(bool)
and_check = sim_check["pickup"] & sim_check["dropoff"]

OG record count: 4352
Reduced record count: 4325
differance: 27


In [61]:
print(f"Freq of both pickup and dropoff: {ridership_df.iloc[sim_check[and_check].index].shape[0]} out of {ridership_df.shape[0]}, {ridership_df.iloc[sim_check[and_check].index].shape[0]/ridership_df.shape[0]:.2%}")
sim_ridership_df = ridership_df.iloc[sim_check[and_check].index]
sim_ridership_df

Freq of both pickup and dropoff: 74 out of 4352, 1.70%


Unnamed: 0,timestamp,pickup,dropoff,stop,vehicle,time,date,day,name
50,2021-06-14 10:55:34,1.0,1.0,Dentist,Mette,11:48:00,2021-06-14,Monday,JW
66,2021-06-16 17:24:18,1.0,1.0,Hospital,Mette,17:00:00,2021-06-16,Wednesday,JW
155,2021-06-25 13:16:00,1.0,1.0,Dentist,Motto,13:15:00,2021-06-25,Friday,MV
679,2021-08-05 14:26:24,1.0,1.0,Bus,Mette,14:26:00,2021-08-05,Thursday,JP
921,2021-08-19 16:46:09,1.0,1.0,Bus,Motto,16:45:00,2021-08-19,Thursday,JR
1287,2021-08-30 16:26:23,1.0,1.0,University,Marble,16:26:00,2021-08-30,Monday,JR
1288,2021-08-30 16:38:24,1.0,1.0,Bus,Marble,16:38:00,2021-08-30,Monday,JR
1350,2021-08-31 16:39:43,1.0,1.0,Bus,Myao,16:39:00,2021-08-31,Tuesday,JR
1382,2021-09-01 10:34:27,1.0,1.0,Hospital,Marble,10:33:00,2021-09-01,Wednesday,JR
1463,2021-09-02 15:18:36,1.0,1.0,University,Motto,15:18:00,2021-09-02,Thursday,VG


In [62]:
print(f'There was {sim_ridership_df[["pickup"]].groupby(ridership_df["day"]).count().sum()[0]} records of a pickup and a dropoff at the same time.')
sim_ridership_df[["pickup"]].groupby(ridership_df["day"]).count().rename(columns={"pickup":"freq"})

There was 74 records of a pickup and a dropoff at the same time.


Unnamed: 0_level_0,freq
day,Unnamed: 1_level_1
Friday,13
Monday,18
Thursday,19
Tuesday,13
Wednesday,11


In [69]:
# remove error inputs
ridership_df_clean = ridership_df.drop(entry_error_ridership_df.index)

# replace NaN with zeros
# because only 74 out of 4352, 1.70% of records contain duplicates
ridership_df_clean[["pickup","dropoff"]] = ridership_df[["pickup","dropoff"]].fillna(value=0,)
ridership_df_clean

# check nulls
ridership_df_clean.isnull().sum()

timestamp    0
pickup       0
dropoff      0
stop         0
vehicle      0
time         0
date         0
day          0
name         0
dtype: int64

In [84]:
# finally convert pickup and dropoff to int64
ridership_df_clean["pickup"] = ridership_df_clean["pickup"].astype("int64")
ridership_df_clean["dropoff"] = ridership_df_clean["dropoff"].astype("int64")

# cleaned data
ridership_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4348 entries, 0 to 4351
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   timestamp  4348 non-null   datetime64[ns]
 1   pickup     4348 non-null   int64         
 2   dropoff    4348 non-null   int64         
 3   stop       4348 non-null   category      
 4   vehicle    4348 non-null   category      
 5   time       4348 non-null   object        
 6   date       4348 non-null   datetime64[ns]
 7   day        4348 non-null   category      
 8   name       4348 non-null   category      
dtypes: category(4), datetime64[ns](2), int64(2), object(1)
memory usage: 222.9+ KB


# **Note!**
To-do: split time intervals by $n$ values.   
Time should be bound from 7am to 7pm.

In [105]:
%matplotlib inline
import matplotlib as plt
time_array = ridership_df_clean["timestamp"]

time_array.dt.hour

0      2021-06-03 13:08:10
1      2021-06-03 13:31:41
2      2021-06-04 11:06:02
3      2021-06-04 11:07:48
4      2021-06-04 12:43:54
               ...        
4347   2021-10-29 17:17:14
4348   2021-10-29 17:25:58
4349   2021-10-29 17:26:25
4350   2021-10-29 17:48:39
4351   2021-10-29 17:55:39
Name: timestamp, Length: 4348, dtype: datetime64[ns]

## Quick stats checkin

In [83]:
ridership_df_clean.groupby(ridership_df["day"]).describe()

Unnamed: 0_level_0,pickup,pickup,pickup,pickup,pickup,pickup,pickup,pickup,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Friday,933.0,0.64,0.75,0.0,0.0,0.0,1.0,4.0,933.0,0.65,0.73,0.0,0.0,1.0,1.0,3.0
Monday,722.0,0.6,0.66,0.0,0.0,1.0,1.0,3.0,722.0,0.6,0.65,0.0,0.0,1.0,1.0,3.0
Thursday,1006.0,0.62,0.71,0.0,0.0,1.0,1.0,3.0,1006.0,0.62,0.7,0.0,0.0,1.0,1.0,3.0
Tuesday,817.0,0.61,0.7,0.0,0.0,1.0,1.0,4.0,817.0,0.61,0.69,0.0,0.0,1.0,1.0,4.0
Wednesday,870.0,0.61,0.7,0.0,0.0,1.0,1.0,3.0,870.0,0.61,0.69,0.0,0.0,1.0,1.0,3.0


In [82]:
ridership_df_clean.groupby(ridership_df["stop"]).describe()

Unnamed: 0_level_0,pickup,pickup,pickup,pickup,pickup,pickup,pickup,pickup,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
stop,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Admin,443.0,0.8,0.8,0.0,0.0,1.0,1.0,3.0,443.0,0.47,0.62,0.0,0.0,0.0,1.0,3.0
Bus,1148.0,0.6,0.7,0.0,0.0,0.0,1.0,4.0,1148.0,0.62,0.67,0.0,0.0,1.0,1.0,3.0
Dentist,337.0,0.53,0.6,0.0,0.0,0.0,1.0,3.0,337.0,0.66,0.73,0.0,0.0,1.0,1.0,3.0
Doctor,429.0,0.7,0.75,0.0,0.0,1.0,1.0,3.0,429.0,0.57,0.72,0.0,0.0,0.0,1.0,3.0
Hospital,225.0,0.61,0.65,0.0,0.0,1.0,1.0,2.0,225.0,0.53,0.58,0.0,0.0,0.0,1.0,2.0
Lime,784.0,0.74,0.7,0.0,0.0,1.0,1.0,4.0,784.0,0.5,0.67,0.0,0.0,0.0,1.0,3.0
Parking,86.0,0.64,0.72,0.0,0.0,1.0,1.0,3.0,86.0,0.63,0.78,0.0,0.0,0.0,1.0,3.0
School,221.0,0.65,0.69,0.0,0.0,1.0,1.0,3.0,221.0,0.58,0.69,0.0,0.0,0.0,1.0,3.0
University,675.0,0.36,0.63,0.0,0.0,0.0,1.0,3.0,675.0,0.91,0.72,0.0,0.0,1.0,1.0,4.0


In [92]:
ridership_df_clean.groupby(["day","stop"]).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,pickup,pickup,pickup,pickup,pickup,pickup,pickup,pickup,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff,dropoff
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
day,stop,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
Friday,Admin,87.0,0.66,0.7,0.0,0.0,1.0,1.0,2.0,87.0,0.59,0.67,0.0,0.0,0.0,1.0,2.0
Friday,Bus,267.0,0.66,0.78,0.0,0.0,1.0,1.0,4.0,267.0,0.66,0.74,0.0,0.0,1.0,1.0,3.0
Friday,Dentist,65.0,0.58,0.68,0.0,0.0,1.0,1.0,3.0,65.0,0.66,0.71,0.0,0.0,1.0,1.0,2.0
Friday,Doctor,111.0,0.75,0.77,0.0,0.0,1.0,1.0,3.0,111.0,0.58,0.78,0.0,0.0,0.0,1.0,3.0
Friday,Hospital,61.0,0.72,0.78,0.0,0.0,1.0,1.0,2.0,61.0,0.57,0.67,0.0,0.0,0.0,1.0,2.0
Friday,Lime,143.0,0.69,0.71,0.0,0.0,1.0,1.0,3.0,143.0,0.54,0.66,0.0,0.0,0.0,1.0,3.0
Friday,Parking,28.0,0.64,0.78,0.0,0.0,0.5,1.0,3.0,28.0,0.64,0.78,0.0,0.0,0.5,1.0,3.0
Friday,School,53.0,0.66,0.73,0.0,0.0,1.0,1.0,2.0,53.0,0.62,0.74,0.0,0.0,0.0,1.0,3.0
Friday,University,118.0,0.41,0.71,0.0,0.0,0.0,1.0,3.0,118.0,0.91,0.75,0.0,0.0,1.0,1.0,3.0
Monday,Admin,64.0,0.81,0.75,0.0,0.0,1.0,1.0,3.0,64.0,0.39,0.55,0.0,0.0,0.0,1.0,2.0


## Converted Variables per *`*.csv`*

<table>
<tr><th>Pickup Data Types </th><th>Ridership Data Types</th></tr>
<tr><td>

| Var         |       Before  |       After     |
|:------------|:--------------|:----------------|
|row_id       | int64         | int64           |
|timestamp    | object        | datetime64[ns]  |   
|pickup       | int64         | int64           |
|dropoff      | int64         | int64           |
|stop         | object        | category        |
|vehicle      | object        | category        |
|time         | object        | object          |
|date         | object        | datetime64[ns]  |
|day          |               | category        |   
|name         | object        | category        |

</td><td>

| Var         |       Before  |       After    |
|:------------|:--------------|:---------------|
|timestamp    | object        | datetime64[ns] |
|pickup       | float64       | int64          |
|dropoff      | float64       | int64          |
|stop         | object        | category       |
|vehicle      | object        | category       |
|time         | object        | object         |
|date         | object        | datetime64[ns] |
|day          |               | category       |
|name         | object        | category       |

</td></tr> </table>

# Data Visualization

In [85]:
#plots go brrrr

# Prep Data for Machine Learning

## One-Hot Encoding