# May Mobility (Data Scientist)

## Rough Idea of how the route looks

I don't believe that this is the actual route.

As I comb through more of the data, I'll get a better understanding of what order the stops go.

I just ploted the lat/lon in order from the appendix to give myself a visual aid.

![rough_image](./resources/pics/rough_idea_route.png)


### Point of Interest (PoI)

| Stop       | Description                      | Latitude | Longitude |
|:-----------|:--------------------------------:|----------|-----------|
| Bus        | Bus stop on a major transit line | 39.77285 | -86.16168 |
| Dentist    | School of Dentistry              | 39.77467 | -86.17895 |
| Doctor     | Pediatrician’s office            | 39.77926 | -86.17496 |
| Admin      | Administrative building          | 39.77459 | -86.17433 |
| Hospital   | Campus hospital                  | 39.77567 | -86.17557 |
| Lime       | Bus stop on campus               | 39.77473 | -86.18376 |
| Parking    | Campus parking lot               | 39.77882 | -86.18121 |
| School     | School of Art and Design         | 39.77148 | -86.17148 |
| University | University lecture hall          | 39.77271 | -86.17575 |


## Read Data in

In [206]:
import pandas as pd
# QoL for viewing df output
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# read-in *.csv files
pickups_df   = pd.read_csv("resources/csv/Data_Science_pickups.csv")
ridership_df = pd.read_csv("resources/csv/Data_Science_site_ridership.csv")

# display head of dataframes
print(pickups_df.head())
print(ridership_df.head())

   row_id            timestamp  pickup  dropoff    stop vehicle      time        date name
0       1  2021-11-01 07:10:54       1        0     Bus  Marble  07:00:00  2021-11-01   ES
1       2  2021-11-01 07:51:13       1        0     Bus  Marble  07:50:00  2021-11-01   ES
2       3  2021-11-01 08:02:13       1        0    Lime  Marble  08:01:00  2021-11-01   ES
3       4  2021-11-01 08:41:16       1        0  Doctor   Motto  08:41:00  2021-11-01   CM
4       5  2021-11-01 09:24:10       1        0     Bus    Myao  09:22:00  2021-11-01   CM
             timestamp  pickup  dropoff    stop vehicle      time        date name
0  2021-06-03 13:08:10     1.0      0.0     Bus   Motto  13:05:00  2021-06-03   JR
1  2021-06-03 13:31:41     0.0      1.0     Bus   Motto  13:31:43  2021-06-03   JR
2  2021-06-04 11:06:02     1.0      0.0  School   Motto  11:03:00  2021-06-04   CM
3  2021-06-04 11:07:48     0.0      1.0     Bus   Motto  11:07:00  2021-06-04   CM
4  2021-06-04 12:43:54     1.0      0.0

## Inspect and convert data types

In [207]:
print('pickups data frame\n')
pickups_df.info()

pickups data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363 entries, 0 to 362
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   row_id     363 non-null    int64 
 1   timestamp  363 non-null    object
 2   pickup     363 non-null    int64 
 3   dropoff    363 non-null    int64 
 4   stop       363 non-null    object
 5   vehicle    363 non-null    object
 6   time       363 non-null    object
 7   date       363 non-null    object
 8   name       363 non-null    object
dtypes: int64(3), object(6)
memory usage: 25.6+ KB


In [208]:
# timestamp    object
pickups_df["timestamp"] = pd.to_datetime(pickups_df["timestamp"], format="%Y-%m-%d %H:%M:%S")

# stop         object
pickups_df["stop"] = pickups_df["stop"].astype("category")

# vehicle      object
pickups_df["vehicle"] = pickups_df["vehicle"].astype("category")

# date         object
pickups_df["date"] = pd.to_datetime(pickups_df["date"], format="%Y-%m-%d")

# name         object
pickups_df["name"] = pickups_df["name"].astype("category")

pickups_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363 entries, 0 to 362
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   row_id     363 non-null    int64         
 1   timestamp  363 non-null    datetime64[ns]
 2   pickup     363 non-null    int64         
 3   dropoff    363 non-null    int64         
 4   stop       363 non-null    category      
 5   vehicle    363 non-null    category      
 6   time       363 non-null    object        
 7   date       363 non-null    datetime64[ns]
 8   name       363 non-null    category      
dtypes: category(3), datetime64[ns](2), int64(3), object(1)
memory usage: 19.1+ KB


In [209]:
print('ridership data frame\n')
ridership_df.info()

ridership data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4352 entries, 0 to 4351
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  4352 non-null   object 
 1   pickup     4340 non-null   float64
 2   dropoff    4337 non-null   float64
 3   stop       4352 non-null   object 
 4   vehicle    4352 non-null   object 
 5   time       4352 non-null   object 
 6   date       4352 non-null   object 
 7   name       4352 non-null   object 
dtypes: float64(2), object(6)
memory usage: 272.1+ KB


In [216]:
# timestamp     object
ridership_df["timestamp"] = pd.to_datetime(ridership_df["timestamp"], format="%Y-%m-%d %H:%M:%S")

# # pickup       float64 
# Should be int64
# ridership_df["pickup"] = ridership_df["pickup"]

# # dropoff      float64
# Should be int64
# ridership_df["dropoff"] = ridership_df["dropoff"]

# stop          object
ridership_df["stop"] = ridership_df["stop"].astype("category")

# vehicle       object
ridership_df["vehicle"] = ridership_df["vehicle"].astype("category")

# date          object
ridership_df["date"] = pd.to_datetime(ridership_df["date"], format="%Y-%m-%d")

# name          object
ridership_df["name"] = ridership_df["name"].astype("category")

ridership_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4352 entries, 0 to 4351
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   timestamp  4352 non-null   datetime64[ns]
 1   pickup     4340 non-null   float64       
 2   dropoff    4337 non-null   float64       
 3   stop       4352 non-null   category      
 4   vehicle    4352 non-null   category      
 5   time       4352 non-null   object        
 6   date       4352 non-null   datetime64[ns]
 7   name       4352 non-null   category      
dtypes: category(3), datetime64[ns](2), float64(2), object(1)
memory usage: 184.8+ KB


In [236]:
# Find out why its null
#    RangeIndex: 4352 entries, 0 to 4351
# 1   pickup     4340 non-null   float64
# 2   dropoff    4337 non-null   float64

print(f'Null pickup values {ridership_df["pickup"].isnull().sum()} out of {ridership_df.shape[0]} records, {ridership_df["pickup"].isnull().sum()/ridership_df.shape[0]:.2%} missing')

print(f'Null dropoff values {ridership_df["dropoff"].isnull().sum()} out of {ridership_df.shape[0]} records, {ridership_df["dropoff"].isnull().sum()/ridership_df.shape[0]:.2%} missing')

Null pickup values 12 out of 4352 records, 0.28% missing
Null dropoff values 15 out of 4352 records, 0.34% missing


### Convert Vars to proper dtypes

<table>
<tr><th>Pickup Data Types </th><th>Ridership Data Types</th></tr>
<tr><td>

| Var         |       Before  |       After     |
|:------------|:--------------|:----------------|
|row_id       | int64         | int64           |
|timestamp    | object        | datetime64[ns]  |   
|pickup       | int64         | int64           |
|dropoff      | int64         | int64           |
|stop         | object        | category        |
|vehicle      | object        | category        |
|time         | object        | object          |
|date         | object        | datetime64[ns]  |   
|name         | object        | category        |

</td><td>

| Var         |       Before  |       After    |
|:------------|:--------------|:---------------|
|timestamp    | object        | datetime64[ns] |
|pickup       | float64       | float64        |
|dropoff      | float64       | float64        |
|stop         | object        | category       |
|vehicle      | object        | category       |
|time         | object        | object         |
|date         | object        | datetime64[ns] |
|name         | object        | category       |

</td></tr> </table>