# Milestone 1 - Chicago, U.S. bikesharing

`Source`: 
https://data.cityofchicago.org/Transportation/Divvy-Trips/fg6s-gzvg/about_data 

`Mobility domain`:
https://data.cityofchicago.org/

`Github repository`:
https://github.com/kbui-03/ChicagoBikeProject


# Target feature of the model: `trip duration` 

By predicting the duration of each trip, we can:
+ Optimise the charging times of electric bikes
+ Present customers with a cost estimate
+ Reduce bike shortages in rush hour
+ and others benefits. 


# Data preparation

Since the dataset given has a total of roughly **21 million rows**, we need to reduce the size of the data before downloading. In order to do that, we used the provided query function to filter the rows: 

`TRIP_ID` is greater than or equal to 22,000,000 AND

`TRIP_ID` is less than or equal to 22,200,000

This left us with about roughly 170.000 unique bike rides to work with, in the time frame of March 2019

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

bike_set = pd.read_csv("Divvy_Trips_Chicago.csv", sep=",")
bike_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169540 entries, 0 to 169539
Data columns (total 18 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   TRIP ID            169540 non-null  int64  
 1   START TIME         169540 non-null  object 
 2   STOP TIME          169540 non-null  object 
 3   BIKE ID            169540 non-null  int64  
 4   TRIP DURATION      169540 non-null  int64  
 5   FROM STATION ID    169540 non-null  int64  
 6   FROM STATION NAME  169540 non-null  object 
 7   TO STATION ID      169540 non-null  int64  
 8   TO STATION NAME    169540 non-null  object 
 9   USER TYPE          169540 non-null  object 
 10  GENDER             156307 non-null  object 
 11  BIRTH YEAR         157020 non-null  float64
 12  FROM LATITUDE      169536 non-null  float64
 13  FROM LONGITUDE     169536 non-null  float64
 14  FROM LOCATION      169536 non-null  object 
 15  TO LATITUDE        169530 non-null  float64
 16  TO


We can observe that there are rows missing `TO LATITUDE`, `TO LONGITUDE`, `TO LOCATION`, `FROM LATITUDE`, `FROM LONGITUDE` and `FROM LOCATION`. These few rows (four to ten rows) can be deleted without a significant impact on the general dataset.

A larger portion of missing values can be seen for the columns `GENDER` and `BIRTH YEAR`. Here, a larger amount of rows, roughly 12,000 to 13,000 rows, are missing. One solution that has been suggested is to analyze the ratio of male to female bike riders and distribute the rows with missing genders accordingly. But due to the size of missing values we, as a group, have decided to delete them instead in order to keep the integrity of the data set as well as the correlation to other features. 

Despite the deletion of those rows, the data still contains sufficient observations with over 100,000 rows which fulfills the given minimum observation requirement.


In [2]:
bike_clean = bike_set.dropna()
bike_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 156296 entries, 0 to 169539
Data columns (total 18 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   TRIP ID            156296 non-null  int64  
 1   START TIME         156296 non-null  object 
 2   STOP TIME          156296 non-null  object 
 3   BIKE ID            156296 non-null  int64  
 4   TRIP DURATION      156296 non-null  int64  
 5   FROM STATION ID    156296 non-null  int64  
 6   FROM STATION NAME  156296 non-null  object 
 7   TO STATION ID      156296 non-null  int64  
 8   TO STATION NAME    156296 non-null  object 
 9   USER TYPE          156296 non-null  object 
 10  GENDER             156296 non-null  object 
 11  BIRTH YEAR         156296 non-null  float64
 12  FROM LATITUDE      156296 non-null  float64
 13  FROM LONGITUDE     156296 non-null  float64
 14  FROM LOCATION      156296 non-null  object 
 15  TO LATITUDE        156296 non-null  float64
 16  TO LONG

# Data Engineering Ideas

Features to **KEEP**:
+ START TIME: Essential
+ BIKE ID: Keep for now. While individual bike IDs might not seem predictive, they could capture effects of bike age, maintenance, or type if certain ID ranges correspond to different bike characteristics (though such details are not included in the dataset, further considerations needed)
+ FROM STATION ID: Crucial, Trip duration is heavily dependent on its origin.
+ USER TYPE: Important. Subscribers and casual customers often exhibit different usage patterns.
+ GENDER: Potentially useful demographic information.
+ BIRTH YEAR: Useful for deriving rider age.
+ FROM LATITUDE, FROM LONGITUDE: Maybe. Since Divvy bikes can only be used to and from authorised stations, it is unclear whether specific coordinates are needed

Features to **REMOVE**:
+ TRIP ID: This is a unique identifier for each row and generally holds no predictive power for the duration of other trips.
+ STOP TIME: Using this directly would be a data leak, as it inherently defines the trip duration when combined with START TIME
+ TO STATION ID, TO LATTITUDE, TO LONGTITUDE: After further research on Divvy services, it seeems the user merely pays for the time frame and not for the distance, since the app won't ask for a destination beforehand (like Uber), therefore the destination cannot be used to predict the duration 
+ FROM STATION NAME and TO STATION NAME: These are likely redundant if FROM STATION ID and TO STATION ID are clean and used
+ FROM LOCATION and TO LOCATION: These appear to be string representations of the latitude/longitude point data. 

Features to **ADD**:
+ Temporal Features (from START TIME):
    + HourOfDay
    + DayOfWeek
    + PartOfDay: Categorical feature (e.g., morning, afternoon, evening, night) based on HourOfDay.
+ AGE, calculated from BIRTH YEAR
+ Weather Data:
    + Temperature
    + WindSpeed
    + Precipitation
+ Location-based Features:
    + ZoningDesignation (e.g. residential, business)
    + IncomeLevel
    + Other interesting demographic details (if found)

Todo list:
- Handle null values
- Decide which features are important
- Decide python environment
- Decide which new columns to add
- Choose appropriate visualisation of the cleaned dataset
- Seperate dataset into traning, validation and test data