# 'Milestone 1 - Chicago, U.S. bikesharing'

`source`: 
https://data.cityofchicago.org/Transportation/Divvy-Trips/fg6s-gzvg/about_data 

`mobility domain`:
https://data.cityofchicago.org/


# 'What variable are we trying to predict?'

`trip duration` 

# 'Data preparation'

Since the dataset given has a total of roughly **21 million rows**, we are reducing the size of the date before downloading it. In order to do that, we have used the query function to filter the rows. 
For the data the following queries where used:

`TRIP_ID` is greater than or equal to 22,000,000 AND

`TRIP_ID` is less than or equal to 22,200,000

todo list:
- null values -> how to handle it
-> 
- which features are important, which ones to drop and why?
- in which python environment are we working?
- which new columns to add? (potentially new data)
- choose appropriate model to visualize the cleaned dataset

In [10]:
import pandas as pd
import numpy as np
import matplotlib as plt
import geopandas as gpd
import seaborn as sns

bike_set = pd.read_csv("Divvy_Trips_20250516.csv", sep=",")
bike_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169540 entries, 0 to 169539
Data columns (total 18 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   TRIP ID            169540 non-null  int64  
 1   START TIME         169540 non-null  object 
 2   STOP TIME          169540 non-null  object 
 3   BIKE ID            169540 non-null  int64  
 4   TRIP DURATION      169540 non-null  int64  
 5   FROM STATION ID    169540 non-null  int64  
 6   FROM STATION NAME  169540 non-null  object 
 7   TO STATION ID      169540 non-null  int64  
 8   TO STATION NAME    169540 non-null  object 
 9   USER TYPE          169540 non-null  object 
 10  GENDER             156307 non-null  object 
 11  BIRTH YEAR         157020 non-null  float64
 12  FROM LATITUDE      169536 non-null  float64
 13  FROM LONGITUDE     169536 non-null  float64
 14  FROM LOCATION      169536 non-null  object 
 15  TO LATITUDE        169530 non-null  float64
 16  TO


We can observe that there are six rows with missing `TO LATITUDE`, `TO LONGITUDE`, `TO LOCATION`, `FROM LATITUDE`, `FROM LONGITUDE` and `FROM LOCATION`. These six rows can be deleted without a significant impact on the general dataset.
A larger portion of missing values can be seen for the columns `GENDER` and `BIRTH YEAR`. Here a larger amount of rows, roughly 12,000 to 13,000 rows are missing. One solution that has been suggested is to analyze the ratio of male to female bike riders and distribute the rows with missing genders accordingly. But due to the size of missing values we have decided as a group to delete them instead in order to keep the integrity of the data set as well as the correlation to other features. Despite the deletion of those rows, the data is still contains sufficient observations with over 100,000 rows. 


In [None]:
# to do -> delete null rows
# März bis April

Unnamed: 0,TRIP ID,START TIME,STOP TIME,BIKE ID,TRIP DURATION,FROM STATION ID,FROM STATION NAME,TO STATION ID,TO STATION NAME,USER TYPE,GENDER,BIRTH YEAR,FROM LATITUDE,FROM LONGITUDE,FROM LOCATION,TO LATITUDE,TO LONGITUDE,TO LOCATION
0,22200000,04/03/2019 08:29:54 AM,04/03/2019 08:38:51 AM,5313,537,18,Wacker Dr & Washington St,50,Clark St & Ida B Wells Dr,Subscriber,Male,1982.0,41.883132,-87.637321,POINT (-87.637321 41.883132),41.875933,-87.630585,POINT (-87.6305845355 41.8759326655)
1,22199999,04/03/2019 08:29:52 AM,04/03/2019 08:33:36 AM,5884,224,137,Morgan Ave & 14th Pl,55,Halsted St & Roosevelt Rd,Subscriber,Male,1965.0,41.862378,-87.651062,POINT (-87.651062 41.862378),41.867324,-87.648625,POINT (-87.648625 41.867324)
2,22199998,04/03/2019 08:29:52 AM,04/03/2019 08:47:57 AM,4048,1085,210,Ashland Ave & Division St,47,State St & Kinzie St,Subscriber,Male,1987.0,41.903450,-87.667747,POINT (-87.667747 41.90345),41.889187,-87.627754,POINT (-87.627754 41.889187)
3,22199997,04/03/2019 08:29:50 AM,04/03/2019 08:34:59 AM,2638,309,96,Desplaines St & Randolph St,192,Canal St & Adams St,Subscriber,Male,1988.0,41.884616,-87.644571,POINT (-87.6445705849 41.88461618962),41.879255,-87.639904,POINT (-87.639904 41.879255)
4,22199996,04/03/2019 08:29:45 AM,04/03/2019 08:38:06 AM,3179,501,77,Clinton St & Madison St,43,Michigan Ave & Washington St,Subscriber,Male,1978.0,41.882242,-87.641066,POINT (-87.641066 41.882242),41.883893,-87.624649,POINT (-87.6246491409 41.8838927658)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169535,22000004,03/05/2019 05:18:57 PM,03/05/2019 05:22:51 PM,4667,234,145,Mies van der Rohe Way & Chestnut St,106,State St & Pearson St,Subscriber,Male,1951.0,41.898587,-87.621915,POINT (-87.6219152258 41.8985866514),41.897448,-87.628722,POINT (-87.628722 41.897448)
169536,22000003,03/05/2019 05:18:52 PM,03/05/2019 05:25:19 PM,1667,387,638,Clinton St & Jackson Blvd (*),175,Wells St & Polk St,Subscriber,Male,1983.0,41.878419,-87.640977,POINT (-87.640977 41.878419),41.872596,-87.633502,POINT (-87.633502 41.872596)
169537,22000002,03/05/2019 05:18:51 PM,03/05/2019 05:28:42 PM,1694,591,110,Dearborn St & Erie St,66,Clinton St & Lake St,Subscriber,Male,1971.0,41.893992,-87.629318,POINT (-87.629318 41.893992),41.885637,-87.641823,POINT (-87.641823 41.885637)
169538,22000001,03/05/2019 05:18:42 PM,03/05/2019 05:24:17 PM,1464,335,44,State St & Randolph St,174,Canal St & Madison St,Subscriber,Male,1979.0,41.884730,-87.627734,POINT (-87.6277335692 41.8847302006),41.882091,-87.639833,POINT (-87.639833 41.882091)


In [None]:
# drop irrelevant columns/features#
# backup
