# Case Study: How Does a Bike-Share Navigate Speedy Success? 

## Import data and add new variable (from previous step)

In [5]:
# import necessary libraries
import pandas as pd
import numpy as np
import datetime as dt
import glob

In [7]:
# list of csv files
files = glob.glob('trip_data' + "/*.csv")

# joining files with concat and read_csv
df = pd.concat(map(pd.read_csv, files), ignore_index=True)

In [8]:
# convert started_at and ended_at to date time
df['date'] = pd.to_datetime(df['started_at'])
df['ended_at'] = pd.to_datetime(df['ended_at'])

In [9]:
# add ride_length (in minute)
df['ride_length'] = (df['ended_at'] - df['date']).dt.total_seconds() / 60

In [10]:
# create a new dataframe bike_df with relevant rows columns
bike_df = df[(df['ride_length'] >= 1) & (df['start_lat'] < 44) 
    & (~df['start_station_id'].str.contains('test', case=False, regex=False, na=False)) 
    & (~df['end_station_id'].str.contains('test', case=False, regex=False, na=False))].drop(['ride_id', 'started_at', 'ended_at'], axis=1)
bike_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5799562 entries, 0 to 5900384
Data columns (total 12 columns):
 #   Column              Dtype         
---  ------              -----         
 0   rideable_type       object        
 1   start_station_name  object        
 2   start_station_id    object        
 3   end_station_name    object        
 4   end_station_id      object        
 5   start_lat           float64       
 6   start_lng           float64       
 7   end_lat             float64       
 8   end_lng             float64       
 9   member_casual       object        
 10  date                datetime64[ns]
 11  ride_length         float64       
dtypes: datetime64[ns](1), float64(5), object(6)
memory usage: 575.2+ MB


## Get station info from GBFS feed

The [data source page](https://ride.divvybikes.com/system-data) mentioned that there are live station info on their station [GBFS JSON feed](https://gbfs.divvybikes.com/gbfs/gbfs.json). Also, I notice that there are a lot of inconsistencies in the station fields (name, id, lat, lng). Moreover, our data has 5.9M entries so the repeated information of start/end stations makes the data much heavier. Therefore, it is better to download and store a separate station info file while keeping only the station id field in the `bike_df` dataframe.

First, we get the station information from the GBFS feed.

In [11]:
# get station information from GBFS
station_information = pd.read_json('https://gbfs.divvybikes.com/gbfs/en/station_information.json')
station_information

Unnamed: 0,data,last_updated,ttl
stations,[{'station_id': 'a3a3b731-a135-11e9-9cda-0a87a...,1663327106,5


Seem like the data we need is stored on `data.stations`. Let's use `pandas.json_normalize` to convert JSON data into a flat table.

In [12]:
# normalize JSON data into a flat table
stations = pd.json_normalize(station_information.data.stations)
stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1419 entries, 0 to 1418
Data columns (total 24 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   station_id                            1419 non-null   object 
 1   lon                                   1419 non-null   float64
 2   eightd_station_services               1419 non-null   object 
 3   eightd_has_key_dispenser              1419 non-null   bool   
 4   electric_bike_surcharge_waiver        1419 non-null   bool   
 5   rental_methods                        681 non-null    object 
 6   station_type                          1419 non-null   object 
 7   name                                  1419 non-null   object 
 8   short_name                            681 non-null    object 
 9   external_id                           1419 non-null   object 
 10  region_id                             676 non-null    object 
 11  legacy_id        

There are 23 columns in the `stations` dataframe. Let look at one entry to understand what they mean.

In [13]:
stations.iloc[0]

station_id                              a3a3b731-a135-11e9-9cda-0a87ae2ba916
lon                                                               -87.673935
eightd_station_services                                                   []
eightd_has_key_dispenser                                               False
electric_bike_surcharge_waiver                                         False
rental_methods                                [KEY, CREDITCARD, TRANSITCARD]
station_type                                                         classic
name                                                 Honore St & Division St
short_name                                                      TA1305000034
external_id                             a3a3b731-a135-11e9-9cda-0a87ae2ba916
region_id                                                                  0
legacy_id                                                                 17
has_kiosk                                                               True

The data page mentioned that stations have updated IDs in the GBFS feeds but we can find previous station identifiers in the `legacy_id` field. I also notice that some of station_id in `bike_df` seem to match with the `short_name` filed. We will keep only the relevant columns and save data as CSV files to use in Tableau.

In [None]:
# keep relevant columns, change column names and save to CSV
stations = stations[['legacy_id', 'short_name', 'name', 'lat', 'lon']]
stations.columns = ['station_id', 'short_name', 'station_name', 'lat', 'lon']
stations.to_csv(r'./station.csv', index=False)

## Update station_id

As mentioned before the station_id in `bike_df` is inconsistent. To use them to create relationship with stations info file in Tableau later, we will update them to match with the `station_id` in the stations file. Overall we will:
* Change the id that match `short_name` to `station_id`
* In case no match, use the `station_name` filed to find corresponding id

First, let check the `start_station_id` that not match either `station_id` or `short_name`.

In [16]:
# check start_station_id not match station_id and short_name
bike_df[(~bike_df['start_station_id'].isin(stations['station_id'])) 
    & (~bike_df['start_station_id'].isin(stations['short_name']))]['start_station_id'].value_counts()

KA1504000152    10297
20246.0          4004
20252.0          2291
20.0             1863
20254.0          1760
                ...  
786                 1
885                 1
811                 1
779                 1
787                 1
Name: start_station_id, Length: 231, dtype: int64

Seem like there are IDs with wrong format before they are converted to string. We will remove the '.0' part before updating station id.

In [17]:
# remove '.0' part in start_stion_id
bike_df['start_station_id'] = bike_df['start_station_id'].str.replace('.0', '', regex=False)

Now, let's replace rows with `short_name` as id to the true `station_id`.

In [18]:
# filter row with null short_name
stations_short_name = stations[stations['short_name'].notnull()]

# replace short_name with id
rows_shortname = bike_df['start_station_id'].isin(stations_short_name['short_name'])
dict = stations_short_name.set_index('short_name')['station_id'].to_dict()
bike_df.loc[rows_shortname, 'start_station_id'] = bike_df.loc[rows_shortname, 'start_station_id'].map(dict)

For rows cannot match by `short_name`, we use station name instead. First, let's check the station name that not match with station database.

In [19]:
# check the name not match
bike_df[(~bike_df['start_station_id'].isin(stations['station_id'])) 
    & (~bike_df['start_station_name'].isin(stations['station_name']))]['start_station_name'].value_counts()

Hyde Park Blvd & 55th St          107
Kenton Ave & Palmer St             99
Kostner Ave & Armitage Ave         93
Hyde Park Blvd & 53rd St           75
Midway Orange Line                 66
                                 ... 
Michigan Ave & 96th St              1
Calumet Ave & 103rd St              1
Keeler Ave & 55th St                1
Lake Park Ave & 44th St             1
Newcastle Ave & Wellington Ave      1
Name: start_station_name, Length: 152, dtype: int64

In [21]:
stations[~stations['station_name'].isin(bike_df['start_station_name'])]['station_name'].value_counts()

Wood St & Webster Ave                         1
Public Rack - Muskegon Ave & 89th St          1
Public Rack - Crandon Ave & 80th St           1
Public Rack - Lawrence Ave & 103rd St         1
Public Rack - 111th St - Morgan Park Metra    1
                                             ..
Public Rack - Homan Ave & 62nd Pl             1
Public Rack - Houston Ave & 91st St           1
Public Rack - East End Ave & 75th St          1
Public Rack - Cottage Grove & 84th St         1
Public Rack - Torrence Ave & 98th St          1
Name: station_name, Length: 472, dtype: int64

Seem like some of the name have the prefix 'Public Rack - ', to avoid inconsistency, we will remove this part from name fields of both data.

In [22]:
# remove 'Public Rack - ' in name
bike_df['start_station_name'] = bike_df['start_station_name'].str.replace('Public Rack - ', '', regex=False)
stations['station_name'] = stations['station_name'].str.replace('Public Rack - ', '', regex=False)

Now let's update the station id by station name.

In [23]:
# replace start_station_id with station_id with match name
rows_name = bike_df['start_station_name'].isin(stations['station_name']) & (~bike_df['start_station_id'].isin(stations['station_id']))
dict = stations.set_index('station_name')['station_id'].to_dict()
bike_df.loc[rows_name, 'start_station_id'] = bike_df.loc[rows_name, 'start_station_name'].map(dict)

With remain unmatch row, I think about using `thefuzz` package to fuzzy match with station data. But first let's filter and take a look at these rows. 

In [24]:
fuzzy_name = bike_df['start_station_name'].notnull() & (~bike_df['start_station_id'].isin(stations['station_id']))
bike_df[fuzzy_name]['start_station_name'].value_counts()

Woodlawn Ave & 63rd St - NE               36
Talman Ave & 51st St - midblock           24
Wilton Ave & Diversey Pkwy - Charging     18
Bissell St & Armitage Ave - Charging      17
WEST CHI-WATSON                           16
Kedzie Ave & 61st Pl                      14
Prairie Ave & 47th St - midblock          13
Albany Ave & 63rd St                      12
Pulaski Rd & 41st                          6
Keeler Ave & Madison St                    5
Kedvale Ave & 63rd St                      5
Ashland Ave & 45th St - midblock south     5
Woodlawn Ave & 63rd St - SE                5
DIVVY CASSETTE REPAIR MOBILE STATION       4
NewHastings                                4
Hastings WH 2                              2
Lamon Ave & Archer Ave                     1
Throop/Hastings Mobile Station             1
Name: start_station_name, dtype: int64

In [28]:
# import thefuzz package
# !pip install thefuzz
# !pip install python-Levenshtein -- to remove alert
from thefuzz import process
from thefuzz import fuzz

We will fuzzy match using `thefuzz` package. I find the `token_set_ratio` scorer is best suited with our purpose. 

In [27]:
# try fuzzy match on one example
process.extractOne('Woodlawn Ave & 63rd St - NE', stations['station_name'], scorer=fuzz.token_set_ratio)

('Woodlawn Ave & 63rd St N', 98, 967)

In [25]:
# create loop to fuzzy match, print out the match and update the station_id accordingly
for i in list(bike_df[fuzzy_name]['start_station_name'].index):
    # get the top 1 closest matches
    match = process.extractOne(bike_df[fuzzy_name]['start_station_name'][i], stations['station_name'], scorer=fuzz.token_set_ratio)

    if match[1] > 90:
        print(bike_df[fuzzy_name]['start_station_name'][i], ' | ', match[0])
        bike_df.loc[i, 'start_station_id'] = stations[stations['station_name'] == match[0]]['station_id'].values

Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Bissell St & Armitage Ave - Charging  |

All match seem legit except for the 'Ashland Ave & 45th St - midblock south' entry. We will take a further look on this one.

In [26]:
process.extract('Ashland Ave & 45th St - midblock south', stations['station_name'], limit=3, scorer=fuzz.token_set_ratio)

[('Ashland Ave & 45th St', 100, 1000),
 ('Ashland Ave & 45th St  S', 95, 1250),
 ('Ashland Ave & Lake St', 85, 109)]

The 'Ashland Ave & 45th St  S' seem to be more precise. Let's update the station id of this one accordingly.

In [27]:
bike_df.loc[bike_df['start_station_name'] == 'Ashland Ave & 45th St - midblock south', 'start_station_id'] \
    = stations.loc[stations['station_name'] == 'Ashland Ave & 45th St  S', 'station_id'].item()

Now, let's recheck if there any unmatch left.

In [28]:
fuzzy_name = bike_df['start_station_name'].notnull() & (~bike_df['start_station_id'].isin(stations['station_id']))
bike_df[fuzzy_name]['start_station_name'].value_counts()

WEST CHI-WATSON                         16
DIVVY CASSETTE REPAIR MOBILE STATION     4
NewHastings                              4
Hastings WH 2                            2
Throop/Hastings Mobile Station           1
Lamon Ave & Archer Ave                   1
Name: start_station_name, dtype: int64

From the name, it seems like the above stations are for testing/reparing purpose rather than normal uses (except for the last one with only one entry). We will remove those entries from our data.

In [29]:
# drop unmatched rows
bike_df.drop(bike_df[fuzzy_name].index, inplace=True)

### Do the same with end_station_id

In [30]:
bike_df['end_station_id'] = bike_df['end_station_id'].str.replace('.0', '', regex=False)

# replace short_name with id
rows_shortname = bike_df['end_station_id'].isin(stations_short_name['short_name'])
dict = stations_short_name.set_index('short_name')['station_id'].to_dict()
bike_df.loc[rows_shortname, 'end_station_id'] = bike_df.loc[rows_shortname, 'end_station_id'].map(dict)

# remove 'Public Rack - ' in name
bike_df['end_station_name'] = bike_df['end_station_name'].str.replace('Public Rack - ', '', regex=False)

# replace start_station_id with station_id with match name
rows_name = bike_df['end_station_name'].isin(stations['station_name']) & (~bike_df['end_station_id'].isin(stations['station_id']))
dict = stations.set_index('station_name')['station_id'].to_dict()
bike_df.loc[rows_name, 'end_station_id'] = bike_df.loc[rows_name, 'end_station_name'].map(dict)

In [31]:
fuzzy_name = bike_df['end_station_name'].notnull() & (~bike_df['end_station_id'].isin(stations['station_id']))
bike_df[fuzzy_name]['end_station_name'].value_counts()

Woodlawn Ave & 63rd St - NE               36
Wilton Ave & Diversey Pkwy - Charging     20
Bissell St & Armitage Ave - Charging      17
Kedzie Ave & 61st Pl                      15
Talman Ave & 51st St - midblock           15
Prairie Ave & 47th St - midblock          13
Albany Ave & 63rd St                       8
WEST CHI-WATSON                            7
DIVVY CASSETTE REPAIR MOBILE STATION       6
Keeler Ave & Madison St                    6
Pulaski Rd & 41st                          6
Kedvale Ave & 63rd St                      5
Ashland Ave & 45th St - midblock south     5
Woodlawn Ave & 63rd St - SE                5
NewHastings                                2
Linder Ave & Archer Ave                    1
Name: end_station_name, dtype: int64

In [32]:
for i in list(bike_df[fuzzy_name]['end_station_name'].index):
    # get the top 1 closest matches
    match = process.extractOne(bike_df[fuzzy_name]['end_station_name'][i], stations['station_name'], scorer=fuzz.token_set_ratio)
    
    if match[1] > 90:
        print(bike_df[fuzzy_name]['end_station_name'][i], ' | ', match[0])
        bike_df.loc[i, 'end_station_id'] = stations[stations['station_name'] == match[0]]['station_id'].values

Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Wilton Ave & Diversey Pkwy - Charging  |  Wilton Ave & Diversey Pkwy
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bissell St & Armitage Ave*
Bissell St & Armitage Ave - Charging  |  Bis

In [33]:
bike_df.loc[bike_df['end_station_name'] == 'Ashland Ave & 45th St - midblock south', 'end_station_id'] \
    = stations.loc[stations['station_name'] == 'Ashland Ave & 45th St  S', 'station_id'].item()

In [34]:
fuzzy_name = bike_df['end_station_name'].notnull() & (~bike_df['end_station_id'].isin(stations['station_id']))
bike_df[fuzzy_name]['end_station_name'].value_counts()

WEST CHI-WATSON                         7
DIVVY CASSETTE REPAIR MOBILE STATION    6
NewHastings                             2
Linder Ave & Archer Ave                 1
Name: end_station_name, dtype: int64

In [35]:
# drop unmatched rows
bike_df.drop(bike_df[fuzzy_name].index, inplace=True)

## Filling missing station_id

In [36]:
# check the number of missing rows
bike_df.isnull().sum()

rideable_type              0
start_station_name    811138
start_station_id      811135
end_station_name      860418
end_station_id        860418
start_lat                  0
start_lng                  0
end_lat                 5353
end_lng                 5353
member_casual              0
date                       0
ride_length                0
dtype: int64

There are 800K missing entries in `start_station_id` but none in `start_lat` and `start_lng`. We will try to see if we can recover some `start_station_id` from them.

In [37]:
# take a look at missing station id rows
bike_df[bike_df['start_station_id'].isnull()].head()

Unnamed: 0,rideable_type,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,date,ride_length
1180,electric_bike,,,Clark St & Schreiber Ave,453,41.96,-87.69,41.999221,-87.671354,member,2021-07-07 10:17:26,28.066667
1614,electric_bike,,,Lake Shore Dr & Monroe St,76,41.89,-87.63,41.880955,-87.616731,casual,2021-07-18 17:45:04,106.4
1616,electric_bike,,,Southport Ave & Clybourn Ave,307,41.91,-87.64,41.920755,-87.663708,member,2021-07-24 16:23:31,8.383333
1617,electric_bike,,,Southport Ave & Clybourn Ave,307,41.92,-87.67,41.92069,-87.663757,member,2021-07-01 11:53:56,84.933333
1620,electric_bike,,,Lakefront Trail & Bryn Mawr Ave,760,41.9,-87.69,41.984036,-87.652265,casual,2021-07-04 16:12:43,50.333333


The `start_lat` and `start_lng` of missing id entries only have two decimal points compared to other records. Since our data is quite large and we only need one closet match result, the `merge_asof` from pandas package seem to be suitable. 

In [29]:
# create a column to multiple lat to 100 so we can have integer key for by
stations['100xlat'] = (stations['lat'].round(2)*100).astype(int)
missing_start = bike_df[bike_df['start_station_id'].isnull()].reset_index()[['index', 'start_lat', 'start_lng']]
missing_start['100xlat'] = (missing_start['start_lat']*100).astype(int)

# merge_asof with by 100xlat and on lng, we also set tolerance = 0.01 to match precisely with two-decimal-point longitude 
full_start = pd.merge_asof(
    missing_start.sort_values('start_lng'), 
    stations.sort_values('lon'), 
    left_on='start_lng', right_on='lon', 
    by='100xlat', 
    allow_exact_matches=True, 
    direction='nearest', 
    tolerance=0.01)

# check how many missing left
full_start[full_start['station_id'].isnull()]

Unnamed: 0,index,start_lat,start_lng,100xlat,station_id,short_name,station_name,lat,lon
0,5699569,41.94,-87.84,4194,,,,,
2,1870901,41.94,-87.84,4194,,,,,
6,1748303,41.94,-87.84,4194,,,,,
10,1451668,41.94,-87.84,4194,,,,,
13,5735518,41.94,-87.84,4194,,,,,
...,...,...,...,...,...,...,...,...,...
811137,4144733,41.69,-87.52,4169,,,,,
811138,2882816,41.69,-87.52,4169,,,,,
811139,5069943,41.69,-87.52,4169,,,,,
811141,4401258,41.69,-87.52,4169,,,,,


This is much better. Let's fill in missing id in bike_df.

In [30]:
full_start.set_index('index', inplace=True)
bike_df['new_start_station_id'] = bike_df['start_station_id'].fillna(full_start['station_id'])
bike_df.isnull().sum()

rideable_type                0
start_station_name      811146
start_station_id        811143
end_station_name        860434
end_station_id          860434
start_lat                    0
start_lng                    0
end_lat                   5353
end_lng                   5353
member_casual                0
date                         0
ride_length                  0
new_start_station_id      6325
dtype: int64

In [41]:
# recheck the new start id
bike_df[['start_station_id', 'new_start_station_id']]

Unnamed: 0,start_station_id,new_start_station_id
0,43,43
1,622,622
2,72,72
3,622,622
4,622,622
...,...,...
5900379,245,245
5900380,20,20
5900381,20,20
5900383,51,51


### Do the same with end_station_id

In [42]:
missing_end = bike_df[bike_df['end_station_id'].isnull() & bike_df['end_lat'].notnull()].reset_index()[['index', 'end_lat', 'end_lng']]
missing_end['100xlat'] = (missing_end['end_lat']*100).astype(int)

full_end = pd.merge_asof(
    missing_end.sort_values('end_lng'), 
    stations.sort_values('lon'), 
    left_on='end_lng', right_on='lon', 
    by='100xlat', 
    allow_exact_matches=True, 
    direction='nearest', 
    tolerance=0.01)

full_end[full_end['station_id'].isnull()]

Unnamed: 0,index,end_lat,end_lng,100xlat,station_id,short_name,station_name,lat,lon
0,3316440,41.39,-88.97,4139,,,,,
1,5084304,41.87,-88.14,4187,,,,,
2,3369513,41.90,-87.98,4190,,,,,
3,3333779,41.90,-87.98,4190,,,,,
4,2883764,41.93,-87.96,4193,,,,,
...,...,...,...,...,...,...,...,...,...
855060,529301,41.68,-87.50,4168,,,,,
855061,593316,41.69,-87.50,4169,,,,,
855062,797295,41.69,-87.50,4169,,,,,
855063,820623,41.68,-87.49,4168,,,,,


In [43]:
full_end.set_index('index', inplace=True)
bike_df['new_end_station_id'] = bike_df['end_station_id'].fillna(full_end['station_id'])
bike_df.isnull().sum()

rideable_type                0
start_station_name      811138
start_station_id        811135
end_station_name        860418
end_station_id          860418
start_lat                    0
start_lng                    0
end_lat                   5353
end_lng                   5353
member_casual                0
date                         0
ride_length                  0
new_start_station_id      6324
new_end_station_id       13512
dtype: int64

In [44]:
bike_df[['end_station_id', 'new_end_station_id']]

Unnamed: 0,end_station_id,new_end_station_id
0,365,365
1,285,285
2,125,125
3,92,92
4,217,217
...,...,...
5900379,245,245
5900380,20,20
5900381,20,20
5900383,51,51


## Drop old columns and save data to CSV file

In [45]:
# drop old station columns
drop_cols = ['start_station_id', 'start_station_name', 'start_lat', 'start_lng', 'end_station_id', 'end_station_name', 'end_lat', 'end_lng']
bike_df.drop(drop_cols, axis=1, inplace=True)

In [46]:
# rename columns
bike_df.rename(columns={'new_start_station_id':'start_station_id', 'new_end_station_id':'end_station_id'}, inplace=True)

In [47]:
# check final missing stations values
bike_df.isnull().sum()

rideable_type           0
member_casual           0
date                    0
ride_length             0
start_station_id     6324
end_station_id      13512
dtype: int64

In [48]:
bike_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5799518 entries, 0 to 5900384
Data columns (total 6 columns):
 #   Column            Dtype         
---  ------            -----         
 0   rideable_type     object        
 1   member_casual     object        
 2   date              datetime64[ns]
 3   ride_length       float64       
 4   start_station_id  object        
 5   end_station_id    object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 309.7+ MB


In [49]:
# save bike_df to csv file
bike_df.to_csv(r'/content/drive/MyDrive/Projects/Case study 1/bike_df_map.csv', index=False)

***
This is the end of station info cleaning step. Next, let's create Tableau dashboard to present our findings. You can find my dashboard [here](https://public.tableau.com/views/CyclisticBikeShare_16611333780870/Overview?:language=en-US&:display_count=n&:origin=viz_share_link).