# Analyzing Bay Wheels December 2019 Data
## by Ibrahim Olayiwola

## Introduction
Bay Wheels is a regional public bicycle sharing system in California's San Francisco Bay Area. It is operated by Motivate in a partnership with the Metropolitan Transportation Commission and the Bay Area Air Quality Management District.Bay Wheels is the first regional and large-scale bicycle sharing system deployed in California and on the West Coast of the United States. It was established as Bay Area Bike Share in August 2013. As of January 2018, the Bay Wheels system had over 2,600 bicycles in 262 stations across San Francisco, East Bay and San Jose.

The data used for this analysis is that of **December 2019** and it can be found [here](https://s3.amazonaws.com/baywheels-data/index.html).

### Preliminary Wrangling

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [2]:
# Load dataset
df = pd.read_csv('201912-baywheels-tripdata.csv')
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,rental_access_method
0,66600,2019-12-31 14:28:50.2860,2020-01-01 08:58:51.2500,364.0,China Basin St at 3rd St,37.772,-122.38997,349.0,Howard St at Mary St,37.78101,-122.405666,12085,Customer,
1,36526,2019-12-31 21:52:47.7620,2020-01-01 08:01:33.9320,38.0,The Embarcadero at Pier 38,37.782926,-122.387921,410.0,Illinois St at Cesar Chavez St,37.7502,-122.386567,9477,Customer,
2,8164,2019-12-31 23:50:04.8770,2020-01-01 02:06:09.4140,14.0,Clay St at Battery St,37.795001,-122.39997,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,10420,Customer,
3,8163,2019-12-31 23:49:21.4000,2020-01-01 02:05:24.6670,14.0,Clay St at Battery St,37.795001,-122.39997,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,2065,Customer,
4,6847,2019-12-31 22:51:05.6850,2020-01-01 00:45:13.4860,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,363.0,Salesforce Transit Center (Natoma St at 2nd St),37.787492,-122.398285,10219,Customer,


In [3]:
# Shape of dataset
df.shape

(150102, 14)

### Structure of dataset.
The dataset consist of 150102 rows and 14 columns.

- **duration_sec**: This is the trip duration in seconds.
- **start_time**: This column tells us the start time and date of the trip.
- **end_time**: This is the end time and date of the trip.
- **start_station_id**: The identification number of the station the trip was started.
- **end_station_id**: The identification number of the station the trip ended.
- **start_station_name**: The name of the station the trip was started.
- **end_station_name**: The name of the station the trip ended.
- **start_station_latitude and start_station_longitude**: Address of the start station on a GPS.
- **end_station_latitude and end_station_longitude**: Address of the end station on a GPS.
- **Bike_id**: Identification number of the bike.
- **User_type**: A user's status (Either a subcriber or a customer).
- **rental_access_method**: The menthod a user used in renting a bike. No information of this column was given on the website where this dataset was gotten.

### Assessing Data

#### Visual Assessment
This dataset was also assessed using Google Sheets.


In [4]:
# Viewing random sample of the dataset
df.sample(20)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,rental_access_method
91096,135,2019-12-07 09:28:56.8410,2019-12-07 09:31:12.1470,196.0,Grand Ave at Perkins St,37.808894,-122.25646,195.0,Bay Pl at Vernon St,37.812314,-122.260779,358,Subscriber,
5318,968,2019-12-29 10:04:05.5860,2019-12-29 10:20:13.7390,171.0,Rockridge BART Station,37.844279,-122.2519,181.0,Grand Ave at Webster St,37.811377,-122.265192,12062,Subscriber,
127546,216,2019-12-05 21:40:12,2019-12-05 21:43:49,278.0,The Alameda at Bush St,37.331932,-121.904888,276.0,Julian St at The Alameda,37.332233,-121.912517,173812,Subscriber,app
27156,661,2019-12-19 11:08:28.6750,2019-12-19 11:19:29.9350,14.0,Clay St at Battery St,37.795001,-122.39997,81.0,Berry St at 4th St,37.77588,-122.39317,10784,Customer,
116492,833,2019-12-03 05:00:05.9660,2019-12-03 05:13:59.6590,189.0,Genoa St at 55th St,37.839649,-122.271756,181.0,Grand Ave at Webster St,37.811377,-122.265192,10000,Subscriber,
10778,135,2019-12-25 13:51:14.2210,2019-12-25 13:53:29.3200,188.0,Dover St at 57th St,37.84263,-122.267738,189.0,Genoa St at 55th St,37.839649,-122.271756,461,Customer,
127717,861,2019-12-09 20:52:03,2019-12-09 21:06:25,,,37.332128,-121.884991,,,37.338766,-121.859844,342510,Subscriber,app
119269,394,2019-12-02 13:12:13.8340,2019-12-02 13:18:48.1420,215.0,34th St at Telegraph Ave,37.822547,-122.266318,181.0,Grand Ave at Webster St,37.811377,-122.265192,2779,Subscriber,
95840,416,2019-12-06 08:47:10.5500,2019-12-06 08:54:07.0450,356.0,Valencia St at Clinton Park,37.769188,-122.422285,89.0,Division St at Potrero Ave,37.769218,-122.407646,11900,Customer,
17625,1585,2019-12-21 12:33:03.8190,2019-12-21 12:59:29.1460,483.0,7th St at King St,37.771229,-122.400886,377.0,Fell St at Stanyan St,37.771951,-122.453705,9587,Customer,


### Programmatic Assessment


In [5]:
# Check basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150102 entries, 0 to 150101
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             150102 non-null  int64  
 1   start_time               150102 non-null  object 
 2   end_time                 150102 non-null  object 
 3   start_station_id         129083 non-null  float64
 4   start_station_name       129087 non-null  object 
 5   start_station_latitude   150102 non-null  float64
 6   start_station_longitude  150102 non-null  float64
 7   end_station_id           128755 non-null  float64
 8   end_station_name         128757 non-null  object 
 9   end_station_latitude     150102 non-null  float64
 10  end_station_longitude    150102 non-null  float64
 11  bike_id                  150102 non-null  int64  
 12  user_type                150102 non-null  object 
 13  rental_access_method     27681 non-null   object 
dtypes: f

In [6]:
# Number of unique bike ids
df.bike_id.nunique()

5905

In [7]:
# Numberof unique start stations
df.start_station_name.nunique()

422

In [8]:
# Number of unique end stations
df.end_station_name.nunique()

425

In [9]:
# Descriptive measures of center on trip duration
df.duration_sec.describe()

count    150102.000000
mean        790.649752
std        2925.944647
min          60.000000
25%         359.000000
50%         570.000000
75%         886.000000
max      912110.000000
Name: duration_sec, dtype: float64

In [10]:
# Unique modes of bike rental access
df.rental_access_method.unique()

array([nan, 'app', 'clipper'], dtype=object)

In [11]:
# Unique user types
df.user_type.unique()

array(['Customer', 'Subscriber'], dtype=object)

From the cells above, we can see that there are some improvements that can be done to the dataset. Some of the improvements includes

- Change the datatypes of some columns (start_time, end_station_id, end_station_latitude, bike_id etc)
- Extract the days and hours from start_time and end_time columns
- Calculate minutes from duration_sec column

### Cleaning Data

In [12]:
# Copying dataset to preserve original data
df_copy = df.copy()
df_copy.head(2)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,rental_access_method
0,66600,2019-12-31 14:28:50.2860,2020-01-01 08:58:51.2500,364.0,China Basin St at 3rd St,37.772,-122.38997,349.0,Howard St at Mary St,37.78101,-122.405666,12085,Customer,
1,36526,2019-12-31 21:52:47.7620,2020-01-01 08:01:33.9320,38.0,The Embarcadero at Pier 38,37.782926,-122.387921,410.0,Illinois St at Cesar Chavez St,37.7502,-122.386567,9477,Customer,


In [13]:
# drop latitudes and longitudes columns and rental_access_method column.
df_copy = df.drop(['start_station_latitude', 'start_station_longitude', 'end_station_latitude',
                   'end_station_longitude', 'rental_access_method'], axis=1)
df_copy.head(2)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,end_station_id,end_station_name,bike_id,user_type
0,66600,2019-12-31 14:28:50.2860,2020-01-01 08:58:51.2500,364.0,China Basin St at 3rd St,349.0,Howard St at Mary St,12085,Customer
1,36526,2019-12-31 21:52:47.7620,2020-01-01 08:01:33.9320,38.0,The Embarcadero at Pier 38,410.0,Illinois St at Cesar Chavez St,9477,Customer


**Define**

1. start_time data type is object. | Change to timestamps.
2. end_time has a data type of object. | Change to timestamps.
3. start_station_id data type is float. | Change to object.
4. end_station_id data type is float. | Change to object.
5. bike_id is of data type int. | Change to object.
6. user_type is object. | Change to category.

**Code**

In [14]:
# Change dates to timestamps
df_copy.start_time = pd.to_datetime(df_copy.start_time)
df_copy.end_time = pd.to_datetime(df_copy.end_time)

In [15]:
# Change start_station_id, end_station_id and bike_id to object
df_copy.bike_id = df_copy.bike_id.astype(str)
df_copy.start_station_id = df_copy.start_station_id.astype(str)
df_copy.end_station_id = df_copy.end_station_id.astype(str)

In [16]:
# Change user_type to category
df_copy.user_type = df_copy.user_type.astype('category')

**Test**

In [17]:
# Test by checking datatypes of columns
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150102 entries, 0 to 150101
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   duration_sec        150102 non-null  int64         
 1   start_time          150102 non-null  datetime64[ns]
 2   end_time            150102 non-null  datetime64[ns]
 3   start_station_id    150102 non-null  object        
 4   start_station_name  129087 non-null  object        
 5   end_station_id      150102 non-null  object        
 6   end_station_name    128757 non-null  object        
 7   bike_id             150102 non-null  object        
 8   user_type           150102 non-null  category      
dtypes: category(1), datetime64[ns](2), int64(1), object(5)
memory usage: 9.3+ MB
