# Bike Sharing: Data Visualization Project
## by Maria Cambalova

## Introduction
In this project, the [data on rides made in a bike-sharing system](https://www.lyft.com/bikes/bay-wheels/system-data) is explored using various visualizations (univariate, bivariate and multivariate). The available data span over four years (2017-2020), but I'll use only data from years 2017 and 2018. One row or record is one bike ride. The data is anonymized.<br><br>
The data contains following features:
- __trip duration__ (`duration_sec`): total duration of one ride in seconds
- __start date and time__ (`start_time`): time and date when the ride started
- __end date and time__ (`end_time`): time and date when the ride ended
- __start station ID__ (`start_station_id`): start station identifier
- __start station name__ (`start_station_name`): name of the start station
- __start station latitude__ (`start_station_latitude`): the latitude coordinate of the start station
- __start station longitude__ (`start_station_longitude`): the longitude coordinate of the start station
- __end station ID__ (`end_station_id`): end station identifier
- __end station name__ (`end_station_name`): name of the start station
- __end station latitude__ (`end_station_latitude`): the latitude coordinate of the end station
- __end station longitude__ (`end_station_longitude`): the longitude coordinate of the end station
- __bike ID__ (`bike_id`): the bike identifier
- __user type__ (`user_type`): whether the user is a regular one, i.e. member ('Subscriber') or a casual one ('Customer')
- __Bike Share for All__ (`bike_share_for_all_trip`): whether the ride was within the [Bike Share for All](https://www.lyft.com/bikes/bay-wheels/bike-share-for-all) program, not available for 2017 rides

## Preliminary Wrangling

I'll divide this section into two parts: 
- [Data Gathering](#data_gathering) - download the data and load it into a dataframe
- [Data Wrangling](#data_wrangling) - quickly explore the data, clean if necessary and prepare it for further exploratory visualizations    

But first, import all packages used in data wrangling and subsequent visualizations:

In [1]:
# NumPy and Pandas
import numpy as np
import pandas as pd

# Packages to gather and manipulate with files
import glob
import os
import requests
import zipfile

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sb

# Set matplotlib backend
%matplotlib inline

<a name='data_gathering'></a>
### Data Gathering

I'd like to analyze bike-sharing data from the years 2017 and 2018. While there is only one file for 2017, the 2018 data is stored in separate month files (there are 12 files containing bike-sharing data for the year 2018). Therefore, I'll download the data programmatically: 
1. construct the file names using year, month and string '-fordgobike-tripdata.csv.zip' common for all the files
2. construct the full files' urls
3. download and store the files in the folder called 'data'  

When creating the file names using months, the numbers have to be padded with a leading zero. This can be accomplished using numpy's [zfill](https://numpy.org/doc/stable/reference/generated/numpy.char.zfill.html) method, as shown here: [Adding leading zeros to strings in NumPy array](https://stackoverflow.com/questions/55376333/adding-leading-zeros-to-strings-in-numpy-array).

In [2]:
# Create folder data to store csv files if the folder does not exist
folder_name = 'data'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [3]:
# Create lists of years and months
years = ['2017', '2018']
months = months = np.char.zfill(np.arange(1, 13).astype(str), 2)

# Testing
# years = ['2017', '2018']
# months = months = np.char.zfill(np.arange(1, 2).astype(str), 2)

# Every file of interest ends with the following string - keep it in file_sfx
file_sfx = '-fordgobike-tripdata.csv.zip'

# The url address (without the file name)
url_pfx = 'https://s3.amazonaws.com/baywheels-data/'

# Loop over files - start with the year
for year in years:
    # There is only one file collecting all data in case of the year 2017
    if year == '2017':
        print('Downloading {}'.format(year + file_sfx))
        # Construct the full file name using year and file suffix
        file = year + file_sfx
        # Download the file and store it
        response = requests.get(url_pfx + file)
        with open(os.path.join(folder_name, file), mode = 'wb') as file:
            file.write(response.content)
    
    # Data for the year 2018 is in separate files according to months
    else:
        for month in months:
            print('Downloading {}'.format(year + month + file_sfx))
            # Construct the full file name using year, month and file suffix
            file = year + month + file_sfx
            # Download the file and store it
            response = requests.get(url_pfx + file)
            with open(os.path.join(folder_name, file), mode = 'wb') as file:
                file.write(response.content)

Downloading 2017-fordgobike-tripdata.csv.zip
Downloading 201801-fordgobike-tripdata.csv.zip
Downloading 201802-fordgobike-tripdata.csv.zip
Downloading 201803-fordgobike-tripdata.csv.zip
Downloading 201804-fordgobike-tripdata.csv.zip
Downloading 201805-fordgobike-tripdata.csv.zip
Downloading 201806-fordgobike-tripdata.csv.zip
Downloading 201807-fordgobike-tripdata.csv.zip
Downloading 201808-fordgobike-tripdata.csv.zip
Downloading 201809-fordgobike-tripdata.csv.zip
Downloading 201810-fordgobike-tripdata.csv.zip
Downloading 201811-fordgobike-tripdata.csv.zip
Downloading 201812-fordgobike-tripdata.csv.zip


Next, unzip the compressed files after the data has been successfully downloaded using the [zipfile](https://docs.python.org/3/library/zipfile.html) module; see also [Unzipping files in Python](https://stackoverflow.com/questions/3451111/unzipping-files-in-python). Also, use the [glob](https://docs.python.org/3/library/glob.html) library to retrieve the names of zip files, as learnt in the Data Wrangling part of Data Analyst Nanodegree Program at Udacity.

In [4]:
# Loop over all downloaded files and extract csv files
for zip_file in glob.glob(folder_name + '/*.zip'):
    with zipfile.ZipFile(zip_file, mode = 'r') as file:
        file.extractall(folder_name)

Finally, load the data into a Pandas dataframe `bikes`:

In [5]:
# Create empty dataframe to store the data
bikes = pd.DataFrame()

# Loop over csv files and add them to the target dataframe one by one
for csv_file in glob.glob(folder_name + '/*.csv'):
    bikes_tmp = pd.read_csv(csv_file)
    bikes = pd.concat([bikes, bikes_tmp], ignore_index = True, axis = 0, sort = False)

<a name='data_wrangling'></a>
### Data Wrangling
Let's examine the bike-sharing data and decide which features would be interesting to look at:

In [6]:
# View the first few lines
bikes.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,bike_share_for_all_trip
0,80110,2017-12-31 16:57:39.6540,2018-01-01 15:12:50.2450,74.0,Laguna St at Hayes St,37.776435,-122.426244,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96,Customer,
1,78800,2017-12-31 15:56:34.8420,2018-01-01 13:49:55.6170,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,96.0,Dolores St at 15th St,37.76621,-122.426614,88,Customer,
2,45768,2017-12-31 22:45:48.4110,2018-01-01 11:28:36.8830,245.0,Downtown Berkeley BART,37.870348,-122.267764,245.0,Downtown Berkeley BART,37.870348,-122.267764,1094,Customer,
3,62172,2017-12-31 17:31:10.6360,2018-01-01 10:47:23.5310,60.0,8th St at Ringold St,37.77452,-122.409449,5.0,Powell St BART Station (Market St at 5th St),37.783899,-122.408445,2831,Customer,
4,43603,2017-12-31 14:23:14.0010,2018-01-01 02:29:57.5710,239.0,Bancroft Way at Telegraph Ave,37.868813,-122.258764,247.0,Fulton St at Bancroft Way,37.867789,-122.265896,3167,Subscriber,


In [7]:
# View the last few lines
bikes.tail()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,bike_share_for_all_trip
2383416,473,2018-12-01 00:11:54.8110,2018-12-01 00:19:48.5470,345.0,Hubbell St at 16th St,37.766474,-122.398295,81.0,Berry St at 4th St,37.77588,-122.39317,3035,Subscriber,No
2383417,841,2018-12-01 00:02:48.7260,2018-12-01 00:16:49.7660,10.0,Washington St at Kearny St,37.795393,-122.40477,58.0,Market St at 10th St,37.776619,-122.417385,2034,Subscriber,No
2383418,260,2018-12-01 00:05:27.6150,2018-12-01 00:09:47.9560,245.0,Downtown Berkeley BART,37.870139,-122.268422,255.0,Virginia St at Shattuck Ave,37.876573,-122.269528,2243,Subscriber,No
2383419,292,2018-12-01 00:03:06.5490,2018-12-01 00:07:59.0800,93.0,4th St at Mission Bay Blvd S,37.770407,-122.391198,126.0,Esprit Park,37.761634,-122.390648,545,Subscriber,No
2383420,150,2018-12-01 00:03:05.7420,2018-12-01 00:05:36.0260,107.0,17th St at Dolores St,37.763015,-122.426497,119.0,18th St at Noe St,37.761047,-122.432642,4319,Subscriber,No


There are 14 features - the meaning of the first 13 features is described in the introductory section. There is one more feature called `bike_share_for_all_trip` present only for the data from 2018. I won't use it for further exploration.</br></br>
The full data contain information about bike-shares over the period of 2 years - it might get pretty big. Let's check it out:

In [8]:
# How big is the dataframe?
bikes.shape

(2383421, 14)

The `bikes` dataframe contains over two million records. Are there any duplicates?

In [9]:
# Check for duplicates
bikes[bikes.duplicated()].shape

(0, 14)

There aren't duplicates - no need to fix anything.</br></br>
Examine data types:

In [10]:
# Print basic information - datatypes and null values
bikes.info(null_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2383421 entries, 0 to 2383420
Data columns (total 14 columns):
duration_sec               2383421 non-null int64
start_time                 2383421 non-null object
end_time                   2383421 non-null object
start_station_id           2371650 non-null float64
start_station_name         2371650 non-null object
start_station_latitude     2383421 non-null float64
start_station_longitude    2383421 non-null float64
end_station_id             2371650 non-null float64
end_station_name           2371650 non-null object
end_station_latitude       2383421 non-null float64
end_station_longitude      2383421 non-null float64
bike_id                    2383421 non-null int64
user_type                  2383421 non-null object
bike_share_for_all_trip    1863721 non-null object
dtypes: float64(6), int64(2), object(6)
memory usage: 200.0+ MB


Columns `start_time` and `end_time` are objects. However, I'd like to have information about year, month, day, day of a week, and hour. These can be retrieved using Pandas Series' [Datetimelike properties](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetimelike-properties). Therefore, I'll convert those two variables into datetime data type, and create the respective columns: 

In [11]:
# Convert start time and end time to datetime format
bikes['start_time'] = pd.to_datetime(bikes['start_time'])
bikes['end_time'] = pd.to_datetime(bikes['end_time'])

In [12]:
# Create additional columns - year, month, day, day of week (Monday, Tuesday, etc.), hour
bikes['start_year'] = bikes['start_time'].dt.year
bikes['start_month'] = bikes['start_time'].dt.month
bikes['start_day'] = bikes['start_time'].dt.day
bikes['start_weekday'] = bikes['start_time'].dt.weekday
bikes['start_hour'] = bikes['start_time'].dt.hour

The `user_type` column would be better as a categorical variable for the visualization purposes:

In [13]:
# View user type variable values and their occurrence
bikes['user_type'].value_counts()

Subscriber    1992784
Customer       390637
Name: user_type, dtype: int64

In [14]:
# Convert user type to categorical variable
level_order = ['Customer', 'Subscriber']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikes['user_type'] = bikes['user_type'].astype(ordered_cat)

The dataframe is quite big - remove columns that will not be used in explorations: 

In [15]:
# Remove unnecessary columns
bikes = bikes.drop(['start_time', 'end_time', 'start_station_latitude', 'start_station_longitude', 
                    'end_station_latitude', 'end_station_longitude', 'bike_share_for_all_trip'], axis = 1)

Let's verify that the `bikes` dataframe contains desired columns, and variables are of the proper data types:

In [16]:
# Check the result - correct datatypes and removal of selected columns
bikes.dtypes

duration_sec             int64
start_station_id       float64
start_station_name      object
end_station_id         float64
end_station_name        object
bike_id                  int64
user_type             category
start_year               int64
start_month              int64
start_day                int64
start_weekday            int64
start_hour               int64
dtype: object

Finally, look at the summaries of numeric variables:

In [18]:
# View basic summary information for numeric variables
bikes[['duration_sec', 'start_year', 'start_month', 'start_day', 'start_weekday', 'start_hour']].describe()

Unnamed: 0,duration_sec,start_year,start_month,start_day,start_weekday,start_hour
count,2383421.0,2383421.0,2383421.0,2383421.0,2383421.0,2383421.0
mean,910.0063,2017.782,7.539704,15.74647,2.611557,13.50274
std,2643.865,0.4129202,3.07904,8.791526,1.84611,4.714829
min,61.0,2017.0,1.0,1.0,0.0,0.0
25%,357.0,2018.0,5.0,8.0,1.0,9.0
50%,564.0,2018.0,8.0,16.0,3.0,14.0
75%,885.0,2018.0,10.0,23.0,4.0,17.0
max,86369.0,2018.0,12.0,31.0,6.0,23.0
