# Bay Wheels Bike Data Exploration
## by Mrunal Karkhanis

## Preliminary Wrangling

> Briefly introduce your dataset here.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

In [2]:
import urllib
import requests
import os
import io
import xml.etree.ElementTree as ET 
import zipfile
from glob import glob

def loadXML(): 
  
    # url of data
    url = 'https://s3.amazonaws.com/baywheels-data'
  
    # creating HTTP response object from given url 
    response = requests.get(url) 
  
    # saving the xml file 
    with open('baywheels-data.xml', 'wb') as f: 
        f.write(response.content) 

In [3]:
def parse_XML(data): 
    # create Element Tree object
    tree = ET.parse(data)
    # get root element 
    root = tree.getroot() 
    # create list to store file names
    filenames = [] 
    # obtain child nodes of Element Tree
    children = root.getchildren()
    for child in children:
        for element in child:
            if (element.tag) == '{http://s3.amazonaws.com/doc/2006-03-01/}Key':
                name = element.text.encode('utf8') 
                # We use find method to exclude the index.html file
                if name.find('.html') == -1:
                    filenames.append(name)
    return filenames

In [4]:
def createURL(name):
    url_list = []
    for i in name:
        url_list.append("https://s3.amazonaws.com/baywheels-data/"+i)
    return url_list

In [5]:
# Make directory if it doesn't already exist
folder_name = 'baywheels_ride_data'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [6]:
def downloadfile(urllist):
    for url in urllist:
        if url.find('.zip') == -1:
            response = requests.get(url)
            with open(os.path.join(folder_name,url.split('/')[-1]),mode ='wb') as file:
                file.write(response.content)
        else:
            response = requests.get(url)
            csv = zipfile.ZipFile(io.BytesIO(response.content))
            csv.extractall(path = '/Users/mkarkhan/anaconda2/baywheels_ride_data')

In [7]:
def filetodf():
    listitem = []
    for filename in glob('/Users/mkarkhan/anaconda2/baywheels_ride_data/'+'*.csv'):
        with open(filename, 'r') as f:
            df = pd.read_csv(f)
            listitem.append(df)
    bikedata_df = pd.concat(listitem, axis=0, ignore_index=True, sort=True)
    return bikedata_df

In [14]:
def main():
    # load xml from web to a file
    loadXML()
    # parse xml file
    filenames = parse_XML('baywheels-data.xml')
    # build urls from file names
    filenames_url = createURL(filenames)
    # doanload csv files to folder using urls
    downloadfile(filenames_url)
    # imports data from csv files in folder to dataframe
    bikedata_df = filetodf()
    return bikedata_df

In [18]:
bikedata_df = main()

  if __name__ == '__main__':


In [20]:
bikedata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3436488 entries, 0 to 3436487
Data columns (total 16 columns):
bike_id                    int64
bike_share_for_all_trip    object
duration_sec               int64
end_station_id             float64
end_station_latitude       float64
end_station_longitude      float64
end_station_name           object
end_time                   object
member_birth_year          float64
member_gender              object
start_station_id           float64
start_station_latitude     float64
start_station_longitude    float64
start_station_name         object
start_time                 object
user_type                  object
dtypes: float64(7), int64(2), object(7)
memory usage: 419.5+ MB


In [21]:
# Check if any duplicates exist
bikedata_df.duplicated().sum()

0

In [22]:
# Check if missing values exist
bikedata_df.isna().sum()

bike_id                         0
bike_share_for_all_trip    519700
duration_sec                    0
end_station_id              12516
end_station_latitude            0
end_station_longitude           0
end_station_name            12516
end_time                        0
member_birth_year          226635
member_gender              226199
start_station_id            12516
start_station_latitude          0
start_station_longitude         0
start_station_name          12516
start_time                      0
user_type                       0
dtype: int64

### Data Assessing:

Quality Issues:
1. start_time and end_time should be datetime datatype
2. start_station_id and end_station_id should be object datatype
3. member_birth_year should be integer datatype
4. Calculate age of member and create new column for it using member_birth_year
5. Calculate distance between stations using latitude and longitude points.

### Data Cleaning

In [23]:
# Create copy of the data before we clean it
bikedata_df_copy = bikedata_df.copy()

#### Define:

Convert start_time and endtime to datetime datatype using to_datetime() function

#### Code:

In [25]:
bikedata_df_copy['start_time'] = pd.to_datetime(bikedata_df_copy.start_time)
bikedata_df_copy['end_time'] = pd.to_datetime(bikedata_df_copy.end_time)

#### Test:

In [26]:
bikedata_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3436488 entries, 0 to 3436487
Data columns (total 16 columns):
bike_id                    int64
bike_share_for_all_trip    object
duration_sec               int64
end_station_id             float64
end_station_latitude       float64
end_station_longitude      float64
end_station_name           object
end_time                   datetime64[ns]
member_birth_year          float64
member_gender              object
start_station_id           float64
start_station_latitude     float64
start_station_longitude    float64
start_station_name         object
start_time                 datetime64[ns]
user_type                  object
dtypes: datetime64[ns](2), float64(7), int64(2), object(5)
memory usage: 419.5+ MB


#### Define:

Convert start_station_id and end_station_id to object datatype using astype() function

#### Code:

In [32]:
bikedata_df_copy['start_station_id'] = bikedata_df_copy['start_station_id'].astype('object')
bikedata_df_copy['end_station_id'] = bikedata_df_copy['end_station_id'].astype('object')

#### Test

In [33]:
bikedata_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3436488 entries, 0 to 3436487
Data columns (total 16 columns):
bike_id                    int64
bike_share_for_all_trip    object
duration_sec               int64
end_station_id             object
end_station_latitude       float64
end_station_longitude      float64
end_station_name           object
end_time                   datetime64[ns]
member_birth_year          float64
member_gender              object
start_station_id           object
start_station_latitude     float64
start_station_longitude    float64
start_station_name         object
start_time                 datetime64[ns]
user_type                  object
dtypes: datetime64[ns](2), float64(5), int64(2), object(7)
memory usage: 419.5+ MB


#### Define:

Fill missing values with 0000 using fillna() and convert member_birth_year to int using astype() function

#### Code:

In [39]:
# We fill missing values for member birth year with '0000'
bikedata_df_copy['member_birth_year'] = bikedata_df_copy['member_birth_year'].fillna('0000')
# Convert member birth year to int datatype
bikedata_df_copy['member_birth_year'] = bikedata_df_copy['member_birth_year'].astype('int64')

#### Test:

In [40]:
bikedata_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3436488 entries, 0 to 3436487
Data columns (total 16 columns):
bike_id                    int64
bike_share_for_all_trip    object
duration_sec               int64
end_station_id             object
end_station_latitude       float64
end_station_longitude      float64
end_station_name           object
end_time                   datetime64[ns]
member_birth_year          int64
member_gender              object
start_station_id           object
start_station_latitude     float64
start_station_longitude    float64
start_station_name         object
start_time                 datetime64[ns]
user_type                  object
dtypes: datetime64[ns](2), float64(4), int64(3), object(7)
memory usage: 419.5+ MB


#### Define:

Calculate age of member by subtracting birth year from current year

#### Code:

In [44]:
bikedata_df_copy['age'] = 2019 - bikedata_df_copy['member_birth_year']

# We replace the age for members whose birth years we do not have with 0
bikedata_df_copy['age'] = bikedata_df_copy['age'].replace(2019,0)

#### Test:

In [45]:
bikedata_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3436488 entries, 0 to 3436487
Data columns (total 17 columns):
bike_id                    int64
bike_share_for_all_trip    object
duration_sec               int64
end_station_id             object
end_station_latitude       float64
end_station_longitude      float64
end_station_name           object
end_time                   datetime64[ns]
member_birth_year          int64
member_gender              object
start_station_id           object
start_station_latitude     float64
start_station_longitude    float64
start_station_name         object
start_time                 datetime64[ns]
user_type                  object
age                        int64
dtypes: datetime64[ns](2), float64(4), int64(4), object(7)
memory usage: 445.7+ MB


#### Define:

Defined a function to calculate distance between start and end stations using latitude and longitude with the Haversine formula.

Assigned calculated distance to distance column in dataframe

#### Code:

In [55]:
def distfromlatlong(lat_start,long_start,lat_end,long_end):
    
    # approximate radius of earth in km
    R = 6373.0

    dlon = long_end - long_start
    dlat = lat_end - lat_start

    a = np.sin(dlat / 2)**2 + np.cos(lat_start) * np.cos(lat_end) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    
    distance = R * c

    return distance
    
bikedata_df_copy['distance'] = distfromlatlong(bikedata_df_copy['start_station_latitude'], bikedata_df_copy['start_station_longitude'], bikedata_df_copy['end_station_latitude'], bikedata_df_copy['end_station_longitude'])

#### Test:

In [56]:
bikedata_df_copy['distance']

0          130.414848
1          110.208813
2          160.366026
3          160.366026
4           95.392913
5          135.320658
6           36.467759
7           51.371323
8           79.441646
9           35.075744
10          35.075744
11          50.968033
12          76.722537
13         176.167459
14          98.494331
15         109.054650
16          33.108054
17          51.131087
18          67.797547
19         282.900648
20          71.226313
21         140.534380
22         110.805746
23         127.568539
24          81.460361
25          46.229203
26          62.509302
27           0.000000
28          69.244730
29         101.294841
              ...    
3436458    200.727509
3436459    102.828609
3436460     58.457084
3436461     51.511821
3436462     61.245524
3436463     61.245524
3436464    115.308631
3436465     46.699158
3436466     55.422154
3436467     61.705591
3436468    187.524348
3436469     42.333251
3436470    244.931663
3436471     25.090531
3436472   

In [57]:
bikedata_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3436488 entries, 0 to 3436487
Data columns (total 18 columns):
bike_id                    int64
bike_share_for_all_trip    object
duration_sec               int64
end_station_id             object
end_station_latitude       float64
end_station_longitude      float64
end_station_name           object
end_time                   datetime64[ns]
member_birth_year          int64
member_gender              object
start_station_id           object
start_station_latitude     float64
start_station_longitude    float64
start_station_name         object
start_time                 datetime64[ns]
user_type                  object
age                        int64
distance                   float64
dtypes: datetime64[ns](2), float64(5), int64(4), object(7)
memory usage: 471.9+ MB


### What is the structure of your dataset?

> 

### What is/are the main feature(s) of interest in your dataset?

In [None]:
### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!