> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Before submitting your project, it will be a good idea to go back through your report and remove these sections to make the presentation of your work as tidy as possible. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: Investigate a Dataset (Replace this with something more specific!)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a>
    <ul>
        <li><a href="#uni">Univariate exploration</a></li>
        <li><a href="#bi">Bivariate exploration</a></li>
        <li><a href="#multi">Multivariate exploration</a></li>
    </ul>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# hide warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [2]:
# load in the dataset into a pandas dataframe.
data = pd.read_csv('201902-fordgobike-tripdata.csv')

In [3]:
# Print numnber of rows and columns in the dataset
data.shape

(183412, 16)

In [4]:
# Print the first 5 rows of the dataset
data.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


In [5]:
# Basic info about the dataset
data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
duration_sec               183412 non-null int64
start_time                 183412 non-null object
end_time                   183412 non-null object
start_station_id           183215 non-null float64
start_station_name         183215 non-null object
start_station_latitude     183412 non-null float64
start_station_longitude    183412 non-null float64
end_station_id             183215 non-null float64
end_station_name           183215 non-null object
end_station_latitude       183412 non-null float64
end_station_longitude      183412 non-null float64
bike_id                    183412 non-null int64
user_type                  183412 non-null object
member_birth_year          175147 non-null float64
member_gender              175147 non-null object
bike_share_for_all_trip    183412 non-null object
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB


> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [6]:
# Drop rows with null values
data.dropna(axis= 0, inplace= True)

In [7]:
# Convert "start_time", and "end_time" variables into datetime type
for x in ["start_time", "end_time"]:
    data[x] = pd.to_datetime(data[x])

In [8]:
# Convert "user_type", and "member_gender" features into nominal categorical datatype.
nom_var_dict = {'user_type': ['Customer', 'Subscriber'],
               'member_gender': ['Male', 'Other', 'Female']}

for var in nom_var_dict:
    nom_var = pd.api.types.CategoricalDtype(categories= nom_var_dict[var], ordered= False)
    data[var] = data[var].astype(nom_var)

In [9]:
# Convert "member_birth_year" data type into intger.
data['member_birth_year'] = data['member_birth_year'].astype('int64')

In [None]:
# Drop rows with null values
data.dropna(axis= 0, inplace= True)

# Convert "start_time", and "end_time" variables into datetime type
for x in ["start_time", "end_time"]:
    data[x] = pd.to_datetime(data[x])

# Convert "user_type", and "member_gender" features into nominal categorical datatype.
nom_var_dict = {'user_type': ['Customer', 'Subscriber'],
               'member_gender': ['Male', 'Other', 'Female']}

for var in nom_var_dict:
    nom_var = pd.api.types.CategoricalDtype(categories= nom_var_dict[var], ordered= False)
    data[var] = data[var].astype(nom_var)

# Convert "member_birth_year" data type into intger.
data['member_birth_year'] = data['member_birth_year'].astype('int64')# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


In [10]:
# Data types of variables included in the dataset after conversion
data.dtypes

duration_sec                        int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id                  float64
start_station_name                 object
start_station_latitude            float64
start_station_longitude           float64
end_station_id                    float64
end_station_name                   object
end_station_latitude              float64
end_station_longitude             float64
bike_id                             int64
user_type                        category
member_birth_year                   int64
member_gender                    category
bike_share_for_all_trip            object
dtype: object

In [11]:
# Summary statistics for numeric variables
data.select_dtypes(['int64','float64']).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
duration_sec,174952.0,704.002744,1642.204905,61.0,323.0,510.0,789.0,84548.0
start_station_id,174952.0,139.002126,111.648819,3.0,47.0,104.0,239.0,398.0
start_station_latitude,174952.0,37.77122,0.100391,37.317298,37.770407,37.78076,37.79732,37.880222
start_station_longitude,174952.0,-122.35176,0.117732,-122.453704,-122.411901,-122.398279,-122.283093,-121.874119
end_station_id,174952.0,136.604486,111.335635,3.0,44.0,101.0,238.0,398.0
end_station_latitude,174952.0,37.771414,0.100295,37.317298,37.770407,37.78101,37.797673,37.880222
end_station_longitude,174952.0,-122.351335,0.117294,-122.453704,-122.411647,-122.397437,-122.286533,-121.874119
bike_id,174952.0,4482.587555,1659.195937,11.0,3799.0,4960.0,5505.0,6645.0
member_birth_year,174952.0,1984.803135,10.118731,1878.0,1980.0,1987.0,1992.0,2001.0


In [12]:
# Summary statistics for variables with object datatype.
data.select_dtypes(['object']).describe().T

Unnamed: 0,count,unique,top,freq
start_station_name,174952,329,Market St at 10th St,3649
end_station_name,174952,329,San Francisco Caltrain Station 2 (Townsend St...,4624
bike_share_for_all_trip,174952,2,No,157606


In [13]:
# Summary statistics for datetime variables
data.select_dtypes(['datetime64[ns]']).describe().T

Unnamed: 0,count,unique,top,freq,first,last
start_time,174952,174941,2019-02-07 17:56:08.897,2,2019-02-01 00:00:20.636,2019-02-28 23:59:18.548
end_time,174952,174939,2019-02-28 17:40:37.328,2,2019-02-01 00:04:52.058,2019-03-01 08:01:55.975


In [14]:
# Summary statistics for variables with category datatype.
data.select_dtypes(['category']).describe().T

Unnamed: 0,count,unique,top,freq
user_type,174952,2,Subscriber,158386
member_gender,174952,3,Male,130500


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### What is the structure of your dataset?

> Your answer here!

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

### Research Question 1 (Replace this header name!)

### Research Question 2  (Replace this header name!)

<a id='uni'></a>
### Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

#### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

#### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

<a id='bi'></a>
### Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

#### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

#### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

<a id='multi'></a>
### Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

#### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

#### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!