## Wrangling bikeshare data for Capstone 1

## Read the data
### Let's start with reading in a test file of bikeshare data, Quarter 1 in 2016. We'll need to make sure the start dates and time are read appropriately.  I also want them used as the index for the file, since I'll want to ask time-related questions later on.  What does the head of this data set look like?

In [5]:
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M')
bikeshare = pd.read_csv('../Data Wrangling/data/2016-Q1-Trips-History-Data.csv', 
                        parse_dates=True, date_parser=dateparse, index_col='Start date')
print(bikeshare.head())

                     Duration (ms)         End date  Start station number  \
Start date                                                                  
2016-03-31 23:59:00         301295    4/1/2016 0:04                 31280   
2016-03-31 23:59:00         557887    4/1/2016 0:08                 31275   
2016-03-31 23:59:00         555944    4/1/2016 0:08                 31101   
2016-03-31 23:57:00         766916    4/1/2016 0:09                 31226   
2016-03-31 23:57:00         139656  3/31/2016 23:59                 31011   

                                      Start station  End station number  \
Start date                                                                
2016-03-31 23:59:00                  11th & S St NW               31506   
2016-03-31 23:59:00  New Hampshire Ave & 24th St NW               31114   
2016-03-31 23:59:00                  14th & V St NW               31221   
2016-03-31 23:57:00      34th St & Wisconsin Ave NW               31214   
2016-03-31

### What do the data look like? How many rows? Columns? What types of data are we dealing with?

In [2]:
bikeshare.shape

(552399, 8)

In [3]:
bikeshare.count()

Duration (ms)           552399
End date                552399
Start station number    552399
Start station           552399
End station number      552399
End station             552399
Bike number             552399
Member Type             552399
dtype: int64

In [4]:
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 552399 entries, 2016-03-31 23:59:00 to 2016-01-01 00:06:00
Data columns (total 8 columns):
Duration (ms)           552399 non-null int64
End date                552399 non-null object
Start station number    552399 non-null int64
Start station           552399 non-null object
End station number      552399 non-null int64
End station             552399 non-null object
Bike number             552399 non-null object
Member Type             552399 non-null object
dtypes: int64(3), object(5)
memory usage: 37.9+ MB


## Tidy the data
### There are 552,399 rows of data in this sample data set, and 8 columns, plus the indexI created, 'start time'.  Let's tidy up a few things next.
### First, replace column names with something easier to handle in our code.

In [34]:
bikeshare.columns = ['duration','enddate','startlocID','startloc','endlocID','endloc','bikeID','memtype']

In [35]:
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 552399 entries, 2016-03-31 23:59:00 to 2016-01-01 00:06:00
Data columns (total 8 columns):
duration      552399 non-null float64
enddate       552399 non-null object
startlocID    552399 non-null int64
startloc      552399 non-null object
endlocID      552399 non-null int64
endloc        552399 non-null object
bikeID        552399 non-null object
memtype       552399 non-null object
dtypes: float64(1), int64(2), object(5)
memory usage: 57.9+ MB


### That's better! 

### Notice, those durations are in MILLISECONDS (really?!).  

### Let's convert to minutes.

In [36]:
bikeshare.duration = bikeshare.duration * 1.66667e-5

In [37]:
bikeshare.duration.head()

Start date
2016-03-31 23:59:00    0.000084
2016-03-31 23:59:00    0.000155
2016-03-31 23:59:00    0.000154
2016-03-31 23:57:00    0.000213
2016-03-31 23:57:00    0.000039
Name: Duration (ms), dtype: float64

### Are there missing values? No!

In [22]:
bikeshare.isnull().values.any()

False

### Are there outliers? What is the distribution of ride duration?

In [38]:
bikeshare.describe()

Unnamed: 0,duration,startlocID,endlocID
count,552399.0,552399.0,552399.0
mean,16.55623,31306.364765,31307.738463
std,34.566415,206.645237,203.722765
min,1.000419,31000.0,31000.0
25%,6.192737,31202.0,31204.0
50%,10.367454,31246.0,31246.0
75%,17.470568,31408.0,31405.0
max,1438.439627,32053.0,32053.0


## Summary:
### So far, we have read in the bikeshare test dataset, used a datetime index, and replaced the column headers.  We have then converted milliseconds to minutes, for improvement interpretability.  Finally, we confirmed there are no missing values or outliers.  