### Explore Bike Share Data

In [None]:
ny = read.csv('new_york_city.csv')
wash = read.csv('washington.csv')
chi = read.csv('chicago.csv')
library(ggplot2)

In [None]:
dim(ny)
head(ny,2)

In [None]:
dim(wash)
head(wash, 2)

In [None]:
dim(chi)
head(chi, 2)

In [None]:
## It will be easier to merge our dataframes into one. Store the column names of Chicago into a variable. 
## Create new dataframe with those column names.
x <- c(names(chi), 'City')
cities <- data.frame(matrix(ncol = 10, nrow = 0))
colnames(cities) <- x
head(cities)

In [None]:
##This function will add any  missing columns, fill in values for city, and append the sets to the new dataframe.
dataMerge <- function(df_city, city){
     ##Check for Washington since it needs three columns added instead of one
    if (city == 'Washington DC'){
        df_city <- cbind(df_city, Gender='', Birth.Year='', City=city)
        cities <<- rbind(cities, df_city)
    } 
    ##Add the city column and row bind the rows to cities.
    else {
        df_city <- cbind(df_city, City = city)
        cities <<- rbind(cities, df_city)
    }
    return (dim(cities))
}

In [None]:
dataMerge(ny, 'New York City')

In [None]:
dataMerge(wash, 'Washington DC')

In [None]:
dataMerge(chi, 'Chicago')

In [None]:
## Remove null values from the combined dataset
cities<-na.omit(cities)
dim(cities)

### Question 1


What is the most popular day of the week for people to use a bike share across all three cities combined?

In [None]:
# First convert Start.Time and End.Time to a datetime function.
cities$Start.Time <- as.POSIXct(cities$Start.Time, format='%Y-%m-%d %H:%M:%S', tz='EST')
cities$End.Time <- as.POSIXct(cities$End.Time, format='%Y-%m-%d %H:%M:%S', tz='EST')

## Extract the numeric day of the week and the actual day of the week from Start.Time column to a new column
## Concantenate the values together
cities$Weekday <- paste((as.POSIXlt(cities$Start.Time)$wday + 1), (weekdays(cities$Start.Time)), sep = ' - ')
head(cities, 2)

In [None]:
## Calulate the counts per weekday
xtabs(~Weekday, data = cities)

In [None]:
# Plot a distribution of the number of bike checkouts per day.
ggplot(cities, aes(Weekday)) +
    geom_histogram(stat="count",  color = 'black', fill = '#099DD9', alpha = 0.75) +
    ggtitle('Distribution of the Number of Trips per Weekday') +
    stat_count(aes(y=..count..+1000, label = ..count..), geom="text") +
    labs(x='Weekdays', y='Number of trips in all cities')

**Summary**

> By combining the data into one master dataframe, a frequency count for each weekday is the best representation of this data. It is not too surprising that Saturday and Sunday, being the weekend, had very similar values that were lower than traditional working days.  The distribution is roughly normal with Wednesday having the most bike share checkouts with a total of 24,370.  

### Question 2

What is the average trip duration in each city?

In [None]:
## Display the summary statistics for the average trip duration.  The unit of measure for Trip.Duration is in seconds.
## Dividing that column by 60 will show the units in minutes, which seems more realistic.

by(cities$Trip.Duration/60, cities$City, summary)

> There are certainly outliers in the data from New York City and Washington.  18K minutes is over two years, which would make it more of a rent-to-own system than a bike share.  It seems unlikely that a person would do a bike share for longer than 12 hours, or 60 minutes, on any given day. 

In [None]:
## Plot the data on separate box plots. Add the upper limit of the y axis to be 720 to exclude the extreme outliers.
## Zoom into the portion of the graph that shows the box, by setting the coord_cartesian limit to 50.
ggplot(cities, aes(x=City, y=Trip.Duration/60, color=City)) +
    geom_boxplot() +
    labs(x='City', y='Length of Trip in Minutes') +
    ggtitle('Boxplots of the Summary Statistics of Trips per City') +
    scale_y_continuous(limits = c(0, 720)) +
    coord_cartesian(ylim = c(0,50)) +
    stat_summary(fun.y=mean, geom='point', shape=23, size=10)

**Summary**

> The data is definitely skewed right for all three cities with the mean falling into the top half of the boxplot box.  There are some people who rent a bike for several hours at a time; however, the majority of people only participate in the ride share for roughly 5 to 20 minutes in a single trip.  Washington DC has the longest trips of the three cities with the bulk ranging between 6.848 and 20.554.  New York City and Chicago are quite similar with the ranges being approximately 5.8 up to 15.9 and 14.9 respectively. 

### Question 3

In New York City, what impact does age have on bike share trips?  Additionally, is there a difference between men and women in the age groups?

In [None]:
## Removes null values
ny <- na.omit(ny)

In [None]:
##Creates an age column based on 
ny$Age <- 2022 - ny$Birth.Year
head(ny, 2)

In [None]:
## Get a feel for this new column
summary(ny$Age)

> The minimum age of 21 seems reasonable; however, the maximum age is too high for the average human.  To error on the side of caution, limit the upper age boundary to equal to 80.

In [None]:
## Remove records of individuals with an age over 80.
ny <- ny[which(ny$Age <= 80),]

## Summarize the Age and Gender columns
summary(ny$Age)
aggregate(ny$Age, list(ny$Gender), FUN=mean)

> The means for the gender groups are quite close. Plotting the histogram will show if each group has the same distribution.

In [None]:
## Plot a Histogram of the ages of riders broken out by gender. 
ggplot(ny, aes(x=Age, fill = Gender)) +
    geom_histogram(binwidth = 5,alpha = 0.5, position = 'identity') +
    labs(x='Age in Years', y='Number of Bike Share Trips') +
    ggtitle('Histogram of Bike Shares in New York by Age and Gender') +
    scale_fill_manual(labels = c('Undisclosed', 'Female', 'Male'), values = c("#F79420", "#eb17e7", "#099DD9")) + 
    facet_wrap(~Gender)

> This clearly shows that men use the bike share system more than women or of undisclosed gender from a strictly volume perspective. 

In [None]:
## Plot Histogram of the proportion or density of the genders
ggplot(ny, aes(x=Age, fill = Gender)) +
    geom_histogram(aes(y=..density..), binwidth = 5,alpha = 0.5, position = 'identity') +
    labs(x='Age in Years', y='Proportion of Bike Share Trips') +
    ggtitle('Density of Bike Shares in New York by Age and Gender') +
    scale_fill_manual(labels = c('Undisclosed', 'Female', 'Male'), values = c("#F79420", "#eb17e7", "#099DD9")) 


**Summary**

> The effect of age on the number of bike share trips made is quite significant.  People from 30 to 50 use the system far more than any other set of ages.  As one might expect, the usage drops significantly the older the population gets.  In contrast, there is not a clear difference between the usage of men, women, and undisclosed genders from a proportional stand point, meaning gender itself does not alter the distribution of age.  At least in New York City, though, men use the system far more than women by about 300%.

In [None]:
system('python -m nbconvert Explore_bikeshare_data.ipynb')