# Exploring Bike Share Data

The goal of this project is to search for usage patterns in New York, Washington DC, and Chicago. We will be exploring the generations that are most likely to use Bike Share, the most common stops for Subscribing & Non-Subscribing Customers, and if there is a correlation between age and trip duration of the customers. 

In [1]:
# open libraries to access useful data transforming tools
library(dplyr)
library(tidyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [None]:
# read in the 3 table files, 1 for each city
ny = read.csv('new_york_city.csv')
wash = read.csv('washington.csv')
chi = read.csv('chicago.csv')

In [None]:
# explore columns & data available for each table/city
head(ny)

In [None]:
# explore columns & data available for each table/city
head(wash)

In [None]:
# explore columns & data available for each table/city
head(chi)

In [None]:
# change all column names to lowercase
names(ny) <- tolower(names(ny))
names(wash) <- tolower(names(wash))
names(chi) <- tolower(names(chi))

In [None]:
# create views w/ columns dropped to clean up data
ny2 <- subset(ny, select = -c(start.time, end.time, trip.duration))
wash2 <- subset(wash, select = -c(start.time, end.time, trip.duration))
chi2 <- subset(chi, select = -c(start.time, end.time, trip.duration))

In [None]:
# create city "name" column to hold city name
ny$name <- 'New York'
wash$name <- 'DC'
chi$name <- 'Chicago'

In [None]:
# calculate the total rows in each table
nytotal <- nrow(ny)
washtotal <- nrow(wash)
chitotal <- nrow(chi)
print(paste("Total Rows in New York:", nytotal))
print(paste("Total Rows in DC:", washtotal))
print(paste("Total Rows in Chicago:", chitotal))

### First Thoughts:
There appears to be no way to track individual users (no userid), this means some data may be skewed because of repeat customers (commuting). Bike Share does not seem to be very popular in Chicago compared to New York and DC. 

# Question 1


### **What generations use Bike Share the most?**

## New York

In [None]:
# find statistics for New York w/o nulls
min_age_ny <- min(ny2$birth.year, na.rm = TRUE)
max_age_ny <- max(ny2$birth.year, na.rm = TRUE)
median_age_ny <- median(ny2$birth.year, na.rm = TRUE)

print(paste("The oldest New York customer was born in:", min_age_ny))
print(paste("The youngest New York customer was born in:", max_age_ny))
print(paste("The median year a New York customer was born in was:", median_age_ny))

## Outliers / Errors Causing Discreptencies

The oldest New York Customer being born in 1885 does not seem accurate-- that would be a 140 year old riding a bike through the streets of New York. The data must be cleaned up to have a more accurate representation of our customer base. Nulls were already removed in the above calculation. 

In [None]:
# create a box plot to see current data spread
boxplot(ny2$birth.year)$out

In [None]:
#find Q1, Q3, and IQR range to appropriately remove outliers
Q1ny <- quantile(ny2$birth.year, .25, na.rm = TRUE)
Q3ny <- quantile(ny2$birth.year, .75, na.rm = TRUE)
IQRny <- IQR(ny2$birth.year, na.rm = TRUE)

#remove rows that values are outside 1.5*IQR of Q1 and Q3
ny2.birthyear.outliers <- subset(ny2, ny2$birth.year> (Q1ny - 1.5*IQRny) & ny2$birth.year< (Q3ny + 1.5*IQRny))

#view new row and column count
ny2total <- nrow(ny2.birthyear.outliers)
ny2complete <- nytotal - ny2total

print(paste("New Total Rows in New York:", ny2total))
print(paste("Total Rows Removed from New York Dataframe:", ny2complete))

New York started with a total of 54,470 rows. After nulls and outliers have been removed, the new total is 49,398 rows. The following visualizations are based on the 49,398 rows that accurately represent the majority of the New York customer base.

In [None]:
# new box plot to see the spread of birth years after the data has been cleaned up. 
boxplot(ny2.birthyear.outliers$birth.year)$out

In [None]:
# find statistics for New York w/o nulls & outliers
min_age_ny2 <- min(ny2.birthyear.outliers$birth.year, na.rm = TRUE)
max_age_ny2 <- max(ny2.birthyear.outliers$birth.year, na.rm = TRUE)
median_age_ny2 <- median(ny2.birthyear.outliers$birth.year, na.rm = TRUE)

print(paste("The oldest New York customer was born in:", min_age_ny2))
print(paste("The youngest New York customer was born in:", max_age_ny2))
print(paste("The median year a New York customer was born in was:", median_age_ny2))

## Chicago

In [None]:
# find statistics for Chicago w/o nulls
min_age_chi <- min(chi2$birth.year, na.rm = TRUE)
max_age_chi <- max(chi2$birth.year, na.rm = TRUE)
median_age_chi <- median(chi2$birth.year, na.rm = TRUE)

print(paste("The oldest Chicago customer was born in:", min_age_chi))
print(paste("The youngest Chicago customer was born in:", max_age_chi))
print(paste("The median year a Chicago customer was born in was:", median_age_chi))

In [None]:
#find Q1, Q3, and IQR range to appropriately remove outliers
Q1chi <- quantile(chi2$birth.year, .25, na.rm = TRUE)
Q3chi <- quantile(chi2$birth.year, .75, na.rm = TRUE)
IQRchi <- IQR(chi2$birth.year, na.rm = TRUE)

#remove rows that values are outside 1.5*IQR of Q1 and Q3
chi2.birthyear.outliers <- subset(chi2, chi2$birth.year> (Q1chi - 1.5*IQRchi) & chi2$birth.year< (Q3chi + 1.5*IQRchi))

#view new row and column count
chi2total <- nrow(chi2.birthyear.outliers)
chi2complete <- chitotal - chi2total

print(paste("New Total Rows in Chicago:", chi2total))
print(paste("Total Rows Removed from Chicago Dataframe:", chi2complete))

In [None]:
boxplot(chi2.birthyear.outliers$birth.year)$out

In [None]:
# find statistics for Chicago w/o nulls & outliers
min_age_chi2 <- min(chi2.birthyear.outliers$birth.year, na.rm = TRUE)
max_age_chi2 <- max(chi2.birthyear.outliers$birth.year, na.rm = TRUE)
median_age_chi2 <- median(chi2.birthyear.outliers$birth.year, na.rm = TRUE)

print(paste("The oldest Chicago customer was born in:", min_age_chi2))
print(paste("The youngest Chicago customer was born in:", max_age_chi2))
print(paste("The median year a Chicago customer was born in was:", median_age_chi2))

In [None]:
# create histogram to visualize age group spreads in new york & chicago

nyvis <- ny2.birthyear.outliers$birth.year
chivis <- chi2.birthyear.outliers$birth.year
hist(nyvis, col="blue", xlab='Birth Year', ylab='# of Customers', main='Birth Years of New York & Chicago Customers')
hist(chivis, col="red", add=TRUE)
legend('topleft', c('New York Customers', 'Chicago Customers'),
       fill=c("blue", "red"))

In [None]:
ny1985 <- sum(ny2.birthyear.outliers$birth.year > 1984 & ny2.birthyear.outliers$birth.year < 1990)
ny1980 <- sum(ny2.birthyear.outliers$birth.year > 1979 & ny2.birthyear.outliers$birth.year < 1985)

chi1985 <- sum(chi2.birthyear.outliers$birth.year > 1984 & chi2.birthyear.outliers$birth.year < 1990)
chi1980 <- sum(chi2.birthyear.outliers$birth.year > 1979 & chi2.birthyear.outliers$birth.year < 1985)

ny85p <- (ny1985 / ny2total) * 100
ny80p <- (ny1980 / ny2total) * 100

ny80r <- format(round(ny80p, 2), nsmall = 2)
ny85r <- format(round(ny85p, 2), nsmall = 2)

print(paste("New York customers born 1980-1984:", ny1980, "making up", ny80r, "% of the population."))
print(paste("New York customers born 1985-1989:", ny1985, "making up", ny85r, "% of the population."))


chi80p <- (chi1980 / chi2total) * 100
chi85p <- (chi1985 / chi2total) * 100

chi80r <- format(round(chi80p, 2), nsmall = 2)
chi85r <- format(round(chi85p, 2), nsmall = 2)

print(paste("Chicago customers born 1980-1984:", chi1980, "making up", chi80r, "% of the population."))
print(paste("Chicago customers born 1985-1989:", chi1985, "making up", chi85r, "% of the population."))

## **Summary - What generations use Bike Share the most? **

The median birth year of New York customers was 1981 and the median for Chicago customers was 1984. The visualization of the histogram shows that for both cities the majority of the customer base was born between 1985-1989, with the 1980-1984 group being the second largest. New York's oldest customers were born in 1944 and the youngest in 2001, this is a 57 year spread. 1980-1989 makes up 10 years of that spread and customers born in those years combined make up ~36% of the customer base. Chicago's oldest customers were born in 1955 and the youngest in 2002, this is a 47 year spread. 1980-1989 makes up 10 years of that spread and customers born in those years combined make up ~42% of the customer base. Bike Share is most popular with customers born before 1990.

## Question 2

### **Are there destination similarities for Non-Subscribing vs Subscribing Customers?**

In [None]:
nysub <- sum(ny$user.type == "Subscriber")
nynonsub <- sum(ny$user.type == "Customer")

washsub <- sum(wash$user.type == "Subscriber")
washnonsub <- sum(wash$user.type == "Customer")

chisub <- sum(chi$user.type == "Subscriber")
chinonsub <- sum(chi$user.type == "Customer")

print(paste("Total New York Non-Subscribers:", nynonsub))
print(paste("Total New York Subscribers:", nysub))
            
print(paste("Total DC Non-Subscribers:", washnonsub))
print(paste("Total DC Subscribers:", washsub))

print(paste("Total Chicago Non-Subscribers:", chinonsub))
print(paste("Total Chicago Subscribers:", chisub))


In [None]:
nysubp <- (nysub / nytotal) * 100
washsubp <- (washsub / washtotal) * 100
dcsubp <- (chisub / chitotal) * 100

print(paste("New York Customers that are subscribers:", nysubp, "%"))
print(paste("DC Customers that are subscribers:", washsubp, "%"))
print(paste("Chicago Customers that are subscribers:", dcsubp, "%"))

### First Thoughts:
The majority of customers are subscribing customers. This makes sense as these are very populated cities that have a lot of vehicle traffic. 

## New York

In [None]:
# see the frequency of each end destination

as.data.frame(table(ny$end.station))

In [None]:
# create a subset for non-subscribing customers
nyc2 <- subset(ny, ny$user.type == "Customer")

# create data frame of end.station + count that is limited to non-subscribing customers
nydf <- as.data.frame(table(nyc2$end.station))

# rename columns
colnames(nydf)[1] = "end.station"
colnames(nydf)[2] = "count"

# reorder by highest #s first
nydf <- nydf[order(nydf$count, decreasing=TRUE),]

# remove stops that are less than 1
nydf <- subset(nydf, nydf$count > 0)

In [None]:
#create a variable that stores the top 10 visited end.stations
nydf2 <- top_n(nydf,10,count)

#display the top 10 visited end.staions
nydf2

In [None]:
#display a piechart of most frequent end.stations for non-subscribers
pie(nydf2$count, nydf2$end.station, radius=0.5, main="Top 10 Stops of Non-Subscribing New York Customers")

In [None]:
# create a subset for subscribing customers
nys2 <- subset(ny, ny$user.type == "Subscriber")

# create data frame of end.station + count that is limited to non-subscribing customers
nysf <- as.data.frame(table(nys2$end.station))

# rename columns
colnames(nysf)[1] = "end.station"
colnames(nysf)[2] = "count"

# reorder by highest #s first
nysf <- nysf[order(nysf$count, decreasing=TRUE),]

# remove stops that are less than 1
nysf <- subset(nysf, nysf$count > 0)

In [None]:
#create a variable that stores the top 10 visited end.stations of subscribers
nysf2 <- top_n(nysf,10,count)

#display the top 10 visited end.staions
nysf2

In [None]:
#display a piechart of most frequent end.stations for subscribers
pie(nysf2$count, nysf2$end.station, radius=0.5, main="Top 10 Stops of Subscribing New York Customers")

## DC

In [None]:
# see the frequency of each end destination

as.data.frame(table(wash$end.station))

In [None]:
# create a subset for non-subscribing customers
washc2 <- subset(wash, wash$user.type == "Customer")

# create data frame of end.station + count that is limited to non-subscribing customers
washdf <- as.data.frame(table(washc2$end.station))

# rename columns
colnames(washdf)[1] = "end.station"
colnames(washdf)[2] = "count"

# reorder by highest #s first
washdf <- washdf[order(washdf$count, decreasing=TRUE),]

# remove stops that are less than 1
washdf <- subset(washdf, washdf$count > 0)

In [None]:
#create a variable that stores the top 10 visited end.stations
washdf2 <- top_n(washdf,10,count)

#display the top 10 visited end.staions
washdf2

In [None]:
#display a piechart of most frequent end.stations for non-subscribers
pie(washdf2$count, washdf2$end.station, radius=0.5, main="Top 10 Stops of Non-Subscribing DC Customers")

In [None]:
# create a subset for subscribing customers
washs2 <- subset(wash, wash$user.type == "Subscriber")

# create data frame of end.station + count that is limited to non-subscribing customers
washsf <- as.data.frame(table(washs2$end.station))

# rename columns
colnames(washsf)[1] = "end.station"
colnames(washsf)[2] = "count"

# reorder by highest #s first
washsf <- washsf[order(washsf$count, decreasing=TRUE),]

# remove stops that are less than 1
washsf <- subset(washsf, washsf$count > 0)

In [None]:
#create a variable that stores the top 10 visited end.stations of subscribers
washsf2 <- top_n(washsf,10,count)

#display the top 10 visited end.staions
washsf2

In [None]:
#display a piechart of most frequent end.stations for subscribers
pie(washsf2$count, washsf2$end.station, radius=0.5, main="Top 10 Stops of Subscribing DC Customers")

## Chicago

In [None]:
# see the frequency of each end destination

as.data.frame(table(chi$end.station))

In [None]:
# create a subset for non-subscribing customers
chic2 <- subset(chi, chi$user.type == "Customer")

# create data frame of end.station + count that is limited to non-subscribing customers
chidf <- as.data.frame(table(chic2$end.station))

# rename columns
colnames(chidf)[1] = "end.station"
colnames(chidf)[2] = "count"

# reorder by highest #s first
chidf <- chidf[order(chidf$count, decreasing=TRUE),]

# remove stops that are less than 1
chidf <- subset(chidf, chidf$count > 0)

In [None]:
#create a variable that stores the top 10 visited end.stations
chidf2 <- top_n(chidf,10,count)

#display the top 10 visited end.staions
chidf2

In [None]:
#display a piechart of most frequent end.stations for non-subscribers
pie(chidf2$count, chidf2$end.station, radius=0.5, main="Top 10 Stops of Non-Subscribing Chicago Customers")

In [None]:
# create a subset for subscribing customers
chis2 <- subset(chi, chi$user.type == "Subscriber")

# create data frame of end.station + count that is limited to non-subscribing customers
chisf <- as.data.frame(table(chis2$end.station))

# rename columns
colnames(chisf)[1] = "end.station"
colnames(chisf)[2] = "count"

# reorder by highest #s first
chisf <- chisf[order(chisf$count, decreasing=TRUE),]

# remove stops that are less than 1
chisf <- subset(chisf, chisf$count > 0)

In [None]:
#create a variable that stores the top 10 visited end.stations of subscribers
chisf2 <- top_n(chisf,10,count)

#display the top 10 visited end.staions
chisf2

In [None]:
#display a piechart of most frequent end.stations for subscribers
pie(chisf2$count, chisf2$end.station, radius=0.5, main="Top 10 Stops of Subscribing Chicago Customers")

## **Summary - Are there destinations similarities for Non-Subscribing vs Subscribing Customers?**
One of the driving forces behind this question was if there were any similarities between the ending destinations for Non-Subscribers and Subscribers. After checking the top 10 destinations from the trips for each city, no similarities or overlapping data was observed. This sums up that Subscribing members are typically commuting to work, as seen by the larger numbers of trips in general to the same destinations, and Non-Subscribing members are using Bike Share to visit tourist-y destinations like parks, memorials, etc. 

# Question 3

### **Is there any relationship between trip duration and age?**

## New York

In [None]:
min_trip_ny <- min(ny$trip.duration, na.rm = TRUE)
max_trip_ny <- max(ny$trip.duration, na.rm = TRUE)
median_trip_ny <- median(ny$trip.duration, na.rm = TRUE)

min_trip_ny <- format(round((min_trip_ny / 60), 2), nsmall = 2)
max_trip_ny <- format(round((max_trip_ny / 60), 2), nsmall = 2)
median_trip_ny <- format(round((median_trip_ny / 60), 2), nsmall = 2)

#original trip duration is listed in seconds, convert to minutes
print(paste("Shortest trip duration:", min_trip_ny, "minutes"))
print(paste("Longest trip duration:", max_trip_ny, "minutes"))
print(paste("Average trip duration:", median_trip_ny, "minutes"))

In [None]:
# create box plot to see overview of duration data
boxplot(ny$trip.duration / 60)

In [None]:
# outlier discreptency is present in year and trip duration.

In [None]:
#find Q1, Q3, and IQR range to appropriately remove outliers
Q1ny2 <- quantile(ny$trip.duration, .25, na.rm = TRUE)
Q3ny2 <- quantile(ny$trip.duration, .75, na.rm = TRUE)
IQRny2 <- IQR(ny$trip.duration, na.rm = TRUE)

#remove rows that values are outside 1.5*IQR of Q1 and Q3 for trip.duration & birth.year
ny.trip.outliers <- subset(ny, ny$trip.duration> (Q1ny2 - 1.5*IQRny2) & ny$trip.duration< (Q3ny2 + 1.5*IQRny2) & 
                           ny2$birth.year> (Q1ny - 1.5*IQRny) & ny2$birth.year< (Q3ny + 1.5*IQRny))

In [None]:
min_trip_ny2 <- min(ny.trip.outliers$trip.duration, na.rm = TRUE)
max_trip_ny2 <- max(ny.trip.outliers$trip.duration, na.rm = TRUE)
median_trip_ny2 <- median(ny.trip.outliers$trip.duration, na.rm = TRUE)

min_trip_ny2 <- format(round((min_trip_ny2 / 60), 2), nsmall = 2)
max_trip_ny2 <- format(round((max_trip_ny2 / 60), 2), nsmall = 2)
median_trip_ny2 <- format(round((median_trip_ny2 / 60), 2), nsmall = 2)

#original trip duration is listed in seconds, convert to minutes
print(paste("Shortest trip duration:", min_trip_ny2, "minutes"))
print(paste("Longest trip duration:", max_trip_ny2, "minutes"))
print(paste("Average trip duration:", median_trip_ny2, "minutes"))

In [None]:
# create box plot to see overview of duration data
boxplot(ny.trip.outliers$trip.duration / 60)

In [None]:
nyx = ny.trip.outliers$birth.year
nyy = ny.trip.outliers$trip.duration / 60

smoothScatter(nyx, nyy, xlab="Birth Year", ylab="Trip Duration in Minutes", main="New York Trip Duration by Age",
              transformation = function(x) x ^ 0.4,
              colramp = colorRampPalette(c("#000099", "#00FEFF", "#45FE4F",
                                           "#FCFF00", "#FF9400", "#FF3100")))

New York's heatmap spread covers birth years from 1944 to 2001. We can observe that while there are some customers who were born in 1950 to 1960 that went on bike rides of 30+ minutes, the majority of their trips were around 10 minutes long. Customers born around 1980 to 1990 tend to have longer bike rides. There does appear to be a slight correlation between age and trip duration, as as the heat map gets greener around the mid-80s. However the correlation is more strongly represented with shorter trip durations such as 15 minutes. 

## Chicago

In [None]:
min_trip_chi <- min(chi$trip.duration, na.rm = TRUE)
max_trip_chi <- max(chi$trip.duration, na.rm = TRUE)
median_trip_chi <- median(chi$trip.duration, na.rm = TRUE)

min_trip_chi <- format(round((min_trip_chi / 60), 2), nsmall = 2)
max_trip_chi <- format(round((max_trip_chi / 60), 2), nsmall = 2)
median_trip_chi <- format(round((median_trip_chi / 60), 2), nsmall = 2)

#original trip duration is listed in seconds, convert to minutes
print(paste("Shortest trip duration:", min_trip_chi, "minutes"))
print(paste("Longest trip duration:", max_trip_chi, "minutes"))
print(paste("Average trip duration:", median_trip_chi, "minutes"))

In [None]:
# create box plot to see overview of duration data
boxplot(chi$trip.duration / 60)

In [None]:
#find Q1, Q3, and IQR range to appropriately remove outliers
Q1chi2 <- quantile(chi$trip.duration, .25, na.rm = TRUE)
Q3chi2 <- quantile(chi$trip.duration, .75, na.rm = TRUE)
IQRchi2 <- IQR(chi$trip.duration, na.rm = TRUE)

#remove rows that values are outside 1.5*IQR of Q1 and Q3 for trip.duration & birth.year
chi.trip.outliers <- subset(chi, chi$trip.duration> (Q1chi2 - 1.5*IQRchi2) & chi$trip.duration< (Q3chi2 + 1.5*IQRchi2) & 
                            chi2$birth.year> (Q1chi - 1.5*IQRchi) & chi2$birth.year< (Q3chi + 1.5*IQRchi))

In [None]:
min_trip_chi2 <- min(chi.trip.outliers$trip.duration, na.rm = TRUE)
max_trip_chi2 <- max(chi.trip.outliers$trip.duration, na.rm = TRUE)
median_trip_chi2 <- median(chi.trip.outliers$trip.duration, na.rm = TRUE)

min_trip_chi2 <- format(round((min_trip_chi2 / 60), 2), nsmall = 2)
max_trip_chi2 <- format(round((max_trip_chi2 / 60), 2), nsmall = 2)
median_trip_chi2 <- format(round((median_trip_chi2 / 60), 2), nsmall = 2)

#original trip duration is listed in seconds, convert to minutes
print(paste("Shortest trip duration:", min_trip_chi2, "minutes"))
print(paste("Longest trip duration:", max_trip_chi2, "minutes"))
print(paste("Average trip duration:", median_trip_chi2, "minutes"))

In [None]:
# create box plot to see overview of duration data
boxplot(chi.trip.outliers$trip.duration / 60)

In [None]:
chix = chi.trip.outliers$birth.year
chiy = chi.trip.outliers$trip.duration / 60

smoothScatter(chix, chiy, xlab="Birth Year", ylab="Trip Duration in Minutes", main="Chicago Trip Duration by Age",
              transformation = function(x) x ^ 0.4,
              colramp = colorRampPalette(c("#000099", "#00FEFF", "#45FE4F",
                                           "#FCFF00", "#FF9400", "#FF3100")))

Chicago's heatmap spread covers birth years from 1955 to 2002. We can observe that while there are some customers who were born in 1955 to 1970 that went on bike rides of 30+ minutes, the majority of their trips were around 5 minutes long. Customers born around 1990 tend to have longer bike rides. There does appear to be a correlation between age and trip duration, as observed by the solidity and shape of the heat map. 

## **Summary - Is there any relationship between trip duration and age?**
There does appear to be a positive relationship between trip distance and age, with those born around 1980 - 1990 going for longer bike trips. It is more easily seen with Chicago's data, potentially because the dataset is smaller. The solid rise of the heatmap when looking from left to right represents the longer trip durations as the years get closer to around the 1990s. The minimum, maximum, and average trip durations between the two cities are very close once outliers were removed. 


## Finishing Up

> Congratulations!  You have reached the end of the Explore Bikeshare Data Project. You should be very proud of all you have accomplished!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the [rubric](https://review.udacity.com/#!/rubrics/2508/view). 


## Directions to Submit

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
system('python -m nbconvert Explore_bikeshare_data.ipynb')