In [1]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Introduction

> **My Capstone Case Study for the Google Data Analytics Professional Certificate is located in this notebook. The R programming language allows for the quick processing of big data sets, and I'm using it to learn more about its syntax and implementation.**

**Scenario**

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

**Characters and Teams**

* Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. 

* Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels. 

* Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

* Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

About the company In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

**Ask Three questions will guide the future marketing program:**

1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?

Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

**You will produce a report with the following deliverables:**

* A clear statement of the business task
* A description of all data sources used
* Documentation of any cleaning or manipulation of data
* A summary of your analysis
* Supporting visualizations and key findings
* Your top three recommendations based on your analysis

# ASK

The stakeholders have decided it is advantageous to convert casual riders into annual members. In order to do so, it is important to understand the similarities and differences between how these two groups use the service. 
My business task is as follows: to analyze the ride data from the last year to describe how casual riders and annual members, as user groups, utilize the Cyclistic service.

**Key Context:**

* Cyclistic founded in 2016
* 5,824 bicycles and 692 stations across Chicago
* 8% use assistive bike options
* 30% use bikes for commute
* Casual users are made up of single-ride and full-day passes
* Annual members more profitable than casual riders

# PREPARE

In this case, I have access to internal Cyclistic ride data from the previous 12 months. Each file in the data set contains one month's worth of data. Unique ride IDs, bike kinds, ride lengths, start and end locations, member types, and ride dates are all included in the data set. Privacy concerns are mostly immaterial because the data set does not include the personal or financial information of the rider. Licensing is not an issue because it is internal data.

**Let's load the libraries that will be useful for our analysis.**

In [2]:
library(tidyverse)
library(lubridate)
library(tidyr)
library(scales)

In [3]:
show_col_types = FALSE
april_2022 <- read_csv('../input/cyclistic-capstone/202204-divvy-tripdata.csv')
march_2022 <- read_csv('../input/cyclistic-capstone/202203-divvy-tripdata.csv')
february_2022 <- read_csv('../input/cyclistic-capstone/202202-divvy-tripdata.csv')
january_2022 <- read_csv('../input/cyclistic-capstone/202201-divvy-tripdata.csv')
december_2021 <- read_csv('../input/cyclistic-capstone/202112-divvy-tripdata.csv')
november_2021 <- read_csv('../input/cyclistic-capstone/202111-divvy-tripdata.csv')
october_2021 <- read_csv('../input/cyclistic-capstone/202110-divvy-tripdata.csv')
september_2021 <- read_csv('../input/cyclistic-capstone/202109-divvy-tripdata.csv')
august_2021 <- read_csv('../input/cyclistic-capstone/202108-divvy-tripdata.csv')
july_2021 <- read_csv('../input/cyclistic-capstone/202107-divvy-tripdata.csv')
june_2021 <- read_csv('../input/cyclistic-capstone/202106-divvy-tripdata.csv')
may_2021 <- read_csv('../input/cyclistic-capstone/202105-divvy-tripdata.csv')

The twelve individual files had the same columns and formatting, so combining them into one dataframe is no problem. We will concatenate the twelve monthly dataframes into one, for ease of analysis.


In [4]:
MAY2021_to_APR2022 <- do.call('rbind', list(
    april_2022, 
    march_2022,
    february_2022,
    january_2022,
    december_2021,
    november_2021,
    october_2021,
    september_2021,
    august_2021,
    july_2021,
    june_2021,
    may_2021))

In [5]:
glimpse(MAY2021_to_APR2022)

In [6]:
head(MAY2021_to_APR2022)

# PROCESS

**If there are any redundant rows, removing them can be helpful. Let's verify.**

In [7]:
Duplicates <- nrow(MAY2021_to_APR2022) - length(unique(MAY2021_to_APR2022[["ride_id"]]))
Duplicates 
sprintf("no. of duplicate rows is  %d", Duplicates)

We can continue since there are no duplicate rows to remove.

It is wise to often search the dataset for "NA" values. Here, we can accomplish that:

In [8]:
colSums(is.na(MAY2021_to_APR2022))

It appears that 6 of our columns contain 'NA' values. There are 843,361 rows of "NA"s in end station id alone. The total number of rows having "NA" values can be determined by performing the following operation:

In [9]:
cyclistic_droped_na <- drop_na(MAY2021_to_APR2022)
nrow(MAY2021_to_APR2022) - nrow(cyclistic_droped_na)

There are at least one 'NA' value in 1,141,803 rows! As part of the data cleaning procedure, it might be desirable in some studies to eliminate all rows with null values. In this scenario, we would lose about 20% of our data and maybe introduce an unidentified bias as a result.

During a situational analysis. We could investigate the reasons behind these null values further. Are the null values due to how/when a journey is started or concluded as they are all connected to start and end locations? Rows with null values will be kept in order to preserve 20% of our total data in the absence of being able to fully answer these questions, as eliminating them due to the absence of a few details may reduce the analysis' accuracy and quality.

In [10]:
cyclistic <- cyclistic_droped_na

In [11]:
head(cyclistic)

We will find it useful to modify the dataframe to incorporate specific features for the analysis. Separate columns for the hour, weekday, month, year, and ride duration will be useful for our analysis. That can be produced using the following code.

In [12]:
#easily usable date column
cyclistic$date <- as.Date(cyclistic$started_at)

#the hour of the day a trip is started
cyclistic$start_hour <- strftime(cyclistic$started_at, "%H")

#the day of the week (numerically and an abbreviation)
cyclistic$weekday <- paste(strftime(cyclistic$date,"%u"), '-', strftime(cyclistic$date, "%a"))

#whether the trip is on a weekday or weekend
cyclistic$day_type <- ifelse(wday(cyclistic$date) == '1' | wday(cyclistic$date) == '7',
                                   'weekend',
                                   'weekday')
#the month and year of the trip
cyclistic$month_year <- paste(strftime(cyclistic$date, "%m"), '-', strftime(cyclistic$date, "%Y"))

#length of the ride in seconds
cyclistic$duration_s <- as.numeric(cyclistic$ended_at - cyclistic$started_at)

#length of the ride in minutes
cyclistic$duration_min <- cyclistic$duration_s/60

In [13]:
print(unique(cyclistic$duration_s < 0))

I have noticed that our dataframe contains rides with a negative duration. This is impossible, so let's drop these rows.


In [14]:
print(paste('before removing negative durations, our dataframe had', nrow(cyclistic), 'rows.'))
cyclistic <- cyclistic[cyclistic$duration_s > 0,]
print(paste('after removing negative durations, our dataframe has', nrow(cyclistic), 'rows.'))

Let's get a final view of what our data looks like as we go into the analysis stage.



In [15]:
head(cyclistic)

In [16]:
print(unique(is.na(cyclistic)))

Let's save a version of our cleaned data as a CSV file.

In [None]:
cyclistic %>%
    write.csv("cyclistic_cleaned.csv")

# ANALYSIS

Let's begin to reframe and analyse the data in a way that will enable us to respond to the pertinent queries at the core of our business assignment. We can start with the big picture.

**How are member rides and casual rides distributed across the full dataset?**

In [None]:
# "dist" contains how many number of casuals and members present in the dataset and their percentage
dist <- cyclistic %>%
            group_by(member_casual) %>%
            summarize(count = length(ride_id),
                     percent = round((length(ride_id) / nrow(cyclistic)) * 100 ))
dist
            

In [None]:
#A function we use to resize our visualizations
display <- function(width, heigth){options(repr.plot.width = width, repr.plot.height = heigth)}

In [None]:
display(8,8)
# Get the positions
df2 <- dist %>% 
  mutate(csum = rev(cumsum(rev(count))), 
         pos = count/2 + lead(csum, 1),
         pos = if_else(is.na(pos), count/2, pos))

ggplot(df2, aes(x="", y=count, fill= member_casual)) +
  geom_bar(stat="identity", width=1, color= 'white') +
  coord_polar("y", start=0) +
theme_void(base_size = 20) +
labs(title= "Fig. 1: Total Rides: Member vs Casual")+
geom_text(aes(y = pos, label = paste0(percent, "%") ), color = "black", size=6) 


So we can see that a majority of our data (and therefore a majority of the rides in the 12 months of data) are member rides.

**How are the rides of annual members and casual riders distributed across the days of the week?**

In [None]:
by_days <- cyclistic %>% 
    group_by(weekday) %>% 
    summarise(count = length(ride_id),
              percent_of_all_rides = round((length(ride_id) / nrow(cyclistic)) * 100),
              percent_members = round((sum(member_casual == "member") / length(ride_id)) * 100),
              percent_casual = round((sum(member_casual == "casual") / length(ride_id)) * 100))
by_days

In [None]:
display(20, 10)
ggplot(cyclistic, aes(weekday, fill=member_casual)) +
    geom_bar() +
    labs(x="Weekday", title="Fig. 2: Distribution by Weekday") +
    theme_grey(base_size = 20)

In [None]:
by_day_type <- cyclistic %>% 
    group_by(day_type) %>% 
    summarise(count = length(ride_id),
              percent_of_all_rides = (length(ride_id) / nrow(cyclistic)) * 100,
              percent_members = (sum(member_casual == "member") / length(ride_id)) * 100,
             percent_casual = (sum(member_casual == "casual") / length(ride_id)) * 100)
by_day_type

In [24]:
display(12, 8)
ggplot(cyclistic, aes(day_type, fill=member_casual)) +
    geom_bar() +
    labs(x="day_type", title="Fig. 2: Distribution by day_type") +
    theme_grey(base_size = 12)+
     scale_y_continuous(labels = scales::comma)

This particular point of analysis gave me the insight that ***most rides on weekdays are taken by members***. The situation is reversed on ***weekends when a majority of rides are taken by casual consumers***. This highlights a crucial distinction between the user groups: yearly members might use the bikes more regularly for commuting, whereas casual riders might be more likely to utilise them for leisure. Additionally, if the objective is to convert casual riders, it will be helpful to monitor the times and locations where the most casual riders are present in order to provide targeted promotions, etc. Weekends are the finest time to reach the most casual riders when it comes to the days of the week.

**Should we look into how the two user groups are distributed throughout the yearly months?**

In [25]:
by_month <- cyclistic %>% 
    group_by(month_year) %>% 
    summarise(count = length(ride_id),
              'percent_of_all_rides' = (length(ride_id) / nrow(cyclistic)) * 100,
              'percent_members' = (sum(member_casual == "member") / length(ride_id)) * 100,
             'percent_casual' = (sum(member_casual == "casual") / length(ride_id)) * 100)
by_month

In [26]:
display(20,10)
ggplot(cyclistic, aes(month_year, fill=member_casual)) +
    geom_bar() +
    labs(x="Month", title="Fig. 3: Distribution by Month") +
    theme_grey(base_size = 20) +
    scale_y_continuous(labels = scales::comma)

The monthly breakdown reveals that, in comparison to the warmer months, total ridership ***declines precipitously in the winter***. This makes sense and should be anticipated for any outdoor activity during the bitterly cold winters in Chicago. As opposed to casual riders, annual members are more likely to continue using Cyclistic during the winter. Which is consistent with our description of casual riders who use the bikes for leisure. The greatest time to aim promotions towards casual riders is during the months of mid-spring to mid-fall, when there are the most casual riders.

**How do the two rider groups compare when we look at the hour the ride is initiated?**

In [27]:
by_start_hour <- cyclistic %>% 
    group_by(start_hour) %>% 
    summarise(count = length(ride_id),
              '%_of_all_rides' = (length(ride_id) / nrow(cyclistic)) * 100,
              '%_members' = (sum(member_casual == "member") / length(ride_id)) * 100,
             '%_casual' = (sum(member_casual == "casual") / length(ride_id)) * 100)
by_start_hour

In [28]:
ggplot(cyclistic, aes(start_hour, fill=member_casual)) +
    geom_bar() +
    labs(x="Start Hour", title="Fig. 4: Distribution by Start Hour") +
    theme_grey(base_size = 20) +
    scale_y_continuous(labels = scales::comma)

Let's look at the distribution of ride start hours on a weekday against that of the weekend.

In [29]:
display(20,10)
ggplot(cyclistic, aes(start_hour, fill=member_casual)) +
    geom_bar() +
    labs(x="Start Hour", title="Fig, 5: Distribution by Start Hour on Weekdays vs. Weekends") +
    theme_grey(base_size = 20) +
    facet_wrap(~ day_type) +
    scale_y_continuous(labels = scales::comma)

The majority of all ridership occurs throughout the day, as would be expected. Annual membership peaks throughout both busy hours of the day. It's interesting to see that between the ***hours of 9 pm and 4 am***, there are more casual riders than annual members. Even Nevertheless, the majority of casual rides still take place during the day. This strengthens the relationship between business and fun that we have been establishing.

On a weekday, there are peaks in the number of casual rides in the afternoon/evening and the morning commutes for annual members. Both user groups steadily increase on weekends, reaching a plateau peak from late morning to early evening.

Once more, we can see that during the majority of the weekend hours, casual users exceed yearly members.

**Now that we know which stations are utilised more frequently by which group, we may examine them. Let's start by taking a look at each user group's Top 5 Start Stations.**

In [30]:
#Top 5 start stations for annual members
mem_start_stations <- cyclistic[cyclistic$member_casual == 'member',] %>% 
    drop_na(start_station_name) %>% 
    group_by(start_station_name) %>% 
    summarise(count = length(ride_id),
              '%' = (length(ride_id) / nrow(cyclistic)) * 100) 
head(mem_start_stations[order(mem_start_stations$count, decreasing = TRUE),], 5)

In [31]:
#Top 5 start stations for casual riders
cas_start_stations <- cyclistic[cyclistic$member_casual == 'casual',] %>%
    drop_na(start_station_name) %>% 
    group_by(start_station_name) %>% 
    summarise(count = length(ride_id),
              '% of total rides' = (length(ride_id) / nrow(cyclistic)) * 100) 
head(cas_start_stations[order(cas_start_stations$count, decreasing = TRUE),], 5)

The fact that there is no overlap in the top five start stations between the two user groups interests me much. Also noteworthy is how frequently tourist sites appear on lists of casual cyclists. Although I don't know much about Chicago, the words "Park," "Aquarium," and "Theater" being on the list give it away that these are places for pleasure. This can offer a fruitful chance to attract casual riders.

**We can apply the same analyis to the most frequently used end stations for each user type, as well.**

In [32]:
mem_end_stations <- cyclistic[cyclistic$member_casual == 'member',] %>% 
    drop_na(end_station_name) %>% 
    group_by(end_station_name) %>% 
    summarise(count = length(ride_id),
              '%' = (length(ride_id) / nrow(cyclistic)) * 100) 
head(mem_end_stations[order(mem_end_stations$count, decreasing = TRUE),], 5)

In [33]:
cas_end_stations <- cyclistic[cyclistic$member_casual == 'casual',] %>% 
    drop_na(end_station_name) %>% 
    group_by(end_station_name) %>% 
    summarise(count = length(ride_id),
              '%' = (length(ride_id) / nrow(cyclistic)) * 100) 
head(cas_end_stations[order(cas_end_stations$count, decreasing = TRUE),], 5)

The most popular end stations lists are nearly identical to the most popular start stations list, so little new insight is gained.

**There is a column in our dataset for 'rideable_type' which contains three different types of bikes. We can analyze the bike types used by members vs casual riders.**

In [34]:
by_bike_type <- cyclistic %>% 
    group_by(rideable_type) %>% 
    summarise(count = length(ride_id),
              '%_of_all_rides' = (length(ride_id) / nrow(cyclistic)) * 100,
              '%_members' = (sum(member_casual == "member") / length(ride_id)) * 100,
             '%_casual' = (sum(member_casual == "casual") / length(ride_id)) * 100)
by_bike_type

In [35]:
ggplot(cyclistic, aes(rideable_type, fill=member_casual)) +
    geom_bar() +
    labs(x="Rideable Type", title="Fig. 6: Distribution by Rideable Type") +
    theme_grey(base_size = 20) +
    scale_y_continuous(labels = scales::comma)

We can see that both user groups prefer classic bikes the most. It is noteworthy that only occasional riders use the docked bikes. Further research would be necessary to determine what constitutes a "docked bike" and whether it is solely accessible to leisure riders. It does appear that casual members are a little more likely than yearly members to use electric bikes.

Let's look at the lengths of the rides for the two user groups. In the data, we found some duration outliers (even after removing the negative durations previously). The following code is used to produce a subset of the ride data that excludes the outliers with the longest (top 5%) and shortest (bottom 5%) durations.

In [36]:
tiles = quantile(cyclistic$duration_min, seq(0, 1, by=0.05))
cyclistic_no_time_outliers <- cyclistic %>% 
    filter(duration_min > as.numeric(tiles['5%'])) %>%
    filter(duration_min < as.numeric(tiles['95%']))

In [37]:
display(10,10)
ggplot(cyclistic_no_time_outliers, aes(x=member_casual, y=duration_min, fill=member_casual)) +
    labs(x="User Type", y="Ride Duration (min)", title="Fig. 7: Ride Duration by User Type") +
    geom_boxplot() +
    theme_grey(base_size = 20)

We can see that non-member rides typically last longer. A casual ride typically lasts five minutes longer than a member ride. This serves to confirm our erroneous assumptions about casual riders.

# SHARE

# Conclusions

For the following profiles, there is strong evidence :
* The majority of Cyclistic riders are annual members, who mostly use the bikes on weekdays during business hours and show rises during periods that would indicate using the bikes for commuting to work. 

* The majority of casual riders utilise the bikes throughout the daytime during the week, with weekends seeing a noticeable increase. 

* In the late evening and on weekends, casual riders outweigh annual members.

* Over the long, chilly winter, casual ridership declines, although annual members are more likely to continue riding. 

* Trips taken by casual riders typically continue longer than those taken by yearly members and are more likely to begin and terminate at tourist and leisure locations.

# ACT

### Top three Proposals

1. Promote with establishments/locations that are on the list of the Top Five Casual Rider Start Stations.
2. Create a campaign to enlighten casual-riders about the advantages of using bicycles for commuting.
3. Make yearly membership more appealing by combining surge pricing and incentives at periods when casual riders are most likely to ride.

### Potential for Improvement

* For start and end stations, look into the NA values

* Examine the docked bike difference that results in ridership being wholly casual.

* Gather more accurate distance information based on actual journey distance rather than just starting and ending points.