## Client  
**Cyclistic** is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.    
Until now, Cyclistic offers flexible pricing plans:   
  1. Single-ride passes   
  2. Full-day passes and 
  3. Annual memberships  
Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are annual members.
  Cyclistic’s ﬁnance analysts have concluded that annual members are much more proﬁtable than casual riders. The director of marketing believes the company’s future success depends on maximising the number of annual members. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently.   
  
### Stakeholders:    
1. Lily Moreno, the director of marketing  
2. Marketing Analytics Team   
3. Executive Team   
  
### Business Task  
  To identify trends and patterns that differentiate casual riders differ from annual members and use the insights to influence causal riders to become annual members.   

   1. How do annual members and casual riders use Cyclistic bikes diﬀerently?  
   2. Why would casual riders buy Cyclistic annual memberships?  
   3. How can Cyclistic use digital media to inﬂuence casual riders to become members?  
  
##### Data Source Information:   
   For this study, data from April 2022 to March 2023 was used. The data has been made available by Motivate International Inc. under this [license](https://ride.divvybikes.com/data-license-agreement). (Note: The datasets have a diﬀerent name because Cyclistic is a ﬁctional company.)    
The data is **ROCCC** (reliable, original, comprehensive, current or cited) since provided by an authentic source.  
  
##### Privacy Check:  
  The data does not contain riders’ personally identifiable information such as geotracking or credit card information.  
  
##### Data Source Location:   
  The data was obtained from a public data source and can be accessed [here](https://divvy-tripdata.s3.amazonaws.com/index.html).  

##### Tools used:    
  This project was completed using R Studio.  
  
## Processing the data  


In [None]:
# Load libraries   
library("tidyverse")
library("tidyr")
library("dplyr")
library("lubridate")
library("janitor")

In [None]:
# Load all the datasets 
trips_03_23 <- read_csv("/kaggle/input/cyclistic0322-423/202303-divvy-tripdata.csv")
trips_02_23 <- read_csv("/kaggle/input/cyclistic0322-423/202302-divvy-tripdata.csv")
trips_01_23 <- read_csv("/kaggle/input/cyclistic0322-423/202301-divvy-tripdata.csv")
trips_12_22 <- read_csv("/kaggle/input/cyclistic0322-423/202212-divvy-tripdata.csv")
trips_11_22 <- read_csv("/kaggle/input/cyclistic0322-423/202211-divvy-tripdata.csv")
trips_10_22 <- read_csv("/kaggle/input/cyclistic0322-423/202210-divvy-tripdata.csv")
trips_9_22 <- read_csv("/kaggle/input/cyclistic0322-423/202209-divvy-publictripdata.csv")
trips_8_22 <- read_csv("/kaggle/input/cyclistic0322-423/202208-divvy-tripdata.csv")
trips_7_22 <- read_csv("/kaggle/input/cyclistic0322-423/202207-divvy-tripdata.csv")
trips_6_22 <- read_csv("/kaggle/input/cyclistic0322-423/202206-divvy-tripdata.csv")
trips_5_22 <- read_csv("/kaggle/input/cyclistic0322-423/202205-divvy-tripdata.csv")
trips_4_22 <- read_csv("/kaggle/input/cyclistic0322-423/202204-divvy-tripdata.csv") 

In [None]:
# Compare the columns of all the datasets  
compare_df_cols(trips_03_23, trips_02_23, trips_01_23, trips_12_22, trips_11_22, trips_10_22, 
                     trips_9_22, trips_8_22, trips_7_22, trips_6_22, trips_5_22, trips_4_22)

In [None]:
# Confirm that the datasets are combinable (yes if it returns TRUE)
compare_df_cols_same(trips_03_23, trips_02_23, trips_01_23, trips_12_22, trips_11_22, trips_10_22,trips_9_22, trips_8_22, trips_7_22, trips_6_22, trips_5_22, trips_4_22)

In [None]:
# Combine all datasets into one dataframe
tripdata <- rbind(trips_03_23, trips_02_23, trips_01_23, trips_12_22, trips_11_22, trips_10_22, trips_9_22, trips_8_22, trips_7_22, trips_6_22, trips_5_22, trips_4_22)

In [None]:
# Check the structure of the dataframe 
str(tripdata)

In [None]:
# Rename the columns for clarity 
tripdata2 <- tripdata %>% 
  rename(bike_type = rideable_type,
         start_time = started_at,
         end_time = ended_at,
         rider_type = member_casual) 

In [None]:
# Calculate the trip length (convert to a numeric factor for ease with analyses), weekday and month 
tripdata2 <- tripdata2 %>% mutate(ride_length_secs = as.numeric(end_time - start_time), 
                                 weekday = weekdays(start_time),
                                 month = months(start_time), .before = start_station_name)

In [None]:
# Check for bad data (trip lengths that maybe 0 or in minus)
filter(tripdata2, ride_length_secs < 0)  

In [None]:
# Filter and remove the bad data 
tripdata2 <- tripdata2 %>% filter(ride_length_secs > 0) 

In [None]:
# Check if there are missing values in the dataframe
colSums(is.na(tripdata2))  

In [None]:
# Remove the bad data i.e., rows with any missing values 
tripdata_cleaned <- tripdata2 %>%
  na_if(NA)  %>%  na.omit()

In [None]:
# Recheck to confirm there are no missing values 
colSums(is.na(tripdata_cleaned))

## Statistical Analysis 


In [None]:
# Create a new dataset with only the data required for analysis 
td_calc <- tripdata_cleaned %>% select(ride_id, bike_type, start_time, end_time
                                       , ride_length_secs, weekday, month, rider_type
                                       , start_station_name, end_station_name )

In [None]:
 # Distribution of riders 
td_calc %>% group_by(rider_type) %>% 
  summarise(number_of_rides = n())  %>% arrange(-number_of_rides)  

#### *Annual members take more rides than casual riders.*


In [None]:
# Distribution of bike type usage
td_calc %>% group_by(rider_type, bike_type) %>%  
  summarise(number_of_rides = n()) %>% 
  arrange(-number_of_rides)

#### *Classic bikes are most popular, followed by electric for both members. Docked bikes are only used by causal members and not by annual members at all.*

In [None]:
# Descriptive analysis of ride length
 summary(td_calc$ride_length_secs)

In [None]:
# Compare different descriptive measures of ride length and rider type 

# 1. Mean ride length
aggregate(ride_length_secs~rider_type, data =td_calc, FUN = mean)
  

In [None]:
# 2. Median ride length
aggregate(ride_length_secs~rider_type, data =td_calc, FUN = median)


In [None]:
# 3. Max ride length
aggregate(ride_length_secs~rider_type, data =td_calc, FUN = max) 

In [None]:
#4. Min ride length
aggregate(ride_length_secs~rider_type, data =td_calc, FUN = min)

#### *Casual riders take longer rides than annual members.*


In [None]:
# Usage by Days
# 1. Mean ride length per day 
td_calc %>% group_by(rider_type, weekday)%>%  
  summarise(mean_ride_length = mean(ride_length_secs)) %>% arrange(-mean_ride_length)

#### *Longest rides by casual riders are on Sundays, Saturdays and Mondays.*
#### *Longest rides by annual members are on Saturdays, Sundays and Fridays.*
#### *Overall, the longest rides for both groups are during the weekend.*

In [None]:
# 2. Busiest days for each rider type 
td_calc %>%
  group_by(rider_type, weekday) %>%  
  summarise(number_of_rides = n()) %>% 
  arrange(rider_type,-number_of_rides)

#### *Casual riders ride the most during weekends.*
#### *Annual members ride the most during the week.*

In [None]:
# Usage by Month 
# 1. Mean ride length per month 
td_calc %>% group_by(rider_type, month)%>%  
  summarise(mean_ride_length = mean(ride_length_secs)) %>% arrange(-mean_ride_length)

#### *Casual riders take the longest rides during May, April, July and June.* 
#### *Annual riders take the longest rides during June, July, May and August.*
  

In [None]:
# 2. Busiest months 
td_calc %>% group_by(month) %>%  summarise(number_of_rides = n()) %>% arrange(-number_of_rides)  

#### *July, June and August are the busiest months and December, January and February are the slowest months.*

In [None]:
# 3. Busiest month for each group 
td_calc %>% 
  group_by(rider_type, month) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(rider_type, -number_of_rides)

#### *June, July and August are the busiest months and December, January and February are the slowest for both groups.*

In [None]:
#Popular stations 
# 1. Most frequented start stations
td_calc %>% group_by(start_station_name) %>% 
  summarise(popularity= n())%>% 
  arrange(-popularity) %>% slice_head(n=5)

In [None]:
# 2. Most frequented end stations
td_calc %>% group_by(end_station_name) %>% 
  summarise(popularity= n()) %>% 
  arrange(-popularity) %>% slice_head(n=5)

#### *Top five most frequented stations are  Streeter Dr & Grand Ave, DuSable Lake Shore Dr & North Blvd, DuSable Lake Shore Dr & Monroe St, Michigan Ave & Oak St and Wells St & Concord Ln.*


In [None]:
#  Annual Members: most frequented stations  
# 1. Most frequented start stations
td_calc %>% group_by(start_station_name) %>% 
  filter(rider_type == "member") %>% summarise (popularity = n()) %>% 
  arrange(-popularity) %>% slice_head(n=5)
  

In [None]:
# 2. Most frequented end stations
td_calc %>% group_by(end_station_name) %>% 
  filter (rider_type == "member") %>% summarise(popularity = n()) %>% 
  arrange(-popularity) %>% slice_head(n=5)

#### *The most frequented stations by annual members are Kingsbury St & Kinzie St, Clark St & Elm St, Clinton St & Washington Blvd and Wells St & Concord Ln.*


In [None]:
  # Casual Riders: most frequented stations  
# 1. Most frequented start stations
td_calc %>% group_by(start_station_name) %>% 
  filter(rider_type== "casual") %>% summarise(popularity =n()) %>% 
  arrange(-popularity) %>% slice_head(n=5)

In [None]:
# 2. Most frequented end stations  
td_calc %>% group_by(end_station_name) %>% 
  filter(rider_type== "casual") %>% summarise(popularity =n()) %>% 
  arrange(-popularity) %>% slice_head(n=5)

#### *The most frequented stations by casual riders are Streeter Dr & Grand Ave, DuSable Lake Shore Dr & Monroe St, Millennium Park, Michigan Ave & Oak St and DuSable Lake Shore Dr & North Blvd.*


## Key Findings
1. Casual riders take lesser but longer rides than annual members.  
2. Casual riders are take more and longer rides during the weekends.  
3. Classic bikes are most popular followed by electric bikes in both member segments. However, docked bikes are only used by causal members for the longest rides.  
4. Summers are the busiest and winters are the slowest for both rider groups.  
5. Both groups take the longest rides during late spring and summer.  
6. The most frequented stations by casual riders are Streeter Dr & Grand Ave, DuSable Lake Shore Dr & Monroe St, Millennium Park, Michigan Ave & Oak St and DuSable Lake Shore Dr & North Blvd.

 ## Visualisations 

In [None]:
library(scales)
library(ggplot2)

In [None]:
##### 1 Ridership according to Rider Type
# Capitalise the values for better labels   

td_calc <- td_calc %>% mutate(rider_type= recode(rider_type, 
                                                 member = "Member", 
                                                 casual = "Casual"))

In [None]:
#####  1.1 Number of Rides per Rider Type 

 td_calc %>% group_by(rider_type) %>% 
  summarise(no_of_rides = n()) %>% 
ggplot(aes(rider_type,  no_of_rides, fill = rider_type)) + 
  geom_col() +   
  scale_fill_manual(values = c("orange", "cyan")) +
 geom_text(aes(label = scales::percent(no_of_rides/sum(no_of_rides))),
             size = 5,
             position = position_stack(vjust = 0.5)) + 
   labs(title = "Number of Rides per Rider Type", fill = "Rider Type") +     
   theme(plot.title = element_text(hjust = 0.5)) +
   scale_x_discrete(name = "Rider Type") +
  scale_y_continuous(name = "Number of Rides", labels = comma, limits = c(0, 3000000))

In [None]:
#####  1.2 Average Ride Duration per Rider Type 

 td_calc %>%  group_by(rider_type) %>% 
   summarise (count = mean(ride_length_secs)/60) %>% 
   ggplot(aes(rider_type, count, fill = rider_type)) +
   geom_col() +
   scale_fill_manual(values = c("orange", "cyan")) +
   labs(title = "Average Ride Duration per Rider Type", fill = "Rider Type" ) +
   theme(plot.title = element_text(hjust = 0.5)) +
   scale_x_discrete(name = "Rider Type") +
   scale_y_continuous(name = "Average Ride Length (mins)", labels = comma, limits = c(0,25))

In [None]:
##### 2 Bike Type and Usage 
# Recode bike types for better labels in plots 

td_calc <- td_calc %>% mutate(bike_type= recode(bike_type,
                                                electric_bike = "Electric",
                                                classic_bike = "Classic",
                                                docked_bike = "Docked"))

In [None]:
#####  2.1 Number of Rides per Bike Type

 td_calc %>%
   group_by(bike_type, rider_type)%>%
   summarise(count = n())%>%
   ggplot(aes(bike_type, count, fill=rider_type))+
   geom_col( position = "dodge") +
  scale_fill_manual(values = c("orange", "cyan")) +
   labs(title = "Number of Rides per Bike Type", fill = "Rider Type") +
   theme(plot.title = element_text(hjust = 0.5)) +
   scale_x_discrete(name = "Bike Type") +
   scale_y_continuous(name = "Number of Rides", label = comma, limits = c(0, 2000000))

In [None]:
#####  2.2 Average Ride Duration per Bike Type

td_calc %>%  group_by(bike_type,rider_type) %>% 
   summarise (count = mean(ride_length_secs)/60) %>% 
ggplot(aes(bike_type, count, fill=rider_type))+
   geom_col(position = "dodge") +
  scale_fill_manual(values = c("orange", "cyan")) +
   labs(title = "Average Ride Duration per Bike Type", fill = "Rider Type") +
   theme(plot.title = element_text(hjust = 0.5)) +
   scale_x_discrete(name = "Bike Type") +
   scale_y_continuous(name = "Average Ride Duration (mins)", label = comma)

In [None]:
##### 3 Days of the Week  
# Order the weekdays

td_calc$weekday <- ordered(td_calc$weekday, 
                                          levels=c( "Monday","Tuesday", "Wednesday",
                                                   "Thursday", "Friday", "Saturday", "Sunday" ))  

In [None]:
# Recode the weekday names (so it does not overlap in the bar plot)

td_calc <- td_calc %>%  mutate(weekday = recode(weekday,
                                                 Monday = "Mon",
                                                 Tuesday = "Tue", 
                                                 Wednesday = "Wed",
                                                 Thursday = "Thu",
                                                 Friday = "Fri",
                                                 Saturday = "Sat",
                                                 Sunday = "Sun"))

In [None]:
#####  3.1 Number of Rides per Day 

td_calc %>% 
   group_by(rider_type, weekday, bike_type) %>% 
   summarise(count = n()) %>% 
   ggplot(aes(weekday, count, fill = bike_type))+ facet_wrap(~rider_type) +
   geom_col()  + 
   labs(title = "Number of Rides per Day", fill = "Bike Type" ) +
   theme(plot.title = element_text(hjust = 0.5))  +
   scale_x_discrete(name = "Day") + 
   scale_y_continuous(name = "Number of Rides", labels = comma, limits =c(0, 450000) ) 

In [None]:
#####  3.2 Average Ride Duration per Day 

 td_calc %>% 
   group_by(rider_type, weekday) %>% 
   summarise(count = mean(ride_length_secs)/60) %>% 
   ggplot(aes(weekday, count, fill = rider_type))+
   geom_col( position = "dodge")  +
    scale_fill_manual(values = c("orange", "cyan")) +
   labs(title = "Average Ride Duration per Day", fill = "Rider Type" ) +
   theme(plot.title = element_text(hjust = 0.5))  +
   scale_x_discrete(name = "Day") + 
   scale_y_continuous(name = "Average Ride Duration (mins)", labels = comma, limits = c(0,30))

In [None]:
##### 3.3 Average Ride Duration of Bike Types per Day 

td_calc %>% 
   group_by(rider_type, weekday, bike_type) %>% 
   summarise(count = mean(ride_length_secs)/60) %>% 
   ggplot(aes(weekday, count, fill = bike_type))+ facet_wrap(~rider_type) +
   geom_col( position = "dodge")  + 
   labs(title = "Average Ride Duration of Bike Types per Day", fill = "Bike Type" ) +
   theme(plot.title = element_text(hjust = 0.5))  +
   scale_x_discrete(name = "Day") + 
   scale_y_continuous(name = "Average Ride Duration (mins)", labels = comma) 

In [None]:
##### 4 Months
# Order the months 

td_calc$month <- ordered(td_calc$month, 
                            levels=c( "January","February", "March", "April", "May", "June", "July", 
                                      "August", "September", "October", "November", "December" ))

In [None]:
# Recode the month names (so it does not overlap in the barplot)

td_calc <- td_calc %>% mutate(month = recode(month,
                                  January = "Jan", February = "Feb", March = "Mar", April = "Apr",
                                  May = "May", June = "Jun", July = "Jul", August = "Aug",
                                  September = "Sep", October = "Oct", November = "Nov",December = "Dec"))

In [None]:
#####  4.1 Number of Rides per Month 

 td_calc %>% 
   group_by(rider_type, month) %>% 
   summarise(count = n()) %>% 
   ggplot(aes(month, count, fill = rider_type))+
   geom_col( position = "dodge")  +
  scale_fill_manual(values = c("orange", "cyan")) +
   labs(title = "Number of Rides per Month", fill = "Rider Type" ) +
   theme(plot.title = element_text(hjust = 0.5))  +
   scale_x_discrete(name = "Month") + 
   scale_y_continuous(name = "Number of Rides", labels = comma, limits = c(0,350000) )

In [None]:
#####  4.2 Average Ride Duration per Month 

  td_calc %>% 
   group_by(rider_type, month) %>% 
   summarise(count = mean(ride_length_secs)/60) %>% 
   ggplot(aes(month, count, fill = rider_type))+
   geom_col(position = "dodge")  +
  scale_fill_manual(values = c("orange", "cyan")) + 
   labs(title = "Average Ride Duration per Month", fill = "Rider Type" ) +
   theme(plot.title = element_text(hjust = 0.5))  +
   scale_x_discrete(name = "Month") + 
   scale_y_continuous(name = "Average Ride Duration (mins)", labels = comma, limits = c(0,30)) 

In [None]:
###### 5 Stations Most Frequented by Casual Riders 
######  5.1 Most Popular Start Stations for Casual Riders

td_calc %>% group_by(start_station_name) %>% 
  filter(rider_type== "Casual") %>% summarise(popularity =n()) %>% 
  arrange(-popularity) %>% slice_head(n=5) %>% ggplot(aes(start_station_name, popularity, fill = popularity)) + 
  geom_col(fill = "darkgreen") + 
   labs(title = "Popular Start Stations for Casual Riders)", fill = "Popularity" ) + theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Station Name") + scale_y_continuous(name = "Popularity", label = comma ) +
coord_flip()

In [None]:
#####  5.2 Most Popular End Stations for Casual Riders

td_calc %>% group_by(end_station_name) %>% 
  filter(rider_type== "Casual") %>% summarise(popularity =n()) %>% 
  arrange(-popularity) %>% slice_head(n=5) %>% 
  ggplot(aes(end_station_name, popularity, fill = popularity)) + 
  geom_col( fill = "darkgreen") + 
   labs(title = "Popular End Stations for Casual Riders", fill = "Popularity" ) + 
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Station Name") + scale_y_continuous(name = "Popularity", label = comma ) +
coord_flip()

## Executive Summary  
1. Casual riders take lesser but longer rides than annual members.  
2. Casual riders are take more and longer rides during the weekends.  
3. Classic bikes are most popular followed by electric bikes in both member segments. However, docked bikes are only used by causal members for the longest rides.  
4. Summers are the busiest and winters are the slowest for both rider groups.  
5. Both groups take the longest rides during late spring and summer.  
6. The most frequented stations by casual riders are Streeter Dr & Grand Ave, DuSable Lake Shore Dr & Monroe St, Millennium Park, Michigan Ave & Oak St and DuSable Lake Shore Dr & North Blvd.  
  
    
## Recommendations
1. Provide special offers at the busiest stations, weekends and summer within annual memberships.  
2. Create competitions based on distance covered and  
     a) Top contestants could win a full or discounted annual membership or  
     b) Make these competitions only open to annual members to encourage causal riders to convert to members.   
     Ideally, these competitions can be done on an app and points can be rewarded for distance.  
3. Promote all campaigns and strategies especlially during peak times such as the weekends and summer months.   
  