# ****Cyclistic | Divvy Bike Sharing Data Analysis with using R****
  by Mangesh Tayade
  

## Introduction

This analysis is brought to you by Mangesh Tayade, inspired by Google and Coursera as a part of Google Data Analytics Certification and Divvy bike-sharing company as a source of data. Throughout this project, you will see some real-world data kindly provided by Motivate International Inc. for public use under the license.

In Data Analytics Certification the dataset is referred to a fictional bike-sharing company called Cyclistic, so let's keep that name - you will see it on different stages of data analysis. You can come along with me through all the steps of data cleaning and processing, but if you are interested just in conclusions the data let me make, you can find it at the end of the report.

## About Company

Divvy is Chicagolandâ€™s bike share system across Chicago and Evanston. Divvy provides residents and visitors with a convenient, fun and affordable transportation option for getting around and exploring Chicago.

Divvy, like other bike share systems, consists of a fleet of specially-designed, sturdy and durable bikes that are locked into a network of docking stations throughout the region. The bikes can be unlocked from one station and returned to any other station in the system. People use bike share to explore Chicago, commute to work or school, run errands, get to appointments or social engagements, and more.

Divvy is available for use 24 hours/day, 7 days/week, 365 days/year, and riders have access to all bikes and stations across the system.

## Data sources used

<a href = "https://divvy-tripdata.s3.amazonaws.com/index.html" target = "https://divvy-tripdata.s3.amazonaws.com/index.html">Divvy Data (July2021 to June2022)</a> - The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips
that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure)


## Business Task


1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?

## Data Import

In [4]:
#add necessary libraries
library(tidyverse)
library(lubridate)


In [6]:
#import All (August 2021 to July 2022) csv files and concat it to a dataframe
july_2021 <- read_csv("../input/bikeshare/db/bike01.csv")
aug_2021 <- read_csv("../input/bikeshare/db/bike02.csv")
sept_2021 <- read_csv("../input/bikeshare/db/bike03.csv")
oct_2021 <- read_csv("../input/bikeshare/db/bike04.csv")
nov_2021 <- read_csv("../input/bikeshare/db/bike05.csv")
dec_2021 <- read_csv("../input/bikeshare/db/bike06.csv")
jan_2022 <- read_csv("../input/bikeshare/db/bike07.csv")
feb_2022 <- read_csv("../input/bikeshare/db/bike08.csv")
march_2022 <- read_csv("../input/bikeshare/db/bike09.csv")
april_2022 <- read_csv("../input/bikeshare/db/bike10.csv")
may_2022 <- read_csv("../input/bikeshare/db/bike11.csv")
june_2022 <- read_csv("../input/bikeshare/db/bike12.csv")


cyclistic_2022 = bind_rows(july_2021, aug_2021, sept_2021, oct_2021, nov_2021, dec_2021, jan_2022, feb_2022, march_2022, april_2022, may_2022, june_2022)

rm("july_2021", "aug_2021", "sept_2021", "oct_2021", "nov_2021", "dec_2021", "jan_2022", "feb_2022", "march_2022", "april_2022", "may_2022", "june_2022")

In [7]:
#check our dataframe, names of columns, types of data
summary(cyclistic_2022)

In [8]:
#format data types of columns and add more columns that we will use for the analysis
cyclistic_2022$date <- as.Date(cyclistic_2022$started_at)
cyclistic_2022$month <- format(as.Date(cyclistic_2022$date), "%m")
cyclistic_2022$day <- format(as.Date(cyclistic_2022$date), "%d")
cyclistic_2022$hour <- format(as.POSIXct(cyclistic_2022$started_at), "%H:%M")
cyclistic_2022$year <- format(as.Date(cyclistic_2022$date), "%Y")
cyclistic_2022$day_of_week <- format(as.Date(cyclistic_2022$date), "%A")
#add ride length column as a difference between end time and start time of a trip in minutes
cyclistic_2022$ride_length <- as.numeric(as.character(difftime(cyclistic_2022$ended_at, cyclistic_2022$started_at, units = 'mins')))
summary(cyclistic_2022)

In [9]:
#How many unique values we have in our dataframe
cyclistic_2022 %>%
summarise_all(n_distinct)

# Exploratory Data Analysis
For the year 2021-22 (July 2021 to June 2022) there are 5900385 rows with data. And from the 2020 analysis I already know there is a category that's changed. Cyclistic now refers to all pedal bikes as "classic".

In [10]:
#Let's replace docked_bike rideable type to classic bike up to current categories the company is using
cyclistic_2022 <- cyclistic_2022 %>%
mutate(rideable_type = replace(rideable_type, rideable_type == "docked_bike", "classic_bike"))%>%
arrange(started_at)
#I'll also arrange the df by starting time

Let's take a look at any inconsistencies in the data with all the NA values.

In [11]:
#How many observations are missing in every column ?
colSums(is.na(cyclistic_2022))

We see that there are quite a lot of station names and IDs missing. It could be due to new stations opening, changes in IDs for a better work with the database. We don't know the exact reason, but we'll try to find those and fill the columns.

In [12]:
#Let's filter with station name column and fill station ID missing
fill_missing_start_id <- cyclistic_2022 %>%
group_by(start_station_name)%>%
select(start_station_name, start_station_id)%>%
gather(key=start_station_name, value=start_station_id)%>%
fill(start_station_id)%>%
ungroup()
sum(is.na(fill_missing_start_id$start_station_id))
colSums(is.na(fill_missing_start_id))

In [13]:
#We see that there are less NA start ID now, update the dataframe
cyclistic_2022 <- cyclistic_2022 %>%
mutate(start_station_id = fill_missing_start_id$start_station_id)

In [16]:
#Let's filter with station ID column and fill station name missing
fill_missing_start_name <- cyclistic_2022 %>%
group_by(start_station_id)%>%
select(start_station_name, start_station_id)%>%
gather(key=start_station_id, value=start_station_name)%>%
fill(start_station_name)%>%
ungroup()
sum(is.na(fill_missing_start_name$start_station_name))
colSums(is.na(fill_missing_start_name))


#We see that there are less NA start ID now, update the dataframe
cyclistic_2022 <- cyclistic_2022 %>%
mutate(start_station_name = fill_missing_start_name$start_station_name) %>%
arrange(started_at)


In [20]:
#Fill missing start_station_IDs referring the coordinates every trip has
#Dataframe with missing values
df_missing <- cyclistic_2022 %>%
  filter(is.na(start_station_id))

#Dataframe without NA in that column
df_nomissings <- cyclistic_2022 %>%
  filter(!is.na(start_station_id))

#I want to use the closest coordinates to find station names - s2_closest_feature
library(s2)


missings_s2 <- s2_lnglat(df_missing$start_lat, df_missing$start_lng)
nomissings_s2 <- s2_lnglat(df_nomissings$start_lat, df_nomissings$start_lng)
df_missing$start_station_id <- cyclistic_2022$start_station_id[s2_closest_feature(missings_s2, nomissings_s2)]

#Check how many null left in the column
sum(is.na(df_missing$start_station_id))

In [21]:
df_missing%>%
head()

In [23]:
#We bind dataframes with existing and filled values and arrange it by started_at again
cyclistic_2022_filled <- bind_rows(df_missing, df_nomissings)
cyclistic_2022_filled <- cyclistic_2022_filled %>%
arrange(started_at)

In [24]:
colSums(is.na(cyclistic_2022_filled))

In [25]:
#Fill missing end station IDs
#Dataframe with missing values
df_missing <- cyclistic_2022_filled %>%
  filter(is.na(end_station_id))

#dataframe without NA in that column
df_nomissings <- cyclistic_2022_filled %>%
  filter(!is.na(end_station_id))

In [27]:
missings_s2 <- s2_lnglat(df_missing$end_lat, df_missing$end_lng)
nomissings_s2 <- s2_lnglat(df_nomissings$end_lat, df_nomissings$end_lng)
df_missing$end_station_id <- cyclistic_2022_filled$end_station_id[s2_closest_feature(missings_s2, nomissings_s2)]

In [28]:
cyclistic_2022_filled <- bind_rows(df_missing, df_nomissings)
cyclistic_2022_filled <- cyclistic_2022_filled%>%
arrange(started_at)

colSums(is.na(cyclistic_2022_filled))

In [29]:
#Let's switch the filters and apply them to the names columns as well.
fill_missing_start_name <- cyclistic_2022_filled %>%
group_by(start_station_id)%>%
select(start_station_name, start_station_id)%>%
gather(key=start_station_id, value=start_station_name)%>%
fill(start_station_name)%>%
ungroup()
sum(is.na(fill_missing_start_name$start_station_name))
colSums(is.na(fill_missing_start_name))

#We keep in mind that v2 is about filtering and filling missing stations IDs and names
cyclistic_2022_v2 <- cyclistic_2022_filled %>%
mutate(start_station_name = fill_missing_start_name$start_station_name)%>%
arrange(started_at)

colSums(is.na(cyclistic_2022_v2))

In [30]:
cyclistic_2022_v2$start_station_id <- gsub("*\\.0*", "", cyclistic_2022_v2$start_station_id)
cyclistic_2022_v2$end_station_id <- gsub("*\\.0*", "", cyclistic_2022_v2$end_station_id)

In [31]:
#Let's filter with ID column and fill station ID missing
fill_missing_start_id <- cyclistic_2022_v2 %>%
group_by(start_station_name)%>%
select(start_station_name, start_station_id)%>%
gather(key=start_station_name, value=start_station_id)%>%
fill(start_station_id)%>%
ungroup()
sum(is.na(fill_missing_start_id$start_station_id))
colSums(is.na(fill_missing_start_id))

#We see that there are less NA start ID now, update the dataframe
cyclistic_2022_v2 <- cyclistic_2022_v2 %>%
mutate(start_station_id = fill_missing_start_id$start_station_id)

In [32]:
#Let's run the filter for names one more time.
fill_missing_start_name <- cyclistic_2022_v2 %>%
group_by(start_station_id)%>%
select(start_station_name, start_station_id)%>%
gather(key=start_station_id, value=start_station_name)%>%
fill(start_station_name)%>%
ungroup()
sum(is.na(fill_missing_start_name$start_station_name))
colSums(is.na(fill_missing_start_name))

#We keep in mind that v2 is about filtering and filling missing stations IDs and names
cyclistic_2022_v2 <- cyclistic_2022_v2 %>%
mutate(start_station_name = fill_missing_start_name$start_station_name)%>%
arrange(started_at)

colSums(is.na(cyclistic_2022_v2))

In [35]:
rm("fill_missing_start_id", "fill_missing_start_name", "cyclistic_2022", "missings_s2", "nomissings_s2")

In [36]:
#Filter and fill end station ID
fill_missing_end_id <- cyclistic_2022_v2 %>%
group_by(end_station_name)%>%
select(end_station_name, end_station_id)%>%
gather(key=end_station_name, value=end_station_id)%>%
fill(end_station_id)%>%
ungroup()
sum(is.na(fill_missing_end_id$end_station_id))

#Mutate our dataframe with filled columns
cyclistic_2022_v2 <- cyclistic_2022_v2 %>%
mutate(end_station_id = fill_missing_end_id$end_station_id)

In [37]:
colSums(is.na(cyclistic_2022_v2))

In [38]:
#Let's switch the filters and apply them to the names columns as well.
fill_missing_end_name <- cyclistic_2022_v2 %>%
group_by(end_station_id)%>%
select(end_station_name, end_station_id)%>%
gather(key=end_station_id, value=end_station_name)%>%
fill(end_station_name)%>%
ungroup()
sum(is.na(fill_missing_end_name$end_station_name))

#We keep in mind that v2 is about filtering and filling missing stations IDs and names
cyclistic_2022_v2 <- cyclistic_2022_v2 %>%
mutate(end_station_name = fill_missing_end_name$end_station_name)

For the moment NA values for end stations are not critical, we have fewer NA values (5374) in coordinates columns, so most likely it's a question of addressing those coordinates to the right station name and ID. Let's see if it will be eliminated with further manipulations.

Right now let's look at ride length and we see that there are some negative values to be filtered out. Furthermore, trips of less than 1 minute would also be mostly a mistake, system tests, or a change of mind, rather than a normal customer behavior.

In [39]:
sum(cyclistic_2022_v2$ride_length < 1)

In [40]:
cyclistic_2022_v3 <- subset(cyclistic_2022_v2, ride_length > 1)

cyclistic_2022_v3 %>%
summarise_all(n_distinct)

colSums(is.na(cyclistic_2022_v3))

We created a ride_length column and filled missing stations names with NA filters. Then we could start checking the differences between the casuals and members patterns: check the number of rides, min, and max, average and median ride_lengths. What we find out is that for some stations there are not so many rides, but there are long-duration trips that can skew the results of our analysis.

From the website, we could find out the maximum length of rides without additional fees the company is expecting users to have, so we'll assume all the extremum values to be outliers, e.g., technical maintenance, bike theft, technical issues with bike locking that can lead to customers complaints. If I was a part of Divvy team, I would have found out the treshold when the company stops counting as a regular ride.

As we can calculate, every ride that is close to 12 hours in a row is worth of annual subscription. The only type of customer that could be close to that amount of time is a casual rider with a daily pass. But from the pricing plans on the website, we know that it shouldn't be 12 consecutive hours riding the same bike. Every 3 hours it should be docked and leads to additional fees for the customers.

Let's define a threshold as 8 hours to be the maximum ride_length (480 mins). And see how many observations we are losing.

In [41]:
sum(cyclistic_2022_v3$ride_length > 480)

Then we filter rows with those trips out

In [42]:
cyclistic_2022_v4 <- subset(cyclistic_2022_v3, ride_length < 480)

In [43]:
ls()

In [44]:
#Get rid of variables we don't need anymore
rm('cyclistic_2022_v2', 'cyclistic_2022_filled', 'end_station_id_missing_data', 'end_station_name_missing_data', 'fill_missing_end_name', 'missings_s2', 'nomissings_s2', 'cyclistic_2022_filled', 'start_station_name_missing_data', 'start_station_id_missing_data', 'fill_missing_end_id', 'df_nomissings')

In [45]:
#Check for missing station names and IDs in start and end columns throughout the dataframe
start_station_name_missing_data <- filter(cyclistic_2022_v4, (is.na(start_station_name)))
start_station_id_missing_data <- filter(cyclistic_2022_v4, (is.na(start_station_id)))
end_station_name_missing_data <- filter(cyclistic_2022_v4, (is.na(end_station_name)))
end_station_id_missing_data <- filter(cyclistic_2022_v4, (is.na(end_station_id)))

In [46]:
start_station_name_missing_data%>%
tail()

In [47]:
#Filter those stations without names and try to find their values in other rows.
filter(cyclistic_2022_v4, start_station_id %in% unique(start_station_name_missing_data$start_station_id)) %>%
select(start_station_name, start_station_id) %>%
unique()%>%
arrange(start_station_id)
filter(cyclistic_2022_v4, end_station_id %in% unique(start_station_name_missing_data$start_station_id)) %>%
select(end_station_name, end_station_id) %>%
unique()%>%
arrange(end_station_id)


In [48]:
#Let's run the filter for names one more time.
fill_missing_start_name <- cyclistic_2022_v4 %>%
group_by(start_station_id)%>%
select(start_station_name, start_station_id)%>%
gather(key=start_station_id, value=start_station_name)%>%
fill(start_station_name, .direction = "up")%>%
ungroup()
sum(is.na(fill_missing_start_name$start_station_name))
colSums(is.na(fill_missing_start_name))

#Update our dataframe
cyclistic_2022_v4 <- cyclistic_2022_v4 %>%
mutate(start_station_name = fill_missing_start_name$start_station_name)%>%
arrange(started_at)

#Let's switch the filters and apply them to the names columns as well.
fill_missing_end_name <- cyclistic_2022_v4 %>%
group_by(end_station_id)%>%
select(end_station_name, end_station_id)%>%
gather(key=end_station_id, value=end_station_name)%>%
fill(end_station_name, .direction = "up")%>%
ungroup()
sum(is.na(fill_missing_end_name$end_station_name))

cyclistic_2022_v4 <- cyclistic_2022_v4 %>%
mutate(end_station_name = fill_missing_end_name$end_station_name)

In [49]:
colSums(is.na(cyclistic_2022_v4))

We filled all the station names and it's time to summarize how members and casual riders use Cyclistic service.

In [50]:
cyclistic_2022_v4 %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n(), min_by=min(ride_length), max_by=max(ride_length), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(month, member_casual, number_of_rides, average_duration)

In [51]:
aggregate(cyclistic_2022_v4$ride_length ~ cyclistic_2022_v4$member_casual, FUN = mean)
aggregate(cyclistic_2022_v4$ride_length ~ cyclistic_2022_v4$member_casual, FUN = median)
cyclistic_2022_v4$day_of_week <- ordered(cyclistic_2022_v4$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

# Plots and analysis

We need to segment it more for better understanding, cause right now from the pricing plans we assume that ebike and bike riders would use it differently. Also, casual riders who obtain day pass and single riders vs members is another segment.

## General group

In [52]:
#Group data by month for further visualization
plot_cyclistic_2022_month <- cyclistic_2022_v4 %>%
group_by(member_casual, month, rideable_type) %>%
summarise(number_of_rides = n(), min_by=min(ride_length), max_by=max(ride_length), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(month, rideable_type, member_casual, number_of_rides, average_duration)

In [53]:
#Group data by day of week for further visualization
plot_cyclistic_2022_day_of_week <- cyclistic_2022_v4 %>%
group_by(member_casual, day_of_week, rideable_type) %>%
summarise(min_by=min(ride_length), max_by=max(ride_length), number_of_rides = n(), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(day_of_week, member_casual, rideable_type, number_of_rides, average_duration)

In [54]:
#Group data by time of the day for further visualization
plot_cyclistic_2022_time <- cyclistic_2022_v4 %>%
group_by(member_casual, hour, rideable_type) %>%
summarise(number_of_rides = n(), min_by=min(ride_length), max_by=max(ride_length), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(hour, rideable_type, member_casual, number_of_rides, average_duration)

In [55]:
#Average duration of trips by months
plot_cyclistic_2022_month %>%
ggplot(aes(x = month, y = average_duration, fill = rideable_type))+
geom_col(position = "dodge")+
facet_wrap(~member_casual)
options(repr.plot.width = 20, repr.plot.height = 5)

In [56]:
#Number of rides by month
plot_cyclistic_2022_month %>%
ggplot(aes(x = month, y = number_of_rides, fill = rideable_type))+
geom_col(position = "dodge")+
facet_wrap(~member_casual)
options(repr.plot.width = 20, repr.plot.height = 5)

The two plots above show us that members tend to use e-bikes as long as classic bikes - short trips for both categories. And the number of rides trend shows that e-bikes use doesn't decline much even closer to the cold time of the year, both for casual riders and annual members.

In [57]:
plot_cyclistic_2022_day_of_week %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = rideable_type))+
geom_col(position = "dodge")+
facet_wrap(~member_casual)
options(repr.plot.width = 20, repr.plot.height = 5)

In [58]:
plot_cyclistic_2022_day_of_week %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = rideable_type))+
geom_col(position = "dodge")+
facet_wrap(~member_casual)
options(repr.plot.width = 20, repr.plot.height = 5)

Trip duration and the number of rides, broken by days of the week, let us see, casual riders use classic bikes more on weekends. E-bikes are used by members and casual riders pretty much the same way during the week, with a slightly longer trip duration on weekends.

In [59]:
plot_cyclistic_2022_time %>%
ggplot(aes(x = hour, y = number_of_rides, fill = member_casual))+
geom_col(position = "dodge")+
facet_wrap(~rideable_type)
options(repr.plot.width = 20, repr.plot.height = 5)

We see from the hourly divided plot two peaks for rush hours when members use all kinds of vehicles before and after work. We will plot top used stations for both categories further.

To segment riders with daily passes, we simply make another subset with ride_length of more than 50 minutes. As we remember, the pricing plans are designed in a way that members and single-trip users will ride up to ~45 and ~30 min respectively. On average, for e-bikes it will be even less. Let's see what the ratio is.

In [60]:
#Approximate amount of casual riders using Cyclistic service with day passes  
sum(cyclistic_2022_v4$ride_length > 50)

## Day pass users

In [61]:
cyclistic_2022_v5 <- subset(cyclistic_2022_v4, ride_length > 50)

In [62]:
plot_cyclistic_2022_month <- cyclistic_2022_v5 %>%
group_by(member_casual, month, rideable_type) %>%
summarise(number_of_rides = n(), min_by=min(ride_length), max_by=max(ride_length), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(month, rideable_type, member_casual, number_of_rides, average_duration)

In [63]:
plot_cyclistic_2022_day_of_week <- cyclistic_2022_v5 %>%
group_by(member_casual, day_of_week, rideable_type) %>%
summarise(min_by=min(ride_length), max_by=max(ride_length), number_of_rides = n(), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(day_of_week, member_casual, rideable_type)

In [64]:
plot_cyclistic_2022_month %>%
ggplot(aes(x = month, y = number_of_rides, fill = rideable_type))+
geom_col(position = "dodge")+
facet_wrap(~member_casual)
options(repr.plot.width = 20, repr.plot.height = 8)

In [65]:
plot_cyclistic_2022_day_of_week %>%
arrange(day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = rideable_type))+
geom_col(position = "dodge")+
facet_wrap(~member_casual)
options(repr.plot.width = 20, repr.plot.height = 5)

We see clearer that people with day passes use bikes mostly on weekends in warm months for entertaining purposes and almost don't use in winter, so it may be harder to engage them with annual passes.

## Single ride users vs Members

In [66]:
cyclistic_2022_v6 <- subset(cyclistic_2022_v4, ride_length < 50)

In [67]:
plot_cyclistic_2022_month <- cyclistic_2022_v6 %>%
group_by(member_casual, month, rideable_type) %>%
summarise(number_of_rides = n(), min_by=min(ride_length), max_by=max(ride_length), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(month, rideable_type, member_casual, number_of_rides, average_duration)

In [68]:
plot_cyclistic_2022_day_of_week <- cyclistic_2022_v6 %>%
group_by(member_casual, day_of_week, rideable_type) %>%
summarise(min_by=min(ride_length), max_by=max(ride_length), number_of_rides = n(), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(day_of_week, member_casual, rideable_type)

In [69]:
plot_cyclistic_2022_time <- cyclistic_2022_v6 %>%
group_by(member_casual, hour, rideable_type) %>%
summarise(number_of_rides = n(), min_by=min(ride_length), max_by=max(ride_length), average_duration=mean(ride_length), median_duration=median(ride_length))%>%
arrange(hour, rideable_type, member_casual, number_of_rides, average_duration)

In [70]:
plot_cyclistic_2022_month %>%
ggplot(aes(x = month, y = average_duration, fill = member_casual))+
geom_col(position = "dodge")+
facet_wrap(~ rideable_type)
options(repr.plot.width = 20, repr.plot.height = 5)

In [71]:
plot_cyclistic_2022_day_of_week %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = rideable_type))+
geom_col(position = "dodge")+
facet_wrap(~member_casual)
options(repr.plot.width = 20, repr.plot.height = 5)

These plots for users that ride (less than 50 min) show a similar pattern both for single ride users and annual members. They use bicycles and e-bikes mostly to commute and this segment of casual riders needs to see the benefits of membership that can be articulated and delivered via push messages in Cyclistic App.

In [72]:
plot_cyclistic_2022_month %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual))+
geom_col(position = "dodge")+
facet_wrap(~ rideable_type)
options(repr.plot.width = 20, repr.plot.height = 5)

Here we see a more precise picture showcasing that casual users and members, who ride for less than 50 min use e-bikes in the same way and will be more interested in using e-bikes for their daily needs in winter. So if Cyclistic offers an easy onboarding plan for these e-bike users without the unlock fee - that would be a no-brainer for them.

In [73]:
plot_cyclistic_2022_time %>%
ggplot(aes(x = hour, y = average_duration, fill = rideable_type))+
geom_col(position = "dodge")+
facet_wrap(~member_casual)
options(repr.plot.width = 20, repr.plot.height = 5)

Here we see that some casual riders and members use e-bikes at rush hour, so this segment of users can be turned into members easier, with a proper offer.

# Conclusions

Two datasets with ~ 8 million trips were checked and cleaned. It took time to find missing station names that would be helpful in a geotargeting ad campaign for casual users. I won't put here the whole dataset for 2021 and 2022 combined.

We had to check current pricing plans on the company's website to see what can be offered to casual riders to turn them into members. The patterns of bike use for different groups are shown on the plots, so we had to segment users for our analysis.

Starting with the Day Pass users. If they ride once a month on a day off or some tourists have a short trip to Chicago, they might not sign up for the whole year, but may do it for a month. It allows them to roam around the city easily if they decide to have one more trip during the month.

## Summary
* Casual riders spent more time in bikes
* Popular spot is Lake Shore Dr & Monroe St
* Classic bikes are most rented
* Docked bikes spent most time cycling
* Saturday has highest count of rented bikes
* Member riders love classic and electric bikes but casual riders prefer docked bikes
* Member riders have been in consistent usage for all days, same for casual riders
* Member riders spent less time biking than casual riders.