**Setting up environment.**

In [None]:
install.packages('tidyverse')

In [None]:
install.packages('lubridate')
install.packages('geosphere')

In [None]:
library('tidyverse')
library("ggplot2")
library("lubridate")
library("geosphere")

# Loading dataset

In [7]:
df1 <- read.csv("../input/cyclist/data/202009-divvy-tripdata.csv")
df2 <- read.csv("../input/cyclist/data/202010-divvy-tripdata.csv")
df3 <- read.csv("../input/cyclist/data/202011-divvy-tripdata.csv")
df4 <- read.csv("../input/cyclist/data/202012-divvy-tripdata.csv")
df5 <- read.csv("../input/cyclist/data/202101-divvy-tripdata.csv")
df6 <- read.csv("../input/cyclist/data/202102-divvy-tripdata.csv")
df7 <- read.csv("../input/cyclist/data/202103-divvy-tripdata.csv")
df8 <- read.csv("../input/cyclist/data/202104-divvy-tripdata.csv")
df9 <- read.csv("../input/cyclist/data/202105-divvy-tripdata.csv")
df10 <- read.csv("../input/cyclist/data/202106-divvy-tripdata.csv")
df11 <- read.csv("../input/cyclist/data/202107-divvy-tripdata.csv")
df12 <- read.csv("../input/cyclist/data/202108-divvy-tripdata.csv")


Here as we have loaded the datasets seperately lets combine them using rbind function in R

In [8]:
bike_rides <- rbind(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)

In [9]:
dim(bike_rides)

I just wanted to check how many no of duplicates are there before dropping them, so the dimension of the dataset is approx 4.9 million rows and 13 column. This is the first time i am working with such large data, okay lets dive into the project.

# Preprocessing and cleaning the data

In [10]:
summary(bike_rides)

We can see that started_at is string in order to do calculation we will be converting it to date time format, and also I have added columns for seeing the date of the travel,start and end hour which gives us hour part of started_at column.

In [11]:
bike_rides$Ymd  <- as.Date(bike_rides$started_at)

bike_rides$started_at <- lubridate::ymd_hms(bike_rides$started_at)
bike_rides$ended_at <- lubridate::ymd_hms(bike_rides$ended_at)

bike_rides$start_hour <- lubridate::hour(bike_rides$started_at)
bike_rides$end_hour <- lubridate::hour(bike_rides$ended_at)

In [12]:
cyclistic <- bike_rides[!duplicated(bike_rides$ride_id), ]
print(paste("Removed", nrow(bike_rides) - nrow(cyclistic), "duplicated rows"))

We have removed the duplicated rows now lets remove the rows containing na values.

In [13]:
cyclistic <- drop_na(cyclistic)
cyclistic$ride_time <- as.numeric(cyclistic$ended_at - cyclistic$started_at) / 60

In [14]:
head(cyclistic)

In [15]:
summary(cyclistic$ride_time)

There are some negative values in the ride time which logically doesn't makes sense, so we will be dropping them to avoid confusion.

In [17]:
cyclistic <- cyclistic %>% filter(ride_time > 0)

In [18]:
cyclistic$day_of_week <- format(as.Date(cyclistic$Ymd), "%A")

Now lets calculate ride speed and ride distance, we are having time, so in order to calculate speed we need to calculate distance, to find distance we will be using lat and long cordinates. After finding distance, we will apply formula speed = distance / time

In [20]:
cyclistic$ride_distance <- distGeo(matrix(c(cyclistic$start_lng, cyclistic$start_lat), ncol = 2), matrix(c(cyclistic$end_lng, cyclistic$end_lat), ncol = 2))
cyclistic$ride_distance <- cyclistic$ride_distance/1000

cyclistic$ride_speed = c(cyclistic$ride_distance / as.numeric(cyclistic$ride_time) *(100))

cyclistic$month <- strftime(cyclistic$started_at, "%m")

In [22]:
ride_count_start_station <- cyclistic %>%
    group_by(start_station_name) %>% 
    summarise(ride_count = length(start_station_id))

In [23]:
cyclistic %>%
  write.csv("cyclistic_clean.csv")

# Analysing the data

In [24]:
ggplot(cyclistic, aes(member_casual, fill=member_casual)) +
    geom_bar() +
    labs(x="Casuals x Members", title="Casuals Vs Members distribution")

Members dominate in the count. To take get more clarity lets drill down deeper, by using groupby lets see the numbers.

In [25]:
cyclistic %>% 
    group_by(member_casual) %>% 
    summarise(count = length(ride_id))

In [26]:
# monthly report
ggplot(cyclistic, aes(month, fill=member_casual)) +
    geom_bar(,position=position_dodge()) +
    labs(x="months", title="No of rides on weekdays") +
    coord_flip()

In [27]:
# weekday report
ggplot(cyclistic, aes(day_of_week, fill=member_casual)) +
    geom_bar(, position=position_dodge()) +
    labs(x="weekdays", title="No of rides on weekdays")

We can see there are more casual members on the Saturday and Sunday.

In [28]:
cyclistic %>%
    ggplot(aes(start_hour, fill=member_casual)) +
    labs(x="Hour of the day", title="") +
    geom_bar(position=position_dodge())

From the above graph we can conclude that people are starting their cycling around 5 in the evening. And the afternoon hour dominates for the most part.

In [29]:
names(cyclistic)

In [30]:
new_df <- cyclistic %>% 
    group_by(member_casual) %>% 
    summarise(mean_time = mean(ride_time),mean_distance = mean(ride_distance))
new_df

In [31]:
new_df1 <- cyclistic %>% 
    group_by(member_casual) %>% 
    summarise(median_time = median(ride_time),median_distance = median(ride_distance))
new_df1

There is a significant difference between mean and median ride time, it may be due to some outliers for simplicity lets now stick with mean.

In [32]:
ggplot(new_df, aes(x=member_casual, y = mean_time, fill = member_casual)) +
    geom_col(,position=position_dodge()) +
    labs(x="members_casual", title="Mean Time members vs casual")

In [33]:
ggplot(new_df, aes(x=member_casual, y = mean_distance, fill = member_casual)) +
    geom_col(,position=position_dodge()) +
    labs(x="members_casual", title="Mean distance members vs casual")

In [34]:
names(cyclistic)

In [35]:
cyclistic %>%
    ggplot(aes(rideable_type, fill=member_casual)) +
    labs(x="rideable type", title="Distribution of rideable_type") +
    geom_bar(position=position_dodge())

From the above data viz we gained some pretty good insights which will help us answer the above questions.

Casual or member both use our services heavly on Weekends, but lets now focus on casuals they use our bikes for time off purposes and they use our service more in terms of time even though there is a significant difference between no of casuals and members. Casual tend to use more classic and electical cycles.

Annual or People who hold membership generally use our services for work related, and it also makes sense they will be using it for work purpose annual subscription was their go to option, they use classic and electical bikes to good extent.

so we can attract casuals and convert them to members, by giving some good offers for classic and electric bikes and for weekends also. we can also persue casuals to annual membership by giving some coupons or offers regarding their ride time, since from the analysis we know that they travel more, this may attract some croud to membership.

And also we know the popular start stations and routes which riders are taking so we can add banners in those routes. Also we can increase price of bikes on weekends since we know casuals tend to use our service more on weekends. We can also give special perks to members which also help us convert casuals to members.