# Google Data Analytics Professional Certificate Capstone #

### Author: Serkan TOKGÖZ


### Briefing: #

Hi, I am Serkan :) This is my capstone of the Google Data Analytics Certificate programme. I discovered some marketing strategies for Cyclistic, the company who provides rental bikes for casual customers and members. In this self-study, I am going to share these strategies for you which analysed with R Programming Language.

**Tidyr, Lubridate, plots, bar charts and some date-time shortcuts used in this "Case Study" and was coded by me.**

# Cyclistic Marketing Strategy - Self Case Study

**Goal:** Design marketing strategies aimed at converting casual riders into annual members. In order to
do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why
casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are
interested in analyzing the Cyclistic historical bike trip data to identify trends.

## Step 1: Ask

How do annual members and casual riders use Cyclistic bikes differently?

* What is the problem trying to be solved?
The main difference between the annual members and casual riders according to usage of Cyclistic bikes.

## Step 2: Prepare

For the purposes of this case study, there is an assumption that the datasets are appropriate and will enable Data Analyst to answer the business questions. The data has been made available by Motivate International Inc. under this license.) This is public data that you can use to explore how different customer types are
using Cyclistic bikes. But data-privacy issues are prohibited from using riders’ personally identifiable information.

There are 12 datasets related to casual and annual members usage, for the past 12 months. 


**Lets assign the datasets to a dataframe and see what we have.**

In [1]:
library(lubridate)

In [2]:
library(tidyverse)
library(dplyr)
path = '../input/cyclistic-members-usage/' 
secondname = '-divvy-tripdata.csv'

## I just preferred to use R for importing data, because I have just 12 datasets in this case study.
month1 <- read_csv(paste(path,"202105",secondname, sep=""))
month2 <- read_csv(paste(path,"202106",secondname, sep=""))
month3 <- read_csv(paste(path,"202107",secondname, sep=""))
month4 <- read_csv(paste(path,"202108",secondname, sep=""))
month5 <- read_csv(paste(path,"202109",secondname, sep=""))
month6 <- read_csv(paste(path,"202110",secondname, sep=""))
month7 <- read_csv(paste(path,"202111",secondname, sep=""))
month8 <- read_csv(paste(path,"202112",secondname, sep=""))
month9 <- read_csv(paste(path,"202201",secondname, sep=""))
month10 <- read_csv(paste(path,"202202",secondname, sep=""))
month11 <- read_csv(paste(path,"202203",secondname, sep=""))
month12 <- read_csv(paste(path,"202204",secondname, sep=""))


**Combining all the data in one Dataframe**

In [3]:
allData <- rbind(month1, month2, month3, month4, month5, month6, month7, month8, month9, month10, month11, month12)
head(allData)

## Step 3: Process

In this step we are going to follow these steps:

**1.** Check the data for errors.

**2.** Choose your tools.

**3.** Transform the data so you can work with it effectively.

In [4]:
## Drop unnecessary columns

cleanedData <- allData %>%
select(ride_id, rideable_type, started_at, ended_at, member_casual)

glimpse(cleanedData)

cleanedData <- cleanedData %>%
drop_na()

glimpse(cleanedData)

I just wanted to compare the row numbers after using **drop_na()** function. It seems that there were no rows valued as NA, because the number of rows were equal after **drop_na()** was used.

## Step 4: Analyze

In this step we are going to follow these steps:

**1.** Aggregate your data so it’s useful and accessible.

**2.** Organize and format your data.

**3.** Perform calculations.

**4.** Identify trends and relationships.

In [5]:
head(cleanedData)
## I want to add time_duration, so datetime columns needs to be seperated.

cleanedData <- cleanedData %>%
mutate(cleanedData, duration_time = (ended_at - started_at)) %>%
mutate(cleanedData, start_date = date(started_at)) %>%
mutate(cleanedData, month_value = month(start_date)) %>%
mutate(cleanedData, year_value = year(start_date))

head(cleanedData)


In [6]:
## After, to grouping year-month values, we need year-month string column.

cleanedData <- cleanedData %>%
mutate(cleanedData, string_ym_value = paste(year_value, month_value,"1", sep="-")) %>%
mutate(cleanedData, day_of_week = wday(start_date))
head(cleanedData)

For calculations, I am going to use multiple dummy dataframes.

## Step 5: Share

In [7]:
##Average and Max usage of bikes

result1 <- cleanedData %>%
select(rideable_type, duration_time)

result1 <- result1 %>%
group_by(rideable_type) %>%
summarize(mean_duration_time = mean(duration_time))
result1
library(ggplot2)


In [8]:
ggplot(data = result1, aes (x =rideable_type, y = mean_duration_time, fill = mean_duration_time)) + 
geom_bar(stat="identity", width = 0.5) + labs(title = "Average Usage of Bike Types")

Docked bikes has more average usage time than other bikes. 

In [9]:
result2 <- cleanedData %>%
select(rideable_type, duration_time, member_casual)

result2 <- result2 %>%
group_by(member_casual, rideable_type) %>%
summarize(mean_duration_time = mean(duration_time)) %>%
arrange(rideable_type, -mean_duration_time, member_casual)
result2

In [10]:
ggplot(data = result2, aes (x =member_casual, y = mean_duration_time, fill = mean_duration_time)) + 
geom_bar(stat="identity", width = 0.5) + labs(title = "Average Usage of Bike Types") + facet_grid(~rideable_type)

It seems that members do not use docked bikes. However, docked bikes have a longer average duration time, and these time values related to casual riders. There is an opportunity to earn member from docked bike casual riders.

In [11]:
result3 <- cleanedData %>%
select(rideable_type, duration_time, member_casual, string_ym_value)

result3$string_ym_value <- as.POSIXct(result3$string_ym_value)


result3 <- result3 %>%
group_by(string_ym_value, member_casual) %>%
summarize(mean_duration_time = mean(duration_time)) %>%
arrange(string_ym_value, -mean_duration_time, member_casual)

result3

In [12]:
ggplot(data = result3, aes (x =string_ym_value, y = mean_duration_time, fill = mean_duration_time)) + 
geom_bar(stat="identity", width = 0.5) + labs(title = "Average Usage Per Month") + facet_grid(~member_casual)

It seems that, casual riders' duration time at summer are greater than the other seasons. Special offer can be presented to casual riders in these summer months.

In [13]:
head(cleanedData)

In [14]:
result4 <- cleanedData %>%
select(rideable_type, duration_time, member_casual, string_ym_value, day_of_week)

result4 <- result4 %>%
group_by(day_of_week, member_casual) %>%
summarize(average_duration_of_week_day = mean(duration_time)) %>%
mutate (day =  case_when(day_of_week == 1 ~ "Sunday",
                         day_of_week == 2 ~ "Monday",
                         day_of_week == 3 ~ "Tuesday",
                         day_of_week == 4 ~ "Wednesday",
                         day_of_week == 5 ~ "Thursday",
                         day_of_week == 6 ~ "Friday",
                         day_of_week == 7 ~ "Saturday")) %>%
arrange(-day_of_week)
                        
                    

result4

In [15]:
ggplot(data = result4, aes (x = reorder(day,-average_duration_of_week_day), y = average_duration_of_week_day, fill = average_duration_of_week_day)) + 
geom_bar(stat="identity", width = 0.5) + labs(title = "Favorite Day of Biking") + facet_grid(~member_casual)

Casual riders average usage time on Saturday and Sunday is greater than the other days. It seems that special offers should be done at weekends to earn this casuals as member.