# R: Cyclistic Bike Share Analysis 


## Connecting MySQL and R
I have MacOS system, so I installed iODBC Administrator64, and then installed packages and libraries to connect MySQL with R. 

In [None]:
## Install Packages and Libraries
> install.packages(c("DBI", "RODBC", "odbc", "dplyr", "dbplyr"))
> library(DBI)
> library(RODBC)
> library(odbc)
> library(dplyr)
> library(dbplyr)


In [None]:
## install MySQL package and library
> install.packages("RMySQL")
> library(RMySQL)

In [None]:
## Set up the connection
> drv <- dbDriver("MySQL")
> con <- dbConnect(drv, username="root", password="***", dbname ="bike_share", host="localhost")
> dbListTables(con)

## Install Packages
These packages I used to perfom the analysis

In [None]:
## Install packages
> install.packages("tidyverse")
> install.packages("janitor")
> install.packages("plyr")
> install.packages("mice")
> install.packages("mice") 
> install.packages("modeest")

## Load libraries
> library(tidyverse)
> library(janitor)
> library(scales)
> library(modeest)
> library(plyr) 
> library(mice)

## Importing Data

In [None]:
Checking folders
> dbListTables(con)
[1] "Bike_Rides"      "Bike_share_2023"

In [None]:
## retrieved data from the “Bike_rides” table and stored it in the bike_rides data frame. 

> bike_rides <- dbGetQuery(con, "SELECT * FROM Bike_rides")
> glimpse(bike_rides)
Rows: 5,718,608
Columns: 13
$ ride_id            <chr> "00000065B3150FF2", "0000085FE82E5429", "0000089D3…
$ rideable_type      <chr> "electric_bike", "electric_bike", "docked_bike", "…
$ started_at         <chr> "2023-09-24 12:56:50", "2023-04-13 17:52:05", "202…
$ ended_at           <chr> "2023-09-24 13:00:43", "2023-04-13 18:10:33", "202…
$ start_station_name <chr> "Sheridan Rd & Noyes St (NU)", "", "Clark St & Lak…
$ start_station_id   <chr> "604", "", "KA1503000012", "", "", "TA1309000033",…
$ end_station_name   <chr> "University Library (NU)", "Halsted St & 18th St",…
$ end_station_id     <chr> "605", "13099", "TA1307000062", "", "13061", "1324…
$ start_lat          <dbl> 42.05820, 41.88000, 41.88602, 41.90000, 41.88000, …
$ start_lng          <dbl> -87.67743, -87.63000, -87.63088, -87.75000, -87.65…
$ end_lat            <dbl> 42.05294, 41.85751, 41.89467, 41.90000, 41.90345, …
$ end_lng            <dbl> -87.67345, -87.64599, -87.63844, -87.76000, -87.66…
$ member_casual      <chr> "casual", "member", "casual", "casual", "member", …


In [None]:
We have 5,718,608 records. 

## Cleaning Data

* To ensure data integrity, we’ll examine whether there are any duplicate records in our dataset.
Since the ride_id is unique for each ride, we’ll use it as the key to identify duplicates.
* remove empty rows and columns if nedeed. Based on my research I found empty values only in start station name and end station name, which we don't use in our analys, so I decided to leave this data.
* check unique values in bike type and member type

In [None]:
## Checking for duplicates
> get_dupes(bike_rides, ride_id)
No duplicate combinations found of: ride_id
 [1] ride_id            dupe_count         rideable_type     
 [4] started_at         ended_at           start_station_name
 [7] start_station_id   end_station_name   end_station_id    
[10] start_lat          start_lng          end_lat           
[13] end_lng            member_casual     
<0 rows> (or 0-length row.names)

## Removing empty rows and columns  
> remove_empty(bike_rides, which = "rows")

## Checking unique values in selected columns 
> unique(bike_rides$rideable_type)
[1] "electric_bike" "docked_bike"   "classic_bike" 

> unique(bike_rides$member_casual)
[1] "casual" "member"

## Adding new calculated column
### Ride length column 
One of the most important index is ride length. We have start time and end time. So, we need to calculate ride duration, but before this we need to convert Character format to DataTime format.

In [None]:
# Convert 'started_at' column to Date format
> bike_rides$started_at <- as.POSIXct(bike_rides$started_at)
bike_rides$ended_at <- as.POSIXct(bike_rides$ended_at)

# Find the length each ride
> bike_rides$ride_length <- difftime(bike_rides$ended_at,bike_rides$started_at, units = "mins")
> bike_rides$ride_length_rounded <- round(bike_rides$ride_length)
print(bike_rides)

### Season column 
To create this column we need to sort data, based on the season: winter, spring, summer or fall.

In [None]:
 # Create a new column 'season' and initialize with NA values
> bike_rides$season <- NA
 
 # Assign season based on date
> for (i in 1:nrow(sample)) {
   if (bike_rides$started_at[i] >= as.Date("2023-06-01") &&
     bike_rides$started_at[i] <= as.Date("2023-08-31")) {
       bike_rides$season[i] <- "Summer"
     } else if (bike_rides$started_at[i] >= as.Date("2023-01-01") &&
                bike_rides$started_at[i] <= as.Date("2023-02-28")) {
       bike_rides$season[i] <- "Winter"
     } else if (bike_rides$started_at[i] >= as.Date("2023-09-01") &&
                bike_rides$started_at[i] <= as.Date("2023-11-30")) {
       bike_rides$season[i] <- "Fall"
     } else if (bike_rides$started_at[i] >= as.Date("2023-03-01") &&
                bike_rides$started_at[i] <= as.Date("2023-05-31")) {
       bike_rides$season[i] <- "Spring"
     } else {
       bike_rides$season[i] <- "Winter"
     }
   }


In [None]:
Output:
Rows: 5,718,608
Columns: 16
$ ride_id             <chr> "00000065B3150FF2", "0000085FE82E5429", "0000089D36728778", "00000B15294F9057", "000010D58FFC4A2B", "0000…
$ rideable_type       <chr> "electric_bike", "electric_bike", "docked_bike", "electric_bike", "electric_bike", "classic_bike", "elect…
$ started_at          <dttm> 2023-09-24 12:56:50, 2023-04-13 17:52:05, 2023-05-08 13:22:50, 2023-12-08 17:52:37, 2023-11-10 17:46:50,…
$ ended_at            <dttm> 2023-09-24 13:00:43, 2023-04-13 18:10:33, 2023-05-08 13:28:53, 2023-12-08 18:00:54, 2023-11-10 17:59:41,…
$ start_station_name  <chr> "Sheridan Rd & Noyes St (NU)", "", "Clark St & Lake St", "", "", "Sheffield Ave & Webster Ave", "N Green …
$ start_station_id    <chr> "604", "", "KA1503000012", "", "", "TA1309000033", "20246.0", "443", "KA1503000043", "13022", "KA15030000…
$ end_station_name    <chr> "University Library (NU)", "Halsted St & 18th St", "Sedgwick St & Huron St", "", "Ashland Ave & Division …
$ end_station_id      <chr> "605", "13099", "TA1307000062", "", "13061", "13243", "13193", "KA17018068", "WL-012", "15541", "TA130900…
$ start_lat           <dbl> 42.05820, 41.88000, 41.88602, 41.90000, 41.88000, 41.92154, 41.88558, 41.97478, 41.88925, 41.89224, 41.78…
$ start_lng           <dbl> -87.67743, -87.63000, -87.63088, -87.75000, -87.65000, -87.65382, -87.64843, -87.69781, -87.63855, -87.61…
$ end_lat             <dbl> 42.05294, 41.85751, 41.89467, 41.90000, 41.90345, 41.91262, 41.92182, 41.93935, 41.88338, 41.86829, 41.78…
$ end_lng             <dbl> -87.67345, -87.64599, -87.63844, -87.76000, -87.66775, -87.68139, -87.64414, -87.68328, -87.64117, -87.62…
$ member_casual       <chr> "casual", "member", "casual", "casual", "member", "member", "casual", "member", "member", "casual", "memb…
$ ride_length         <drtn> 3.883333 mins, 18.466667 mins, 6.050000 mins, 8.283333 mins, 12.850000 mins, 12.350000 mins, 18.866667 m…
$ ride_length_rounded <drtn> 4 mins, 18 mins, 6 mins, 8 mins, 13 mins, 12 mins, 19 mins, 16 mins, 3 mins, 37 mins, 2 mins, 14 mins, 8…
$ season              <chr> "Fall", "Spring", "Spring", "Winter", "Fall", "Summer", "Fall", "Fall", "Winter", "Summer", "Winter", "Wi…

### Month column 

In [None]:
## Adding column for month
> bike_rides$month <- format(bike_rides$started_at, "%B")

### Day of week column 

In [None]:
## Adding column for day_of_week
> bike_rides$day_of_week <- format(as.Date(bike_rides$started_at), "%A")
## Checking new columns
> head(bike_rides)
  season     month day_of_week
1   Fall September      Sunday
2 Spring     April    Thursday
3 Spring       May      Monday
4 Winter  December      Friday
5   Fall  November      Friday
6 Summer      June    Thursday

### Start Hour column

In [None]:
bike_rides$start_hour <- hour(bike_rides$started_at)

## Identification of Bad Data
Checking and removing records, where ride length is negative or equal to 0. 

In [None]:
## Converting ride_length into numeric
bike_rides$ride_length <- as.numeric(bike_rides$ride_length)

# Filter records with ride_length = 0
> zero_ride_length <- bike_rides[bike_rides$ride_length_rounded == 0, ]

# Found 96729 records with ride duration is equal to 0.

In [None]:
# Removing records with ride_length = 0
> bike_rides <- bike_rides[!(bike_rides$ride_length != 0),]
> glimpse(bike_rides)

Rows: 5,621,879

In [None]:
## Handling Short Durations. Removing records with ride_length <=1
bike_rides <- bike_rides[!bike_rides$ride_length_rounded <= 1, ]
> glimpse(bike_rides)
Rows: 53,071

In [None]:
# Filter records with negative ride_length 
> negative_ride_length <- bike_rides[bike_rides$ride_length_rounded < 0, ]

# Found 0 records with negative ride duration.
# Total number of deleted rows: 149,800

In [None]:
# Output: 
Rows: 5,568,808

## Detecting Outliers

In [None]:
> summary(bike_rides$ride_length)
    Min.    1st Qu.   Median   Mean   3rd Qu.      Max. 
    1.02     5.70     9.8    18.67    17.23     98489.07 

The summary statistics for ride_length reveal some extreme values:<bk>
Minimum: 1.02 minutes<bk>
Maximum: 98489.07 minutes (equivalent to approximately 68.5 days)<bk>
Let’s create a boxplot to visualize the distribution of ride lengths by membership type (casual vs. member).

In [None]:
# Create a boxplot
library(ggplot2)
ggplot(bike_rides, aes(x = member_casual, y = ride_length, fill = member_casual)) +
  geom_boxplot() +
  labs(title = "Ride Length Distribution by Membership Type", x = "Membership Type", y = "Ride Length")

In [None]:
<img src='Images/R_Project/Graph before z-score.png' />

## Outlier Detection and Impact

The boxplot visually confirms the presence of outliers in our data. These extreme values can distort statistical analyses and affect the reliability of our results. It's essential to handle outliers appropriately to ensure accurate insights and model performance.


## Ride Length Distribution Analysis

The summary of the `ride_length` variable in the `bike_rides` dataset provides valuable insights about the distribution of ride lengths. Let's break down the key points:

- **Minimum (Min.) Ride Length**: The shortest ride length observed in the dataset is 1.02 minutes (which suggests that someone took the bike by mistake or changed their mind shortly after starting the ride).

- **First Quartile (1st Qu.)**: The 25th percentile (Q1) corresponds to a ride length of 5.7 minutes. Approximately 25% of rides are shorter than this value.

- **Median (Midpoint)**: The median (50th percentile) ride length is 9.80 minutes. Roughly half of the rides fall below this duration.

- **Mean (Average) Ride Length**: The average ride length across all rides is 18.67 minutes. However, this value can be influenced by extreme outliers.

- **Third Quartile (3rd Qu.)**: The 75th percentile (Q3) represents a ride length of 17.23 minutes. Most rides fall below this duration.

- **Maximum (Max.) Ride Length**: The longest ride observed in the dataset is an astonishing 98489.07 minutes (which seems highly unusual and warrants further investigation).

### Conclusions:

- The dataset exhibits a wide range of ride lengths, spanning from very short to extremely long rides.
- The mean ride length is higher than the median, suggesting the presence of outliers (very long rides) that impact the average.
- Investigate extreme values (e.g., the ride with a length of 98489.07 minutes) to determine their validity and potential anomalies.

These insights can guide decision-making related to bike-sharing program management and resource allocation.


### Calculating outliers using Z-Score Method:
Define an observation as an outlier if its z-score is less than -3 or greater than 3.
Calculate the z-score for each value in the ride_length column:

In [None]:
> bike_rides$z <- (bike_rides$ride_length - mean(bike_rides$ride_length)) / sd(bike_rides$ride_length)

In [None]:
> outliers <- bike_rides[bike_rides$z > 3, ]

In [None]:
## Removing Outliers
> bike_rides <- bike_rides[!bike_rides$z > 3,]

In [None]:
## Boxplot after removing outliers

> library(ggplot2)
> ggplot(bike_rides, aes(x = member_casual, y = ride_length, fill = member_casual)) +
+     geom_boxplot() +
+     labs(title = "Ride Length Distribution by Membership Type", x = "Membership Type", y = "Ride Length")

In [None]:
> summary(bike_rides$ride_length)
   Min.  1st Qu.  Median    Mean   3rd Qu.    Max. 
  1.500   5.783   9.850    15.046  17.267    568.633 

In [None]:
JPEG AFTER z-score

### Grouping data for visualization
#### by member_casual status
Starting this section I faced with a problem that i can't group my data (group by()), I just getting an overall summary instead of a grouped summary. But the problem was that I uploaded plyr after dplyr.
So, I removed plyr and tried again and I got the grouped summary.

In [None]:
> summary(bike_rides$ride_length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   5.783   9.850  15.046  17.267 568.633 

In [None]:
# Calculate average ride length depending by membership type
avg_ride_length <- aggregate(ride_length ~ member_casual, data = bike_rides, FUN = mean)

# Create a bar plot
library(ggplot2)
ggplot(avg_ride_length, aes(x = member_casual, y = ride_length, fill = member_casual)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Ride Length by Membership Type", x = "Membership Type", y = "Average Ride Length")

In [None]:
# A tibble: 2 × 7
  member_casual ride_count ride_percentage average_ride_duration median   max   min
  <chr>              <int>           <dbl>                 <dbl>  <dbl> <dbl> <dbl>
1 casual           1997117            35.9                  19.9  12.2   566.  1.02
2 member           3561550            64.1                  12.1   8.75  566.  1.02

In [None]:
JPEG Avg ride depending from member type


#### by member_casual status and day_of_week 

In [None]:
## Comparing by each day of week for member vs casual
bike_rides %>%
  group_by(member_casual, day_of_week) %>%
  summarise(number_of_rides = n(),
            average_ride_duration = mean(ride_length),
            median = median(ride_length),
            max = max(ride_length),
            min = min(ride_length),
           .groups="drop") %>%
  arrange(member_casual, day_of_week) %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by Members and Casual riders Vs. Day of the week") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

In [None]:
# A tibble: 14 × 7
   member_casual day_of_week number_of_rides average_ride_duration median   max   min
   <chr>         <chr>                 <int>                 <dbl>  <dbl> <dbl> <dbl>
 1 casual        Friday               297617                  19.1  11.8   561.  1.02
 2 casual        Monday               230337                  19.9  11.8   563.  1.02
 3 casual        Saturday             389241                  22.5  14.1   566.  1.02
 4 casual        Sunday               351339                  22.9  14.2   564.  1.02
 5 casual        Thursday             256482                  17.3  10.8   566.  1.02
 6 casual        Tuesday              232945                  18.0  10.9   565.  1.02
 7 casual        Wednesday            239156                  17.0  10.6   563.  1.02
 8 member        Friday               524124                  12.0   8.62  563.  1.02
 9 member        Monday               470709                  11.6   8.35  566.  1.02
10 member        Saturday             462840                  13.4   9.52  565.  1.02
11 member        Sunday               416478                  13.6   9.52  560.  1.02
12 member        Thursday             567941                  11.7   8.58  559.  1.02
13 member        Tuesday              551274                  11.7   8.5   562.  1.02
14 member        Wednesday            568184                  11.6   8.53  558.  1.02


In [None]:
JPEG Total rides by Members and Casual riders Vs. Day of the week

Based on the bar graph:<br>

**Weekend Peaks:** Casual riders show a significant increase in rides during the weekends, especially on Saturdays.<br>
**Member Consistency:** Members’ ride numbers are relatively consistent throughout the week, with a slight uptick on weekdays.<br>
**Monday Dip:** Both members and casual riders have the lowest number of rides on Mondays.<br>
**Distinct Patterns:** There is a clear difference in riding patterns between members and casual riders, with casual riders peaking on weekends and members showing steadier usage.<br>

#### by Day_of_week & Member_casual Status VS Avg_ride_duration

In [None]:
bike_rides %>%
  group_by(member_casual, day_of_week) %>% 
  summarise(average_ride_duration = mean(ride_length)) %>% 
  ggplot(aes(x = day_of_week, y = average_ride_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = format(round(average_ride_duration, 2), nsmall = 2), vjust = - 0.5)) +
  labs(title = "Grouping Average Duration by Day of Week")

In [None]:
# A tibble: 14 × 3
# Groups:   member_casual [2]
member_casual    day_of_week        average_ride_duration
   <chr>         <chr>                       <dbl>
 1 casual        Friday                       19.2
 2 casual        Monday                       20.0
 3 casual        Saturday                     22.7
 4 casual        Sunday                       23.0
 5 casual        Thursday                     17.4
 6 casual        Tuesday                      18.1
 7 casual        Wednesday                    17.1
 8 member        Friday                       12.1
 9 member        Monday                       11.7
10 member        Saturday                     13.5
11 member        Sunday                       13.7
12 member        Thursday                     11.8
13 member        Tuesday                      11.7
14 member        Wednesday                    11.7

In [None]:
JPEG Avg duration by day_of_week.jpg

Based on the bar graph, here are the key findings: <br>

**Casual vs. Member:** Casual riders consistently have higher average trip durations than members across all days of the week. <br>
**Peak Days:**
For casual riders, the longest average trip duration occurs on Sunday, approximately 22.70 minutes.<br>
For members, the peak is on Saturday, nearly 13.48 minutes.<br>
**Lowest Durations:**
The shortest average trip duration for casual riders is on Wednesday, about 17.15 minutes.<br>
For members, it’s on Tuesday, roughly 11.65 minutes.

#### by Member_casual Status & Season

In [None]:
bike_rides %>%
  group_by(member_casual, season) %>%
  summarise(number_of_rides = n(),
            average_ride_duration = mean(ride_length),
            median = median(ride_length),
           .groups="drop") %>%
  arrange(season) %>% 
  ggplot(aes(x = season, y = average_ride_duration, fill = member_casual)) +
  labs(title ="Total rides by Members and Casual riders Vs. Season") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

In [None]:
# A tibble: 8 × 5
  member_casual season number_of_rides average_ride_duration median
  <chr>         <chr>            <int>                 <dbl>  <dbl>
1 casual        Fall            519523                  19.0  11.4 
2 member        Fall            998935                  11.9   8.63
3 casual        Spring          425321                  20.0  12.0 
4 member        Spring          809917                  11.9   8.52
5 casual        Summer          908727                  21.6  13.5 
6 member        Summer         1270922                  13.3   9.75
7 casual        Winter          129853                  14.0   8.53
8 member        Winter          451516                  10.6   7.47

In [None]:
JPEG Graph by Member_casual & Season

Based on the bar graph, here are the key findings: <br>

**Seasonal Trends:** The graph shows a clear seasonal pattern in the average number of rides taken by members and casual riders. <br>
**Summer Peak:** Casual riders have a significantly higher average number of rides in the summer compared to members.<br>
**Winter Drop:** Both groups experience a drop in average rides during the winter, with members still riding more than casual riders.<br>
**Spring Equilibrium:** In spring, the average number of rides is similar for both members and casual riders, with members slightly ahead.<br>

####  by number of rides per month by day_of_week for Memberas and Casuals

In [None]:
ggplot(data = bike_rides) + geom_bar(mapping = aes(x = day_of_week, fill = member_casual)) +
  facet_wrap(~ month) + 
  ylab("number_of_rides") + 
  labs(title = "Number of Rides per Month") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, vjust=0.5, size = 6)) 

In [None]:
JPEG number of rides per month by day_of_week for Memberas and Casuals

**Peak Season:** The peak season for bike rides is during the summer months. During this time, almost all days are popular for riding. <br>
**Least Popular Months:** The least popular months for bike rides are December, January, February, and March.<br>
**Member vs. Casual Riders:**
Summer Months: In the summer months, both members and casual riders contribute almost equally to the total number of rides.
Winter Season: During the winter season, only members typically use the bikes.<br>

#### by Rideable Type & Member_casual Status VS Avg_ride_duration

In [None]:
bike_rides %>%
  group_by(member_casual, rideable_type) %>%
  summarise(number_of_rides = n(),
            average_ride_duration = mean(ride_length),
            median = median(ride_length),
           .groups="drop") %>%
  arrange(rideable_type) %>% 
  ggplot(aes(x = rideable_type, y = average_ride_duration, fill = member_casual)) +
  labs(title ="Avarage ride duration by Members and Casual riders Vs. Bike") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

In [None]:
A tibble: 5 × 5
member_casual   rideable_type     number_of_rides    average_ride_duration   median
  <chr>         <chr>                   <int>               <dbl>           <dbl>
1 casual        classic_bike           854545                24.1           14.6 
2 member        classic_bike          1775578                12.8           9.13
3 casual        docked_bike             74958                45.8            28   
4 casual        electric_bike         1053921                14.9           10.2 
5 member        electric_bike         1755712                11.7           8.53

In [None]:
JPEG Ride duration_member_bike_type.jpg

Here are the key findings from the bar graph:<br>

**Electric Bikes:** Electric bikes show the highest average ride durations across all days. <br>
**Weekend Trends:** Saturdays and Sundays experience the longest average ride durations for all bike types.<br>
**Classic Bikes:** Classic bikes generally have the shortest average ride durations, except on Wednesdays.<br>
**Weekday vs. Weekend:** There is a significant increase in average ride durations during the weekends compared to weekdays.<br>

#### by number of rides by day of week for Members  and Casuals

In [None]:
## Rides by day of week
bike_rides %>%
    group_by(member_casual, day_of_week) %>%
    summarise(number_of_rides = n(),
.groups="drop") %>%
  arrange(day_of_week) %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + labs(title = "Rides by Day of Week") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))


In [None]:
# A tibble: 14 × 3
   member_casual day_of_week number_of_rides
   <chr>         <chr>                 <int>
 1 casual        Friday               295557
 2 member        Friday               519450
 3 casual        Monday               228818
 4 member        Monday               466641
 5 casual        Saturday             386438
 6 member        Saturday             459067
 7 casual        Sunday               348826
 8 member        Sunday               413020
 9 casual        Thursday             254849
10 member        Thursday             563053
11 casual        Tuesday              231359
12 member        Tuesday              546608
13 casual        Wednesday            237577
14 member        Wednesday            563451

In [None]:
JPEG Number of rides by day of week

**Peak Day:** The highest number of rides for both casual riders and members is on Saturday. <br>
**Member Dominance:** Members take more rides than casual riders on every day of the week.<br>
**Weekend Patterns:** The smallest difference in the number of rides between members and casual riders is observed on weekends.<br>
**Weekly Trend:** There is a noticeable increase in rides from Monday to Saturday, with a slight dip on Sunday.<br>
**Lowest Rides:** Tuesday has the fewest rides for casual riders, while Monday has the fewest for members.<br>

#### by number of rides by Month for Members  and Casuals

In [None]:
 ## Rides by month
  bike_rides %>%
          group_by(member_casual, month) %>%
          summarise(number_of_rides = n(),
        .groups="drop") %>%
          arrange(month) %>%
          ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
          geom_col(position = "dodge") + labs(title = "Rides by Month") +
          scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+ theme(axis.text.x = element_text(angle = 45, vjust=0.5, size = 6)) 

In [None]:
JPEG Number of rides by month

**Seasonal Trends:** Ride frequency increases during warmer months, peaking in August, and decreases during colder months, with the lowest point in February.<br>
**Member Consistency:** Members consistently take more rides than casual riders throughout the year.<br>
**Casual vs. Member Peak:** Both casual and member rides peak in August, indicating it’s the most popular month for bike-sharing.<br>
**Off-Peak Observations:** There is a notable decline in rides from October to November for both user types.<br>

#### by number of rides by Day_of_week depending from bike type

In [None]:
 bike_rides %>%
                   group_by(rideable_type, day_of_week) %>%
                   summarise(number_of_rides = n(),
                 .groups="drop") %>%
                   arrange(day_of_week) %>%
                   ggplot(aes(x = day_of_week, y = number_of_rides, fill = rideable_type)) +
                   geom_col(position = "dodge") + labs(title = "Number of rides depending from bike type") +
                   scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+ theme(axis.text.x = element_text(angle = 45, vjust=0.5, size = 6)) 

In [None]:
JPEG Number of rides depending from bike type

**Classic Bike Preference:** Classic bikes have the highest number of rides across all days. <br>
**Docked Bike Dip:** Docked bikes see a significant drop in usage on weekends.<br>
**Electric Bike Consistency:** Electric bikes have the least fluctuation in rides between weekdays and weekends.<br>
**Sunday Slump:** All bike types experience a decrease in rides on Monday.<br>

#### by average ride duration by Day_of_week depending from bike type

In [None]:
bike_rides %>%
     group_by(rideable_type, day_of_week) %>%
     summarise(avg_duration = mean(ride_length), .groups = "drop") %>%
     arrange(day_of_week) %>%
     ggplot(aes(x = day_of_week, y = avg_duration, fill = rideable_type)) +
     geom_col(position = "dodge") +
     labs(title = "Average ride duration with different bike types by each day") +
     scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
     theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 6))

In [None]:
JPEG Average ride duration with different bike types by each day

**Classic Bikes:** Generally have the lowest average duration, suggesting they are used for shorter, quicker trips.<br>
**Electric Bikes:** Have average durations between classic and docked bikes, indicating a balance between convenience and travel time.<br>
**Docked Bikes:** Show the highest average duration, which may imply they are preferred for longer rides or leisure activities.<br>

In [None]:
# Calculate the counts for each category in the 'member_casual' column
number_of_rides <- table(bike_rides$member_casual)

# Calculate the percentages
ride_percents <- prop.table(number_of_rides) * 100

# Create the pie chart
pie(number_of_rides,
    labels = paste(round(number_of_rides), round(ride_percents, 1), "%")), # Include percentages in labels
    col = c("skyblue", "lightgreen"),  # Colors for each slice
    main = "Bike Rides by Member Type"  # Title of the chart
    )

# Add a legend
legend("topright", legend = c("Casual", "Member"), fill = c("skyblue", "lightgreen"))

In [None]:
JPEG Bike rides by member type

**Member:** Represents 64% of bike rides, totaling 3,531,290. <br>
**Casual:** Accounts for 36% of bike rides, with a total of 1,983,424.<br>

#### Top 10 start stations 

In [None]:
## Created summary for top 10 stations
top_station <- bike_rides %>%
  group_by(start_station_name) %>%
  summarise(total_count = length(na.omit(start_station_name))) %>%
  arrange(desc(total_count)) %>%
  top_n(10)



top_station <- top_station[!(is.na(top_station$start_station_name) | top_station$start_station_name == ""), ]
top_station

In [None]:
A tibble: 10 × 2
   start_station_name                 total_count
   <chr>                                    <int>
 1 Streeter Dr & Grand Ave                  61195
 2 DuSable Lake Shore Dr & Monroe St        38948
 3 Michigan Ave & Oak St                    36262
 4 Clark St & Elm St                        34993
 5 DuSable Lake Shore Dr & North Blvd       34838
 6 Kingsbury St & Kinzie St                 34112
 7 Wells St & Concord Ln                    32764
 8 Clinton St & Washington Blvd             31633
 9 Wells St & Elm St                        29718
10 Theater on the Lake                      29217

In [None]:
> ggplot(data = top) +
     geom_col(mapping = aes(x = reorder(start_station_name, -total_count), y = total_count, fill = start_station_name),
              position = "dodge") +
     labs(title = "Top Ten Start Stations") +
     theme(axis.text.x = element_blank(), axis.title.x = element_blank())

In [None]:
JPEG Top Ten Start Stations

**Most Popular:** “Clinton St & Washington Blvd” is the most popular start station with over 6000 starts. <br>
**High Usage:** “Canal St & Madison St” also shows high usage, closely following the top station.<br>

#### Top 10 start stations for Members

In [None]:
## Top Ten Stations for Members
top_station1 <- bike_rides %>%
  filter(member_casual == "member") 
## excluding station name with empty cells
top_station1 <- top_station1[!(is.na(top_station1$start_station_name) | top_station1$start_station_name == ""), ]

top_station_member <- top_station1 %>%
  group_by(start_station_name) %>%
  summarise(total_count = length(na.omit(start_station_name))) %>%
  arrange(desc(total_count)) %>%
  top_n(10)

top_station1 <- top_station1[!(is.na(top_station1$start_station_name) | top_station1$start_station_name == ""), ]
top_station_member

In [None]:
# A tibble: 10 × 2
   start_station_name           total_count
   <chr>                              <int>
 1 Kingsbury St & Kinzie St           25515
 2 Clinton St & Washington Blvd       25297
 3 Clark St & Elm St                  24417
 4 Wells St & Concord Ln              20871
 5 Wells St & Elm St                  19925
 6 Clinton St & Madison St            19781
 7 University Ave & 57th St           19382
 8 Broadway & Barry Ave               18523
 9 Loomis St & Lexington St           18268
10 Ellis Ave & 60th St                17637

In [None]:
ggplot(data = top_station_member) +
     geom_col(mapping = aes(x = reorder(start_station_name, -total_count), y = total_count, fill = start_station_name),
     position = "dodge") +
     labs(title = "Top Ten Start Stations by Members") +
     theme(axis.text.x = element_blank(), axis.title.x = element_blank())

In [None]:
JPEG Top ten start stations for members

**Most Popular Station:** Clinton St & Madison St has the highest total count of starts by members.<br>
**Close Second:** Broadway & Barry Ave follows closely behind in popularity.<br>
**Consistent Usage:** Most of the top ten stations show a relatively even distribution of member usage.<br>
**Least Utilized:** Wells St & Elm St has the lowest total count among the top stations.<br>

#### Top 10 start stations for Members

In [None]:
## Top Ten Stations for Casual
top_station2 <- bike_rides %>%
  filter(member_casual == "casual") 
## excluding station name with empty cells
top_station2 <- top_station2[!(is.na(top_station1$start_station_name) | top_station2$start_station_name == ""), ]

top_station_casual <- top_station2 %>%
  group_by(start_station_name) %>%
  summarise(total_count = length(na.omit(start_station_name))) %>%
  arrange(desc(total_count)) %>%
  top_n(10)

top_station2 <- top_station2[!(is.na(top_station1$start_station_name) | top_station2$start_station_name == ""), ]
top_station_casual

In [None]:
# A tibble: 10 × 2
   start_station_name                 total_count
   <chr>                                    <int>
 1 Streeter Dr & Grand Ave                  44500
 2 DuSable Lake Shore Dr & Monroe St        29492
 3 Michigan Ave & Oak St                    21989
 4 DuSable Lake Shore Dr & North Blvd       19729
 5 Millennium Park                          19451
 6 Shedd Aquarium                           17209
 7 Theater on the Lake                      15874
 8 Dusable Harbor                           14966
 9 Wells St & Concord Ln                    11893
10 Adler Planetarium                        11547

In [None]:
ggplot(data = top_station_casual) +
     geom_col(mapping = aes(x = reorder(start_station_name, -total_count), y = total_count, fill = start_station_name),
     position = "dodge") +
     labs(title = "Top Ten Start Stations by Casual") +
     theme(axis.text.x = element_blank(), axis.title.x = element_blank())

In [None]:
JPEG Top ten start stations for Casual

**Most Popular:** DuSable Harbor is the most frequented start station by casual users, with a significantly higher total count than other stations.<br>
**High Activity:** Streeter Dr & Grand Ave and DuSable Lake Shore Dr & Monroe St also show high casual usage.<br>
**Varied Usage:** The remaining stations have progressively smaller counts, indicating varied preferences among casual riders.<br>
**Least Popular:** Wells St & Concord Ln has the lowest total count among the top ten stations for casual starts.<br>

#### The Most popular Start time by Member Type

In [None]:
# Load the scales package for label_number()
library(scales)  


bike_rides %>%
  group_by(start_hour, day_of_week, member_casual) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = factor(start_hour), y = number_of_rides, color = member_casual)) +
  geom_line(size = 1) +
  geom_point(size = 5, alpha = 0.3) +
  facet_wrap(~day_of_week) +
  labs(title = "Number of Rides by Start Hour and Member Type",
       x = "Start Hour for Members") +
  scale_x_discrete(limits = as.character(0:23)) +  
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 6))

In [None]:
JPEG Number of Rides by Start Hour and Member Type

**Dual Peaks:** Both ‘casual’ and ‘member’ riders exhibit two prominent peaks in ride numbers daily, likely corresponding to rush hours. <br>
**Evening Preference:** The evening peak generally sees more rides than the morning peak.<br>
**Weekday Consistency:** ‘Member’ riders show more consistent ride numbers throughout weekdays.<br>
**Weekend Casual Rise:** On weekends, ‘casual’ riders increase rides during midday, especially on Saturday.<br>

# Common Key Findings for Bike-Sharing Programs:

## Bike-Sharing Usage Patterns


- The highest number of rides for both casual riders and members occurs on Saturday.
- Members take more rides than casual riders on every day of the week.
- The smallest difference in the number of rides between members and casual riders is observed on weekends.
- There is a noticeable increase in rides from Monday to Saturday, with a slight dip on Sunday.
- Tuesday has the fewest rides for casual riders, while Monday has the fewest for members.

## Seasonal Trends in Bike-Sharing

- Ride frequency increases during warmer months, peaking in August.
- Members consistently take more rides throughout the year.
- Both casual and member rides peak in August.
- There is a notable decline in rides from October to November for both user types.

## Bike Type Preferences

- Classic bikes have the highest number of rides across all days.
- Have average durations between classic and docked bikes, indicating a balance between convenience and travel time.
- Show the highest average duration, which may imply they are preferred for longer rides or leisure activities.

# Start Station Popularity


- "Clinton St & Washington Blvd" is the most popular start station with over 6000 starts.
- "Canal St & Madison St" also shows high usage, closely following the top station.
- Most of the top ten stations show a relatively even distribution of member usage.
- Wells St & Elm St has the lowest total count among the top stations.

## Casual vs. Member Riding Patterns

- Both casual and member riders exhibit two prominent peaks in ride numbers daily, likely corresponding to rush hours.
- The evening peak generally sees more rides than the morning peak.
- Members show more consistent ride numbers throughout weekdays.
- On weekends, casual riders increase rides during midday, especially on Saturdays.


# Recommendations to Increase Bike-Share Membership:

## Bike-Sharing Program Recommendations

### Targeted Marketing
- **Focus on Casual Riders**: Tailor marketing efforts specifically to casual riders.
- **Weekend and Summer Focus**: Highlight membership benefits during weekends and summer months when casual usage peaks.
- **Cost Savings and Convenience**: Emphasize the cost savings and convenience of bike sharing for casual riders.

### Membership Incentives
- **Promotions for Conversion**: Offer promotions or discounts to casual riders who convert to members.
- **Ride Threshold**: Consider offering incentives after a certain number of rides to encourage membership.
- **Seasonal Offers**: Introduce special offers during specific seasons, such as summer.

### Station Optimization
- **High-Traffic Stations**: Improve bike availability and visibility at top casual rider stations.
- **Strategic Placement**: Focus on stations like Streeter Dr & Grand Ave and DuSable Lake Shore Dr & Monroe St.
- **User-Friendly Stations**: Ensure stations are user-friendly and well-maintained.

### Ride Experience Enhancement
- **Guided Tours**: Offer guided bike tours starting from popular stations.
- **Suggested Routes**: Provide suggested routes for casual riders to explore the city.
- **Safety and Comfort**: Enhance the overall riding experience to encourage repeat usage.

### Community Engagement
- **Collaborate with local organizations:** schools, and businesses to raise awareness.
- Host community events, workshops, or bike safety classes to engage residents.

### Integration with Public Transit
- Integrate bike sharing with existing public transportation systems (e.g., buses, trains).
- Offer joint memberships or seamless transfers between bike share and transit.
