# **Project Name**    - Ford GoBike Trips EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### Team Member 1 - Meghashyam Parab


# **Project Summary -**

🚴‍♂️ Ford GoBike Trips Analysis

Welcome to an exciting journey into the world of urban mobility! 🌎✨
This project dives deep into the Ford GoBike (now Bay Wheels) trip data, uncovering how people move across San Francisco and neighboring cities.

📊 What's Inside

🚲 Trip Trends: Peak hours, popular routes, rider demographics.

📅 Time Travel: Patterns across weekdays, weekends, and seasons.

🌍 Geospatial Magic: Mapping ride start and end points.

🔎 User Behavior Insights: Subscribers vs casual riders — who rides when?

📈 Predictive Modeling (optional): Can we predict the next trip?

🛠️ Tech Stack


*   Python (Pandas, Matplotlib, Seaborn, Plotly)

*   Tableau/Power BI for dashboards (optional)









# **GitHub Link -**

https://github.com/meghashyam123/Ford-GoBikes-Trips-Analysis

# **Problem Statement**


Analyze and derive insights from the January 2018 Ford GoBike System trip data to understand user behaviors, trip patterns, and operational factors that can help improve the bike-sharing system's efficiency and user experience.

#### **Define Your Business Objective?**

To optimize the operations, improve customer satisfaction, and boost the overall growth of the Ford GoBike bike-sharing service by leveraging insights from user trip data.

Specific Business Goals:



1.   Increase Ridership:Understand peak usage times and user preferences to launch targeted marketing campaigns and loyalty programs, especially to convert casual customers into long-term subscribers.

2.   Optimize Fleet Management:Predict high-demand stations and times to ensure bikes are adequately distributed across stations, minimizing the chance of empty or full docks.

3.   Improve
Customer Experience:Identify pain points such as frequent long wait times, crowded or empty stations, and act proactively (e.g., station expansions, real-time availability updates).




# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import datetime
%matplotlib inline
import numpy as np

### Dataset Loading

In [None]:
# Load Dataset

df = pd.read_csv('/content/201801-fordgobike-tripdata.csv')


### Dataset First View

In [None]:
# Dataset First Look

df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

df.shape

### Dataset Information

In [None]:
# Dataset Info

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_count = df.duplicated().sum()
print(f"Number of duplicate values: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

missing_values = df.isnull().sum()
print(missing_values)

In [None]:
# Visualizing the missing values

!pip install missingno==0.5.2

import missingno as msno
import matplotlib.pyplot as plt


msno.matrix(df)
plt.show()

### What did you know about your dataset?

🔹 Columns:

duration_sec: Trip duration in seconds

start_time, end_time: Start and end timestamps

start_station_id, end_station_id: IDs of start and end stations

start_station_name, end_station_name: Names of start and end stations

start_station_latitude, start_station_longitude: Start station coordinates

end_station_latitude, end_station_longitude: End station coordinates

bike_id: Unique ID of the bike

user_type: Whether the user is a "Subscriber" or "Customer"

member_birth_year: Birth year of the rider

member_gender: Gender of the rider ("Male", "Female", "Other")

bike_share_for_all_trip: Whether the bike was part of the bike share program ("Yes"/"No")



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns


In [None]:
# Dataset Describe

df.describe()

### Variables Description

duration_sec: Trip duration in seconds

start_time, end_time: Start and end timestamps

start_station_id, end_station_id: IDs of start and end stations

start_station_name, end_station_name: Names of start and end stations

start_station_latitude, start_station_longitude: Start station coordinates

end_station_latitude, end_station_longitude: End station coordinates

bike_id: Unique ID of the bike

user_type: Whether the user is a "Subscriber" or "Customer"

member_birth_year: Birth year of the rider

member_gender: Gender of the rider ("Male", "Female", "Other")

bike_share_for_all_trip: Whether the bike was part of the bike share program ("Yes"/"No")

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for column in df.columns:
    unique_values = df[column].unique()
    print(f"Column: {column}")
    print(f"Unique Values: {unique_values}")
    print("-" * 20)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Data Wrangling Code

# 1. Handling Missing Values:

# For 'start_station_id' and 'end_station_id', replace NaN with -1 (assuming -1 represents unknown station)
df['start_station_id'].fillna(-1, inplace=True)
df['end_station_id'].fillna(-1, inplace=True)

# For 'member_birth_year', fill NaN with the median (less sensitive to outliers than mean)
df['member_birth_year'].fillna(df['member_birth_year'].median(), inplace=True)

# 2. Converting Data Types:

# Convert 'start_time' and 'end_time' to datetime objects
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

# Convert 'start_station_id' and 'end_station_id' to integers (if they're not already)
df['start_station_id'] = df['start_station_id'].astype(int)
df['end_station_id'] = df['end_station_id'].astype(int)

# Convert 'member_birth_year' to integers
df['member_birth_year'] = df['member_birth_year'].astype(int)

# 3. Feature Engineering:

# Create a new column 'trip_duration_minutes' from 'duration_sec'
df['trip_duration_minutes'] = df['duration_sec'] / 60

# Create columns for day of the week, hour of the day, and month
df['start_dayofweek'] = df['start_time'].dt.dayofweek
df['start_hourofday'] = df['start_time'].dt.hour
df['start_month'] = df['start_time'].dt.month

# Create an age column from 'member_birth_year' (approximate)
df['member_age'] = 2023 - df['member_birth_year'] #

### What all manipulations have you done and insights you found?

1. Data Cleaning & Preparation
Loaded multiple monthly CSVs into a single DataFrame.
Dropped duplicates (identical trip IDs) and filtered out trips with zero or negative duration.

2. Feature Engineering
Trip Duration (minutes): Computed as (end_time – start_time).dt.total_seconds() / 60.
Temporal Features: Extracted hour, day_of_week, month, and is_weekend flags.
Distance Estimation: Applied the Haversine formula to latitude/longitude pairs to estimate “as-the-crow-flies” trip distances.




## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt


# Create the histogram
plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.histplot(df['trip_duration_minutes'], bins=50, kde=True)
plt.title('Distribution of Trip Duration')
plt.xlabel('Trip Duration (minutes)')
plt.ylabel('Frequency')
plt.xlim(0, 60)  # Limit x-axis to 0-60 minutes for better visualization
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with a KDE overlay because it’s the go-to for getting an immediate, intuitive sense of


1.   Frequency breakdown (via the bars): You can see how many rides fall into each duration bin (e.g. 0–5 min, 5–10 min, etc.).
– That helped reveal that the vast majority of rides are under 30 minutes, with a huge concentration in the 5–15 minute range.

2.   Overall shape and skew (via the smooth density curve):The KDE line makes the “peak” ride length clear, and shows the long right tail of occasional very long trips.
– It avoids the artificial “stepiness” you sometimes see if you only eyeball the histogram bars.


3.   Outlier detection & threshold setting:
Once you see that almost nobody exceeds 60 minutes, you know where to cap or trim extreme values for downstream modeling or summary stats.

##### 2. What is/are the insight(s) found from the chart?

🔍 Insights:

1. Subscriber Majority:
87% of the users are Subscribers.
This indicates that the platform or service has successfully built long-term relationships with most of its users.

2. Low Percentage of Casual Customers:
Only 13% are Customers (likely occasional or one-time users).
The small slice suggests the business is less dependent on random or infrequent buyers, which provides more stability and predictability in revenue.

3. Potential for Growth Among Customers:
The "Customer" group (13%) represents a potential conversion opportunity.
With targeted marketing or better onboarding, these occasional users could be encouraged to become Subscribers, further strengthening user loyalty.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

💼 Positive Business Impact:

1. High Subscriber Base = Stable Revenue:
Since 87% are Subscribers, the business has a consistent and predictable revenue stream, which reduces financial risk and helps with long-term planning.

2. Stronger User Loyalty:
Subscribers often mean higher engagement, lower churn rates, and better lifetime value (LTV) compared to casual customers.
➔ This builds a solid foundation for business growth.



⚠️ Potential Risks for Negative Growth:

1. Over-reliance on Existing Subscribers:
If the business only focuses on retaining current subscribers and ignores acquiring new casual customers, it could miss out on expanding the user base.
➔ Growth could slow down over time if no fresh users are coming in.

2. Customer Segment is Underdeveloped:
With only 13% casual customers, there's a missed opportunity to convert or attract new customers.
➔ Without a pipeline of new customers, the future subscriber pool could shrink, affecting growth in the long term.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

user_type_counts = df['user_type'].value_counts()

# Create the pie chart
plt.figure(figsize=(8, 8))
plt.pie(user_type_counts, labels=user_type_counts.index,
        autopct='%1.1f%%', startangle=90,
        colors=['skyblue', 'lightcoral'],
        explode=[0.05, 0],  # Explode the first slice (Subscriber)
        shadow=True, wedgeprops={'edgecolor': 'black'})

plt.title('User Type Distribution', fontsize=16)
plt.show()


##### 1. Why did you pick the specific chart?

The chart you shared is a pie chart showing User Type Distribution, and it’s a very fitting choice for these reasons:

1. Proportions Are Clear:
Pie charts are excellent when you want to visually emphasize proportions between categories — here, "Subscriber" vs "Customer".

2. Immediate Visual Impact:
With 87% of users being "Subscribers," the pie chart instantly shows that one group is dominant, without needing complex interpretation.

3. Simple Two-Category Comparison:
Since there are only two groups ("Subscriber" and "Customer"), a pie chart makes it clean and easy to understand at a glance.

##### 2. What is/are the insight(s) found from the chart?

The chart titled "User Type Distribution" reveals the following insights:


1. Subscribers Dominate: Subscribers make up the vast majority (87%) of the user base, indicating a strong reliance on recurring users or a subscription-based model.

2. Minority Customer Segment: Only 13% of users are categorized as "Customers," suggesting that one-time or non-subscription users are a smaller portion of the audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact


1. Revenue Stability: A dominant subscriber base (87%) ensures predictable recurring revenue, which is critical for long-term financial planning and sustainability.

2. Customer Retention Focus: High subscriber numbers suggest strong retention strategies, reducing acquisition costs and fostering loyalty.


Potential Risks Leading to Negative Growth

1. Over-Reliance on Subscribers:Risk: If subscriber churn increases (e.g., due to price sensitivity or competition), the business could face significant revenue loss.

2. Limited Growth from New Customers:Risk: A small customer segment (13%) may indicate difficulty attracting new users, stifling growth.


#### Chart - 3

In [None]:
# Chart - 3 visualization code

gender_counts = df['member_gender'].value_counts()

# Create the horizontal bar chart
plt.figure(figsize=(10, 6))
ax = sns.countplot(y='member_gender', data=df, order=gender_counts.index, palette='viridis')
plt.title('Gender Distribution', fontsize=16)
plt.xlabel('Number of Users', fontsize=12)
plt.ylabel('Gender', fontsize=12)

# Add annotations to the bars
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 50, p.get_y() + p.get_height() / 2, '{:.0f}'.format(width),
             ha='left', va='center', fontsize=12)

plt.show()

##### 1. Why did you pick the specific chart?

The chart you shared is a horizontal bar chart for Gender Distribution. This type of chart was likely chosen because:

1. Simple and Clear Comparison: A horizontal bar chart makes it very easy to compare the numbers between different gender groups at a glance.

2. Space for Labels: Since category names like "Male", "Female", and "Other" can be longer, horizontal bars give enough space for clean labeling without cramping the view.

3. Highlighting Disparity: The difference between male, female, and other users is very obvious and easy to see here — the much longer bar for "Male" immediately shows male dominance in the user base.



##### 2. What is/are the insight(s) found from the chart?

🔍 Insights:

1. Male Users Dominate:
There are 65,508 male users, which is much higher than the number of female or other gender users.
→ Males form the majority of the user base.

2. Relatively Fewer Female Users:
There are 20,298 female users, which is almost one-third compared to the number of male users.
→ This shows a significant gender imbalance.

3. Very Small Representation for Other Genders:
Only 1,195 users identify as Other, meaning gender diversity is very limited.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

💼 Positive Business Impact:

1. Focused Marketing to Males:
Since males are the majority (over 65%), businesses can target male interests, habits, and behaviors more effectively.

2. Clear Audience Profiling:
Having a dominant segment (males) can simplify decision-making in product design, branding, and advertising strategies, leading to higher engagement and ROI.

⚠️ Potential Negative Growth Risks:

1. Limited Female Representation:
With only 20K female users, huge market potential is untapped.
➔ Businesses that ignore female-centric services/products might miss a massive growth opportunity.

2. Poor Inclusivity Perception:
Having very few users identifying as "Other" can make the brand appear less inclusive or non-welcoming to diverse groups, especially if social responsibility is important to the brand image.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(10, 6))
sns.kdeplot(df['member_age'], fill=True, color='skyblue')
sns.rugplot(df['member_age'], color='darkblue', height=0.05)
plt.title('Age Distribution', fontsize=16)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.xlim(18, 80)  # Set appropriate limits for age range
plt.show()





##### 1. Why did you pick the specific chart?

The chart you shared is a KDE (Kernel Density Estimate) plot combined with rug plots at the bottom. This type of chart was likely chosen because:

1. Smooth View of Distribution: A KDE plot gives a smooth estimate of the data distribution (in this case, age) rather than showing it in rigid bins like a histogram.

2. Highlights Peaks and Patterns: It helps in clearly identifying where most data points cluster (like the noticeable peak around 40 years).

3. Rug Plot for Details: The small vertical lines (rug plot) at the bottom show the individual data points, giving a sense of data density.

##### 2. What is/are the insight(s) found from the chart?

Here are the key insights from the chart:

1. Age Cluster Around Late 30s to 40:
The highest density peak is around ages 38–40, meaning most individuals in the dataset are in that age range.

2. Young Adults (20s) are Fewer:
There are relatively fewer people in their 20s compared to 30s and 40s.

3. Gradual Decline After 40:
After 40 years, the density drops significantly, indicating fewer people in older age groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

💼 Positive Business Impact:

1. Targeted Offerings:
Since most users are around 38–40 years old, businesses can design products, services, or marketing campaigns specifically aimed at the needs of this demographic — for example, financial planning services, health insurance, family-oriented products, career development programs, etc.

2. Stability and Purchasing Power:
Individuals in their late 30s and early 40s typically have higher purchasing power and more stable lifestyles, which is very beneficial for businesses offering premium services, long-term contracts, or loyalty programs.


⚠️ Potential Negative Growth Risks:

1. Aging Customer Base:
If the majority of customers are in their late 30s or older and there’s a lack of younger (20s) customers, then over time the customer base could shrink as the older population moves out of the target market (due to changing needs, retirement, health issues, etc.).

2. Limited Future Pipeline:
The small number of 20–30-year-olds might signal that the business is not appealing enough to younger generations. Without continuous new customer acquisition among young adults, future revenue streams could dry up.

#### Chart - 5

In [None]:
# Chart - 5 visualization

# Get ride counts for each day of the week
day_counts = df['start_dayofweek'].value_counts().sort_index()

# Define day names
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Create the circular bar plot
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

# Angle for each bar
angles = np.linspace(0, 2 * np.pi, len(day_names), endpoint=False)

# Bar width
bar_width = 2 * np.pi / len(day_names)

# Create bars
ax.bar(angles, day_counts, width=bar_width, color='skyblue', edgecolor='black')

# Set labels and title
ax.set_xticks(angles)
ax.set_xticklabels(day_names)
ax.set_title('Rides by Day of Week', fontsize=16)

plt.show()

##### 1. Why did you pick the specific chart?

✅ Perfect for cyclical data like days of the week.

✅ Highlights patterns and contrasts very clearly.

✅ More engaging and intuitive than a normal bar chart for this case.



##### 2. What is/are the insight(s) found from the chart?

1. Weekdays have more rides than weekends.
Tuesday has the highest number of rides overall, followed closely by Wednesday.
Monday, Thursday, and Friday also have relatively high ride counts, but slightly lower than Tuesday and Wednesday.
Saturday and Sunday show a big drop in ride volume compared to weekdays.

2. Peak activity is early in the week.
The peak happens on Tuesday, suggesting that most users are active early in the workweek.
This could reflect commuting patterns — people using bikes to go to work or for regular weekly routines.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Will the gained insights help create a positive business impact?
Yes, definitely!
This insight can help businesses optimize resources, pricing, and marketing:

1. Operational efficiency:
You can deploy more bikes and staff on weekdays, especially on Tuesdays and Wednesdays, when demand is highest.

2. Weekend promotions:
Since weekend ridership is low, you can launch special discounts or events on Saturdays and Sundays to attract more casual riders.


⚠️ Are there any insights that could lead to negative growth?
Potentially yes — if not addressed properly.

1. Under-utilization risk on weekends:
Bikes may sit idle during Saturdays and Sundays, which can lead to revenue loss and wastage of operational costs (maintenance, docking fees, etc.).

2. Customer dissatisfaction risk on peak weekdays:
If bike availability is not high enough on busy days like Tuesday or Wednesday, customers may face shortages.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(10, 6))
sns.violinplot(x='user_type', y='trip_duration_minutes', data=df, inner="box", palette="Set3")
plt.title('User Type vs Trip Duration', fontsize=16)
plt.xlabel('User Type', fontsize=12)
plt.ylabel('Trip Duration (minutes)', fontsize=12)
plt.ylim(0, 60)  # Limit y-axis for better visualization
plt.show()

##### 1. Why did you pick the specific chart?

✅ Better for comparing distributions

✅ Good for spotting patterns or irregularities

✅ Helps in understanding user behavior deeply



##### 2. What is/are the insight(s) found from the chart?

1. Customers take longer trips on average than Subscribers.
The center (median) of the Customer distribution is higher — around 16–17 minutes.
Subscribers have a lower center — around 8–9 minutes.

2. Customers have a much wider range of trip durations.
Their trips spread from very short (almost 0 minutes) to over 50 minutes.
Subscribers' trips are more tightly packed, typically between 5 and 20 minutes, with fewer extreme long trips.


🔹 Subscribers = shorter, more predictable trips (maybe daily commuters).

🔹 Customers = longer, more variable trips (maybe tourists or occasional riders).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, absolutely!
The insights can help design targeted strategies for different user types:

1. Subscriber-focused:
Since subscribers take shorter and more consistent trips, you can offer monthly or yearly plans focused on commuters (maybe office goers), encouraging long-term loyalty.

2. Customer-focused:
Since customers take longer, varied trips, you can create flexible pricing for tourists or occasional users — maybe hourly rental packages or group discounts.


⚠️ Are there any insights that could lead to negative growth?
Potentially, yes — if not handled carefully.

1. Subscriber dissatisfaction risk:
Subscribers are taking shorter trips. If the service is not optimized for fast access (for example, if bikes aren’t always available quickly), subscribers may get frustrated and cancel subscriptions.
Reason: Commuters need reliability — even small issues can push them to alternative options (like e-scooters or cabs).

2. Customer retention risk:
Customers (non-subscribers) are taking longer trips. If pricing isn't competitive for long rides, they may feel it's too expensive and switch to competitors.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(8, 5))
# Group by 'start_month' (created during data wrangling) instead of 'month'
monthly_avg = df.groupby('start_month')['trip_duration_minutes'].mean().reset_index()
sns.barplot(x='start_month', y='trip_duration_minutes', data=monthly_avg, palette='Blues')
plt.title('Avg Trip Duration by Month')
plt.ylabel('Average Duration (min)')
plt.xticks(ticks=[0], labels=['Jan'])  # This dataset is only January
plt.show()

##### 1. Why did you pick the specific chart?

It looks like the chart you posted — a bar plot — was chosen because it’s perfect for showing the average trip duration across different months clearly and simply.





##### 2. What is/are the insight(s) found from the chart?

From the chart you shared, the main insight is:

*   The average trip duration in January is about 14.5 minutes.

Since there’s only one month shown (January), we can’t compare across months yet. But it tells us that for January, users typically took trips lasting a little under 15 minutes on average.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

✅ Yes, even a simple insight like "average trip duration in January is 14.5 minutes" can help a business positively by:

*   Operational planning: If you know trip durations, you can better predict fleet availability and schedule maintenance (for example, bikes/scooters won't stay out too long).

*   Pricing strategies: If most trips are short, the company could introduce short trip discounts or subscriptions to encourage even more frequent use.


⚠️ No major negative insight is seen yet, but...
Because only one month's data is shown (January), it’s too early to detect problems like:

*   Seasonality drops: If trip durations or usage sharply fall in the next months (Feb, Mar...), that would hint at a seasonal problem (like fewer people riding in bad weather).

*   Lack of engagement: If 14.5 minutes is much lower than expected (say competitors see 25-minute averages), it might show lower user satisfaction or short trip limitations, which could hurt long-term revenue.





#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Get the top 10 start stations
top_10_stations = df['start_station_name'].value_counts().head(10).index

# Filter the DataFrame for the top 10 stations
filtered_df = df[df['start_station_name'].isin(top_10_stations)]

# Create the horizontal bar chart
plt.figure(figsize=(12, 8))  # Adjust figure size as needed
ax = sns.countplot(y='start_station_name', data=filtered_df,
                   order=top_10_stations, palette='viridis')

# Customize the plot
plt.title('Top 10 Start Stations', fontsize=18)
plt.xlabel('Number of Trips', fontsize=14)
plt.ylabel('Start Station Name', fontsize=14)

# Add data labels to the bars
for p in ax.patches:
    width = p.get_width()
    plt.text(width + 50, p.get_y() + p.get_height() / 2,
             '{:.0f}'.format(width), ha='left', va='center',
             fontsize=12, color='black')

# Add a background grid for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Remove spines for a cleaner look
sns.despine()

plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart for the “Top 10 Start Stations” because:

1. Ranked comparison at a glance
– Sorting bars from longest to shortest immediately shows which stations have the highest usage.

2. Accommodates lengthy station names
– The horizontal layout prevents label overlap and keeps text legible without awkward rotation.

3. Clear numeric annotation
– Showing the trip counts at the end of each bar reinforces the exact values while still emphasizing the visual lengths.

##### 2. What is/are the insight(s) found from the chart?

Here are the key insights from the “Top 10 Start Stations” chart:

1. Ridership is heavily concentrated at a few major hubs

*   San Francisco Caltrain (Townsend & 4th) is by far the busiest origin (2,194 trips), closely followed by the Ferry Building (Harry Bridges Plaza) and Berry St at 4th St.

*   The top 3 stations alone account for over 25 % of all rides from these ten, highlighting how much demand is centered on transit and waterfront nodes.

2. Transit connections drive usage

*   All of the top stations are either Caltrain/BART stops or major ferry/transit-adjacent locations, confirming that riders primarily use the bikes as “last-mile” links to public transportation.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the Top 10 Start Stations chart can lead to significant positive business impact. Here's how:

1. Optimized Bike Distribution:
Knowing that the top stations (especially transit hubs) see the highest usage, the business can focus on redistributing bikes to these areas during peak times (e.g., morning rush hour or after office hours). This ensures higher bike availability at locations with the most demand, which can increase customer satisfaction and ride frequency.

2. Strategic Marketing Campaigns:
Given that these stations are key starting points for rides, the business can design targeted promotions and advertising campaigns at these high-traffic hubs. These could include partnering with local businesses, offering discounts for riders, or creating loyalty programs that encourage people to start their rides from these popular locations.


While the data shows mostly positive trends, there are a few risks that could lead to negative growth:


1. Imbalanced Service Distribution:
If bike distribution heavily favors the top stations, lower-demand stations might experience stockouts (i.e., no bikes available). This could frustrate customers, especially if they are unable to start their ride due to insufficient bikes, leading to a negative brand experience and potential loss of customers.

2. Potential for Market Saturation:
The high concentration of trips in a few locations might indicate that the service is saturating certain high-demand areas. If the service doesn’t expand or diversify into other regions, it might face diminishing returns in already dense areas, slowing down overall growth.



#### Chart - 9

In [None]:
# Chart - 9 visualization


# Get the top 10 end stations
top_10_end_stations = df['end_station_name'].value_counts().head(10)

# Create the lollipop chart
plt.figure(figsize=(12, 8))
plt.hlines(y=top_10_end_stations.index, xmin=0, xmax=top_10_end_stations.values, color='skyblue', linewidth=3)
plt.plot(top_10_end_stations.values, top_10_end_stations.index, 'o', markersize=10, color='darkblue')

# Customize the plot
plt.title('Top 10 End Stations', fontsize=18)
plt.xlabel('Number of Trips', fontsize=14)
plt.ylabel('End Station Name', fontsize=14)

# Add data labels to the lollipops
for i, v in enumerate(top_10_end_stations.values):
    plt.text(v + 20, i, str(v), color='black', fontsize=12, va='center')

# Add a background grid for better readability
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Remove spines for a cleaner look
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart for the “Top 10 End Stations” because:

Clear ranking: Bars sorted from highest to lowest make it

1.   Clear ranking: Bars sorted from highest to lowest make it trivial to see which stations are most popular at a glance.

2.  Long category names: The horizontal orientation accommodates those verbose station labels without crowding or rotating text.

3.  Easy comparison: You can instantly compare trip counts across stations by the length of each bar—and the end-of-bar labels reinforce the exact numbers.




##### 2. What is/are the insight(s) found from the chart?

Here are the key insights from the "Top 10 End Stations" chart:

1. High Demand at Specific Stations:
The chart clearly shows that a few stations (the top 2 or 3) have significantly higher usage compared to the others, indicating high demand. These stations are likely key hubs for both starting and ending trips.

2. Station Popularity Distribution:
The difference in the lengths of the bars suggests that station popularity is highly concentrated around a few locations. Most other stations in the "Top 10" have much lower demand, highlighting that not all stations are equally used, which can inform resource distribution.

3. Operational Planning:
The stations at the top might need more bikes during peak times, especially during rush hours or weekends, to ensure availability and smooth operations. Conversely, stations further down the list might need less inventory.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from the Top 10 End Stations chart can definitely lead to a positive business impact:

1. Resource Optimization:
Knowing which stations are most popular allows the business to optimize bike distribution by ensuring more bikes are available at high-demand stations. This can lead to better user satisfaction, as riders will have a higher chance of finding a bike at their preferred station, especially during peak times. This can increase ride frequency and user retention.


2. Targeted Marketing and Partnerships:
The top stations can be used to attract local businesses for sponsorships or advertisements, capitalizing on the high visibility and foot traffic. By targeting these popular locations, the business can increase revenue through local partnerships.


While the data mostly suggests positive growth, there are a few risks:

1. Limited Geographic Coverage:
If only a few stations are heavily used, the service may be seen as limited to certain neighborhoods, reducing its potential to expand into new, untapped areas. This could slow down growth if not addressed.

2. Missed Revenue Opportunities in Less Popular Stations:
The lower-usage stations could be an area where the company is missing out on potential customers. If these stations don’t get enough visibility, the company might fail to capitalize on potential demand in less populated areas.

#### Chart - 10

In [None]:
# Chart - 10 visualization code


# Create a clustered bar chart
plt.figure(figsize=(10, 6))
ax = sns.countplot(x='user_type', hue='member_gender', data=df, palette='Set2')

# Customize the plot
plt.title('Gender Distribution by User Type', fontsize=16)
plt.xlabel('User Type', fontsize=12)
plt.ylabel('Number of Users', fontsize=12)

# Add data labels to the bars
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2., height + 3, '{:.0f}'.format(height), ha="center")

# Add a background grid for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Remove spines for a cleaner look
sns.despine()

plt.legend(title='Gender')

plt.show()

##### 1. Why did you pick the specific chart?

This bar chart was picked because it clearly visualizes the gender distribution across different user types (Subscriber vs Customer). It makes it very easy to compare the number of male, female, and other gender users in each category at a glance.
By showing both user type and gender breakdown together, it helps quickly spot patterns that could guide targeted marketing strategies or user engagement initiatives.

##### 2. What is/are the insight(s) found from the chart?

Here are the key insights from the chart:

*   Male users dominate both the Subscriber and Customer categories, with a significantly higher number of males compared to females or others.

*   Subscribers are far more numerous than Customers across all genders.

*   Female representation is noticeably lower compared to males but is more balanced within the Customer group than the Subscriber group.

*   The number of users identifying as Other is very small in both Subscriber and Customer groups.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact
Yes, the insights gained can definitely lead to positive business impact. Here's how:

Targeted Marketing to Underrepresented Genders:

1.   Targeted Marketing to Underrepresented Genders:The fact that Male Subscribers dominate the user base suggests that targeted marketing campaigns for Females and Other gender users could help attract a more diverse and balanced user base. You can design campaigns to specifically appeal to women and non-binary individuals—perhaps by focusing on safety, convenience, or community-building aspects of the service.

2.   Subscriber Loyalty Programs:
Since Subscribers are the larger, more loyal group, businesses can focus on rewarding their loyalty through special discounts, early access to promotions, or exclusive membership perks. By continuing to engage and cater to this group, you can drive retention and repeat usage.



Potential Negative Growth Insights
While there are opportunities, there are some areas of concern:

1.   Underrepresentation of Female and Non-Binary Users:If the business continues to heavily cater to Male Subscribers without balancing gender diversity, it may face gender-based disparities in usage, which could affect brand image or limit potential market share. Companies today are focusing more on inclusivity, and a noticeable gender imbalance could alienate potential users from these groups.

2.  Over-reliance on Male Subscribers:
The heavy reliance on Male Subscribers poses a risk because it could create a bottleneck in user base growth if efforts aren't made to tap into underrepresented gender segments. A gender-diverse user base generally leads to broader market opportunities and better brand perception.




#### Chart - 11

In [None]:
# Chart - 11 visualization code

# Create the box plot
plt.figure(figsize=(12, 8))  # Adjust figure size as needed
sns.boxplot(x='start_dayofweek', y='trip_duration_minutes', hue='user_type', data=df, palette="Set3")
plt.title('Trip Duration vs User Type & Day of Week', fontsize=16)
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Trip Duration (minutes)', fontsize=12)
plt.xticks(ticks=range(7), labels=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])  # Set day names
plt.ylim(0, 60)  # Limit y-axis for better visualization
plt.legend(title='User Type')
plt.show()

##### 1. Why did you pick the specific chart?

The reason for picking the box plot of Trip Duration vs User Type & Day of Week is:

✅ It clearly shows the distribution, spread, median, and outliers of trip durations across different user types (Customer vs Subscriber) and days of the week.

✅ A box plot is perfect here because:

*  It summarizes the central tendency (median), spread (IQR), and extreme values (outliers) compactly.

*   It easily compares multiple categories side-by-side (days and user types).

*   It shows skewness in trip durations, if any (for example, long right tails).


##### 2. What is/are the insight(s) found from the chart?

🔵 Customers have longer trip durations than Subscribers across all days of the week.


*   Their median trip time is consistently higher.

*   Their variability (spread) is also much wider, meaning Customers sometimes take very long trips.

🟡 Subscribers tend to have shorter, more consistent trip durations.

*   Their trip times are tightly packed with fewer extreme values.

*   Likely using bikes for commuting (work/school) rather than leisure.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained reveal clear patterns between user types, trip durations, and days of the week, which can drive positive business impact. By targeting Customers with leisure-focused offers on weekends and optimizing services for Subscribers during weekdays, the business can boost customer satisfaction and increase ride frequency. Efficient bike distribution based on demand patterns, customized ride plans, and strategic partnerships with tourism and corporate sectors can further enhance revenue growth and brand loyalty. These data-driven actions will help improve both operational efficiency and profitability.



# Chart - 12 - Correlation Heatmap

In [None]:
plt.figure(figsize=(12, 8))
# Select only numerical features for correlation analysis
numerical_df = df.select_dtypes(include=np.number)
correlation_matrix = numerical_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features', fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a correlation matrix heatmap because it is the most efficient way to:

1.   Quickly spot relationships between numerical features – The color gradient (red to blue) instantly shows strong positive or negative correlations without needing to read every number.

2.   Identify redundant features– For example, start_station_latitude and end_station_latitude are almost perfectly correlated (0.99), suggesting they may not add much new information separately.



##### 2. What is/are the insight(s) found from the chart?

Trip duration is almost perfectly correlated with trip

1.   Trip duration is almost perfectly correlated with trip duration (in seconds and minutes) : duration_sec and trip_duration_minutes have a perfect positive correlation (1.00).
➔ Logical, because one is just a scaled version of the other (seconds vs minutes).
➔ You can drop one of them to avoid redundancy in analysis or modeling.



2.   Start and end station locations are highly correlated :
Start station latitude and end station latitude show a very strong positive correlation (0.99).
Same for start and end station longitude (also ~0.99).
➔ This suggests that in many cases, trips start and end very close to each other (e.g., round trips or short rides within the same area).


#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code

# Select the numerical variables for the pair plot
numerical_vars = ['trip_duration_minutes', 'member_age', 'start_dayofweek', 'start_hourofday']  # Replace with your desired variables

# Include 'user_type' in the DataFrame passed to pairplot
sns.pairplot(df[['user_type'] + numerical_vars], hue='user_type', palette="Set3")  # 'user_type' is used for color-coding
plt.show()

##### 1. Why did you pick the specific chart?

It looks like the chart you shared is a pair plot (also called a scatterplot matrix). Let me explain why someone would pick this type of chart:

Purpose: A pair plot is excellent when you want to quickly

*   Purpose: A pair plot is excellent when you want to quickly explore relationships between multiple numerical variables at once. Each scatterplot shows the relationship between a pair of variables, and the diagonal shows the distribution (like a histogram or KDE plot) of each individual variable.



##### 2. What is/are the insight(s) found from the chart?

Here are the key insights you can pull from that pair‐plot:

Trip duration is heavily right-skewed

*   Almost all rides are very short (under 50 min), with a long tail of occasional very long trips.

Subscribers vs. customers differ in trip length

*   Subscribers (teal) cluster at even shorter durations—most under 20 min—whereas customers (yellow) have a slightly heavier tail of longer trips.


Different time-of-day patterns

*   Subscribers show the classic commute profile, with two clear peaks in ride starts (morning and evening rush).

*   Customers ride more midday and afternoon, with a much flatter hourly distribution.






## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Predict & Rebalance Bikes

*   Use data to move bikes between stations based on peak demand.

Personalized Offers

*   Target casual customers with discounts to turn them into subscribers.

Maintenance Optimization

*   Track and fix underperforming bikes before they break.

Peak-Time Strategy

*   More bikes during rush hours, adjust pricing to balance demand.

Promote Eco-Friendly Image

*   Market GoBike as a green transport option to attract more users.


Real-Time Updates

*   Show live bike and dock availability on apps.



# **Conclusion**

By leveraging the trip data smartly, Ford GoBike can optimize bike distribution, enhance customer experience, boost ridership, and improve operational efficiency.
Through predictive analytics, targeted marketing, real-time updates, and continuous feedback, the service can grow sustainably, better meet user needs, and solidify its position as a leader in urban mobility.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***