# Introduction


## About the Company
Cyclistic, a fictional company, introduced a successful bike-share program in 2016. Since its inception, it has expanded to include about 5,824 geotracked bicycles, which can be locked into any of the 692 stations spread throughout Chicago. Riders can unlock a bike from one station and return it to any other station.

## Business Task 
Discover how members and casual riders use Cyclistic bikes differently and design marketing strategies aimed at converting casual riders into annual members.

## Data Sources 
The data used in this analysis were sourced from Kaggle's Cyclistic dataset, covering the period from June 2023 to May 2024. The dataset comprised approximately 5,743,278 rows and 13 columns. The columns include:

* Ride_id: Unique identifier for each ride.
* Rideable_type: Type of bike used (electric, classic, docked)
* Started_at: Date and time when the ride began.
* Ended_at: Date and time when the ride ended.
* Start_station_name: Name of the station where the ride began.
* Start_station_id: ID of the station where the ride began.
* End_station_name: Name of the station where the ride ended.
* End_station_id: ID of the station where the ride ended.
* Start_lat: Starting latitude.
* Start_lng: Starting longitude.
* End_lat: Ending latitude.
* End_lng: Ending longitude.

# Loading Dataset

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import os

#________________________Data Extraction________________________________
# Create an empty list to store dataframes
df_list = []

# Walk through the input directory to find all CSV files
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        # Create the full path to the file
        file_path = os.path.join(dirname, filename)
        # Read the file into a dataframe
        df = pd.read_csv(file_path)
        # Append the dataframe to the list
        df_list.append(df)

# Concatenate all dataframes in the list
df = pd.concat(df_list, ignore_index=True)

# Exploring Data

In [None]:
# Display a sample dataframe
df.sample(5)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
categorical = df.dtypes[df.dtypes == "object"].index
df[categorical].describe()

Checking for nulls in all columns. It appears we have 905,237 nulls in "start_station_name" and "start_station_id", 956,579 nulls in "end_station_name" and "end_station_id", and 7,684 nulls in "end_lat" and "end_lng".

In [None]:
#Checking for nulls
print(df.isnull().sum())

Checking for duplicates in the identifier column.

In [None]:
#Checking for duplicate values in primary key
dup_list = df[df['ride_id'].duplicated(keep=False)]

if not dup_list.empty:
    print('duplicates in ride_id \n', dup_list['ride_id'].count())
else:
    print('No duplicates in Ride_id\n\n')

# Data Cleaning

## Removing Inconsistencies

There are 1,806 rows where "started_at" values are equal to or greater than "ended_at" values. This compromises data integrity, as it is impossible for a ride to start after it ends. Additionally, when "started_at" equals "ended_at," it suggests a ride duration of 0 hours, 0 minutes, and 0 seconds, which skews the dataset. To address these issues, we will remove the affected rows.

In [None]:
# Converting started_at and ended_at columns to datetime
df['started_at'] = pd.to_datetime(df['started_at'])
df['ended_at'] = pd.to_datetime(df['ended_at'])

neg_df = df[df['started_at'] >= df['ended_at']].copy()

#checking for negative values 
neg_df[['ride_id', 'started_at', 'ended_at']]

In [None]:
#dropping rows with inconsistent data in 'started_at' and 'ended_at' column
clean_df = df[~df['ride_id'].isin(neg_df['ride_id'])].copy()

# Reset the index
clean_df.reset_index(drop=True, inplace=True)

clean_df 

## Handling Missing Values

Now that we have addressed the inconsistent data, we can focus on the 905,237 missing values in the "start_station_name" column identified during data exploration. To fill these values and uncover trends in rider behavior, we will use a machine learning model to predict missing values based on the available features. I decided to use the K-Nearest Neighbors model because it leverages the similarity between data points such as location ("start_lat" and "start_lng") and station name ("start_station_name"), as nearby points are likely to have the same or similar start station names.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [None]:
# Separate rows with and without missing start_station_name
start_station_known = clean_df[clean_df['start_station_name'].notna()].copy()
start_station_missing = clean_df[clean_df['start_station_name'].isna()].copy()

# Create a LabelEncoder instance and encode the known 'start_station_name'
le = LabelEncoder()
start_station_known["start_station_name"] = le.fit_transform(start_station_known["start_station_name"])

# making the known 'start_lat' and 'start_lng' columns our features and known 'start_station_name' our target
X = start_station_known[['start_lat', 'start_lng']]
y = start_station_known['start_station_name']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the KNN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Evaluate the model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")

We want an accuracy of 0.8 or above to use the model to fill in the missing values in the "start_station_name".

In [None]:
if accuracy > 0.8: 
    X_missing = start_station_missing[['start_lat', 'start_lng']]
    predicted_start_station_names = knn.predict(X_missing)
    start_station_missing['start_station_name'] = le.inverse_transform(predicted_start_station_names)

    # Reconstruct the clean_df with the predicted values
    clean_df.loc[clean_df['start_station_name'].isna(), 'start_station_name'] = start_station_missing['start_station_name']

The missing values in "start_station_name" have successfully been filled. Now let’s check if "end_station_name" has any nulls, as we will use this column to discover behavioral trends between member and casual riders as well.

In [None]:
print("end_station_name total nulls: ", clean_df['end_station_name'].isnull().sum())

In [None]:
print('Total of nulls where end_station_name, end_lat, and end_lng are null: ', clean_df[((clean_df['end_lat'].isna()) | (clean_df['end_lng'].isna())) & (clean_df['end_station_name'].isna())].shape[0] )

There are 955,579 rows with missing values in the "end_station_name" column. To fill these gaps, we will use the K-Nearest Neighbors (KNN) model, as we previously did for the "start_station_name" values.

However, before proceeding, it's important to note that 7,568 of these rows also have missing values in the "end_lat" and "end_lng" columns. This poses a challenge, as the KNN model requires non-null "end_lat" and "end_lng" values to accurately predict the missing "end_station_name". Since it’s not feasible to deduce the missing "end_lat" and "end_lng" values, we will leave these fields as null and fill the "end_station_name" with "Unknown" for these specific cases.

In [None]:
#filling the null 'start_station_name' as 'unknown' with corresponding null values in 'end_lat' and 'end_lng'
clean_df.loc[((clean_df['end_lat'].isna()) | (clean_df['end_lng'].isna())) & (clean_df['end_station_name'].isna()), 'end_station_name'] = 'unknown'

clean_df[clean_df['end_station_name'] == 'unknown']

Next, we will focus on filling the "end_lat" and "end_lng" for the rows that have a non-null "end_station_name". To do this, we will calculate the average "end_lat" and "end_lng" for each end station and use these averages to fill in only the null "end_lat" and "end_lng" values.

In [None]:
#filtering to create a df with only rows that have non-null end_station_name and null end_lat and end_lng
df_inconsistent = clean_df[(clean_df['end_station_name'] != 'unknown') & ((clean_df['end_lat'] == 0) | (clean_df['end_lng'] == 0) | (clean_df['end_lat'].isnull()))].copy()

#creating a list to store station names
station_list = []
station_list = df_inconsistent['end_station_name']


for i in station_list:
    
    # Calculate the average end_lat for stations
    avg_end_lat = clean_df[clean_df['end_station_name'] == i ]['end_lat'].mean().copy()
    avg_end_lng = clean_df[clean_df['end_station_name'] == i ]['end_lng'].mean().copy()
    
    # Update rows where start_lat, start_lng, end_lat, or end_lng are zero and null for the specific station
    condition = (clean_df['end_station_name'] == i) & \
                ((clean_df['end_lat'] == 0) | (clean_df['end_lng'] == 0) | (clean_df['end_lat'].isnull())).copy()

    clean_df.loc[condition, 'end_lat'] = avg_end_lat
    clean_df.loc[condition, 'end_lng'] = avg_end_lng

Now that we have filled in the missing values for "end_lat" and "end_lng" we are ready to use the K-Nearest Neighbors model to fill in the 948,011 nulls in "end_station_name".

In [None]:
print("end_station_name total nulls: ", clean_df['end_station_name'].isnull().sum())

In [None]:
# Separate rows with and without missing end_station_name
end_station_known = clean_df[(clean_df['end_station_name'].notna()) & (clean_df['end_station_name'] != 'unknown') & ((clean_df['end_lat'].notnull()) | (clean_df['end_lng'].notnull()))].copy()
end_station_missing = clean_df[clean_df['end_station_name'].isna()].copy()

# Create a LabelEncoder instance and encode the known 'end_station_name'
le = LabelEncoder()
end_station_known["end_station_name"] = le.fit_transform(end_station_known["end_station_name"])

# Features and target
X = end_station_known[['end_lat', 'end_lng']]
y = end_station_known['end_station_name']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the KNN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Evaluate the model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")

We want an accuracy of 0.8 or above to use the model to fill in the missing values in the "end_station_name".

In [None]:
# If the accuracy is satisfactory, predict missing end_station_name values
if accuracy > 0.8: 
    X_missing = end_station_missing[['end_lat', 'end_lng']]
    predicted_end_station_names = knn.predict(X_missing)
    end_station_missing['end_station_name'] = le.inverse_transform(predicted_end_station_names)

    # Reconstruct the clean_df with the predicted values
    clean_df.loc[clean_df['end_station_name'].isna(), 'end_station_name'] = end_station_missing['end_station_name']

Zero nulls left in "start_station_name" and "end_station_name". 

In [None]:
print(clean_df[['start_station_name', 'end_station_name']].isnull().sum())

### Removing Irrevent Columns

The next step in our cleaning process is to remove irrevent columns.

Since we now have "start_station_name" and "end_station_name" without nulls, we no longer need the following columns in our analysis: "start_station_id", "end_station_id", "start_lat", "start_lng", "end_lat", and "end_lng".

In [None]:
#filtering out columns start_station_id, end_station_id, start_lat, start_lng, end_lat, end_lng
clean_df = clean_df[['ride_id','rideable_type','started_at','ended_at','start_station_name', 'end_station_name', 'member_casual']].copy()
clean_df

# Data Manipulation and Feature Engineering
After converting the "started_at" and "ended_at" columns to the datetime datatype during the data cleaning process, I will engineer a new feature called "ride_length". This feature will provide deeper insights into rider behavior by "member_casual" type. The "ride_length" column represents the duration of each ride, calculated as the difference between the "started_at" and "ended_at" columns.

In [None]:
# Calculating the ride_length from ended_at and started_at column for analysis
clean_df['ride_length'] = pd.to_timedelta(clean_df['ended_at'] - clean_df['started_at'])

I engineered a new feature called the "day_of_week" column, which indicates the day of the week each ride began.

In [None]:
#calulating day of week and storing them in a column for analysis
clean_df['day_of_week'] = clean_df['started_at'].dt.day_name()

Furthermore, I extracted the month and year from the "started_at" column to create the "month_year" column, which displays the month and year of each ride.

In [None]:
#calulating month and year and storing them in a column for analysis
clean_df['month_year'] = clean_df['started_at'].dt.to_period('M')

# Analyze

## DF Analysis

In [None]:
clean_df.describe()

In [None]:
categorical = clean_df.dtypes[clean_df.dtypes == "object"].index
clean_df[categorical].describe()

In [None]:
print('Average ride length for all member types: ', clean_df['ride_length'].mean())

In [None]:
clean_df.groupby('rideable_type')['ride_id'].count().reset_index(name='ride count')

In [None]:
# Grouping by 'start_station_name' and counting 'ride_id'
df_start_station = clean_df.groupby('start_station_name')['ride_id'].count().reset_index(name='ride count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 start stations
df_start_station.sort_values(by='ride count', ascending=False).head(5)

In [None]:
# Grouping by 'end_station_name' and counting 'ride_id'
df_end_station = clean_df.groupby('end_station_name')['ride_id'].count().reset_index(name='count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 end stations
df_end_station.sort_values(by='count', ascending=False).head(5)

## Casual DF Analysis

In [None]:
# creating a dataframe for casual riders
casual_df = clean_df[clean_df['member_casual'] == 'casual'].copy()

In [None]:
casual_df.describe()

In [None]:
casual_df[categorical].describe()

In [None]:
print('Average ride length for Casual riders: ', casual_df['ride_length'].mean())

In [None]:
casual_df.groupby('rideable_type')['ride_id'].count().reset_index(name='casual_rider_count')

In [None]:
casual_df.groupby('month_year')['ride_id'].count().reset_index(name='casual_rider_count')

In [None]:
# Grouping by 'start_station_name' and counting 'ride_id'
casual_df_start_station = casual_df.groupby('start_station_name')['ride_id'].count().reset_index(name='casual_rider_count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 start stations
casual_df_start_station.sort_values(by='casual_rider_count', ascending=False).head(5)


In [None]:
# Grouping by 'end_station_name' and counting 'ride_id'
casual_df_end_station = casual_df.groupby('end_station_name')['ride_id'].count().reset_index(name='casual_rider_count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 end stations
casual_df_end_station.sort_values(by='casual_rider_count', ascending=False).head(5)

## Member DF Analysis

In [None]:
# creating a dataframe for member riders
member_df = clean_df[clean_df['member_casual'] == 'member'].copy()

In [None]:
member_df.describe()

In [None]:
member_df[categorical].describe()

In [None]:
print('Average ride length for Member riders: ', member_df['ride_length'].mean())

In [None]:
member_df.groupby('rideable_type')['ride_id'].count().reset_index(name='Member_rider_count')

In [None]:
member_df.groupby('month_year')['ride_id'].count().reset_index(name='Member_rider_count')

In [None]:
# Grouping by 'start_station_name' and counting 'ride_id'
member_df_start_station = member_df.groupby('start_station_name')['ride_id'].count().reset_index(name='Member_rider_count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 start stations
member_df_start_station.sort_values(by='Member_rider_count', ascending=False).head(5)

In [None]:
# Grouping by 'end_station_name' and counting 'ride_id'
member_df_end_station = member_df.groupby('end_station_name')['ride_id'].count().reset_index(name='Member_rider_count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 end stations
member_df_end_station.sort_values(by='Member_rider_count', ascending=False).head(5)

# Visualization
To observe the differences between casual riders and members in the data, we will visualize the data.

In the graph below (Figure 1), I have visualized the rider count by "member_casual" type. It can be observed that there are significantly more members than casual riders, with a difference of 1,643,974 riders.

In [None]:
def addlabels(x,y):
    for i in range(len(x)):
        plt.text(i,y[i],y[i])
        
rideid_analysis = clean_df.groupby('member_casual')['ride_id'].count().reset_index(name='count').copy()

plt.figure(figsize=(8, 4))
plt.bar(rideid_analysis['member_casual'], rideid_analysis['count'], color=['skyblue', 'lightgreen'])
plt.xlabel('Member Type')
plt.ylabel('Count')
plt.yticks([0, 1000000, 2000000, 3000000, 4000000], ['0','1,000,000', '2,000,000', '3,000,000', '4,000,000'])
addlabels(rideid_analysis['member_casual'], rideid_analysis['count'])
plt.title('Rider Count by Type of Member (June 2023 - May 2024)')
plt.show()
print('figure 1')

In these pie charts (Figure 2.1 and Figure 2.2), I visualized the distribution of rideable types by "member_casual" type. It can be observed that casual riders use electric bikes slightly more than members, while members prefer classic bikes more than casual riders. A notable difference is that 2.4% of the bikes ridden by casual riders are docked bikes, whereas in the members' chart, 0% of the bikes ridden are docked bikes.

In [None]:
# Creating plot
casual_df_rideabletype = casual_df.groupby('rideable_type')['ride_id'].count().reset_index(name='count').copy()
fig = plt.figure(figsize=(8, 5))
plt.pie(casual_df_rideabletype['count'], labels=casual_df_rideabletype['rideable_type'], autopct='%1.1f%%', colors =['purple', 'orange', 'green'])

# Show plot
plt.title('Rideable Types for Casual riders')
plt.show()
print('figure 2.1')

member_df_rideabletype = member_df.groupby('rideable_type')['ride_id'].count().reset_index(name='count').copy()
# Creating plot
fig = plt.figure(figsize=(8, 5))
plt.pie(member_df_rideabletype['count'], labels=member_df_rideabletype['rideable_type'], autopct='%1.1f%%', colors =['purple', 'green', 'orange'])

# Show plot
plt.title('Rideable Types for Members')
plt.show()
print('figure 2.2')

To explore seasonal activity behavior of member and casual riders, I generated a line chart shown in figure 3. it shows both casual and member counts are highest during spring and summer However casual rider's ride count peak in July where as members peak in August. Both ride counts plumet in the fall and winter season.

In [None]:
#_________________________________Plotting Graphs____________________________________

# To find the ride_id count for each member type, group by 'month_year' and calculate the count for 'ride_id'
casual_counts = casual_df.groupby('month_year')['ride_id'].count().reset_index(name='casual_count').copy()
member_counts = member_df.groupby('month_year')['ride_id'].count().reset_index(name='member_count').copy()


# Extract month names for line chart's x-axis labels
casual_counts['month_name'] = casual_counts['month_year'].dt.strftime('%B %Y')
member_counts['month_name'] = member_counts['month_year'].dt.strftime('%B %Y')



# Plotting Ride Count by Month for Casual and Member Riders
plt.figure(figsize=(12, 6))
plt.plot(casual_counts['month_name'], casual_counts['casual_count'], label='Casual Riders', color='green', marker='o')
plt.plot(member_counts['month_name'], member_counts['member_count'], label='Member Riders', color='blue', marker='o')
plt.xlabel('Months (June 2023 - May 2024)')
plt.ylabel('Ride Count')
plt.title('Ride Count by Month for Casual and Member Riders')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
print('Figure 3')

To highlight the differences between casual and members riders, I created two bar charts. Figure 4.1 shows that casual riders have an average ride time of 28.20 minutes, while members average 12.92 minutes. In Figure 4.2, it is revealed that casual riders generally have ride lengths over 20 minutes through the week, whereas members average below 13 minutes through the week. These charts demonstrate that casual riders typically have longer ride times than members.

In [None]:
#finding the mean of 'ride_length' column by member type
ride_length_df = clean_df.groupby('member_casual')['ride_length'].mean().reset_index(name='mean_ride_length').copy()


#turn mean ride length into minutes 
ride_length_df['mean_ride_length'] = ride_length_df['mean_ride_length'].dt.total_seconds() / 60


# Plot the bar chart Average Ride Length by Type of Member 
plt.figure(figsize=(10, 6))
plt.bar(ride_length_df['member_casual'], ride_length_df['mean_ride_length'], color=['green', 'blue'])
plt.xlabel('Member Type')
plt.ylabel('Average Ride Length (minutes)')
addlabels(ride_length_df['member_casual'], round(ride_length_df['mean_ride_length'], 2))
plt.title('Average Ride Length by Type of Member (June 2023 - May 2024)')
plt.show()
print('Figure 4.1\n\n')

In [None]:
#find the mean of casual and members df
casual_analysis = casual_df.groupby('day_of_week')['ride_length'].mean().reset_index(name='mean_ride_length').copy()
member_analysis = member_df.groupby('day_of_week')['ride_length'].mean().reset_index(name='mean_ride_length').copy()

#find the count of casual and members df
casual_analysis['ride_id_count'] = casual_df.groupby('day_of_week')['ride_id'].count().values.copy()
member_analysis['ride_id_count'] = member_df.groupby('day_of_week')['ride_id'].count().values.copy()

#turn mean ride length into minutes 
casual_analysis['mean_ride_length'] = casual_analysis['mean_ride_length'].dt.total_seconds() / 60
member_analysis['mean_ride_length'] = member_analysis['mean_ride_length'].dt.total_seconds() / 60

#sorting 'day_of_week' to display correctly in the x-axis
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

casual_analysis['day_of_week'] = pd.Categorical(casual_analysis['day_of_week'], categories=day_order, ordered=True)
member_analysis['day_of_week'] = pd.Categorical(member_analysis['day_of_week'], categories=day_order, ordered=True)

casual_analysis = casual_analysis.sort_values('day_of_week')
member_analysis = member_analysis.sort_values('day_of_week')


# plotting Average Ride Length by Membership Type and Day of the Week (June 2023 - May 2024)
plt.figure(figsize=(12, 6))

# Positions for the bars
bar_width = 0.35
index = np.arange(len(member_analysis['day_of_week']))

# Plot bars
plt.bar(index - bar_width/2, casual_analysis['mean_ride_length'], bar_width, color='green', label='Casual Rider')
plt.bar(index + bar_width/2, member_analysis['mean_ride_length'], bar_width, color='blue', label='Member Rider')
plt.xlabel('Day of the Week')
plt.ylabel('Average Ride Length (minutes)')
plt.title('Average Ride Length by Membership Type and Day of the Week (June 2023 - May 2024)')
plt.xticks(index, member_analysis['day_of_week'])
plt.legend()
plt.tight_layout()
plt.show()
print('Figure 4.2')

Further analysis showed distinct patterns in ride activity between casual riders and members. A bar chart (figure 5) uncovered that member ride counts start high on Monday, peak on Thursday, and decrease over the weekend. Conversely, casual rider counts are low during weekdays, increase on Friday, peak on Saturday, and slightly decrease on Sunday. 

In [None]:
# Plotting Rider Count by Type of Member and Day of the Week (June 2023 - May 2024)
plt.figure(figsize=(12, 6))
plt.bar(index - bar_width/2, casual_analysis['ride_id_count'], bar_width, color = 'green', label='Casual Rider')
plt.bar(index + bar_width/2, member_analysis['ride_id_count'], bar_width, color = 'blue', label='Member Rider')
plt.xlabel('Day of The Week')
plt.ylabel('Rider count')
plt.title('Rider Count by Type of Member and Day of the Week (June 2023 - May 2024)')
plt.xticks(index, casual_analysis['day_of_week'])
plt.legend()
plt.show() 
print('Figure 5')

I created two bar charts, each highlighting the top 5 start stations for casual riders and members, respectively.

For casual riders, the chart reveals that Streeter Dr & Grand Ave is the most frequented starting point, followed by Dusable Lake Shore Dr & Monroe St, Millennium Park, Theater on the Lake, and Michigan Ave & Oak St.

In contrast, the chart for member riders shows that Clark St & Elm St is the preferred starting station, followed by Clinton St & Washington Blvd, Wabash Ave & Grand Ave, Canal St & Adams St, and Dearborn St & Monroe St, respectively.

In [None]:
# Grouping by 'start_station_name' and counting 'ride_id'
casual_df_start_station = casual_df.groupby('start_station_name')['ride_id'].count().reset_index(name='count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 start stations for casual riders
casual_df_start_station_top5 = casual_df_start_station.sort_values(by='count', ascending=False).head(5)

#create bar chart
plt.figure(figsize=(8, 4))
plt.bar(casual_df_start_station_top5['start_station_name'], casual_df_start_station_top5['count'], color = 'green')
plt.xlabel('Top Five Start Station')
plt.ylabel('Count')
plt.yticks([0, 10000, 20000, 30000, 40000, 50000, 60000], ['0','10,000', '20,000', '30,000', '40,000', '50,000', '60,000'])
plt.title('Top 5 Start Stations for Casual Riders')
plt.legend()
plt.xticks(rotation=45)
plt.show()
print('figure 6.1')

# Grouping by 'start_station_name' and counting 'ride_id'
member_df_start_station = member_df.groupby('start_station_name')['ride_id'].count().reset_index(name='count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 start stations for members
member_df_start_station_top5 = member_df_start_station.sort_values(by='count', ascending=False).head(5)

#create bar chart
plt.figure(figsize=(8, 4))
plt.bar(member_df_start_station_top5['start_station_name'], member_df_start_station_top5['count'], color = 'blue')
plt.xlabel('Top Five Start Station')
plt.ylabel('Count')
plt.yticks([0, 10000, 20000, 30000, 40000], ['0','10,000', '20,000', '30,000', '40,000'])
plt.title('Top 5 Start Stations for Member Riders')
plt.legend()
plt.xticks(rotation=45)
plt.show()
print('figure 6.2')

I created two bar charts (Figure 7.1 and Figure 7.2), each showcasing the top 5 end stations by "member_casual" type.

In Figure 7.1, which focuses on casual riders, Streeter Dr & Grand Ave emerges as the most popular end station, followed by Dusable Lake Shore Dr & Monroe St, Dusable Lake Shore Dr & North Blvd, Michigan Ave & Oak St, and Theater on the Lake.

Figure 7.2, which highlights member riders, shows that Kingsbury St & Kinzie St is the most frequented end station, followed by Wilton Ave & Belmont Ave, Canal St & Adams St, LaSalle St & Illinois St, and Dearborn Pkwy & Delaware Pl.

In [None]:
# Grouping by 'end_station_name' and counting 'ride_id'
casual_df_end_station = casual_df.groupby('end_station_name')['ride_id'].count().reset_index(name='count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 end stations for casual riders
casual_df_end_station_top5 = casual_df_end_station.sort_values(by='count', ascending=False).head(5)

#create bar chart
plt.figure(figsize=(8, 4))
plt.bar(casual_df_end_station_top5['end_station_name'], casual_df_end_station_top5['count'], color = 'green')
plt.xlabel('Top Five End Station')
plt.ylabel('Count')
plt.yticks([0, 10000, 20000, 30000, 40000, 50000, 60000], ['0','10,000', '20,000', '30,000', '40,000', '50,000', '60,000'])
plt.title('Top 5 end Stations for Casual Riders')
plt.legend()
plt.xticks(rotation=45)
plt.show()
print('figure 7.1')
      
# Grouping by 'end_station_name' and counting 'ride_id'
member_df_end_station = member_df.groupby('end_station_name')['ride_id'].count().reset_index(name='count').copy()

# Sorting the DataFrame by count in descending order and selecting the top 5 end stations for members
member_df_end_station_top5 = member_df_end_station.sort_values(by='count', ascending=False).head(5)

#create bar chart
plt.figure(figsize=(8, 4))
plt.bar(member_df_end_station_top5['end_station_name'], member_df_end_station_top5['count'], color = 'blue')
plt.xlabel('Top Five End Station')
plt.ylabel('Count')
plt.yticks([0, 10000, 20000, 30000, 40000], ['0','10,000', '20,000', '30,000', '40,000'])
plt.title('Top 5 end Stations for Member Riders')
plt.legend()
plt.xticks(rotation=45)
plt.show()
print('figure 7.2')

# Findings

The analysis reveals several key differences between casual riders and members:

Rider Distribution:

* The data shows a significant difference between the number of casual riders and members, with members outnumbering casual riders by 1,643,974 riders (Figure 1).

Seasonal Riding Patterns:

* Both casual riders and members exhibit increased activity during spring and summer. However, casual riders peak in July, while members peak in August. Both groups see a decline in ride counts during the fall and winter months (Figure 3).

Ride Duration:

* Casual riders tend to have longer ride durations compared to members. The average ride time for casual riders is 28.20 minutes, whereas members average 12.92 minutes. Throughout the week, casual riders generally have rides exceeding 20 minutes, while members average below 13 minutes (Figures 4.1 and 4.2).

Weekly Ride Activity:

* There are distinct patterns in ride activity between the two groups. Members’ ride counts are highest on weekdays, peaking on Thursday and tapering off over the weekend. Conversely, casual riders are more active on weekends, with their counts peaking on Saturday (Figure 5).

Popular Start and End Stations:

* For casual riders, the most frequented starting point to casual rider's ride is Streeter Dr & Grand Ave, followed by Dusable Lake Shore Dr & Monroe St, Millennium Park, Theater on the Lake, and Michigan Ave & Oak St. (Figures 6.1). As for end stations, Streeter Dr & Grand Ave emerges as the most popular end station, followed by Dusable Lake Shore Dr & Monroe St, Dusable Lake Shore Dr & North Blvd, Michigan Ave & Oak St, and Theater on the Lake (Figures 7.1).

* Members, on the other hand, prefer starting station Clark St & Elm St followed by Clinton St & Washington Blvd, Wabash Ave & Grand Ave, Canal St & Adams St, and Dearborn St & Monroe St (Figures 6.2). For ending stations, Kingsbury St & Kinzie St is the most popular followed by Wilton Ave & Belmont Ave, Canal St & Adams St, LaSalle St & Illinois St, and Dearborn Pkwy & Delaware Pl (Figures 7.2).

# Recommendations 
To address the question of how to convert casual riders into annual members, I have made three recommendations. Firstly, I suggest setting up membership ads during the months when casual rider are most active, specifically June, July, and August. Since casual riders are particularly active during these months, this strategy will attract a lot of attention from casual members and encourage them to sign up. Secondly, implementing membership advertisements at the top 3 most popular starting stations (Streeter Dr & Grand Ave, Dusable Lake Shore Dr & Monroe St, Millennium Park) and top 3 most popular ending stations (Streeter Dr & Grand Ave, Dusable Lake Shore Dr & Monroe St, Dusable Lake Shore Dr & North Blvd) could effectively encourage more casual riders to become members.. Lastly, on average, casual riders ride for 28 minutes per trip. Highlighting the cost savings of a membership compared to day passes for longer rides could further incentivize casual riders to sign up for memberships.