# Ford GoBike
## Mariam Ahmed

## Preliminary Wrangling

> This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area with the following fields: 
 - Trip Duration (in seconds)
 - Start Time and Date
 - End Time and Date
 - Start Station ID
 - Start Station Name
 - Start Station Latitude
 - Start Station Longitude
 - End Station ID
 - End Station Name
 - End Station Latitude
 - End Station Longitude
 - Bike ID
 - User Type ("Subscriber" = Member or "Customer" = Casual)
 - Member Year of Birth
 - Member Gender

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import datetime

%matplotlib inline

In [2]:
# Uploads the dataset into a dataframe 
df = pd.read_csv("201902-fordgobike-tripdata.csv")
# Visually check first 5 records
df.head()

FileNotFoundError: [Errno 2] File 201902-fordgobike-tripdata.csv does not exist: '201902-fordgobike-tripdata.csv'

In [None]:
# View info of the dataframe
df.info()

In [None]:
# Check for duplicates 
df.duplicated().sum()

In [None]:
# View descriptive statistics of the dataframe
df.describe()

### Cleaning Data

In [None]:
# Create a different copy to our original dataframe to avoid corrupting it
df_clean = df.copy()

# Set a timestamp
df_clean.start_time = pd.to_datetime(df_clean.start_time)
df_clean.end_time = pd.to_datetime(df_clean.end_time)

In [None]:
# Categorize users by their user_type (customers or subscribers), gender, and sharing.
df_clean.user_type = df_clean.user_type.astype('category')
df_clean.member_gender = df_clean.member_gender.astype('category')
df_clean.bike_share_for_all_trip = df_clean.bike_share_for_all_trip.astype('category')

In [None]:
# Reset all the IDs to unify the new df objects
df_clean.bike_id = df_clean.bike_id.astype(str)
df_clean.start_station_id = df_clean.bike_id.astype(str)
df_clean.end_station_id = df_clean.bike_id.astype(str)

In [None]:
# View info of the edited dataframe 
df_clean.info()

Adding new data *feilds* to enhance the dataset

In [None]:
# Adding users age to the dataset fields
# Calculate users age by subtractibg their birth year from the current year
df_clean['member_age'] = 2021 - df_clean['member_birth_year']

In [None]:
# Adding the starting time month to the dataset fields
df_clean['start_time_month'] = df_clean['start_time'].dt.month.astype(int)

# Adding the starting time weekday to the dataset fields
df_clean['start_time_weekday'] = df_clean['start_time'].dt.strftime('%a')

# Adding the starting time hour to the dataset fields
df_clean['start_time_hour'] = df_clean['start_time'].dt.hour

In [None]:
df_clean.head()

In [None]:
df_clean.info()

In order to represent the variety of the users age, we chose box plot to show that distribution

In [None]:
# Visually represent the users age distribution using boxplot
sb.boxplot(data=df_clean, x='member_age');
print("The mean of ages:",df_clean.member_age.mean())

In [None]:
df_clean.member_age.describe()

As seen in the plot, there're age outliers that need to be removed for ages over 60.

In [None]:
# Only keep the data of users below or equal the age of 60
df_clean = df_clean.query('member_age <= 60')

In [None]:
df_clean.info()
# Save cleaned data
df_clean.to_csv('cleaned_file.csv', index=False)

### What is the structure of your dataset?

> The dataset is a record of 183411 bike ride intery recorded in 2019 at San Fransisco Bay area.

> This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area with the following fields: 
 - Trip Duration (in seconds)
 - Start Time and Date
 - End Time and Date
 - Start Station ID
 - Start Station Name
 - Start Station Latitude
 - Start Station Longitude
 - End Station ID
 - End Station Name
 - End Station Latitude
 - End Station Longitude
 - Bike ID
 - User Type ("Subscriber" = Member or "Customer" = Casual)
 - Member Year of Birth
 - Member Gender

> The dataset was furtherly inhanced by given features to be more further informative:
- Users age
- Usage day & month

### What is/are the main feature(s) of interest in your dataset?

>  
  - The age range and gender of the users.
  - The variation of the demand through out the year and week days.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> I think users related data can give us prediction of the targeted customers and the start time indicates the patterns of the usage.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

In [None]:
# weekday usege 

weekday = ['Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri']
p = sb.catplot(data=df_clean, x='start_time_weekday', kind='count', order = weekday)
p.set_axis_labels("Weekdays", "Bike Trips")
p.fig.suptitle('Weekly usage of the bike share system', y=1.03, fontsize=14, fontweight='semibold');



We can see the variety of usage between the weekdays, and that thursdays are the highest, while saturdays are the least.

In [None]:
# hourly usege of the bike sharing system

g = sb.catplot(data=df_clean, x='start_time_hour', kind='count')
g.set_axis_labels("Hours", "Bike Trips")
g.fig.suptitle('Hourly usage of the bike share system', y=1.03, fontsize=14, fontweight='semibold');

We can observe flom the plot above that the time around 8 am and 5 pm are the highest usage times.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Thursdays are the highest in bikes usage, while Saturdays and Sundays are the least, which makes sense because Saturdays and Sundays are holidays.
The time slot around 8 am and 5 pm are the highest usage times since they are the time of going/leaving work.
What is unusual is that the dataset was only abount month february of year 2019.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Yes, the age outliers reached over 120 years, and since the majority of users are at the age between 20 and 57 years old with mean 35 years old, I removed the data of users over 60.

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

In [None]:
# Find the amount of the users who are customers and those who are subscribers
customer = df_clean.query('user_type == "Customer"')['bike_id'].count()
subscriber = df_clean.query('user_type == "Subscriber"')['bike_id'].count()

#normalize
n = df_clean['bike_id'].count()
ncustomer = customer / n
nsubscriber = subscriber / n

# Plot for comparison 
p = sb.countplot(data=df_clean, x="user_type", order=df_clean.user_type.value_counts().index)
plt.suptitle('User type split for GoBike sharing system', y=1.03, fontsize=14, fontweight='semibold');


We can observe that the bike system is mainly used by subscribers.

In [None]:
# Explore the usage of each user category in the weekdays
weekday = ['Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri']
p = sb.catplot(data=df_clean, x='start_time_weekday', col='user_type', kind='count', order = weekday)
p.set_axis_labels("Weekdays", "Bike Trips")
p.fig.suptitle('Daily usage of the bike share system per user type', y=1.03, fontsize=14, fontweight='semibold');


We can see the different trends between the two-user category, though in both categories bikes trips are higher on Thursdays.

In [None]:
# Hourly usage for both user categories
p = sb.catplot(data=df_clean, x='start_time_hour', col='user_type', kind='count')
p.set_axis_labels("Hour", "Bike Trips")
p.fig.suptitle('Hourly usage of the bike share system per user type', y=1.03, fontsize=14, fontweight='semibold');



Both user categories seem to follow the same pattern, with the difference of the ammount of users.

In [None]:
# Usage duration per user category

# first we remove the duration_sec outliers by the limit  of an hour 60x60
d = df_clean.query('duration_sec < 3600')
p = sb.catplot(data=d, y='duration_sec', col='user_type', kind='box')
p.set_axis_labels("", "Trips duration")
p.fig.suptitle('The duration of using the bike share system per user type by seconds', y=1.03, fontsize=14, fontweight='semibold');

We can observe from the above box plots that the customers trips durations are longer than subscribers.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> First of all, we could observe that the bike system is mainly used by subscribers, though both of the subscribers and customers followed the almost the same weekday and hourly pattern of bike trips. Moreover, we found that customers trips durations are longer than subscribers.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

We observed the differnce in the trip duration between customers and subscribers, that customers trips are longer than for subscribers, probably because they are tourests, while the subscribers on the other hand use the system mainly for daily purposes and quick rides to and from work/school.

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

In [None]:
# Plot the gender difference in each user category
p = sb.countplot(data=df_clean, x='user_type', hue='member_gender', order=df_clean.user_type.value_counts().index)


We can see that males are more users in both users categories.

In [None]:
# Explore the usage by gender for each user category in the weekdays
weekday = ['Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri']
p = sb.catplot(data=df_clean, x='start_time_weekday', col='user_type', hue='member_gender', kind='count', order = weekday)
p.set_axis_labels("Weekdays", "Bike Trips")
p.fig.suptitle('Weekly usage of the bike share system per user type and gender', y=1.03, fontsize=14, fontweight='semibold');

However male users are more in both users categories, they are following the same pattern through out the week days.

In [None]:
# Hourly usage by gender for both users categories
p = sb.catplot(data=df_clean, x='start_time_hour', col='user_type', hue='member_gender', kind='count')
p.set_axis_labels("Hour", "Bike Trips")
g.fig.suptitle('Hourly usage of the bike share system per user type and gender', y=1.03, fontsize=14, fontweight='semibold');

we can observe that both genders in both user categories are following the same pattern through day ours.

In [None]:
# Usage duration for both genders per user category
p = sb.catplot(data=df_clean ,x='user_type', y='duration_sec', hue='member_gender', kind='bar')
p.set_axis_labels("", "Trips duration")
p.fig.suptitle('Trip duration per user type and gender', y=1.03, fontsize=14, fontweight='semibold');

We can observe from the plot above that females take longer durations in their bike rides than men for both user types cases.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> We observed that males are more in both users categories, but however, they are following the same pattern through out the week days as well as through out the daily hours. while females take longer trips than men for both user types cases.

### Were there any interesting or surprising interactions between features?

> I found it intersting that both males and females, regardless of their amount they both follow almost the same pattern