# Ford GoBike
## by Ahmed Omran
## Preliminary Wrangling

Ford GoBike is a regional public bike sharing system in the San Francisco Bay Area, California, with nearly 500,000 rides since the launch in 2017 and had about 10,000 annual subscribers as of January 2018. The dataset used for this exploratory analysis consists of [monthly individual trip data](https://www.lyft.com/bikes/bay-wheels/system-data) from January 2018 to December 2018 in CSV format.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import os
import glob

%matplotlib inline

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

### Gathering and Assessing Data

In [None]:
# only ran once to append all monthly trip data together

folder_name = 'Data'
frames = [pd.read_csv(f) for f in glob.glob(os.path.join(folder_name, '*.csv'))]
result = pd.concat(frames, ignore_index=True)
print(result.shape)
result.sample(5)

In [None]:
# save the appended result to a .csv for further usage

result.to_csv('fordgobike_trips_2018.csv', index=False)

In [None]:
biketrips18 = pd.read_csv('fordgobike_trips_2018.csv')
biketrips18.head()

In [None]:
biketrips18.info()

In [None]:
biketrips18.isnull().sum()

In [None]:
biketrips18.duplicated().sum()

In [None]:
biketrips18.member_gender.value_counts()

### Data Cleaning 


In [None]:
# make a copy of the dataframe 
# problem 1: fixing multiple fields that are not in the correct dtype

trips18 = biketrips18.copy()
trips18['start_time'] = pd.to_datetime(trips18['start_time'])
trips18['end_time'] = pd.to_datetime(trips18['end_time'])

trips18['start_station_id'] = trips18['start_station_id'].astype('str')
trips18['end_station_id'] = trips18['end_station_id'].astype('str')
trips18['bike_id'] = trips18['bike_id'].astype('str')

trips18['user_type'] = trips18['user_type'].astype('category')
trips18['member_gender'] = trips18['member_gender'].astype('category')

trips18.info(null_counts=True)

In [None]:
# problem 2: adding new columns for trip duration in minute, trip start date in yyyy-mm-dd format, trip start hour of the day, day of week and month

trips18['duration_minute'] = trips18['duration_sec']/60

trips18['start_date'] = trips18.start_time.dt.strftime('%Y-%m-%d')
trips18['start_hourofday'] = trips18.start_time.dt.strftime('%H')
trips18['start_dayofweek'] = trips18.start_time.dt.strftime('%A')
trips18['start_month'] = trips18.start_time.dt.strftime('%B')

trips18.head()

In [None]:
# problem 3: adding a new column calculating riders' age from 'member_birth_year'

trips18['member_age'] = 2019 - trips18['member_birth_year']
trips18.describe()

In [None]:
# plotting the distribution of members' age

plt.figure(figsize=[8, 6])
bins = np.arange(0, trips18['member_age'].max()+5, 5)
plt.hist(trips18['member_age'].dropna(), bins=bins);

In [None]:
# problem 4: filtering out outlier ages from visually examination of the distribution above
# problem 5: casting 'member_birth_year' and 'member_age' to integer instead of float type

trips18 = trips18.query('member_age <= 70')
trips18['member_birth_year'] = trips18['member_birth_year'].astype('int')
trips18['member_age'] = trips18['member_age'].astype('int')
trips18.info(null_counts=True)

### What is the structure of your dataset?

The original combined data contains approximately 1,860,000 individual trip records with 16 variables collected. The variables can be divided into 3 major categories:
- trip duration: `duration_sec`, `start_time`, `end_time`


- station info: `start_station_id`, `start_station_name`, `start_station_latitude`, `start_station_longitude`, `end_station_id`, `end_station_name`, `end_station_latitude`, `end_station_longitude`


- member info (anonymized): `bike_id`, `user_type`, `member_birth_year`, `member_gender`, `bike_share_for_all_trip`

Derived features/variables to assist exploration and analysis:
- trip info: `duration_minute`, `start_date`, `start_hourofday`, `start_dayofweek`, `start_month`


- member: `member_age`

### What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out how trip duration is dependent on other features such as: trip duration, user type, and gender from the dataset.



### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Each trip's start date/time and duration information will help understanding how long a trip usually takes and when. The member information like user type, gender and age will help us find out who are the main target customer groups, use the different groups to summarize bike usage data to see if there is any special pattern associated with a specific group of riders.

## Univariate Exploration

A series of plots to first explore the trips distribution over hour-of-day, day-of-week and month.

In [None]:
# trip distribution over day hours

plt.rcParams['figure.figsize'] = 8, 6
base_color = sb.color_palette('colorblind')[0]
sb.set_style('darkgrid')

sb.countplot(data=trips18, x='start_hourofday', color=base_color);
plt.xlabel('Trip Start Hour of Day');
plt.ylabel('Count');

In [None]:
# trip distribution over weekdays
# problem 6: casting 'start_dayofweek' to category dtype

weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekdaycat = pd.api.types.CategoricalDtype(ordered=True, categories=weekday)
trips18['start_dayofweek'] = trips18['start_dayofweek'].astype(weekdaycat)

sb.countplot(data=trips18, x='start_dayofweek', color=base_color);
plt.xlabel('Trip Start Day of Week');
plt.ylabel('Count');

In [None]:
# trip distribution over months
#  problem 7: casting 'start_month' to category dtype for easy plotting

month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
monthcat = pd.api.types.CategoricalDtype(ordered=True, categories=month)
trips18['start_month'] = trips18['start_month'].astype(monthcat)

sb.countplot(data=trips18, x='start_month', color=base_color);
plt.xticks(rotation=30);
plt.xlabel('Trip Start Month');
plt.ylabel('Count');

The trip distribution over day hours peaks around two timeframes, 8am-9am and 17pm-18pm, during typical rush hours. Looking in combined with the trip distribution over day of week plot, it is quite obvious that the majority of rides happened on work days (Mon-Fri) and primary usage is probably for commuting. In the 12 months in 2018, October had the most ride trips compared to the others, but overall it was the most popular during summer time (May-Sept), probably due to the weather in the area.

### The next several plots are around members/users to see what makes up of the riders.

In [None]:
sb.countplot(data=trips18, x='user_type', color=base_color);
plt.xlabel('User Type');
plt.ylabel('Count');

In [None]:
sb.countplot(data=trips18, x='member_gender', color=base_color);
plt.xlabel('Gender');
plt.ylabel('Count');

In [None]:
sb.countplot(data=trips18, x='bike_share_for_all_trip', color=base_color);
plt.xlabel('Bike Share for All Trip');
plt.ylabel('Count');

In [None]:
plt.hist(data=trips18, x='duration_minute');
plt.xlabel('Trip Duration in Minute');

#### as we can see the majorty of trips were in range of 200 minutes , by zooming in the next cell 

In [None]:
bins = np.arange(0, 66, 1)
ticks = np.arange(0, 66, 5)
plt.hist(data=trips, x='duration_minute', bins=bins);
plt.xticks(ticks, ticks);
plt.xlabel('Trip Duration in Minute');

#### looks like the majorty were less than one hour so will filter out the longer durations for more accurate and relevant results . 

In [None]:
# problem 8: filtering  out outlier trip records where the duration was very long

trips18 = trips18.query('duration_minute <= 66')
trips18.info(null_counts=True)

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

the trip duration takes a large amount of values and is concentrated at 600 seconds peak occurred starting from 0 and then distribution starts to dip and there is no more peak value. 

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When investigating the Birth year variables, Birth year is converted by substracting the year from current year in order to give the distibution for age and better perception. In addtion, start station and end station is plotted in a larger plot so that it gives a better insight regarding traffic of bikes at certain stations.



In [None]:
# save the clean data to a .csv file 
#we will need it in the slide 

trips18.to_csv('fordgobike_trips_2018_clean.csv', index=False)

## Bivariate Exploration

How does the trip duration distribution vary between customers and subscribers?

In [None]:
sb.violinplot(data=trips18, x='user_type', y='duration_minute', color=base_color, inner='quartile');
plt.xlabel('User Type');
plt.ylabel('Trip Duration in Minute');

The trip duration distribution is much narrower for subscribers compared to casual riders on the shorter/quicker trip end overall. 

### How does the trip duration distribution vary by gender?

In [None]:
sb.boxplot(data=trips18, x='member_gender', y='duration_minute', color=base_color);
plt.xlabel('Gender');
plt.ylabel('Trip Duration in Minute');

Though not a huge difference, male riders tend to have shorter trips compared to female users, indicated by both a smaller median and shorter IQR.  

#### Average Trip Duration on Weekdays

In [None]:
sb.barplot(data=trips18, x='start_dayofweek', y='duration_minute', color=base_color);
plt.xlabel('Day of Week');
plt.ylabel('Avg. Trip Duration in Minute');

The riding trips are much shorter on Monday through Friday compared to weekends. 

####  Average trip duration by month

In [None]:
sb.barplot(data=trips18, x='start_month', y='duration_minute', color=base_color);
plt.xticks(rotation=30);
plt.xlabel('Month');
plt.ylabel('Avg. Trip Duration in Minute');

In [None]:
sb.boxplot(data=trips18, x='start_dayofweek', y='member_age', color=base_color);
plt.xlabel('Day of Week');
plt.ylabel('Member Age');

Users who rented the bikes Monday to Friday are  older than those who ride on weekends.

#### Yearly usage between customers and subscribers

In [None]:
sb.countplot(data=trips18, x='start_month', hue='user_type');
plt.xticks(rotation=30);
plt.xlabel('Month');
plt.ylabel('Count');

Both subscribers and customers ride the most during the summer months with subscribers maxed out in October and customers peaked in July. The usage was clearly not popular during winter season like November, December and January likely due to the weather.

#### Member age between customers and subscribers

In [None]:
sb.boxplot(data=trips18, x='user_type', y='member_age', color=base_color);
plt.xlabel('User Type');
plt.ylabel('Member Age');

Similar to the Member age by weekdays plot, subscribers who ride most often Monday through Friday are slightly older than customers, with a wider range of ages as well. 

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Trip Duration is so dependendable on the age of the member, when the age between 20 to 45, the trip duration is higher than the older ages.

However, Start station and end station does not much determine the trip duration. trip duration for some station for start station is higher and for some stations for end station is higher. As a result, we can see that what stations result in starting of longer trips and what stations comes end of longer trips.
### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

 thought that variables which are user type and gender values having higher value to get higher trip duration but it is the opposite. For gender, value of male members is very high but it got lower trip duration. For user type, value of subscriber members is very high but it got lower trip duration then customer.



## Multivariate Exploration

How does the average trip duration vary in weekdays between customers and subscribers?

In [None]:
sb.pointplot(data=trips, x='start_dayofweek', y='duration_minute', hue='user_type', dodge=0.3, linestyles="");
plt.xlabel('Day of Week');
plt.ylabel('Avg. Trip Duration in Minute');

 Subscribers usage seems to be more efficient than customers overall and maintained a very consistent average duration Monday through Friday.     



### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I observed that although the number of users for male is higher but percentage is higher for women and other in trip duration. And,for the others leap at an older age (around 60 years) to got 3000 trip duration which is a peak. Also, for subscribers the trip duration is higher than customer for older age.


### Were there any interesting or surprising interactions between features?

Looking back on the plots, leaping for other gender at an older is a surprise. And, for subscribers the trip duration is higher than customer for older age is a surprise



> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!