# Solutions: Pandas Practice 3 (Bike Share Data)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/bike_share_201402_trip_data.csv')

How many observations are there?

In [None]:
len(df)  # or df.shape[0]

Change the columns to be pythonic: lowercase, replace ' ' with '_', replace '#' with 'num'

In [None]:
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('#', 'num')

How many types of subscription options are there? What are the different subscription types?

In [None]:
print(df['subscription_type'].nunique())  # Number of unique types
print(df['subscription_type'].unique())  # The types

What is the frequency of each subscription option?

In [None]:
df['subscription_type'].value_counts()

Please plot the frequency of each subscription option with a pie chart:

In [None]:
df['subscription_type'].value_counts().plot(kind='pie', autopct='%1.1f%%')

Please plot the frequency of each subscription option with a bar chart:

In [None]:
df['subscription_type'].value_counts().plot(kind='bar')

Have a look at the start_station column: Which 10 stations occur most frequently?

In [None]:
df['start_station'].value_counts().head(10)

Now look at the end_station column: Which 10 stations occur the least often?

In [None]:
df['end_station'].value_counts().tail(10)

Create a table that has start_station segmented by subscription_type and include also the row/column margins (subtotals).

In [None]:
pd.crosstab(df['start_station'], df['subscription_type'], margins=True)

Let's look at the duration... Which unit do you think is used here? How long is the shortest trip? How many are that short?

In [None]:
# Unit is seconds (based on typical bike share durations).
print(df['duration'].min())  # Shortest trip
print((df['duration'] == df['duration'].min()).sum())  # How many that short

What do you think is going on with the short trips?

In [None]:
# Likely false starts, errors, or quick returns (e.g., bike issues). Filter to see:
df[df['duration'] == df['duration'].min()]

What is the longest trip?

In [None]:
df['duration'].max()

How would you define a "long" trip? How many trips are "long" according to your definition?

In [None]:
# Define long as > 1 hour (3600 seconds)
(df['duration'] > 3600).sum()

Do the long durations seem reasonable? Why are they so long? What could it tell us about the users?

In [None]:
# Possibly forgotten returns or system errors. Users might be tourists/explorers.
df[df['duration'] > 3600]

Plot the duration column.

In [None]:
df['duration'].plot(kind='hist', bins=50)

Does this plot give any insights?

In [None]:
# Most trips are short (<30 min), with a long tail of outliers.

Select subsections of the data to make plots that provide more insights.

In [None]:
# Example: Short trips only (<30 min)
df[df['duration'] < 1800]['duration'].plot(kind='hist', bins=50)

The Product Team would like all of the station names to be lower case and with `_` as a separator

In [None]:
df['start_station'] = df['start_station'].str.lower().str.replace(' ', '_')
df['end_station'] = df['end_station'].str.lower().str.replace(' ', '_')

Now take a timer and set it to 15 minutes. Take this time to explore the data guided by your own intuition or hypothesesâ€¦

In [None]:
# Example exploration: Average duration by subscription type
df.groupby('subscription_type')['duration'].mean().plot(kind='bar')