<a href="https://colab.research.google.com/github/pathstream-curriculum/Statistics/blob/master/Rideshare_Project2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rideshare Project Part 2
<img src="https://data.cityofchicago.org/api/assets/73F1665C-0FE6-4183-8AD1-E91DB8EFAFA4?7CB02402-8E06-48B0-8C9A-3890182D58C7.png" width=400 alt="Drawing" style="width: 200px;"/>

The city of Chicago has hired you as an analyst to dive into their recently published a dataset containing detailed information about all rides taken with rideshare providers like Uber and Lyft in Chicago and surrounding areas From November 2018 through March 2019.

The dataset you'll be investigating in this notebook is the same one you looked at in the Google Sheets lab, comprising two weeks worth of data from Dec. 21, 2018 to Jan. 3, 2019. This dataset has been downsampled by a factor of 500 to reduce the size, meaning that only one of every 500 records (selected randomly) from this time period in the original dataset is included here.  

*Note: Some columns of unnessary or redundant information have been removed from the original data set and columns for Year, Month, Weekday and Hour of Day have been added for convenience. The original published data was anonymized by rounding off dollar amounts and times of day. To make the data more realistic looking we have added random noise to the Fare, Tip, Latitude and Longitude columns.*

***The first few steps of this project (reading in the data, removing null values and investigating descriptive statistics) are the same as for part 1 of this project.***


## Saving Material from this Lab for your Final Report
As you go through this lab, you'll encounter questions that are meant to get you thinking about the interpretation of each step of analysis. You can double click on any text cell to edit directly as a way of taking notes, or you can take notes elsewhere. To save visuals or other outputs of the code you should take a screenshot (Holding down "Command-Shift-4" on Mac or using the "Snipping Tool" in Windows).


Ultimately, the final report you compile in a Google doc should include your interpretations (text) and screenshots of the relevant code outputs. You can always come back to this notebook later to take more screenshots or remind yourself of the interpretation questions.

## Objectives for this Python lab:
Some of the steps you will complete in this lab are effectively the same as you did in the Sheets lab, but now in a Python environment. Other steps are completely new. The overlap between this lab and the previous Sheets lab is intended to show you how some things work in Python that you're already familiar with in Sheets, and the new steps are to demonstrate where Python allows you to easily do things that would be very difficult or impossible in Sheets. Here are your objectives for this lab:

1. Read in the dataset and compute summary statistics.

2. Compute and investigate new columns of data.

3. Compute and interpret confidence intervals for driver income.

The code for all these objectives is already written for you. All you need to do is press "Shift+Enter" on your keboard after selecting each cell to run the code. <mark>Be sure to run all the cells in order because some of the steps need to happen in a sequence</mark>. At a few points we give suggestions for making simple modifications to the code. Feel free to experiment with more if you're feeling curious!

### Objective 1: Read in the dataset and compute summary statistics.

**1.1 Investigate the data source (if you haven't already).**

As always, a great first step before you jump into your analysis is to check out the source of your data. You can find out more about this exciting dataset [here](https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p/data).

**1.2 Import all the relevant Python packages.**

Run the next cell to import Python packages

In [0]:
# Import the pandas library for reading and manipulating your data
# Anywhere you see "pd" in this notebook it's a reference to the pandas library
import pandas as pd
# Extra step to ensure that pandas plays nice with matplotlib
pd.plotting.register_matplotlib_converters()
# Import the numpy library for running calculations on your data
# Anywhere you see "np" in this notebook it's a reference to the numpy library
import numpy as np
# Import some components of the matplotlib library for plotting your data
# Anywhere you see "plt" or "mpimg" in this notebook it's a reference to the "pyplot" and "image" packages from matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# Import seaborn library for making your plots pretty!
# Anywhere you see "sns" in this notebook it's a reference to the seaborn library
import seaborn as sns
# Set some default plotting parameters using seaborn
sns.set()


**1.3 Read in the dataset.**

With the code below, you will use the `pandas` code library to read a csv (comma separated value) file containing your data into an object called "`df`". In this case, `df` is short for dataframe, which is a type of object used for storing rows and columns of data.

Run the following cell (select and press play or "Shift+Enter" on the keyboard) to complete this step. 



In [0]:
# Read the data into a dataframe called "df".
url = "https://raw.githubusercontent.com/pathstream-curriculum/Statistics/master/rideshare_holidays.csv"
df = pd.read_csv(url, parse_dates=['Trip Start Timestamp', 'Trip End Timestamp'])

**1.4 Inspect the dataset.**

Run the following cell to look at the column names and first few rows of your data. Be sure to scroll to the right to see all the columns. 

You can explore the [city of Chicago website for this dataset](https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p/data) further to learn more about each column.

In [0]:
# Print the column names and first five rows of the dataset contained in df
# Note: to look at more than 5 rows just enter a number in the parentheses e.g., "df.head(10)"
df.head()

In [0]:
df.tail()

**1.4 Explore column data types and the presence of null / missing values.**

In the cells below, running `df.shape` prints out the shape of the dataset and `df.info()` prints out each column name along with the total number of non-null values in that column and its data type. Investigate the output and see what you find!

In [0]:
# Display the shape of this dataset
df.shape

In [0]:
# Print information about the total number of non-null values and data types in each column of your dataframe
df.info()

**1.5 Handle missing values**

 If there are null/missing values in your data, you will sometimes want to eliminate those rows from the dataset, or change them to an acceptable value. In some cases, null values may be interesting to explore further. For this project, you can choose to simply leave the null values alone or remove records with null values. 

---

If you want to remove all records containing null values, make a note of why you've decided to do this and then run the next cell after removing the "#" from the beginning of the line that says `df.dropna(inplace=True)` to drop all rows containing null values. If you choose not to remove missing values you can skip the next cell. 

In [0]:
# Remove all rows that contain any null values
#df.dropna(inplace=True) 
# Print information about the total number of non-null values and data types in each column of your dataframe
#df.info()

**1.6 Investigate summary statistics.**

Run the cell below to compute and display summary statistics for your dataset. The output will be a table containing the count, mean, standard deviation, min, max and 25%, 50% (median) and 75% (Q1-Q3) quartiles for all columns. 

In [0]:
# Print out summary statistics for all columns
df.describe()

## Objective 2: Compute and investigate new columns of data.
In the first part of this project (previous colab notebook), you looked at summary statistics and interpreted a range of different visual representations of your dataset to uncover insights and anomalies. In this part of the project, you'll calculate new columns for your dataframe to derive further insights into the data. 

While the data contained in the original dataset around how long a trip took, how many miles were covered or how much the fare was are interesting in their own right, you'll now combine these columns to compute metrics that are much more relevant to drivers, namely, how fast a trip was in terms of miles per hour and how much the driver made in terms of dollars per hour. 



### 2.1 Calculate a new column called "Hourly Income".
You've looked at things like how geographical location might be related to tips, which days of the week and hours of the day are more or less popular than others for riders, but surely one of the most fundamental metrics for drivers is how much money they can expect to make in a given timeframe. 

While you don't have information that would allow you to track individual drivers over time, you do have information on total fare and trip time so you can calculate a basic hourly income rate per ride and look at how that varies against other metrics. You'll calculate hourly income like this:
```
trip_hours = df['Trip Seconds'] / 3600
fare_fraction = 0.75
hourly_fraction = 0.5
Hourly Income = hourly_fraction*(fare_fraction*df['Fare'] + df['Tip']) / Trip Hours
```
The `fare_fraction` of 0.75 is to account for the fact that rideshare companies typically keep ~25% of the fare. Multiplying by an `hourly_fraction` of 0.5 is to account for the fact that drivers are typically only completing a ride 50% of the time they spend in the car (i.e., half the time they are working they don't have someone in the car).

Run the following code cell to add the "Hourly Income" column to your dataframe

Note: this cell will not produce any output

In [0]:
# Compute an hourly income rate per trip and add the column to your dataframe
trip_hours = df['Trip Seconds']/3600 # convert seconds to hours
fare_fraction = 0.75 # to account for rideshare companies keeping 25% of fare
hourly_fraction = 0.5 # to account for an average of 50% time without riders in the car
df['Hourly Income'] = round(hourly_fraction*(fare_fraction*df['Fare'] + df['Tip'])/trip_hours, 2)

### 2.2 Investigate Hourly Income.
Run the next cell to look at the first few rows of the dataframe and scroll all the way to the right to see the new "Hourly Income" column you created.

In [0]:
# Print the column names and first five rows of the dataset contained in df
# Note: to look at more than 5 rows just enter a number in the parentheses e.g., "df.head(10)"
df.head()

Run the next cell to compute summary statistics on the "Hourly Income" column.

In [0]:
# Have a quick look at the summary statistics for this new column
df['Hourly Income'].describe()

The median hourly income of $17/hour is actually higher than most estimates from larger studies that have been done. If you're curious, have a look at [this report](https://www.ridester.com/how-much-do-uber-drivers-make/) to learn more about how much Uber drivers actually make in various locations around the U.S.

Run the following cell to generate boxplots showing the 5-number summary for "Hourly Income" vs. "Weekday" and "Hour of Day".

In [0]:
# Plot boxplots for hourly income rate as a function of day of week and time of day
plt.figure(figsize=(20,5))
# Indicate you want the weekday boxplot on the left with the plt.subplot() routine
plt.subplot(1, 2, 1)
# Create boxplots for hourly income vs. day of the week
sns.boxplot(x='Weekday', 
            y='Hourly Income', 
            data=df, 
            color='r') # Set color to pale red just for fun!
# Indicate you want the hourly boxplot on the right with the plt.subplot() routine
plt.subplot(1, 2, 2)
# Create boxplots of hourly income vs. hour of the day.
sns.boxplot(x='Hour of Day', 
            y='Hourly Income', 
            data=df,
            color='r')
plt.show()

### Questions:
1. Do you find anything surprising about the summary statistics or boxplot displays of "Hourly Income"?
2. Does it look like hourly income rate changes much as a function of day of the week or time of day?
3. It looks like some rides generate a very high hourly income rate... is there any indication that these higher hourly income rides are more common in some days of the week or hours of the day?

### 2.3 Dig deeper into high hourly income rates.
While it looks like drivers are typically making about \$17 an hour, there are some cases where they're making a lot more. Next, you'll take a look at rides where the hourly income rate is greater than \$50/hour.

In [0]:
# Isolate rides where hourly income was high
df_high = df[df['Hourly Income'] > 50]
# Compute summary statistics for high hourly income rides
df_high.describe()

Run the next cell to again print out the summary statistics for the full dataframe so you can compare with the high hourly income records above.

In [0]:
# Reprint summary statistics for the entire dataset for comparison
df.describe()

### Questions 
1. What columns look significantly different in terms of the summary statistics for high hourly income rides vs. the entire dataset?
2. In what ways are high income rides different from the typical rides in the full data set?
3. Can you think of other ways in which you could explore the higher hourly income rides?

### 2.4 Compute and investigate the speed of rides in miles per hour.
Some drivers prefer to work when traffic is light, while others don't mind driving at rush hour. In this step, you'll use the "Trip Miles" and "Trip Seconds" columns to compute "Trip Speed" in miles per hour. 

Run the following cell to perform this step.

In [0]:
# Compute trip speed in miles per hour and add the column to your dataframe
df['Trip Speed'] = round(df['Trip Miles'] / (df['Trip Seconds']/3600), 1)

Run the next cell to print out the first few rows of the dataframe and scroll all the way to the right to see your new "Trip Speed" column.

In [0]:
# Print the column names and first five rows of the dataset contained in df
# Note: to look at more than 5 rows just enter a number in the parentheses e.g., "df.head(10)"
df.head()

Run the next cell to compute summary statistics for the new "Trip Speed" column

In [0]:
# Have a quick look at the summary statistics for this new column
df['Trip Speed'].describe()

Run the following cell to generate boxplots showing the 5-number summary for "Trip Speed" vs. day of the week and hour of the day.

In [0]:
# Plot boxplots for Trip Speed as a function of day of week and time of day
plt.figure(figsize=(20,5))
# Indicate you want the weekday boxplot on the left with the plt.subplot() routine
plt.subplot(1, 2, 1)
# Create boxplots for trip speed vs. day of the week
sns.boxplot(x='Weekday', 
            y='Trip Speed', 
            data=df,
            color='orange') #Set color to orange just for fun!
# Limit the y-axis to a fixed range
plt.ylim(0, 60)
# Indicate you want the hourly boxplot on the right with the plt.subplot() routine
plt.subplot(1, 2, 2)
# Create boxplots of trip speed vs. hour of the day.
sns.boxplot(x='Hour of Day', 
            y='Trip Speed', 
            data=df,
            color='orange')
# Limit the y-axis to a fixed range
plt.ylim(0, 60)
plt.show()

### Questions:
1. What insights can you derive from the summary statistics or boxplot displays of "Trip Speed"?
2. What time of day are rides the fastest? What about day of the week?

### 2.5 Compare Trip Speed and Hourly Income.
It might seem logical that if a trip is taken at a higher speed, then all things being equal, it should take less time and generate a higher hourly income rate, but is it really the case?

Run the next cell to plot Trip Speed vs Hourly Income and finds out!

In [0]:
# Plot Hourly Income vs. Trip Speed
plt.figure(figsize=(10,7))
# Create a scatterplot to display the data
sns.scatterplot(x='Trip Speed', 
                y='Hourly Income', 
                data=df, 
                alpha=0.4, # Make points semi-transparent
                s=20)      # Set size=20 to make points smaller than default
# Limit the x and y-axes to a fixed range
plt.ylim(0, 75)
plt.xlim(0, 60)
plt.show()

### Questions: 
1. Does it look to you like Trip Speed and Hourly Income are related? Or in other words, do higher trip speeds typically yield higher hourly income?


### 2.6 Isolate rides which are high-speed and high-income and map them geographically.

<img src="https://pathstream-data-analytics.s3-us-west-2.amazonaws.com/high_speed_high_income2.png" width = "400" />

Run the following cell to isolate high-speed, high-income rides.

In [0]:
# Create a new dataframe (called df_hshi for high-speed, high-income)
df_hshi = df[(df['Trip Speed'] > 30) & (df['Hourly Income'] > 20)]

Run the next cell to visualize six dimensions of your data at once! In this cell, you are plotting selected high-speed, high-income rides as a function of geographical location (latitude and longitude), where the size of points indicate the "Total Fare" for each ride and the color indicates the "Hour of Day" the ride was taken. 

In [0]:
# Plot the location of high-speed, high-income rides, where point size represents Fare and color indicates Hour of Day
plt.figure(figsize=(14, 8))
# Indicate you want the first plot on the left with the plt.subplot() routine
plt.subplot(1, 2, 1)
# Create a scatter plot of longitude vs. latitude from the df_hshi dataframe
sns.scatterplot(x = 'Pickup Centroid Longitude', 
                y = 'Pickup Centroid Latitude', 
                data = df_hshi,
                s = df_hshi['Trip Total']*7, # Set the size of points to correspond to the "Trip Total" value
                alpha=0.4,                   # Make points semi-transparent
                hue = df_hshi['Hour of Day'],# Set color to correspond to "Hour of Day"
                palette = 'copper_r')        # Choose a color palette (see https://seaborn.pydata.org/tutorial/color_palettes.html)

# Add a plot title
plt.title('High-Speed (mph > 30) & High-Income ($/hr > 20) rides\n (point size indicates total fare, color shows time of day)', fontsize=15)

# Read in a map image of Chicago and display next to the plots
plt.subplot(1, 2, 2)
url = "https://raw.githubusercontent.com/pathstream-curriculum/Statistics/master/600px-Integrated_Chicago_districts_map.png"
img = mpimg.imread(url)
plt.imshow(img)
plt.title('Map of Chicago')
plt.show()

### Questions: 
1. What can you learn about high-speed, high-income rides from the figure above?
2. Does it look like any location or time of day might be better than another for these more desirable rides?


## Objective 3: Compute confidence intervals for driver income
Thus far in your exploratory data analysis you've uncovered lots of insights and reported on everything from tips and fares to hourly usage and the speed of rides. Now it's time to explore the uncertainty in those numbers.



### 3.1 Compute the mean of driver hourly income and convert to annual income for full-time drivers.

Suppose you are asked to report the typical hourly income drivers can expect to be making. You begin by writing the code in the following cell to compute this value. 

Run the cell below to compute the average hourly income.


In [0]:
# Compute the mean hourly income and round to the nearest penny
mean_income = df['Hourly Income'].mean()
print(f'The average hourly income for drivers is: ${mean_income:.2f}')


### Questions:
1. Does the number above for average hourly income seem like a reasonable estimate? 
2. In other words, would you be confident in telling a new driver that this is what they should expect to make on a typical ride? Why or why not?
3. What is your uncertainty in this value for average hourly income? 

Your task to report driver annual income to estimate the long term economic impact of rideshare activity in the city of Chicago. If you assume that a full-time driver completes rides totalling 40 hours of delivering passengers to their destination each week and takes two weeks off per year, what would be their annual income?

To answer this question you code up the following cell. 

Run the next cell to compute annual income for drivers based on the assumptions listed above.

In [0]:
# Compute average annual income as hourly income multiplied by 40 hours per week and 50 weeks per year (assuming two weeks off)
annual_income = df['Hourly Income'].mean()*40*50
print(f'Average annual income for drivers is: ${annual_income:.2f}')

### Questions:
1. What is your uncertainty in the number you computed above for estimated annual salary?
2. What are the potential sources of error that could be represented in this estimate?

### 3.2 Compute the Standard Error of the Mean.

In this case, you are estimating the average income for drivers based on two weeks of sample data. In other words, you are estimating the average income for the entire annual population of rides based on a relatively small sample, and, if you can assume your sample is unbiased, then the error associated with your estimate is what's known as the standard error of the mean (SEM), which is computed as follows.

First, the standard deviation is defined as:
$$ \sigma = \sqrt{\frac{\Sigma(x_i - \mu)^2}{n}} $$

where each $x_i$ is a data value, $\mu$ is the mean of all values, and $n$ is the number of elements in your sample.

The SEM is then: 

$$SEM = \frac{\sigma_{population}}{\sqrt{n}}$$

Where $\sigma_{population}$ is the standard deviation of the entire population. 

### Question: 
1. Do you know the standard deviation of the population?

Typically, you do not know the standard deviation of the population (and if you did, you'd probably know the mean too and wouldn't have to be worrying about calculating standard error!). When you don't know the population standard deviation, you can use the standard deviation of your sample as a rough approximation, so the SEM becomes:

$$SEM = \frac{\sigma_{sample}}{\sqrt{n}}$$

where $\sigma_{sample}$ is now the sample standard deviation you calculated above. 

Run the next cell to calculate the SEM for your estimate of annual driver income.

In [0]:
# Compute the standard error on average annual income
annual_income_sem = df['Hourly Income'].sem()*40*50
print(f'Average annual income for drivers is: ${annual_income:.2f} +/- {2*annual_income_sem:.2f}')

### Questions:
1. What percent confidence is represented by this estimate of SEM for driver income?


According to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), the mean values computed from different samples drawn from a population will be normally distributed, and the bigger your sample, the narrower that normal distribution will be (and the smaller SEM confidence interval). 


### Questions:
The central limit theorem requires that the sample you choose to measure the mean and SEM be a random unbiased sample of the entire population.
1. What is the 95% confidence interval implied by your measurement of annual driver income reported above?
2. Given the requirements of the central limit theorem, would you be comfortable reporting this result with 95% confidence?

Run the cell below to read in a larger data set (90k records) that consists of a random sample drawn from November through March and compute annual driver income and SEM.

### Gathering more data

Run the following cell to make the same hourly income calculation on a much larger data set, this time containing rougly 90k records randomly selected from November to March. 

In [0]:
# Read in a new dataset of 5,000 randomly selected records
url = "https://raw.githubusercontent.com/pathstream-curriculum/Statistics/master/rideshare_random90k.csv"
df_90k = pd.read_csv(url, parse_dates=['Trip Start Timestamp', 'Trip End Timestamp'])
# Compute an hourly income rate per trip and add the column to your dataframe
trip_hours_90k = df_90k['Trip Seconds']/3600 # convert seconds to hours
df_90k['Hourly Income'] = hourly_fraction*(fare_fraction*df_90k['Fare'] + df_90k['Tip'])/trip_hours_90k
# Compute annual driver income and the standard error on average annual income
annual_income_90k = df_90k['Hourly Income'].mean()*40*50
annual_income_sem_90k = df_90k['Hourly Income'].sem()*40*50
print(f'Annual income for drivers is: ${annual_income_90k:.2f} +/- {2*annual_income_sem_90k:.2f} dollars per year (at 95% confidence)')

### Questions:
1. How does the annual driver income you calculated with the larger sample compare with the estimate you calculated above for the holiday sample?
2. What can you say about the statistical agreement or disagreement between the results from the holiday sample and the larger random 90k sample?
2. What does this tell you about the two week holiday dataset in terms of being a representative (random, unbiased) sample of the overall population?
3. Turning this on its head, what can you now infer with high confidence about driver hourly income over the holidays? What is the z-value associated with this conclusion?

# Congratulations! You've come to the end of this notebook!

Great job carrying out this exploratory analysis of the Chicago rideshare dataset. You now have a wealth of insights to present regarding the statistics of rides in Chicago and key metrics that affect drivers. Write up your results along with any supporting figures you would like to use from this notebook or your work in Google sheets. 