<a href="https://colab.research.google.com/github/pathstream-curriculum/Statistics/blob/master/Rideshare_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rideshare Project Part 1
<img src="https://data.cityofchicago.org/api/assets/73F1665C-0FE6-4183-8AD1-E91DB8EFAFA4?7CB02402-8E06-48B0-8C9A-3890182D58C7.png" width=400 alt="Drawing" style="width: 200px;"/>

The city of Chicago has hired you as an analyst to dive into their recently published a dataset containing detailed information about all rides taken with rideshare providers like Uber and Lyft in Chicago and surrounding areas From November 2018 through March 2019.

The dataset you'll be investigating in this notebook is the same one you looked at in the Google Sheets lab, comprising two weeks worth of data from Dec. 21, 2018 to Jan. 3, 2019. This dataset has been downsampled by a factor of 500 to reduce the size, meaning that only one of every 500 records (selected randomly) from this time period in the original dataset is included here. 

*Note: Some columns of unnessary or redundant information have been removed from the original data set and columns for Year, Month, Weekday and Hour of Day have been added for convenience. The original published data was anonymized by rounding off dollar amounts and times of day. To make the data more realistic looking we have added random noise to the Fare, Tip, Latitude and Longitude columns.*



## Saving Material from this Lab for your Final Report
Feel free to go through this notebook from top to bottom the first time without worrying about notes or screenshots if you like. Once you're ready to start compiling material for your report you can come back and make notes or take screenshots.

As you go through this lab, you'll encounter questions that are meant to get you thinking about the interpretation of each step of analysis. You can double click on any text cell to edit directly as a way of taking notes, or you can take notes elsewhere. To save visuals or other outputs of the code you should take a screenshot (Holding down "Command-Shift-4" on Mac or using the "Snipping Tool" in Windows).


Ultimately, the final report you compile in a Google doc should include your interpretations (text) and screenshots of the relevant code outputs. You can always come back to this notebook later to take more screenshots or remind yourself of the interpretation questions.

## Objectives for this Python lab:
Some of the steps you will complete in this lab are effectively the same as you did in the Sheets lab, but now in a Python environment. Other steps around your detailed investigation of fares and tips are completely new. The overlap between this lab and the previous Sheets lab is intended to show you how some things work in Python that you're already familiar with in Sheets, and the new steps are to demonstrate where Python allows you to easily do things that would be very difficult or impossible in Sheets. Here are your objectives for this lab:

1. Explore and prepare the dataset.

2. Compute summary statistics and visualize individual columns of data.

3. Drill down on the Tip column.

4. Compare tips against fares and pickup location. 

The code for all these objectives is already written for you. All you need to do is press "Shift+Enter" on your keboard after selecting each cell to run the code. <mark>Be sure to run all the cells in order because some of the steps need to happen in a sequence</mark>. At a few points we give suggestions for making simple modifications to the code. Feel free to experiment with more if you're feeling curious!

### Objective 1: Explore and prepare the dataset.

**1.1 Investigate the data source (if you haven't already).**

As always, a great first step before you jump into your analysis is to check out the source of your data. You can find out more about this exciting dataset [here](https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p/data).

**1.2 Import the relevant Python packages.**

To run the code in this notebook, you first need to import some Python libraries to read and work with the data. Comments in the code (lines that begin with #) indicate what each package will be used for. 

Run the following cell by selecting it and clicking the play (triangle) button on the left or press "Shift+Enter" to run this cell and perform these imports.

Note, this cell will not produce any output. If it runs without complaining you've been successful!

In [0]:
# Import the pandas library for reading and manipulating your data
# Anywhere you see "pd" in this notebook it's a reference to the pandas library
import pandas as pd
# Extra step to ensure that pandas plays nice with matplotlib
pd.plotting.register_matplotlib_converters()
# Import the numpy library for running calculations on your data
# Anywhere you see "np" in this notebook it's a reference to the numpy library
import numpy as np
# Import some components of the matplotlib library for plotting your data
# Anywhere you see "plt" or "mpimg" in this notebook it's a reference to the "pyplot" and "image" packages from matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# Import seaborn library for making your plots pretty!
# Anywhere you see "sns" in this notebook it's a reference to the seaborn library
import seaborn as sns
# Set some default plotting parameters using seaborn
sns.set()


**1.3 Read in the dataset.**

With the code below, you will use the `pandas` code library to read a csv (comma separated value) file containing your data into an object called "`df`". In this case, `df` is short for dataframe, which is a type of object used for storing rows and columns of data.

Run the following cell (select and press play or "Shift+Enter" on the keyboard) to complete this step. There will again be no output if this cell runs successfully.



In [0]:
# Read the data into a dataframe called "df".
url = "https://raw.githubusercontent.com/pathstream-curriculum/Statistics/master/rideshare_holidays.csv"
df = pd.read_csv(url, parse_dates=['Trip Start Timestamp', 'Trip End Timestamp'])

**1.4 Inspect the dataset.**

Run the following cell to look at the column names and first few rows of your data. Be sure to scroll to the right to see all the columns. 

In [0]:
# Print the column names and first five rows of the dataset contained in df
# Note: to look at more than 5 rows just enter a number in the parentheses e.g., "df.head(10)"
df.head()

Run the next cell to look at the last few rows of your data. 

In [0]:
# Print the column names and last five rows of the dataset contained in df
# Note: to look at more than 5 rows just enter a number in the parentheses e.g., "df.tail(10)"
df.tail()

### Answer the following questions after looking at the data:
To explore the city of Chicago website for this dataset further to learn more about each column, return to the lesson page where you launched this lab and click the link to the data source.

***Double click on this cell (or select and click the pencil icon in the upper right of the cell) to input responses to these questions as a means of taking notes. You can edit any of the text cells in this notebook in the same manner***

***Througout this lab you'll encounter questions like the two below. These are just meant to get you thinking and give you prompts to take notes for your final report. There is no preferred format for answering and these will not be graded***

1. What timeframe (range of dates/times) does this dataset cover?

2. Which columns look like the most interesting ones to explore?

**1.4 Explore column data types and the presence of null / missing values.**

In the cells below, running `df.shape` prints out the shape of the dataset and `df.info()` prints out each column name along with the total number of non-null values in that column and its data type. Investigate the output and see what you find!

In [0]:
# Display the shape of this dataset
df.shape

In [0]:
# Print information about the total number of non-null values and data types in each column of your dataframe
df.info()

### Questions:
The output of `df.info()` is a table containing the column names, number of non-null elements and data type. Above the printed table `RangeIndex:` indicates the total number of rows in the table.
1. What is the shape of this data set in terms of (# of rows, # of columns)? 
2. How many columns contain null values? Do these columns have anything in common?

***Optional Extra Questions:***
3. In the output of `df.info()` above, you'll see that the different data types are listed as things like "datetime64", "bool" and "float64"... what do these names actually mean? In other words, what does it mean to be an "int64" vs. a "bool" vs. something else? (note: this isn't from the lessons, might have to look it up!)
4. How many different data types are there? 
  

**1.5 Handle missing values**

 If there are null/missing values in your data, you will sometimes want to eliminate those rows from the dataset, or change them to an acceptable value. In some cases, null values may be interesting to explore further. For this project, you can choose to simply leave the null values alone or remove records with null values. 

---

If you want to remove all records containing null values, make a note of why you've decided to do this and then run the next cell after removing the "#" from the beginning of the line that says `df.dropna(inplace=True)` to drop all rows containing null values. If you choose not to remove missing values you can skip the next cell. 

In [0]:
# Remove all rows that contain any null values
#df.dropna(inplace=True) 

If you removed records with null values, you can now you can run `df.info()` again to confirm that you have the same number of non-null values in all columns of your dataset. Otherwise the next cell will produce the same output as you saw above.

In [0]:
# Print information about the total number of non-null values and data types in each column of your dataframe
df.info()

### Objective 2: Compute summary statistics and visualize individual columns of data.

**2.1 Investigate summary statistics.**

Run the cell below to compute and display summary statistics for your dataset. The output will be a table containing the count, mean, standard deviation, min, max and 25%, 50% (median) and 75% (Q1-Q3) quartiles for all columns. 

In [0]:
# Print out summary statistics for all columns
df.describe()

### Answer these questions based on summary statistics
The table you printed using `df.describe()` is full of interesting information! Answer the following questions based on the summary statistics presented in the table.
1. What is the median tip (Q2 or 50% quartile) given for rides?
2. What is the longest distance ride taken?
3. What is the average trip time in minutes?
4. What was the most expensive ride?
5. Are there more rides in the AM or PM hours of the day?

## Visualizing Data
<img src="https://pathstream-data-analytics.s3-us-west-2.amazonaws.com/datavis_example.png" width=400 alt="Drawing" style="width: 200px;"/>


**2.2 Visualize the 5-number summary as a boxplot.**

Using `df.describe()` above, you printed out the summary statistics for each column. Often times it can be really helpful to look at summary statistics graphically, and for that you'll use a boxplot, where you can easily see the range, quartiles and any outliers in your data.

When you run the next cell, you'll generate a boxplot using the [seaborn `boxplot()` routine](https://seaborn.pydata.org/generated/seaborn.boxplot.html), where the area shaded in blue shows the interquartile range (IQR), with vertical lines showing the Q1, Q2(median), and Q3 quartiles.  

In [0]:
# Set the figure size for large display
plt.figure(figsize=(10,4))
# Display Fare boxplot
sns.boxplot(df['Fare'])
plt.show()

In the figure you generated above, the area shaded in blue shows the interquartile range (IQR), with vertical lines showing the Q1, Q2(median), and Q3 quartiles. The vertical lines outside the shaded box show +/- 1.5\*IQR, which is the default range used to identify outliers. The individual datapoints plotted show those identified as outliers.

### Exercise + Questions:
1. Refer back to the table you printed out using `df.describe()` above and verify that the Q1, Q2(median), and Q3 quartiles for the "Fare" column agree with the plot you generated running the previous cell.
2. In this case, what values are considered "outliers" for the "Fare" column?
3. In the code cell above, where it says `sns.boxplot(df['Fare'])`, change "Fare" to "Trip Miles" and run the cell again. (Note: the column name you put in this line of code has to match exactly with what's in the data, i.e., none of the following will work: "trip miles", "TripMiles" or "TRIP MILES")
4. Change the column name to "Tip" and run the code again. What do you find?

**2.3 Visualize a column of data as a histogram.**

Another useful way to visualize a column of data is using a histogram. In this case, you'll be using the [seaborn `distplot()`](https://seaborn.pydata.org/generated/seaborn.distplot.html) routine to generate bins across the entire range of your data and count the number of data points that fall into each bin. 

---

Run the cell below to look at the "Fare" column in histogram form.

In [0]:
# Set the figure size for large display
plt.figure(figsize=(8,6))
# Display Fare histogram. 
# The "kde=False flag indicates you don't want to plot a line over the histogram (set kde=True to see what happens!)
# The "rug=True" flag indicates you want to put a tick mark along the x-axis for each datapoint (set rug=False to see what happens!)
sns.distplot(df['Fare'], 
             kde=False, 
             rug=True)
# Add a label to the vertical axis
plt.ylabel('Count')
plt.show()

### Exercise + Questions:
1. It might look like there are no histogram bars on the right-hand side of the graph but the fact is they're there, just small! The tick marks along the x-axis show the location of each of the data points. Refer back to the table you printed out using `df.describe()` above and verify that the range (min and max values) for the "Fare" column agree with the plot you generated running the previous cell.
2. In the code cell above, where it says `sns.distplot(df['Fare'])`, change "Fare" to "Trip Seconds" and run the cell again. 
3. Change the column name to "Tip" and run the code again. What do you find?
4. Change the column name to something else and explore a different column!

### What's going on with the "Tip" Column??
<img src="https://pathstream-data-analytics.s3-us-west-2.amazonaws.com/boxplot_tips.png" width = "600" />



In your investigation thus far, you've encountered some interesting results looking at the "Tip" column. In the 5-number summary and the boxplot, you saw that the Q1, Q2(median), and Q3 quartiles were all equal to zero! In looking at the histogram, you saw that the values in the 'Tip" column are dominated by zeros. This warrants further investigation!

Next, you'll isolate the data for those who tipped and those who didn't and investigate these groups independently. To begin, you'll isolate the tippers and recreate the boxplot and histogram for just this group. 

### Objective 3: Drill down on the Tip column.

 **3.1 Isolate tippers and non-tippers and compute the fraction of people who tip.**

 Run the following code cell to create two new dataframes; one that only contains rides that tipped and one containing rides that didn't tip.

In [0]:
# Create dataframe of just the data where tips are greater than 0 and another for tips equal to zero.
df_tip = df[df['Tip'] > 0]
df_notip = df[df['Tip'] == 0]
# Compute the "length" of the new dataframe (number of rows) and divide by the length of the original to calculate the fraction of people who tip
number_of_tippers = len(df_tip)
total_number = len(df)
fraction_of_tippers = number_of_tippers/total_number
print('Two new dataframes successfully created!')
print('The fraction of people who tip is {} or {}% of all riders'.format(round(fraction_of_tippers, 2), round(fraction_of_tippers*100)))

**3.2 Visualize tip distribution for tippers.**
 
 Run the following code cell to create a boxplot and histogram where tips were greater than 0.

In [0]:
# Plot boxplot and histogram of df_tippers
plt.figure(figsize=(20,5))
# Indicate you want the boxplot on the left with the plt.subplot() routine
plt.subplot(1, 2, 1)
sns.boxplot(df_tip['Tip'])
# Indicate you want the histogram on the right
plt.subplot(1, 2, 2)
sns.distplot(df_tip['Tip'], 
             kde=False, 
             rug=True)
plt.show()

### Questions:
1. What fraction of riders tip?
2. Based on the plots above, what is the approximate IQR for tips when tips are given?
2. What's the difference between a "typical" tip and the maximum tip?

**3.3 Investigate summary statistics for tippers vs. non-tippers.**

Run the cell below to perform this step.


In [0]:
# Print summary statistics for df_tip
df_tip.describe()

In [0]:
# Print summary statistics for df_notip
df_notip.describe()

### Questions:
1. What is the difference between the median fare for tippers vs. non-tippers?
2. Based on the summary statistics for tippers compared with non-tippers, are there any obvious differences that stand out between the two (apart from the fact that non-tippers don't tip!)?

**3.4 Visualize tips by day of the week and hour of day.** 

The next logical step when looking to identify insights in your data is to investigate subsets of the data. In this case, the summary statistics don't look much different for the entire data set, so we can start by looking at subsets grouped by day of the week and hour of the day.

Run the next cell to generate boxplots for tips given grouped by day of the week and hour of day.

In [0]:
# Plot boxplot and histogram of df_tippers
plt.figure(figsize=(20,5))
# Indicate you want the weekday boxplot on the left with the plt.subplot() routine
plt.subplot(1, 2, 1)
column = "Tip"
sns.boxplot(x='Weekday', 
            y=column, 
            data=df_tip)
# Indicate you want the weekday boxplot on the left with the plt.subplot() routine
plt.subplot(1, 2, 2)
sns.boxplot(x='Hour of Day', 
            y=column, 
            data=df_tip)
plt.show()

The graphs you created above show multiple boxplots at once. Now the boxplots are oriented vertically instead of horizontally, but the information they show is the same thing, namely, a representation of the quartiles, IQR, and outliers for tips grouped by day of the week on the left and hour of the day on the right (note: the hour starts at 0 for midnight up to 23 for 11PM).


### Questions and Exercise:
1. Based on the plots above, does it look like there is any difference in tips between different days of the week or hours of the day? 
2. Which hour of the day has the highest median tip?

Change the code above to create these plots for the "Fare" column instead of "Tip". To do this, change the line that says "Tip" to say "Fare" instead. 

3. Do you see any similarities between the boxplots you created for "Tip" and those for "Fare"?

***Bonus Exercise***: The colors shown in the plots above are not adding any information. For some crazy reason they are just the defaults for the `sns.boxplot()` routine in this case. Visual design best practices would suggest you should change to a single color for these plots. You can do so with a small change to the code. 

Change these lines above:
```python
sns.boxplot(x='Weekday', 
            y="Tip", 
            data=df_tip)
sns.boxplot(x='Hour of Day', 
            y="Tip", 
            data=df_tip)
```
by adding `color='b'` to set the color to pale bluelike this:
```python
sns.boxplot(x='Weekday',
            y='Tip', 
            data=df_tip, 
            color='b')
sns.boxplot(x='Hour of Day', 
            y='Tip', 
            data=df_tip, 
            color='b')
```
Try it and see what you get!

You can also try switching to `color='g'`  for green, `'r'` for red, and so on.

***Note: Throughout the rest of this notebook, colors for plotting have been chosen at random. Feel free to change them!***

**3.5 Compute the fraction of people who tip by day of week and hour of day.**

Run the cell below to compute the fraction of people who tip each day of the week.

In [0]:
# Print out the fraction of tippers by day of the week
# Note: the days print out in a strange order because the list is arranged alphabetically
print('Fraction of tippers by day of week:\n', round(df_tip.groupby(['Weekday']).count()['Tip']/df.groupby(['Weekday']).count()['Tip'], 2))

Run the next cell to compute the fraction of people who tip at each hour of the day.

In [0]:
# Print out the fraction of tippers by hour of day
# Note: the hour starts at 0 for midnight up to 23 for 11PM.
print('Fraction of tippers by hour of day:\n', round(df_tip.groupby(["Hour of Day"]).count()["Tip"]/df.groupby(["Hour of Day"]).count()["Tip"], 2))

### Question:

1. Is any one day of the week or hour of the day bettter than another in terms of fraction of people who tip?

### Objective 4. Compare tips against fares and pickup location. 

**4.1 Create scatter plots of Fare vs. Tip.**

In the previous steps, you may have noticed that both tips and fares seemed a bit higher for some reason in the early AM hours. Next you'll plot these two columns against each other in a scatter plot to see if they appear to be related.

Run the cell below to create a scatterplot of Fare vs. Tip for all riders who tipped using the [seaborn scattertplot() routine](https://seaborn.pydata.org/generated/seaborn.scatterplot.html).

In [0]:
# Plot a scatterplot of fare vs tip
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_tip,
                x = 'Fare', 
                y = 'Tip',
                alpha = 0.7,
                edgecolors='w',
                color='red')
plt.show()

### Question: 
1. Does it look to you like there is a relationship between Fare and Tip? Or in other words, do riders paying a high fare tend to tip more?

**4.2 Investigate Tip with regards to pickup location.**

Your data set includes location data for the pickup and dropoff location of each ride, listed in the fields "Pickup Centroid Latitude", "Pickup Centroid Longitude" "Dropoff Centroid Latitude", and "Dropoff Centroid Longitude". Run the cell below to plot the geographical pickup location of rides that tipped and those that didn't. 

In [0]:
# Plot pickup location for tippers and non-tippers
plt.figure(figsize=(18,6))

# Indicate you want the boxplot on the left with the plt.subplot() routine
plt.subplot(1, 3, 1)
sns.scatterplot(x = 'Pickup Centroid Longitude', 
                y = 'Pickup Centroid Latitude', 
                data = df_notip,
                alpha=0.2, 
                edgecolors='w')
plt.title('Pickup Location of Non-Tippers')

# Indicate you want the boxplot on the left with the plt.subplot() routine
plt.subplot(1, 3, 2)
sns.scatterplot(x = 'Pickup Centroid Longitude', 
                y = 'Pickup Centroid Latitude', 
                data = df_tip,
                s = df_tip['Tip']*25, # Use the s parameter to set the size of data points to tip size!
                alpha=0.2, 
                edgecolors='w')
plt.title('Pickup Location and Size of Tips')

# Read in a map image of Chicago and display next to the plots
plt.subplot(1, 3, 3)
url = "https://raw.githubusercontent.com/pathstream-curriculum/Statistics/master/600px-Integrated_Chicago_districts_map.png"
img = mpimg.imread(url)
plt.imshow(img)
plt.title('Map of Chicago')
plt.show()

The plot on the left above shows the pickup location of riders who didn't tip and the plot in the center shows the pickup location of riders who tipped. Compare the plots above with the image on the right... can you make out the city of Chicago in the distribution of points above?



### Questions
1. Do you notice any differences about the geographical distribution of pickup location for riders who tip vs. those who don't?

2. There appear to be two concentrations of riders who tip from pickup locations that are fairly far out west (left) of the city. Looking at the figure below and the larger map (scroll down to see), can you guess which locations these are?
<img src="https://pathstream-data-analytics.s3-us-west-2.amazonaws.com/tips_west_of_chicago.png" width = "300" />


Check out the website this map came from to learn more about the neighborhoods of Chicago and think about how location might factor in to rideshare usage: https://wikitravel.org/en/Chicago

<img src="https://wikitravel.org/upload/shared//thumb/1/1f/Integrated_Chicago_districts_map.png/600px-Integrated_Chicago_districts_map.png" width = 600 alt="Drawing" style="width: 100px;"/>

# Congratulations! You've come to the end of this notebook!

Great job carrying out this exploratory analysis of the Chicago rideshare dataset. You now have a wealth of insights to present regarding the statistics of rides in Chicago and some key metrics that affect drivers. Proceed to the "Chicago Rideshare Project: Part 1 Report" module in the Pathstream platform to begin writing up your report. You can leave this lab open or return to it later to gather materials after previewing the Google Doc report template. 