In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw01.ipynb")

# Homework 01: Explorartoty Data Analysis and Visualization

## Introduction

Bike sharing systems are a new generation of traditional bike rentals where the process of signing up, renting, and returning is automated. Through these systems, users are able to easily rent a bike from one location and return it to another. We will be analyzing bike sharing data from Washington D.C. 

In this assignment, you will perform tasks to clean, visualize, and explore the bike sharing data. You will also investigate open-ended questions. These open-ended questions ask you to think critically about how the plots you have created provide insight into the data.

After completing this assignment, you should be comfortable with:

* reading plaintext delimited data into `pandas`
* wrangling data for analysis
* using EDA to learn about your data 
* making informative plots

To receive credit for a homework assignment, answer all questions correctly and submit before the deadline.

**Due Date:** Wednesday, March 10, 2021 at 7:00 p.m.

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

**Notes:**

- Your plots should be **similar** to the given examples. Small variations are acceptable such as color differences or slight variations in scale. However it is in your best interest to make the plots as similar as possible, as similarity is subject to the instructor.

- It is expected that for all plotting questions from here on out, there are appropriate titles, axis labels, legends, etc. The following question serves as a good guideline on what is "enough": If I directly downloaded the plot and viewed it, would I be able to tell what was being visualized without knowing the question?** 

- In this notebook a custom figure size has been configured. Click [here](https://matplotlib.org/users/customizing.html) to read the documentation about customizing aspects of matplotlib.

Run the cell below.

In [1]:
import numpy as np
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10,10)
plt.rcParams['figure.dpi'] = 150
sns.set()

---
## 1. Loading Bike Sharing Data
The data we are exploring was collected from a bike sharing system in Washington D.C.

The variables in this data frame are defined as:

Variable  |Description
--------- |--------------------------------------------------------------
instant | record index
dteday | date
season | 1. spring <br> 2. summer <br> 3. fall <br> 4. winter
yr | year (0: 2011, 1:2012)
mnth | month ( 1 to 12)
hr | hour (0 to 23)
holiday | whether day is holiday or not
weekday | day of the week
workingday | if day is neither weekend nor holiday
weathersit | 1. clear or partly cloudy <br> 2. mist and clouds <br> 3. light snow or rain <br> 4. heavy rain or snow
temp | normalized temperature in Celsius (divided by 41)
atemp | normalized "feels-like" temperature in Celsius (divided by 50)
hum | normalized percent humidity (divided by 100)
windspeed| normalized wind speed (divided by 67)
casual | count of casual users
registered | count of registered users
cnt | count of total rental bikes including casual and registered  

### Examining the File Contents and the Metadata

Run the cell below to print out the first 5 lines of the `bikeshare.txt` file.

In [2]:
from itertools import islice

lines_to_print = 5
with open('bikeshare.txt', "r") as f:
    for line in list(islice(f, lines_to_print)):
        print(line, end = "")

**Question 1.1.** Identify the file format. Then choose an item from the list below and assign its corresponding number to the variable `q1_1`.

<!--
BEGIN QUESTION
name: q1_1
manual: false
-->

1. Excel spreadsheet
2. CSV (comma-separated values)
3. TSV (tab-separated values)

In [3]:
q1_1 = ...

In [None]:
grader.check("q1_1")

Metadata is "data that provides information about other data". In other words, it is "data about data". [[1]](https://en.wikipedia.org/wiki/Metadata)

Run the cell below to view some metadata about the `bikeshare.txt` file.

In [6]:
size = os.stat("bikeshare.txt").st_size
line_count = len(open("bikeshare.txt").readlines(  ))

print("Size:", size, "bytes")
print("Line Count:", line_count, "lines")

**Question 1.2.** Identify the size and the line count. Then put the values in a list and assign it the variable `q1_2`. 

<!--
BEGIN QUESTION
name: q1_2
manual: false
-->

In [7]:
q1_2 = ...

In [None]:
grader.check("q1_2")

### Loading the Data

Run the cell below to load the dataset as DataFrame.

In [15]:
bike = pd.read_csv('bikeshare.txt')
bike.head()

Below, we show the shape of the file. You should see that the size of the DataFrame matches the number of lines in the file, minus the header row.

In [16]:
bike.shape

---
## 2. Data Preparation
A few of the variables that are numeric/integer actually encode categorical data. These include `holiday`, `weekday`, `workingday`, and `weathersit`. In the following question, we will convert these four variables to strings specifying the categories. In particular, we will use 3-letter labels (`Sun`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, and `Sat`) for `weekday`.For `holiday` and `workingday`,  simply use `yes`/`no`.

In this exercise we will **mutate** the data frame (i.e. **overwriting the corresponding variables in the data frame.**). However, our notebook will effectively document this in-place data transformation for future readers. Make sure to leave the underlying datafile `bikeshare.txt` unmodified.

**Question 2.1.** Decode the `holiday`, `weekday`, `workingday`, and `weathersit` fields:

- `holiday`: Convert to `yes` and `no`.  

- `weekday`: Mutate the `'weekday'` column to use the 3-letter label (`'Sun'`, `'Mon'`, `'Tue'`, `'Wed'`, `'Thu'`, `'Fri'`, and `'Sat'`) instead of its current numerical values. Assume `0` corresponds to `Sun`, `1` to `Mon` and so on.

- `workingday`: Convert to `yes` and `no`.

- `weathersit`: You should replace each value with one of `Clear`, `Mist`, `Light`, or `Heavy`.

**Note:** If you want to revert changes, run the cell that reloads the data file.

**Hint:**  One approach is to use the [replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) method of the pandas DataFrame class. We haven't discussed how to do this so you'll need to look at the documentation. The most concise way is with the approach described in the documentation as ["nested-dictonaries"](https://www.geeksforgeeks.org/python-nested-dictionary/), though there are many possible solutions.

<!--
BEGIN QUESTION
name: q2_1
manual: false
-->

In [17]:
bike = ...
bike.head()

In [None]:
grader.check("q2_1")

**Question 2.2** How many entries in the data correspond to holidays?  Set the variable `num_holidays` to this value.

<!--
BEGIN QUESTION
name: q2_2
manual: false
-->

In [23]:
num_holidays = ...

In [None]:
grader.check("q2_2")

The granularity of this data is at the hourly level.  However, for some of the analysis we will also want to compute daily statistics.  In particular, in the next few questions we will be analyzing the daily number of registered and unregistered users.

**Question 2.3.** Construct a data frame named `daily_counts` indexed by `dteday` with the following columns:

* `casual`: total number of casual riders for each day
* `registered`: total number of registered riders for each day
* `workingday`: whether that day is a working day or not (`yes` or `no`)

**Hint**: Consider using the `groupby` and `agg` functions. For `agg`, you can check the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) for examples on applying different aggregations per column. If you use the capability to do different aggregations by column, you can do this task with a single call to `groupby` and `agg`. For the `workingday` column we can take any of the values since we are grouping by the day, thus the value will be the same within each group. It may also be helpful to take a look at the `first` or `last` aggregation functions.

<!--
BEGIN QUESTION
name: q2_3
manual: false
-->

In [25]:
daily_counts = ...
daily_counts.head()

In [None]:
grader.check("q2_3")

<!-- BEGIN QUESTION -->

---
## 3. Exploring the Distribution of Riders

Let's begin by comparing the distribution of the daily counts of casual and registered riders.

**Question 3.1.** Use the `sns.histplot` function to create a plot that overlays the distribution of the daily counts of `casual` and `registered` users. 

- The temporal granularity of the records should be daily counts, which you should have after completing **Question 2.3.**.

- After creating the plot, look at it and make sure you understand what the plot is actually telling us (e.g on a given day, the most likely number of registered riders we expect is ~4000, but it could be anywhere from nearly 0 to 7000).

**Note:** Click [here](https://seaborn.pydata.org/generated/seaborn.histplot.html) to read the `sns.histplot` documentation.

**Hint:** `matplotlib.pyplot` has methods to customize the features of a plot. For example, the syntax for the `.xlabel`, `.ylabel`, `.title`, and `.legend` methods can be found in the [documentation](https://matplotlib.org/stable/index.html).

<img src='casual_v_registered.png' width = "800px" class = "center"/>

<!--
BEGIN QUESTION
name: q3_1
manual: true
-->

In [38]:
sns.histplot(...)
plt.title(...)
plt.legend(...)
plt.xlabel(...)
plt.ylabel(...);

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.2.** In the cell below, descibe the differences you notice between the histograms for casual and registered riders. Consider each of the following concepts:

- mode 

- symmetry

- skewness 

- spread of the distributions 

<!--
BEGIN QUESTION
name: q3_2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Ethical Considerations

City planners, transportation agencies, and policy makers have started to collaborate with bike sharing companies in order to reduce congestion and transportation costs. Recently city planners and policy makers have also been trying to make transportation more equitable. 

Equity in transportation includes: finding ways to make transportation more accessible to people in all neighborhoods within a given region, making the costs of transportation affordable to people across all income levels, and assessing how inclusive transportation systems are over time. Data about city residents may shed light on how to better assess transportation cost and equity impacts on transportation users. 

Keeping this in mind, answer the following two question on the nature of the data, their possible shortcomings, and ethical considerations associated with how we as data scientists use, manipulate, and share this data.

<!-- BEGIN QUESTION -->

**Question 3.3.** In addition to the type of rider (casual vs. registered) and the overall count of each, what other kinds of demographic data would be useful (e.g. identity, neighborhood, monetary expenses, etc.)? 

<!--
BEGIN QUESTION
name: q3_3
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.4.** What is an example of a privacy or consent issue that could occur when accessing the demographic data you mentioned in the previous question.

<!--
BEGIN QUESTION
name: q3_4
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.5.** The histogram plots do not show us how the counts for registered and casual riders **vary together**. Use [`sns.lmplot`](https://seaborn.pydata.org/generated/seaborn.lmplot.html) to make a scatter plot to investigate the relationship between casual and registered counts. This time, let's use the `bike` DataFrame to plot hourly counts instead of daily counts.

The `lmplot` function will also try to draw a linear regression line (just as you saw in Foundations of Data Science). Color the points in the scatterplot according to whether or not the day is working day. There are many points in the scatter plot so make them small to help reduce overplotting. Also make sure to set `fit_reg = True` to generate the linear regression line. You can also set the `height` parameter if you want to adjust the size of the `lmplot`.

<img src='casual_registered_working_nonworking.png' width = "800px" class = "center"/>

**Hints:**
- Click [here](https://seaborn.pydata.org/generated/seaborn.lmplot.html) to view the documentation on `sns.lmplot`.

- Checkout this helpful [tutorial on `lmplot`](https://seaborn.pydata.org/tutorial/regression.html).

- You will need to set `x`, `y`, and `hue` and the `scatter_kws`.



<!--
BEGIN QUESTION
name: q3_5
manual: true
-->

In [39]:
sns.lmplot(...)
plt.title(...)
plt.legend(...)
plt.xlabel(...)
plt.ylabel(...);

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.6.** What does this scatterplot seem to reveal about the relationship (if any) between casual and registered riders and whether or not the day is on the weekend?

<!--
BEGIN QUESTION
name: q3_6
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.7.** Let's examine the behavior of riders by plotting the average number of riders for each hour of the day over the **entire dataset**, stratified by rider type. Your plot should look like the following:

<img src="diurnal_bikes.png" width = "800px" class = "center"/>

<!--
BEGIN QUESTION
name: q3_7
manual: true
-->

In [41]:
...
plt.title(...)
plt.legend(...)
plt.xlabel(...)
plt.ylabel(...);

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.8.** What do you observe about the bike use of the different categories of riders from the plot? When does each group have the most use? How does the bike use change throughoout the day?

<!--
BEGIN QUESTION
name: q3_8
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.9.** What can you say about the meaning of the peaks in the registered riders' distribution?

<!--
BEGIN QUESTION
name: q3_9
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

---
## 4. Exploring Ride Sharing and Weather

Now let's examine how the weather is affecting riders' behaviors. First let's look at how the proportion of casual riders changes as weather changes.

**Question 4.1.** Create a new column `prop_casual` in the `bike` DataFrame representing the proportion of casual riders out of all riders.

<!--
BEGIN QUESTION
name: q4_1
manual: true
-->

In [43]:
...

In [None]:
grader.check("q4_1")

<!-- END QUESTION -->



In order to examine the relationship between proportion of casual riders and temperature, we can create a scatterplot using `sns.scatterplot`. We can even use color/hue to encode the information about day of week. 

**Example 4.1.** Run the cell below, and you'll see we end up with a big mess that is impossible to interpret.

In [53]:
sns.scatterplot(data = bike, x = 'temp', y = 'prop_casual', hue = 'weekday');

We could attempt linear regression using `sns.lmplot`, which may hint at some relationships between temperature and proportion of casual riders.  

**Example 4.2.** Run the cell below, and you'll see that the plot is still fairly unconvincing.

In [52]:
sns.lmplot(data = bike, x = 'temp', y = 'prop_casual', hue = 'weekday', scatter_kws = {'s': 20}, height = 10, legend = False)
plt.title('Proportion of Casual Riders by Weekday', fontsize = 15)
plt.legend(title = 'Weekday', prop = {'size':15});
plt.xlabel('Temperature', fontsize = 15)
plt.ylabel('Proportion of Casual Riders', fontsize = 15);

A better approach is to use **local smoothing**. The basic idea is that for each $x-$value, we compute some sort of representative $y-$value that captures the data close to that $x-$value. 

One technique for local smoothing is "Locally Weighted Scatterplot Smoothing" or LOWESS. Watch the video below to see an explaination of the basic idea behind LOWESS.

In [54]:
from IPython.display import YouTubeVideo

# fitting a curve to data using lowess or loess
# channel: StatQuest
# video credit: Josh Starmer

YouTubeVideo('Vf7oJ6z2LCc', width = 750, height = 315)

**Example 4.3.** An example of what this technique looks like is below. The red curve shown is a smoothed version of the scatterplot.

In [55]:
from statsmodels.nonparametric.smoothers_lowess import lowess

# make noisy data
xobs = np.sort(np.random.rand(100)*4.0 - 2)
yobs = np.exp(xobs) + np.random.randn(100) / 2.0
sns.scatterplot(x = xobs, y = yobs, label = 'Raw Data')

# predict 'smoothed' valued for observations
ysmooth = lowess(yobs, xobs, return_sorted = False)
sns.lineplot(x = xobs, y = ysmooth, label = 'Smoothed Estimator', color = 'r')
plt.legend(prop = {'size':15});

In our case with the bike ridership data, we want 7 curves, one for each day of the week. The $x-$axis will be the temperature and the $y-$axis will be a smoothed version of the proportion of casual riders. We want to make a graph that looks like the one below.

<img src="curveplot_temp_prop_casual.png" width = "800px" class = "center"/>

The first thing we need to do is add a column to the the `bike` DataFrame.

**Question 4.2.** Add a column to the `bike` DataFrame named `fatemp` that is the temperature in Fahrenheit.

**Hint:** Refer back to the top of this homework notebook for a description of the temperature field to know how to convert to Fahrenheit. By default, the temperature field ranges from 0.0 to 1.0.

<!--
BEGIN QUESTION
name: q4_2
manual: false
-->

In [56]:
...

In [None]:
grader.check("q4_2")

<!-- BEGIN QUESTION -->

**Question 4.3.** Next, we will use [`statsmodels.nonparametric.smoothers_lowess.lowess`](http://www.statsmodels.org/dev/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html) just like in **Example 4.3.**. However, unlike **Example 4.3.** above, we will plot **only** the lowess curve. Do not plot the actual data, which would result in overplotting (i.e. when the data or labels in a data visualization overlap, making it difficult to see individual data points in a data visualization).

**Hints:** 
- The `lowess` function expects $y$ coordinate first, then $x$ coordinate.

- Start by just plotting only one day of the week to make sure you can do that first.

- For this problem, the simplest way to plot all 7 curves is to use a loop.

<!--
BEGIN QUESTION
name: q4_3
manual: true
-->

In [60]:
...
plt.title(...)
plt.legend(...)
plt.xlabel(...)
plt.ylabel(...);

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.4.** How is `prop_casual` changing as a function of temperature?

<!--
BEGIN QUESTION
name: q4_4
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.5.** Which, if any, of the curves are approximately linear? Also, mention anything else you notice that is interesting?

<!--
BEGIN QUESTION
name: q4_5
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



----
## 5. Conclusion

A map of areas with bike sharing systems and other forms of micro mobility as of 2018 is provided, below (Source: [NACTO](https://nacto.org/shared-micromobility-2018/)).

<img src="Shared-Micromobility-Across-the-U.S..png" width="700px" />



**Question 5.1.** Based on the data you have explored (distribution of orders, daily patterns, weather, additional data/information you have seen), do you think bike sharing should be realistically scaled across major cities in the the US in order to alleviate congestion, provide geographic connectivity, reduce carbon emissions, and promote inclusion among communities? Why or why not? 

Write 3 - 4 paragraphs to answer **Question 5.1.**. Be sure to use the data and the visualisations you made to support your conclusions.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)