In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw03.ipynb")

# Homework 03: Exploratory Data Analysis and Visualization

Welcome to Homework 03! To receive credit for a homework assignment, answer all questions correctly and submit before the deadline.

**Due Date:**

**Collaboration Policy:** You are not allowed to discuss this assignment with other students. If you have questions please refer them to your instructor.

# Introduction

In this assignment, you will perform tasks to clean, visualize, and explore the bike sharing data. You will also investigate open-ended questions. These open-ended questions ask you to think critically about how the plots you have created provide insight into the data.

After completing this assignment, you should be comfortable with:

* reading plaintext delimited data into `pandas`.

* wrangling data for analysis.

* using EDA to learn about your data. 

* making informative plots.

**Notes:**

- Your plots should be **similar** to the given examples. Small variations are acceptable such as color differences or slight variations in scale. However it is in your best interest to make the plots as similar as possible, as similarity is subject to the instructor.

- It is expected that for all plotting questions from here on out, there are appropriate titles, axis labels, legends, etc. The following question serves as a good guideline on what is "enough": If I directly downloaded the plot and viewed it, would I be able to tell what was being visualized without knowing the question?** 

- In this notebook a custom figure size has been configured. Click [here](https://matplotlib.org/users/customizing.html) to read the documentation about customizing aspects of `matplotlib`.

Run the cell below.

In [None]:
import numpy as np
import os
import pandas as pd
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 4)
plt.rcParams['figure.dpi'] = 100

## Loading Bike Sharing Data
The data we are exploring was collected from a bike sharing system in Washington D.C.

The variables in this data frame are defined as:

Variable  |Description
:--------- |:--------------------------------------------------------------
instant | record index
dteday | date
season | 1. spring <br> 2. summer <br> 3. fall <br> 4. winter
yr | year (0: 2011, 1:2012)
mnth | month ( 1 to 12)
hr | hour (0 to 23)
holiday | whether day is holiday or not
weekday | day of the week
workingday | if day is neither weekend nor holiday
weathersit | 1. clear or partly cloudy <br> 2. mist and clouds <br> 3. light snow or rain <br> 4. heavy rain or snow
temp | normalized temperature in Celsius (divided by 41)
atemp | normalized "feels-like" temperature in Celsius (divided by 50)
hum | normalized percent humidity (divided by 100)
windspeed| normalized wind speed (divided by 67)
casual | count of casual users
registered | count of registered users
cnt | count of total rental bikes including casual and registered  

**Question 1.** Using the criteria established by the authors of the textbook [Learning Data Science](https://learningds.org/ch/10/eda_feature_types.html) in section 10.1 classify the features of the the Bike Sharing dataset.

_Type your answer here, replacing this text._

### Examining the File Contents and the Metadata

Run the cell below to print out the first 5 lines of the `bikeshare.txt` file.

In [None]:
from itertools import islice

lines_to_print = 5
with open('data/bikeshare.txt', "r") as f:
    for line in list(islice(f, lines_to_print)):
        print(line, end = "")

**Question 2.** Identify the file format. Then choose an item from the list below and assign its corresponding number to the variable `q2`.


1. Excel spreadsheet

2. CSV (comma-separated values)

3. TSV (tab-separated values)

In [None]:
q2 = ...

In [None]:
grader.check("q2")

Metadata is "data that provides information about other data". In other words, it is "data about the data". [[1]](https://en.wikipedia.org/wiki/Metadata)

Run the cell below to view some metadata about the `bikeshare.txt` file.

In [None]:
size = os.stat("data/bikeshare.txt").st_size
line_count = len(open("data/bikeshare.txt").readlines(  ))

print("Size:", size, "bytes")
print("Line Count:", line_count, "lines")

**Question 3.** Identify the size and the line count. Then put the values in a dictionary object and assign it the variable `q3`. 

**Note:** Make sure the keys are *Size* and *Line Count* and the corresponding numerical values.

In [None]:
q3 = ...
q3

In [None]:
grader.check("q3")

### Loading the Data

Run the cell below to load the dataset as a `pndas` `DataFrame`.

In [None]:
bike = pd.read_csv('data/bikeshare.txt')

In [None]:
bike.head()

<!-- BEGIN QUESTION -->

**Question 4.** Before we start, use the cells below to perform some initial data exploration. Include text/markdown cells before each code cell to explain what you are investigating and why it is of interest to you.

**Note:** You must use each markdown cell and each code cell.

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

# Data Wrangling
A few of the variables that are numeric/integer actually encode categorical data. These include `holiday`, `weekday`, `workingday`, and `weathersit`. In the following question, we will convert these four variables to strings specifying the categories. 

In particular, we will use 3-letter labels (`Sun`, `Mon`, `Tue`, `Wed`, `Thu`, `Fri`, and `Sat`) for `weekday`. For `holiday` and `workingday`,  we will simply use `yes`/`no`.

In this exercise we will **mutate** the data frame (i.e. **overwriting the corresponding variables in the data frame.**). However, our notebook will effectively document this in-place data transformation for future readers. 

**Note:** Make sure to leave the underlying datafile `bikeshare.txt` unmodified.

**Question 5.** Decode the `yr`, `holiday`, `weekday`, `workingday`, `weathersit` and `season` fields:

- `yr`: Convert 0 to 2011 and 1 to 2012. These values should be strings.

- `holiday`: Convert to `yes` and `no`  .

- `weekday`: Mutate the `'weekday'` column to use the 3-letter label (`'Sun'`, `'Mon'`, `'Tue'`, `'Wed'`, `'Thu'`, `'Fri'`, and `'Sat'`) instead of its current numerical values. Assume `0` corresponds to `Sun`, `1` to `Mon` and so on.

- `workingday`: Convert to `yes` and `no`.

- `weathersit`: You should replace each value with one of `Clear`, `Mist`, `Light`, or `Heavy`.

- `season`: Convert to Spring, Summer, Fall, Winter.

**Note:** If you want to revert changes, run the cell that reloads the data file.

**Hint:**  One approach is to use the [replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) method of the `pandas` `DataFrame` class. We haven't discussed how to do this so you'll need to look at the documentation.

In [None]:
...
bike.head()

In [None]:
grader.check("q5")

**Question 6.** How many entries in the data correspond to holidays?  Set the variable `num_holidays` to this value.

**Note:** To earn all the points for this question you must show the code you used to obtain your result.

In [None]:
num_holidays = bike['holiday'].value_counts()['yes']
num_holidays

In [None]:
grader.check("q6")

Holidays don't always occur on the weekend. What if we wanted to know the distribution of non-working days (holidays) in each year that occurred on each weekday (i.e. Mon, Tue, Wed, Thu, Fri)?

**Question 7.** How many entries in the data correspond to non-working days that did not occur on the weekend? For this question we will assume weekend days to be Saturday and Sunday. Save your answer as a number to the variable `q7`.

**Note:** To earn all the points for this question you must show the code you used to obtain your result.

In [None]:
q7 = ...
q7

In [None]:
grader.check("q7")

**Question 8.** How many non working days were on each weekday (i.e. Mon, Tue, Wed, Thu, Fri) in 2011? Save your result to a dataframe named `q8`, where the index values are integers and the column names are **day** (for the day of the week) and **count** (for the number of non working days)

**Note:** To earn all the points for this question you must show the code you used to obtain your result.

In [None]:
q8 = ...
q8

In [None]:
grader.check("q8")

<!-- BEGIN QUESTION -->

**Question 9.** Use your result from **Question 8** to make a bar chart. 

**Note:** Make sure you give the plot a title and label the axes.

In [None]:
...
...

<!-- END QUESTION -->

The granularity of this data is at the hourly level.  However, for some of the analysis we will also want to compute daily statistics.  In particular, in the next few questions we will be analyzing the daily number of registered and unregistered users.

**Question 10.** Construct a dataframe named `daily_count` indexed by `dteday` with the following columns:

* `casual`: total number of casual riders for each day

* `registered`: total number of registered riders for each day

* `workingday`: whether that day is a working day or not (`yes` or `no`)

**Hint**: Consider using the `groupby` and `agg` functions. For `agg`, you can check the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) for examples on applying different aggregations per column. If you use the capability to do different aggregations by column, you can do this task with a single call to `groupby` and `agg`. For the `workingday` column we can take any of the values since we are grouping by the day, thus the value will be the same within each group. It may also be helpful to take a look at the `first` or `last` aggregation functions.

In [None]:
daily_count = ...
daily_count.head()

In [None]:
grader.check("q10")

# Exploring the Distribution of Riders

Let's begin by comparing the distribution of the daily counts of casual and registered riders.

<!-- BEGIN QUESTION -->

**Question 11.** Create a histogram that overlays the distribution of the daily counts of `casual` and `registered` users. 

- The temporal granularity of the records should be daily counts, which you should have after completing **Question 10**.

- After creating the plot, look at it and make sure you understand what the plot is actually telling us (e.g on a given day, the most likely number of registered riders we expect is ~4000, but it could be anywhere from nearly 0 to 7000).

**Hint:** `matplotlib.pyplot` has methods to customize the features of a plot. For example, the syntax for the `.xlabel`, `.ylabel`, `.title`, and `.legend` methods can be found in the [documentation](https://matplotlib.org/stable/index.html).

<img src='images/casual_v_registered.png' width = "800px" class = "center"/>

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 12.** In the cell below, describe the differences you notice between the histograms for casual and registered riders. Consider each of the following concepts:

- mode. 

- symmetry.

- skewness. 

- spread of the distributions. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Ethical Considerations

City planners, transportation agencies, and policy makers have started to collaborate with bike sharing companies in order to reduce congestion and transportation costs. Recently city planners and policy makers have also been trying to make transportation more equitable. 

Equity in transportation includes: finding ways to make transportation more accessible to people in all neighborhoods within a given region, making the costs of transportation affordable to people across all income levels, and assessing how inclusive transportation systems are over time. Data about city residents may shed light on how to better assess transportation cost and equity impacts on transportation users. 

Keeping this in mind, answer the following two question on the nature of the data, their possible shortcomings, and ethical considerations associated with how we as data scientists use, manipulate, and share this data.

<!-- BEGIN QUESTION -->

**Question 13.** In addition to the type of rider (`casual` vs. `registered`) and the overall count of each, what other kinds of demographic data would be useful (e.g. identity, neighborhood, monetary expenses, etc.)? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 14.** What is an example of a privacy or consent issue that could occur when accessing the demographic data you mentioned in the previous question.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

[Seaborn](https://seaborn.pydata.org/) is a Python data visualization library based on [`matplotlib`](https://matplotlib.org/).

**Note:** For a brief introduction to the ideas behind the library, you can read the [introductory notes](https://seaborn.pydata.org/introduction.html) or the [paper](https://joss.theoj.org/papers/10.21105/joss.03021).

In the next few questions we will use seaborn.

<!-- BEGIN QUESTION -->

**Question 15.** Use the [`sns.histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html) function to create a plot that overlays the distribution of the daily counts of `casual` and `registered` users. The temporal granularity of the records should be daily counts, which you should have after completing **Question 5**.

Include a legend, $x-$label, $y-$label, and title. Read the [seaborn plotting tutorial](https://seaborn.pydata.org/tutorial/distributions.html) if you're not sure how to add these. After creating the plot, look at it and make sure you understand what the plot is actually telling us, e.g on a given day, the most likely number of registered riders we expect is ~4000, but it could be anywhere from nearly 0 to 7000.

<img src='images/g1.png' width = "800px"/>

Click [here](https://seaborn.pydata.org/generated/seaborn.histplot.html) to find more information about `sns.histplot`.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 16.** The density plots do not show us how the counts for registered and casual riders vary together. Use [`sns.lmplot`](https://seaborn.pydata.org/generated/seaborn.lmplot.html) to make a scatter plot to investigate the relationship between casual and registered counts. This time, let's use the `bike` dataframe to plot hourly counts instead of daily counts.

The `lmplot` function will also try to draw a linear regression line (just as you saw in Data 8). Color the points in the scatterplot according to whether or not the day is working day. There are many points in the scatter plot so make them small to help reduce overplotting. Also make sure to set `fit_reg=True` to generate the linear regression line. You can set the `height` parameter if you want to adjust the size of the `lmplot`. Make sure to include a title.

<img src='images/casual_registered_working_nonworking.png' width="800px" />

**Hints:** 

- Checkout this helpful [tutorial on `lmplot`](https://seaborn.pydata.org/tutorial/regression.html).

- You will need to set `x`, `y`, and `hue` and the `scatter_kws`.

- In the `sns.lmplot` function leave the key word arguments `fit_reg = True, scatter_kws = {"s": 3}, legend = False`. `scatter_kws = {"s": 3}` controls the size of each point and `legend=False` allows you to control the placement of the legend using `plt.legend`.


In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 17.** What does this scatterplot seem to reveal about the relationship (if any) between casual and registered riders and whether or not the day is on the weekend?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 18.** Let's examine the behavior of riders by plotting the average number of riders for each hour of the day over the **entire dataset**, stratified by rider type. Your plot should look like the following:

<img src="images/diurnal_bikes.png" width = "800px"/>

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 19.** What do you observe about the bike use of the different categories of riders from the plot? When does each group have the most use? How does the bike use change throughout the day?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 20.** What can you say about the meaning of the peaks in the registered riders' distribution?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 21.** Create a box plot to compare the total number of riders for each season.

<img src="images/boxplot.png" width = "800px"/>

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 22.** Compare the distribution of the rider's based on the season. Is this what you would expect? Why, why not?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 23.** Create a density plot to compare the total number of riders and the temperature.

<img src="images/kde.png" width = "800px"/>

In [None]:
sns.scatterplot(data=bike, x='temp', y='cnt', edgecolor='white');

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 24.** What information can you infer from your visualization in the previous question?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)