In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab06.ipynb")

# Lab 06: Plotting

Welcome to Lab 06! In this lab we will work with plotting with `matplotlib` and `panda`s. [Matplotlib](https://matplotlib.org/stable/tutorials/index) is a comprehensive library for creating static, animated, and interactive visualizations in Python. [Pandas](https://pandas.pydata.org/docs/user_guide/visualization.html) has built-in support for data visualization through charts with `matplotlib`.

To receive credit for a lab, answer all questions correctly and submit before the deadline.

**Due Date:**

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

Run the cell below.

Run the cell below

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 5)
plt.rcParams['figure.dpi'] = 100

# 1. Plotting in Pyhon

Let's explore the `titanic` dataset.

In [None]:
titanic = pd.read_csv("data/titanic.csv")
titanic.head()

Before we start let's convert the `Embarked`, `Sex`, and `Survived` columns to type category.

**Question 1.** Convert the `Embarked`, `Sex`, and `Survived` columns to type category. Then rename the categories as follows:

* **Embarked:** C: Cherbourg, Q: Queenstown, S: Southampton
* **Sex:** F: Female, M: Male
* **Survived:** D: Died, S: Survived

In [None]:
...

In [None]:
grader.check("q1")

## Categorical Variables

### Bar Plot

A bar plot displays the counts of the labels from a categorical column. This can be done using `pandas`

```
df.plot.bar()
```

or using `matplotlib`

```
plt.bar()
```

In [None]:
tbl = titanic.Embarked.value_counts()
tbl

<!-- BEGIN QUESTION -->

**Question 2.** Make a bar chart using `plt.bar()` for the counts of the `Embarked` column.

**Note:** 

- `x` represents the categories.

- `height` represents the corresponding heights.

In [None]:
plt.bar(x = ..., height = ...);

<!-- END QUESTION -->

We can make the same bar plot using `panda`s.

In [None]:
tbl.plot.bar(rot = 0);

<!-- BEGIN QUESTION -->

**Question 3.** Add a title and axes labels to the plot from **Question 2**.

<img src="images/g1.png" width="400" height="100">

In [None]:
plt.bar(x = ["Southampton", "Cherbourg", "Queenstown"], height = tbl)
plt.title("..."")
plt.xlabel("\n..."")
plt.ylabel("...");

<!-- END QUESTION -->

We can prodcue a stacked bar chart by using a two-way contingency table.

<!-- BEGIN QUESTION -->

**Question 4.** Make a two-way contingency table that looks like

|**Survived**   |**Died**|**Survived**|
|---------------|--------|------------|
|**Embarked**   |        |            |
|**Cherbourg**  | 75     |          93|
|**Queenstown** | 47     |          30|
|**Southampton**| 427    |         217|

In [None]:
two_way_tbl = pd.crosstab(titanic.Embarked, titanic.Survived)
two_way_tbl

In [None]:
grader.check("q4")

<!-- END QUESTION -->



In [None]:
type(two_way_tbl)

In [None]:
print(titanic.Embarked.cat.categories)

In [None]:
two_way_tbl.loc[:, "Survived"]

### Stacked Bar Plot

**Example 1.** Let's use the `two_way_tbl` dataframe to create a stacked bar plot.

In [None]:
p1 = plt.bar(x = ["Cherbourg", "Queenstown", "Southampton"],
             height = two_way_tbl.loc[:, "Died"], label = 'Died')

p2 = plt.bar(x = ["Cherbourg", "Queenstown", "Southampton"],
             height = two_way_tbl.loc[:, "Survived"],
             bottom = two_way_tbl.loc[:, "Died"], label = 'Survived')

plt.title("Passenger Counts for Each Port")
plt.xlabel("\nPort Embarked")
plt.ylabel("Count")
plt.legend();

Let's inspect the elements of this plot.

In [None]:
type(p1), type(p2)

In [None]:
len(p1)

In [None]:
type(p1[0]), type(p1[1]), type(p1[2])

**Example 2.**  We can prodcue the same plot using `tbl.plot.bar()`.

In [None]:
two_way_tbl.plot.bar(stacked = True, rot = 0);

**Example 3.** The `.plot.bar()` function returns a `matplotlib.axes.AxesSubplot` object. We can set the labels on that object.

In [None]:
ax = two_way_tbl.plot.bar(stacked = True, rot = 0)
ax.set_title("Passenger Counts for Each Port")
ax.set_xlabel("\nPort Embarked")
ax.set_ylabel("Count")
ax.legend();

### Side by Side Bar Plots

**Example 4.** We can make side by side bar plots with `.plot.bar()`.

In [None]:
two_way_tbl.plot.bar(rot = 0);

<!-- BEGIN QUESTION -->

**Question 5.** Add a title and axes labels to your plot from **Example 4**.

In [None]:
ax = two_way_tbl.plot.bar(rot = 0)
ax.set_title("..."")
ax.set_xlabel("\n..."")
ax.set_ylabel("...")
ax.legend(...);

<!-- END QUESTION -->

## Numerical Variables

### Histograms

A histogram is an approximate representation of the distribution (the frequency and pattern) of numerical data.

Now let's look at the distribution of the ages of the passengers on the titanic.

**Example 5.** Use `.hist()` to plot a histogram of the ages of the passengers on the titanic.

In [None]:
titanic.Age.hist();

**Example 6.** Customize the plot from **Example 5.** by removing the gridlines, and, adding axes labels and a title.

In [None]:
ax = titanic.Age.hist()
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\nAges")
ax.set_ylabel("Count");

**Example 7.** We can also customize the bars.

In [None]:
ax = titanic.Age.hist(edgecolor = "black")
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\nAges")
ax.set_ylabel("Count");

**Example 8.** We can also customize the number of bins and the bin locations.

In [None]:
ax = titanic.Age.hist(color = "red", edgecolor = "black", bins = 6)
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\nAges")
ax.set_ylabel("Count");

In [None]:
bins = [0, 10, 20, 30, 40, 50, 60, 70]
ax = titanic.Age.hist(color = "lightblue", edgecolor = "blue", bins = bins)
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\nAges")
ax.set_ylabel("Count");

Now let's look at some univariate information by slicing our three-way contingency table. 

**Example 9.** We can also compare across distributions using `by = `.

In [None]:
titanic.Age.hist(edgecolor = "black", bins = bins, by = titanic.Survived, rot = 0)
plt.xlabel("Age");

**Example 10.** We can also make overlayed histograms.

In [None]:
male = titanic.loc[titanic.Sex == 'Male'].Fare
female = titanic.loc[titanic.Sex == 'Female'].Fare
num_bins = 10
ax = male.hist(bins = num_bins, alpha = 0.5, label = 'Male', edgecolor = 'black')
ax = female.hist(bins = num_bins, alpha = 0.5, label = 'Female', edgecolor = 'black')
ax.grid(False)
ax.set_xlabel("\nFare")
ax.set_ylabel("Count")
ax.legend();

In [None]:
age_died = titanic.loc[titanic.Survived == "Died", "Age"]
age_survived = titanic.loc[titanic.Survived == "Survived", "Age"]
ax = age_died.hist(alpha = 0.5, label = 'Died', edgecolor = 'black')
ax = age_survived.hist(alpha = 0.5, label = 'Survived', edgecolor = 'black')
ax.grid(False)
ax.set_xlabel("\nAge")
ax.set_ylabel("Count")
ax.legend();

### Scatter Plots

A scatter plot is used to visualize the linear relationship between variables.

**Example 11.** Make a scatter plot using `Age` to predict `Fare`.

In [None]:
titanic.plot.scatter(x = "Age", y = "Fare");

If we want we can customize the marker type and size.

In [None]:
titanic.plot.scatter(x = "Age", y = "Fare", c = "red", s = 40, marker = "^");

**Example 12.** We can modify based on a label from a categorical feature.

In [None]:
ax = plt.scatter(x = titanic.loc[titanic.Survived == "Died"].Age, y = titanic.loc[titanic.Survived == "Died"].Fare, alpha = 0.5, c = "red", label = "Died")
ax = plt.scatter(x = titanic.loc[titanic.Survived == "Survived"].Age, y = titanic.loc[titanic.Survived == "Survived"].Fare, alpha = 0.25, c = "blue", label = "Survived")
plt.title("Scatter Plot of Age vs. Fare by Survival Status")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.legend();

### Box Plots

A box plot displays the five-number summary of a set of data.

- Min, Q1, Median, Q3, and Max

- Shows possible outliers

**Example 13.** A boxplot of the `Age` variable.

In [None]:
titanic.Age.plot.box();

A boxplot grouped by `Survived`.

In [None]:
titanic.boxplot(column = "Age", by = "Survived")
plt.grid(False)
plt.suptitle("")
plt.title("Box Grouped by Survival Status");

For the remainder of this notebook you will create your own plots using data from the `bike` dataframe.

**Note:** Be sure to provide a title and label your axes.

Run the cell below.

In [None]:
penguins = pd.read_csv('data/penguins.csv')
penguins.head()

<!-- BEGIN QUESTION -->

**Question 6.** Choose a categorical column from the `penguins` dataframe. Then use the data to create a bar plot.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 7.** Choose a numerical column from the `penguins` dataframe. Then use the data to create a histogram.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 8.** Choose a numerical column from the `penguins` dataframe. Then use the data to create a box plot.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 9.** Choose two numerical columns from the `penguins` dataframe. Then use the data to create a scatter plot.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 10.** Choose two different numerical columns from the `penguins` dataframe. Then use the data to create a scatter plot.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 11.** Use data from the `penguins` dataframe to create a stacked bar plot.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 12.** Use data from the `penguins` dataframe to create a side by side bar plot.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 13.** Use data from the `penguins` dataframe to create a overlayed histogram.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 14.** Use data from the `penguins` dataframe to create a scatter plot that has different colored points for different labels of a categroical variable like we did in **Example 12**.

**Note:** Be sure to provide a title and label your axes.

In [None]:
penguins.columns

In [None]:
...

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by finding it in the file browswer on the left side of the screen, then right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)