In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("dsc295_003_005_a8.ipynb")

# Assignment 08

## Due: See Date in Moodle

In this assignment we will review our Python skills by working together to explore a data set through the use of numerical summaries and visualizations.

I would like you to attempt each question of the assignment. To get **full credit** on this assignment, you must complete each question.

## This Week's Assignment

Today we'll create numerical summaries and visualizations in Python.

## Loading Data

Before we begin we need to load the necessary modules and data sets.

Run the next two code cells below.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
titanic = sns.load_dataset('titanic')

## Numerical Summaries

First, let's look at the `titanic` data frame.

In [None]:
titanic.head()

We already know about `.describe()`. This method i used for numerical data. What about categorical data?

In [None]:
titanic.describe()

What data type is in the embarked column?

In [None]:
titanic.embarked

In [None]:
type(titanic.embarked[0])

### Categoricals in `pandas` 

What is the category type in python? The [`pandas` categorical data type](https://pandas.pydata.org/docs/user_guide/categorical.html), is similar to with R’s `factor`.  Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

<!-- BEGIN QUESTION -->

**Question 1.** Let’s change the `embarked` column to type `category`.

In [None]:
titanic.embarked = ...
titanic.embarked[0:6] 

<!-- END QUESTION -->

**Question 2.** Let's rename the categories to something more descriptive.

In [None]:
titanic.embarked = titanic.embarked.cat.rename_categories({
    '...' : 'Cherbourg',
    '...' : 'Queenstown', 
    '...' : 'Southampton'
})
titanic.embarked[0:6] 

Let's look at the data type of the `survived` column.

In [None]:
type(titanic.survived[0])

Let's change the `survived` column to type category.

In [None]:
titanic.survived = titanic.survived.astype("str").astype("category")
titanic.survived[0:2]

**Question 3.** Let's rename the categories to something more descriptive.

In [None]:
titanic.survived = titanic.survived.cat.rename_categories({'0' : '...', '1' : '...'})
titanic.survived[0:2] 

<!-- BEGIN QUESTION -->

**Question 4.** Use the `.value_counts()` method to make contingency tables for the `embarked`, `sex`, and `survived` columns.

In [None]:
print("Embarked")
print(titanic['embarked']...)
print("\n")
print("Sex")
print(titanic['sex']...)
print("\n")
print("Survived")
print(titanic['survived']...)

<!-- END QUESTION -->

## Contingency Tables

### Two-Way

**Example 1.** The `pd.crosstab()` function will make a two-way contingency table that gets returned as a dataframe.

In [None]:
pd.crosstab(titanic.embarked, titanic.survived)

**Example 2.**  We can add marginal totals.

In [None]:
pd.crosstab(titanic.embarked, titanic.survived, margins=True)

## Numerical Summaries Across Groups

- Pandas `.groupby` is used for grouping the data according to categories and applying functions to the categories. It also helps to aggregate data efficiently.

- Pandas `.groupby()` function is used to split the data into groups based on the column labels. 

**Example 3.** Use the `.groupby` function on the `embarked` column.

In [None]:
titanic.groupby('embarked')

**Example 4.** Use the `.groups` method and `.keys()` function on the `embarked` column to show all the group labels.

In [None]:
titanic.groupby('embarked').groups.keys()

<!-- BEGIN QUESTION -->

**Question 5.** Use the `.groupby` function and the `.get_group` method to get the `Queenstown` group.

In [None]:
titanic.groupby('embarked').get_group('...')

<!-- END QUESTION -->

We can group on multiple columns. 

Run the cell below.

In [None]:
titanic.groupby(["survived", "sex"])

In [None]:
titanic.groupby(["survived", "sex"]).groups.keys()

In [None]:
titanic.groupby(["survived", "sex"]).get_group(('Died', 'male')).head(5)

<!-- BEGIN QUESTION -->

**Question 6.** Find the average age of the passengers that survived and the passengers that died on the Titanic. 

**Note:** To earn all the points for this question you must use `.groupby`.

In [None]:
titanic.groupby('survived')['...'].mean()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 7.** Find the average age of passengers from each port based on survival status. 

**Note:** To earn all the points for this question you must use `.groupby`.

In [None]:
titanic.groupby(['survived', 'embarked'])['age'].mean()

<!-- END QUESTION -->

## Plotting in Python

We will use plotting features from `pandas` and from `matplotlib` to create our visualizations. Run the cell below to import the libraries we need.

**Note:** To learn more about visualizations in `matplotlib` click [here](https://matplotlib.org/) and for documentation on creating visualizations using `pandas` click [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 5)
plt.rcParams['figure.dpi'] = 100

## Categorical Variables

### Bar Plot

A bar plot displays the counts of the labels from a categorical column. This can be done in `pandas` using a `DataFrame` or a `Series`

```
.plot.bar()
```

or using `matplotlib`

```
plt.bar()
```

In [None]:
tbl = titanic.embarked.value_counts()
tbl

<!-- BEGIN QUESTION -->

**Question 8.** Make a bar chart using `plt.bar()` for the counts of the `embarked` column.

**Note:** 

- `x` represents the categories.

- `height` represents the corresponding heights.

In [None]:
plt.bar(x=..., height=...);

<!-- END QUESTION -->

We can make the same bar chart using `pandas`.

In [None]:
tbl.plot.bar(rot=0);

**Example 5.** Add a title and axes labels to the plot from **Question 8**.

<img src="images/g1.png" width="600" height="400">

In [None]:
plt.bar(x=tbl.index, height=tbl)
plt.title("Passenger Counts for Each Port")
plt.xlabel("\n Port Embarked")
plt.ylabel("Count");

<!-- BEGIN QUESTION -->

**Question 9.** Use the `survived` column from the `titanic` dataset a create a bar chart. Be sure to label the axes and title.

**Note:** You can use either `matplotlib` or `pandas`.

In [None]:
plt.bar(x=..., height=...)
plt.title("...")
plt.xlabel("\n ...")
plt.ylabel("...");

<!-- END QUESTION -->

## Numerical Variables

### Histograms

A histogram is an approximate representation of the distribution (the frequency and pattern) of numerical data. Let's look at the distribution of the ages of the passengers on the titanic.

**Example 6.** Use `.hist()` to plot a histogram of the ages of the passengers on the titanic.

In [None]:
titanic.age.hist();

**Example 7.** Customize the plot from **Example 6.** by removing the gridlines, adding axes labels, placing an edge color between the bars,  and adding a title.

**Note:** In the plot below we use the Axes object to customize the plot. For more information on how this works [read this article](https://towardsdatascience.com/what-are-the-plt-and-ax-in-matplotlib-exactly-d2cf4bf164a9) and this [tutorial guide](https://realpython.com/python-matplotlib-guide/).

In [None]:
ax = titanic.age.hist(edgecolor = "white")
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\n Age")
ax.set_ylabel("Count");

**Example 8.** We can change the $y-$axis to a percentage (i.e. density).

In [None]:
ax = titanic.age.hist(edgecolor='white', density=True)
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\nAges");

**Example 9.** We can also customize the number of bins.

In [None]:
ax = titanic.age.hist(edgecolor='white', bins=7, density=True)
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\nAges");

**Example 10.** We can also customize the bin locations.

In [None]:
bins = [0, 10, 20, 30, 40, 50, 60, 70]
ax = titanic.age.hist(edgecolor='white', bins = bins, density=True)
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\nAges");

**Example 10.** We can also compare across distributions using `by = `.

In [None]:
ax = titanic.age.hist(edgecolor="white", bins=bins, by=titanic.survived, rot=0)
ax[0].set_xlabel("\n Age")
ax[1].set_xlabel("\n Age");

**Example 11.** We can also compare across distributions using overliad histograms.

In [None]:
m = titanic[titanic.sex == 'male']
f = titanic[titanic.sex == 'female']

plt.hist(m.age, alpha=0.75, edgecolor='white', bins=bins, label='Male')
plt.hist(f.age, alpha=0.75, edgecolor='white', bins=bins, label='Female')
  
plt.legend(loc='upper right')
plt.title("Age Distribution of Male and Female Passengers")
plt.xlabel("\n Age")
plt.ylabel("Count");

### Scatter Plots

A scatter plot is used to visualize the linear relationship between variables.

**Example 12.** Make a scatter plot visulaize the association between `age` and `fare`.

In [None]:
titanic.plot.scatter(x="age", y="fare");

If we want we can customize the marker type, color and size.

In [None]:
titanic.plot.scatter(x="age", y="fare", marker="^", color="red", edgecolor="white", size=300);

**Example 13.** We can modify the color based on a label from a categorical feature.

In [None]:
ax = plt.scatter(x=titanic[titanic.survived=="Died"]['age'], 
                 y=titanic[titanic.survived=="Died"]['fare'], 
                 edgecolor="white",
                 color="red", 
                 label="Died")

ax = plt.scatter(x=titanic[titanic.survived=="Survived"]['age'],
                 y=titanic[titanic.survived=="Survived"]['fare'], 
                 edgecolor="white",
                 color="blue", 
                 label="Survived")

plt.title("Scatter Plot of Age vs. Fare by Survival Status")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.legend();

### Box Plots

A box plot displays the five-number summary of a set of data.

- Min, Q1, Median, Q3, and Max

- Shows possible outliers


<img src="https://lsc.studysixsigma.com/wp-content/uploads/sites/6/2015/12/1435.png" width="800" height="400">

**Source:** https://www.leansigmacorporation.com/box-plot-with-minitab/ 

**Author:** Michael Parker

**Example 14.** A boxplot of the `age` variable.

In [None]:
titanic.age.plot.box();

A boxplot grouped by survival status.

In [None]:
titanic.boxplot(column="age", by="survived")
plt.grid(False)
plt.suptitle("")
plt.title("Box Grouped by Survival Status");

## Line Chart

A line chart is similar to a scatter plot except that the measurement points are ordered and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time (i.e. a time series), thus the line is often drawn chronologically.

**Example 15.** A line chart that shows the number of flights per year.

In [None]:
flights = sns.load_dataset('flights')
flights.head()

In [None]:
flights.plot(x="year", y="passengers", legend=False);

In [None]:
df = flights.groupby('year')['passengers'].sum().to_frame().reset_index()
df

In [None]:
df.plot(x='year', y='passengers', legend=False);

In [None]:
tbl = flights.groupby('year')['passengers'].sum()
tbl

In [None]:
plt.plot(tbl.index, tbl)
plt.xticks(df.year);

Now its your turn to make plots using penguin data.

In [None]:
penguins = sns.load_dataset('penguins')
penguins.head()

<!-- BEGIN QUESTION -->

**Question 9.**  Choose a categorical column from the `penguins` dataframe. Then use the data to create a bar chart.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 10.**  Choose a numerical column from the `penguins` dataframe. Then use the data to create a histogram.

**Note:** Be sure to provide a title and label your axes.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 11.** Choose two numerical columns from the `penguins` dataframe. Then use the data to create a scatter plot.

In [None]:
...

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Moodle to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)