# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook that looks like ▶| or hold down `Shift` + `Return`.

Before you begin run the code cell below.

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("dsc201_001_a10.ipynb")

**Name:** 

**Section:** 

**Date:**

## This Week's Assignment

In this week's assignment,

- summarize data in tables.

- build visualizations for categotical and numerical data.

Let's get started!

## Exploratory Data Analysis

We will be analyzing data from the `titanic.csv` file. 

<!-- BEGIN QUESTION -->

**Question 1.** Import `pandas` and `NumPy` using the appropriate aliasing.

In [None]:
# Import the pandas module as pd
import ... as ...

# Import the numpy module as np
import ... as ...

<!-- END QUESTION -->

**Question 2.** Load the `titanic.csv` file into a `pandas` `DataFrame` named `titanic`.

In [None]:
titanic = ...

In [None]:
grader.check("q2")

Now, let's explore the `titanic` data frame.

We already know about `.describe()`. This method is used for numerical data. What about categorical data?

What data type is in the `embarked` column?

## Categorical Variables

### Categorical in `pandas` 

What is the category type in python? The [`pandas` categorical data type](https://pandas.pydata.org/docs/user_guide/categorical.html), is similar to R’s `factor`.  Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are sex, social class, blood type, country affiliation, observation time or rating via Likert scales.

Let’s change the `embarked` column to type `category`.

In [None]:
titanic.embarked = ...
titanic.embarked[0:6] 

Let's rename the categories to something more descriptive.

In [None]:
titanic.embarked = ...
titanic.embarked[0:6] 

What is the data type of the `survived` column?

In [None]:
type(titanic.survived[0])

Let's change the `survived` column to type category.

In [None]:
titanic.survived = ...
titanic.survived[0:2]

Let's rename the categories to something more descriptive.

In [None]:
titanic.survived = ...
titanic.survived[0:2] 

<!-- BEGIN QUESTION -->

**Question 3.** Use the `.value_counts()` method to make contingency tables for the `embarked`, `sex`, and `survived` columns.

In [None]:
print("Embarked")
print(...)
print("\n")
print("Sex")
print(...)
print("\n")
print("Survived")
print(...)

<!-- END QUESTION -->

### Two-Way Contingency Tables

A two-way contingency table, also known as a contingency table or a two-dimensional cross-tabulation, is a statistical table used to display the frequency distribution of two categorical variables. It provides a way to examine the relationship between two categorical variables by showing how the values of one variable are distributed across the categories of the other variable.

<!-- BEGIN QUESTION -->

**Question 4.** Make a two-way contingency table that shows the number of passengers that died and survived across the the different ports.

In [None]:
...

<!-- END QUESTION -->

## Visualizations

We will use plotting features from `pandas` and from `matplotlib` to create our visualizations. Run the cell below to import the libraries we need.

**Note:** To learn more about visualizations in `matplotlib` click [here](https://matplotlib.org/) and for documentation on creating visualizations using `pandas` click [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

### Bar Plot

A bar plot displays the counts of the labels from a categorical column. This can be done using `pandas`

```
df.plot.bar()
```

or using `matplotlib`

```
plt.bar()
```

Let's import the the `matplotlib.pyplot` library.

In [None]:
# Import the matplotlib.pyplot library
import ... as ...

# Magic command
%matplotlib inline

# Set default parameters for all visualizations
plt.rcParams['figure.figsize']=(12, 5)
plt.rcParams['figure.dpi']=100

In [None]:
# Display titanic dataframe information
titanic.info()

In [None]:
# Create a table
tbl = titanic.embarked.value_counts()
tbl

In [None]:
type(tbl)

In [None]:
tbl.index

<!-- BEGIN QUESTION -->

**Question 5.** Make a bar chart using `plt.bar()` for the counts of the `Embarked` column.

**Notes:** 

- `x` represents the categories.

- `height` represents the corresponding heights.

In [None]:
plt.bar(x=..., height=...);

<!-- END QUESTION -->

In the next section of the this assignment we want to analyze only the instutions that offer at least a bachelor's degree. 

**Example 1.** Add a title and axes labels to the plot from **Question 5**.

<img src="images/g1.png" width="800" height="100">

We can make the same bar chart using `pandas`.

In [None]:
tbl.plot.bar(rot=0);

<!-- BEGIN QUESTION -->

**Question 6.** Use the `survived` column from the `titanic` dataset a create a bar chart. Be sure to label the axes and title.

**Note:** You can use either `matplotlib` or `pandas`.

In [None]:
...
plt.title("Passenger Counts For Each Port")
plt.xlabel("\n Survival")
plt.ylabel("Count");

<!-- END QUESTION -->

## Numerical Variables

### Histograms

A histogram is an approximate representation of the distribution (the frequency and pattern) of numerical data. Let's look at the distribution of the ages of the passengers on the titanic.

**Example 2.** Use `.hist()` to plot a histogram of the ages of the passengers on the titanic.

In [None]:
...

**Example 3.** Customize the plot from **Example 6.** by removing the gridlines, adding axes labels, placing an edge color between the bars,  and adding a title.

**Note:** In the plot below we use the Axes object to customize the plot. For more information on how this works [read this article](https://towardsdatascience.com/what-are-the-plt-and-ax-in-matplotlib-exactly-d2cf4bf164a9) and this [tutorial guide](https://realpython.com/python-matplotlib-guide/).

In [None]:
ax = titanic.age.hist(edgecolor="white")
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\n Age")
ax.set_ylabel("Count");

**Example 4.** We can change the $y-$axis to a percentage (i.e. density).

In [None]:
ax = titanic.age.hist(edgecolor='white', density=True)
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\n Age");

**Example 5.** We can also customize the number of bins.

In [None]:
ax = titanic.age.hist(edgecolor='white', bins=7, density=True)
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\n Age");

**Example 6.** We can also customize the bin locations.

In [None]:
bins = [0, 10, 20, 30, 40, 50, 60, 70]
ax = titanic.age.hist(edgecolor='white', bins = bins, density=True)
ax.grid(False)
ax.set_title("Distribution of Passenger Ages")
ax.set_xlabel("\n Age");

**Example 7.** We can also compare across distributions using `by = `.

In [None]:
ax = titanic.age.hist(edgecolor="white", bins=bins, by=titanic.survived, rot=0)
ax[0].set_xlabel("\n Age")
ax[1].set_xlabel("\n Age");

**Example 8.** We can also compare across distributions using overliad histograms.

In [None]:
m = titanic[titanic.sex == 'male']
f = titanic[titanic.sex == 'female']

plt.hist(m.age, alpha=0.75, edgecolor='white', bins=bins, label='Male')
plt.hist(f.age, alpha=0.75, edgecolor='white', bins=bins, label='Female')
  
plt.legend(loc='upper right')
plt.title("Age Distribution of Male and Female Passengers")
plt.xlabel("\n Age")
plt.ylabel("Count");

### Scatter Plots

A scatter plot is used to visualize the linear relationship between variables.

**Question 7.** Make a scatter plot visulaize the association between `age` and `fare`.

In [None]:
... 

**Example 9.** If we want we can customize the marker type, color and size.

In [None]:
...

**Example 10.** We can modify the color based on a label from a categorical feature.

In [None]:
ax = plt.scatter(x=titanic[titanic.survived=="Died"]['age'], 
                 y=titanic[titanic.survived=="Died"]['fare'], 
                 edgecolor="white",
                 color="red", 
                 label="Died")

ax = plt.scatter(x=titanic[titanic.survived=="Survived"]['age'],
                 y=titanic[titanic.survived=="Survived"]['fare'], 
                 edgecolor="white",
                 color="blue", 
                 label="Survived")

plt.title("Scatter Plot of Age vs. Fare by Survival Status")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.legend();

## Comparisons Across Groups

### Box Plots

A box plot displays the five-number summary of a set of data.

- Min, Q1, Median, Q3, and Max

- Shows possible outliers


<img src="https://lsc.studysixsigma.com/wp-content/uploads/sites/6/2015/12/1435.png" width="800" height="400">

**Source:** https://www.leansigmacorporation.com/box-plot-with-minitab/ 

**Author:** Michael Parker

**Example 11.** A boxplot of the `age` variable.

In [None]:
...

**Example 12.** A boxplot grouped by survival status.

In [None]:
titanic.boxplot(column="age", by="survived")
plt.grid(False)
plt.suptitle("")
plt.title("Box Grouped by Survival Status");

## Seaborn

Seaborn is a Python data visualization library based on [`matplotlib`](https://matplotlib.org/). It provides a high-level interface for drawing attractive and informative statistical graphics.

For a brief introduction to the ideas behind the library, you can read the [introductory notes](https://seaborn.pydata.org/tutorial/introduction.html) or the [paper](https://joss.theoj.org/papers/10.21105/joss.03021). Visit the [installation page](https://seaborn.pydata.org/installing.html) to see how you can download the package and get started with it. You can browse the [example gallery](https://seaborn.pydata.org/examples/index.html) to see some of the things that you can do with seaborn, and then check out the [tutorials](https://seaborn.pydata.org/tutorial.html) or [API reference](https://seaborn.pydata.org/api.html) to find out how.

In [None]:
# Import seaborn
import ... as ...

### Line Chart

A line chart is similar to a scatter plot except that the measurement points are ordered and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time (i.e. a time series), thus the line is often drawn chronologically.

**Example 13.** A line chart that shows the number of flights per year.

In [None]:
flights = ...
flights.head()

In [None]:
flights.plot(x="year", y="passengers", legend=False);

<!-- BEGIN QUESTION -->

**Question 8.** Fix the line chart from **Example 13**.

In [None]:
... 

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Moodle to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)