# Data Vis: Visualizing Numerical and Categorical Data
* Notebook 1: Visualizing Amounts

## Setup

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

## Data

In this notebook, we will use the NYC Flights 2013 dataset, which contains information about all domestic flights that departed from NYC in 2013. The dataset includes the following tables:
- `flights`: Contains information about each flight, including the origin and destination airports, departure and arrival times, and delays.
- `planes`: Contains information about the planes, including their tail numbers and model years.
- `airports`: Contains information about the airports, including their names and locations.
- `airlines`: Contains information about the airlines, including their names and IATA codes.
- `weather`: Contains information about the weather at the origin airports, including temperature, wind speed, and precipitation.

In [None]:
data = pd.read_csv('flights_joined.csv')

In [None]:
data.shape

In [None]:
data.head()

## Bar Chart

The bar chart is the most common way to visualize counts of categorical data. Here, we will use the figure-level function `catplot()` to create a bar chart of the number of flights by carrier. The `order`parameter is used to specify the order of the bars in the chart (here, by decreasing by counts).

In [None]:
g = sns.catplot(x='carrier', data=data, kind='count', order=data['carrier'].value_counts().index)
g.fig.suptitle("Number of flights per airline")
g.set_xlabels("Airline")
g.set_ylabels("Number of flights")
plt.show()

We can create the same plot using the axes-level function `barplot()`, but we will need to use the `groupby()` method to aggregate the data first. So, the `catplot()` function is more convenient for this type of plot, as it automatically handles the aggregation for us.

In [None]:
data_grouped = data.groupby('carrier').size().reset_index(name='count')
data_grouped = data_grouped.sort_values(by='count', ascending=False)

ax = sns.barplot(x='carrier', y='count', data=data_grouped)
ax.set_title("Number of flights per airline")
ax.set_xlabel("Airline")
ax.set_ylabel("Number of flights")
plt.show()

Next, we will create a stacked bar chart to show the number of flights by origin airport and airline carrier. We will use the `hue` parameter to specify the carrier, and the `multiple="stack"` parameter to stack the bars for each origin airport. 

In my opinion, it's a bit confusing that stacked bar charts are created using the `displot()` function, which is usually used for histograms.

In [None]:
g = sns.displot(x="origin", hue="carrier", data=data, kind="hist", multiple="stack")
g.fig.suptitle("Number of flights per origin and airline")
g.set_xlabels("Origin")
g.set_ylabels("Number of flights")
plt.show()

Similarily, we can create a grouped bar chart to show the number of flights by origin airport and airline carrier, using the `multiple="stack"` parameter to group the bars for each origin airport.

In [None]:
g = sns.displot(x="origin", hue="carrier", data=data, kind="hist", multiple="dodge")
g.fig.suptitle("Number of flights per origin and airline")
g.set_xlabels("Origin")
g.set_ylabels("Number of flights")
plt.show()

Now it's your turn. Create bar charts, stacked bar charts, and grouped bar charts to show the number of flights by other categorical variables in the dataset...

In [None]:
# YOUR CODE HERE

## Heatmap

Heatmaps are another way to visualize counts of categorical data. They are particularly useful for visualizing the relationship between two categorical variables. In this example, we will create a heatmap to show the number of flights by `weekday` and `hour`.

The variable `weekday`does not exist in the dataset, so we will create it by extracting the day of the week from the `time_hour` variable.

In [None]:
data["weekday"] = pd.to_datetime(data["time_hour"]).dt.day_name()

First, we have to aggregate the raw data by `weekday` and `hour` before we can create the heatmap. This can be done with the `pivot_table()` method, which creates a pivot table from the raw data. The `pivot_table()` method takes the following parameters:
- `index`: The variable to use for the rows of the pivot table (here, `hour`).
- `columns`: The variable to use for the columns of the pivot table (here, `weekday`).
- `values`: The variable to use for the values of the pivot table (here, `flight`).
- `aggfunc`: The aggregation function to use for the values of the pivot table (here, `count`).

In [None]:
pivot_table = data.pivot_table(index="hour", columns="weekday", values="flight", aggfunc="count")
pivot_table.head(10)

The weekdays are not ordered correctly, so we will use the `reindex()` method to get the columns in the right order.

In [None]:
ordered_weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
pivot_table = pivot_table.reindex(columns=ordered_weekdays)
pivot_table.head(10)

Note that there are `NaN` values in the pivot table. There are also missing rows for hours with no flights. Let's fix these issues.

In [None]:
for hour in range(24):
    if hour not in pivot_table.index:
        pivot_table.loc[hour] = 0
        
pivot_table = pivot_table.fillna(0)
pivot_table = pivot_table.sort_index()

pivot_table.head(10)

Finally, we are reday to create the heatmap. We will use the `heatmap()` function to create the heatmap. The `cmap`, `linewidths`, and `linecolor` parameters, amongst others, can be used to customize the appearance of the heatmap. See https://seaborn.pydata.org/generated/seaborn.heatmap.html for more details.

In [None]:
fig = plt.figure(figsize=(9, 6))
ax = sns.heatmap(data=pivot_table, cmap='viridis', linewidths=0.1, linecolor='white')
plt.yticks(rotation=0)
plt.show()

Now it's your turn. Create a heatmap using other categorical variables in the dataset...

In [None]:
# YOUR CODE HERE