# Data Vis: Visualizing Numerical and Categorical Data
* Notebook 1: Visualizing Associations

## Setup

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

## Data

In this notebook, we will use the NYC Flights 2013 dataset, which contains information about all domestic flights that departed from NYC in 2013. The dataset includes the following tables:
- `flights`: Contains information about each flight, including the origin and destination airports, departure and arrival times, and delays.
- `planes`: Contains information about the planes, including their tail numbers and model years.
- `airports`: Contains information about the airports, including their names and locations.
- `airlines`: Contains information about the airlines, including their names and IATA codes.
- `weather`: Contains information about the weather at the origin airports, including temperature, wind speed, and precipitation.

In [None]:
data = pd.read_csv('flights_joined.csv')

In [None]:
data.shape

In [None]:
data.head()

## Scatterplot

Scatterplots are the standard way to visualize the relationship between two continuous variables `x` and `y`. In this notebook, we will use the figure-level `relplot()` function with `kind='scatter'` to create scatterplots.

In [None]:
g = sns.relplot(x='distance', y='air_time', alpha=0.5, data=data, kind='scatter')
g.fig.suptitle('Air Time vs Distance')
g.set_xlabels("Distance")
g.set_ylabels("Air Time")
plt.show()

We can use the `hue` parameter to color the points by a categorical variable.

In [None]:
g = sns.relplot(x='distance', y='air_time', hue='carrier', alpha=0.5, data=data, kind='scatter')
g.fig.suptitle('Air Time vs Distance by Carrier')
g.set_xlabels("Distance")
g.set_ylabels("Air Time")
plt.show()

Finally, we can add map another numerical variable to the size of the points using the `size` parameter. The `sizes` parameter can be used to control the range of sizes for the points.

In [None]:
g = sns.relplot(x='distance', y='air_time', hue="origin", alpha=0.5,
  size="dep_delay", sizes=(1, 100),
  data=data, kind='scatter')
g.set_xlabels("Distance")
g.set_ylabels("Air Time")
plt.show()

If you want a trend line on your scatterplot, you can use the figure-level `regplot()` function. The required regression analysis is performed automatically, and the regression line is added to the scatterplot. The parameters `line_kws` and `scatter_kws` can be used to control the appearance of the line and points.

In [None]:
sns.regplot(x="distance", y="air_time", data=data,
  line_kws={'color':'red'}, scatter_kws={'alpha':0.2})
plt.show()

All of the above scatterplots suffered from heavy overplotting. To avoid overplotting, we can create a heatmap with two continuous variables as an alternative to a scatterplot.

In [None]:
sns.displot(x="distance", y="air_time", data=data)
plt.show()

Now it's your turn. Create scatterplots of other continous variables and the dataset.

In [None]:
# YOUR CODE HERE