# Matplotlib and Seaborn - Multivariate Charts

## Scatterplot

Relationship between two quantitative variables. Value of one variable is on the x-axis and the values of the other variable are on the y-axis.

### Pearson correlation coeffienct variable

Represented by `r`, ranges from `-1` to `1`. A positive value indicates a positive change in one variable are associated with a positive change in the other. A negative value indicates that when one variable increases, the other tends to decrease. Values closer to `-1` or `1` indicate a stronger correlation, while values closer to `0` indicate a weaker correlation (is not necessarily related to slope of the line). Only captures linear relationships.

### Plotting

Matplotlib: 

`plt.scatter(data = df, x = 'num_var1', y = 'num_var2')`

Seaborn (combo of scatter plot with regression function):

`sb.regplot(data = df, x = 'num_var1', y = 'num_var2')`

Exclude regression with the `fit_reg = False` option.

#### Plotting with the log function 

Create a function that transforms the values to `log10` form and vice-versa:

```python
def log_trans(x, inverse = False):
    if not inverse:
        return np.log10(x)
    else:
        return np.power(10, x)

sb.regplot(df['num_var1'], df['num_var2'].apply(log_trans))
tick_locs = [10, 20, 50, 100, 200, 500]
plt.yticks(log_trans(tick_locs), tick_locs)
```

## Overplotting, Transparency, and Jitter

### Overplotting

There are too many overlapping points in a small area making it hard or impossible to interpet the scatterplot.

Can be resolved with sampling, transparency, and jitter.

**Sampling**: use a random of selection from the data

**Transparency**: Make each individual point a little transparent, so when they overlap they become darker. Darker color shows the peaks in the data.

**Jitter**: When the points are discret and overlapping, it's hard to see any relationship. Jitter makes the group points 'jitter' so they are slightly moved over from the original value and no longer overlap

#### Adding transparency

Use the `alpha` option with a transparency value from 0 (full transparency) to 1 (full opacity).

`plt.scatter(data = df, x = 'disc_var1', y = 'disc_var2', alpha = 1/5)`

`sb.regplot(data = df, x = 'disc_var1', y = 'disc_var2', fit_reg = False, scatter_kws = {'alpha' : 1/3})`

#### Adding jitter

Not directly available in Matplotlib, but can do it seaborn's `regplot`:

```
sb.regplot(data = df, x = 'disc_var1', y = 'disc_var2', fit_reg = False,
           x_jitter = 0.2, y_jitter = 0.2, scatter_kws = {'alpha' : 1/3})
```

Can do either x or y jitter, or both.

## Heat Maps

Uses density or lightness/darkness of color to show data. Like a 2D histogram, looking from a top-down perspective. Can be used as an alternative to scatterplots. Since color is imprecise, you probably want to add numbers to the grid.

Use heat maps for quantitative vs quantitative variables, especially when they are discrete values. Good alternative to add transparency for a lot of data.

Bin size is very important for heat maps too! Too large: too hard to see the relationship. Too small: get distracted by noise.

### Plotting

Use Matplotlibs `hist2d`:

```python
# right plot: heat map with bin edges between values
bins_x = np.arange(0.5, 10.5+1, 1)
bins_y = np.arange(-0.5, 10.5+1, 1)
plt.hist2d(data = df, x = 'disc_var1', y = 'disc_var2',
           bins = [bins_x, bins_y])
plt.colorbar();
```

Since there are two variables the `bins` param takes a list of two bin edge specifications. The `colorbar` function adds a color bar to the side of the plot to show the mapping legend.

### Variations

#### Coloring adjustments
For a different color palette, use the `cmap` parameter with the `hist2d` function. This lets you set a string referencing a built-in Matplotlib palettes (see [here](https://matplotlib.org/api/pyplot_summary.html#colors-in-matplotlib))

Also, you can make sure that only cells with at least one value are colored in using the `cmin` option like so: `cmin = 0.5`. This helps distinguish cells with zero counts from those with non-zero counts.

`plt.hist2d(data = df, x = 'disc_var1', y = 'disc_var2', bins = [bins_x, bins_y], cmap = 'viridis_r', cmin = 0.5)`

#### Annotations

It can be helpful to have count annotations on the cells when there is a lot of data. With `hist2d`, you have to add text labels one by one to each cell.

Counts are available from one of the return variables of the `hist2d` function - it returns an array of counts and two vectors of bin edges.

```python
# hist2d returns a number of different variables, including an array of counts
bins_x = np.arange(0.5, 10.5+1, 1)
bins_y = np.arange(-0.5, 10.5+1, 1)
h2d = plt.hist2d(data = df, x = 'disc_var1', y = 'disc_var2',
               bins = [bins_x, bins_y], cmap = 'viridis_r', cmin = 0.5)
counts = h2d[0]

# loop through the cell counts and add text annotations for each
for i in range(counts.shape[0]):
    for j in range(counts.shape[1]):
        c = counts[i,j]
        if c >= 7: # increase visibility on darkest cells
            plt.text(bins_x[i]+0.5, bins_y[j]+0.5, int(c),
                     ha = 'center', va = 'center', color = 'white')
        elif c > 0:
            plt.text(bins_x[i]+0.5, bins_y[j]+0.5, int(c),
                     ha = 'center', va = 'center', color = 'black')
```

Note that annotations when you have a large number of cells can be too overwhelming, so it would be better to leave them off.

## Violin plots

Use for comparing a quantitative variable vs. a qualitative variable. The curves of the 'violin' show the distribution of the points (wider curves have more points). This is like a smooth histogram turn on its side.

### Plotting

Use Seaborn's `violinplot`:

`sb.violinplot(data = df, x = 'cat_var', y = 'num_var')`

Each category is shown in a different color, but you can set a default if the color is not meaningful with the `color` options. (e.g. use `base_color = sb.color_palette()[0]` to get a default color)

There is also a mini box plot inside each violin. You can turn it off with the `inner = False` option.

```python
base_color = sb.color_palette()[0]
sb.violinplot(data = df, x = 'cat_var', y = 'num_var', color = base_color,
              inner = None)
```

#### Variation

Render the plot horizontally by plotting the categorical variable on the `y` and the numerical variable on the `x`. In case both are numeric, you can also use the `orient` option to specifically set the orientation.

You can also add summary stats in the violin plot with the `inner = 'quartile'` option. The line with thick dashes indicates the median, and the two lines with shorter dashes on either side the first and third quartiles.

`sb.violinplot(data = df, x = 'cat_var', y = 'num_var', color = base_color, inner = 'quartile')`

Add a swarmplot inside the violin with `inner = 'point'` or `inner = 'stick'`:

`sb.violinplot(data = df, x = 'num_var', y = 'cat_var', color = base_color, inner = 'stick')`

## Box plots

Also used for comparing a quantitative variable vs. a qualitative variable. It summarizes the data by showing descriptive statistics for the numerical values in each category with boxes and whiskers.

The horizontal line in the box is the median. The lower (Q1) and upper (Q3) box edges show the 1st and 3rd quartiles. The height of the box is the interquartile range (IQR).

Whiskers indicate the minimum and the maximum values. Typically, a maximum range is set on whisker length; by default this is 1.5 times the IQR. Anything more or less are considered outliers and are depicted by dots that go beyond the end of the whiskers.

### Plotting

Use seaborn's `boxplot`:

`sb.boxplot(data = df, x = 'cat_var', y = 'num_var', color = base_color)`

Plot horizontally by putting the categorical variable along the `y` axis and the numerical variable along the `x` axis.

You can also add summary stats in the violin plot with the `inner = 'quartile'` option. The line with thick dashes indicates the median, and the two lines with shorter dashes on either side the first and third quartiles.

## Clustered Bar Charts

Use to show relationship between two categorical variables. In a clustered bar chart, bars are organized into clusters based on levels of the first variable, and then bars are ordered consistently across the second variable within each cluster.

To make a clustered bar chart, add a second variable to seaborn's `countplot` with the `hue` option:

`sb.countplot(data = df, x = 'cat_var1', hue = 'cat_var2')`

The first categorial variable is plotted across the x-axis as a group. The second variable is depicted as additional bars in each group differentiating in color depending on the category. A legend provides the interpretations for the second variable.

To move the legend, use the Axes method to set the position (`countplot` returns an Axes object):

`ax = sb.countplot(data = df, x = 'cat_var1', hue = 'cat_var2')`
`ax.legend(loc = 8, ncol = 3, framealpha = 1, title = 'cat_var2')`

### Heat map as an alternative

One alternative way of depicting the relationship between two categorical variables is through a heat map. It's like a 2D version of the bar chart from above.

You can use seaborn's `heatmap` but instead of using the original dataframe as data, the counts need to be summarized into a matrix.

```python
# Create matris
ct_counts = df.groupby(['cat_var1', 'cat_var2']).size()
ct_counts = ct_counts.reset_index('count')
ct_counts = ct_counts.pivot(index = 'cat_var2', columns = 'cat_var1', values = 'count')

sb.heatmap(ct_counts)
```

Add annotations for each cell with `annot = True` option:

`sb.heatmap(ct_counts, annot = True, fmt = 'd')`

Adding `fmt = 'd'` means that annotations will all be formatted as integers. Use `fmt = '.0f'` if you have any cells with no counts, in order to account for NaNs.

[Series `reset_index` documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html), [DataFrame `pivot` documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html)

## Faceting

Faceting is a way of comparing distributions or relationships across levels of additional variables, especially when there are three or more variables of interest overall. The data is divided into separate subsets, often by different levels of categorical variable.

For example, rather than depicting the relationship between one numeric variable and one categorical variable using a violin plot or box plot, we could use faceting to look at a histogram of the numeric variable for subsets of the data divided by categorical variable levels.

Use seaborn's `FacetGrid` object ([docs](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html)):

Step 1: Create an instance of the FacetGrid object and specify the feature to facet by:

`grid = sb.FacetGrid(data = df, col = 'cat_var')`

Step 2: Use `map` method on the FacetGrid object specify the plot type and variable(s) that will be plotted in each subset (here, histogram on 'num_var')

`grid.map(plt.hist, 'num_var')`

Each subset of the data is plotted independently. Each uses the default of ten bins from `hist` to bin together the data, and each plot has a different bin size. The axis limits on each facet are the same to allow clear and direct comparisons between groups.

It helps to also set the same bin edges on each subset, which can be done with extra options in the `map` function.

```python
bin_edges` = np.arange(-3, df['num_var'].max()+1/3, 1/3)
g = sb.FacetGrid(data = df, col = 'cat_var')
g.map(plt.hist, "num_var", bins = bin_edges)
```

### Variations

If there are many subsets like 15, setting a wrap, e.g. `col_wrap = 5`, will make the plots be organized into rows of five facets each, rather than a single long row of fifteen plots.

Other operations to increase readabity:

Setting each facet height to 2 inches (`size`), sorting the facets by group mean (`col_order`), limiting the number of bin edges, and changing the titles of each facet to just the categorical level name using the `set_titles` method and `{col_name}` template variable.

```python
group_means = df.groupby(['many_cat_var']).mean()
group_order = group_means.sort_values(['num_var'], ascending = False).index

g = sb.FacetGrid(data = df, col = 'many_cat_var', col_wrap = 5, size = 2,
                 col_order = group_order)
g.map(plt.hist, 'num_var', bins = np.arange(5, 15+1, 1))
g.set_titles('{col_name}')
```


## Adapting Univariate Plots

### Pointplot

The `pointplot` function can be used to plot the averages as points rather than bars. This can be useful if having bars in reference to a 0 baseline aren't important or would be confusing.

```python
sb.pointplot(data = df, x = 'cat_var', y = 'num_var', linestyles = "")
plt.ylabel('Avg. value of num_var')
```

By default, `pointplot` connects values by a line. This is fine if the categorical variable is ordinal in nature, but it can be a good idea to remove the line via `linestyles = ""` for nominal data.

### Adapted bar chart for binary (e.g. 0/1 or true/false) data

If the numeric variable is binary in nature, taking values only of 0 or 1, then a box plot or violin plot will not be informative. An adapted bar chart can be the best choice for displaying the data.

`sb.barplot(data = df, x = 'condition', y = 'binary_out', color = base_color)`

### Adapted histograms

Matplotlib's `hist` function can also be adapted so that bar heights indicate value other than a count of points through the use of the "weights" parameter. By default, each data point is given a weight of 1, so that the sum of point weights in each bin is equal to the number of points. If we change the weights to be a representative function of each point's value on a second variable, then the sum will end up representing something other than a count.

```python
bin_edges = np.arange(0, df['num_var'].max()+1/3, 1/3)

# count number of points in each bin
bin_idxs = pd.cut(df['num_var'], bin_edges, right = False, include_lowest = True,
                  labels = False).astype(int)
pts_per_bin = df.groupby(bin_idxs).size()

num_var_wts = df['binary_out'] / pts_per_bin[bin_idxs].values

# plot the data using the calculated weights
plt.hist(data = df, x = 'num_var', bins = bin_edges, weights = num_var_wts)
plt.xlabel('num_var')
plt.ylabel('mean(binary_out)')
```

To get the mean of the y-variable (`binary_out`) in each bin, the weight of each point should be equal to the y-variable value, divided by the number of points in its x-bin (`num_var_wts`). As part of this computation, we make use of pandas' `cut` function in order to associate each data point to a particular bin (`bin_idxs`). The `labels = False` parameter means that each point's bin membership is associated by a numeric index, rather than a string. We use these numeric indices into the `pts_per_bin`, with the `.values` at the end necessary in order for the Series' indices to not be confused between the indices of `df['binary_out']`.

## Line Plots

Plot the relationship of two quantitative variables, one on each axis, using a line. Emphasizes relative change.

In contrast to a scatterplot, where all data points are plotted, in a line plot, only one point is plotted for every unique x-value or bin of x-values (like a histogram). If there are multiple observations in an x-bin, then the y-value of the point plotted in the line plot will be a summary statistic (like mean or median) of the data in the bin. The plotted points are connected with a line that emphasizes the sequential or connected nature of the x-values.

A time series plot shows the relationship of a variable vs. time, where time is represented along the x-axis (e.g. stock or currency data). Usually, there is only one data point per time period.

### Plotting

Seaborn has a function for time series plots `tsplot`, but it is highly specialized.

In Matplotlib, you can use `errorbar`, but you need to do some summarizing/processing of the data first. We need to sort the data by x value and have only one y value for each x value.

```python
# set bin edges, compute centers for summarizing data on the displacement value
bin_size = 0.25
xbin_edges = np.arange(0.5, df['num_var1'].max()+bin_size, bin_size)
# Need center so points are plotted in their accurate positions. Leave off the last value because it doesn't correspond to an actual bin center
xbin_centers = (xbin_edges + bin_size/2)[:-1]

# compute statistics in each bin
## Use cut to figure out which bin each data point should be used in
data_xbins = pd.cut(df['num_var1'], xbin_edges, right = False, include_lowest = True)
## Group by bin and get the mean of each bin
y_means = df['num_var2'].groupby(data_xbins).mean()
## Get the standard deviation to plot with the yerr param below
y_sems = df['num_var2'].groupby(data_xbins).sem()

# plot the summarized data
plt.errorbar(x = xbin_centers, y = y_means, yerr = y_sems)
plt.xlabel('num_var1')
plt.ylabel('num_var2')
```

Since the x-variable (`num_var1`) is continuous, first set a number of bins into which the data will be grouped. In addition to the usual edges, the center of each bin is also computed for later plotting. For the points in each bin, compute the mean and standard error of the mean.

### Alternate variations

#### Rolling window

Instead of computing summary statistics on fixed bins, you can also make computations on a rolling window through use of pandas' `rolling` method. Since the rolling window will make computations on sequential rows of the dataframe, we should use `sort_values` to put the x-values in ascending order first.

```python
# compute statistics in a rolling window
df_window = df.sort_values('num_var1').rolling(15)
x_winmean = df_window.mean()['num_var1']
y_median = df_window.median()['num_var2']
y_q1 = df_window.quantile(.25)['num_var2']
y_q3 = df_window.quantile(.75)['num_var2']

# plot the summarized data
base_color = sb.color_palette()[0]
line_color = sb.color_palette('dark')[0]
plt.scatter(data = df, x = 'num_var1', y = 'num_var2')
plt.errorbar(x = x_winmean, y = y_median, c = line_color)
plt.errorbar(x = x_winmean, y = y_q1, c = line_color, linestyle = '--')
plt.errorbar(x = x_winmean, y = y_q3, c = line_color, linestyle = '--')

plt.xlabel('num_var1')
plt.ylabel('num_var2')
```



## Swarmplot

Another alternative violin plots and box plots is the swarm plot. Similar to a scatterplot, each data point is plotted with position according to its value on the two variables being plotted. Instead of randomly jittering points as in a normal scatterplot, points are placed as close to their actual value as possible without allowing any overlap. A swarm plot can be created in seaborn using the `swarmplot` function, similar to how you would a call `violinplot` or `boxplot`.

`sb.swarmplot(data = df, x = 'cat_var', y = 'num_var', color = base_color)`

Unlike the violin plot and box plot, in a swarmplot, every point is plotted, so you can now compare the frequency of each group in the same plot. While there is some distortion due to location jitter, we also have a more concrete picture of where the points actually lie, removing the long tails that can be present in violin plots.

However, it is only reasonable to use a swarm plot if we have a small or moderate amount of data. If we have too many points, then the restrictions against overlap will cause too much distortion or require a lot of space to plot the data comfortably. In addition, having too many points can actually be a distraction, making it harder to see the key signals in the visualization. Use your findings from univariate visualizations to inform which bivariate visualizations will be best, or simply experiment with different plot types to see what is most informative.

## Strip plot

It's like a swarm plot but without any dodging or jittering to keep points separate or off the categorical line. You can also think of it as a rug plot faceted by categorical levels. 

`sb.stripplot(data = df, x = 'num_var', y = 'cat_var', color = base_color)`

## Using Color for Third Variables

One of the most common ways of adding a third variable to a plot in matplotlib and seaborn is through the use of color.

The `violinplot`, `boxplot`, and `barplot` functions can all be made with third-variable clusters by adding a `hue` parameter. Code for heat maps can be adapted to depict third variables rather than counts, just by changing the `weights` parameter for `hist2d`, or the aggregation functions for your data to be fed into `heatmap`.

For scatterplots, there are two different ways of setting color, depending on the type of variable. For numeric variables, you can set the `color` or `c` parameter directly in the `scatter` function call.

```python
plt.scatter(data = df, x = 'num_var1', y = 'num_var2', c = 'num_var3')
plt.colorbar()
```

If you have a qualitative variable, you can set different colors for different levels of a categorical variable through the `hue` parameter on seaborn's `FacetGrid` class.

```python
g = sb.FacetGrid(data = df, hue = 'cat_var1', size = 5)
g.map(plt.scatter, 'num_var1', 'num_var2')
g.add_legend()
```

### Color Palettes

Depending on the type of variable you have, you might want to choose a different color palette than the default provided. There are three main palette types to consider: qualitative, sequential, and diverging.

**Qualitative palettes** are built for nominal-type data. This is the palette class taken by the default palette. In a qualitative palette, consecutive color values are distinct so that there is no inherent ordering of levels implied. Colors in a good qualitative palette should also try and avoid drastic changes in brightness and saturation that would cause a reader to interpret one category as being more important than the others - unless that emphasis is deliberate and purposeful.

`sb.palplot(sb.color_palette(n_colors=9))`

(Documentation [palplot](https://seaborn.pydata.org/generated/seaborn.palplot.html) / [color_palette](https://seaborn.pydata.org/generated/seaborn.color_palette.html)

For other types of data (ordinal and numeric), a choice may need to be made between a sequential scale and a diverging scale. 

In a **sequential palette**, consecutive color values should follow each other systematically. Typically, this follows a light-to-dark trend across a single or small range of hues, where light colors indicate low values and dark colors indicate high values. The default sequential color map, "viridis", takes the opposite approach, with dark colors indicating low values, and light values indicating high.

`sb.palplot(sb.color_palette('viridis', 9))`

Most of the time, a sequential palette will depict ordinal or numeric data just fine. However, if there is a meaningful zero or center value for the variable, you may want to consider using a **diverging palette**. In a diverging palette, two sequential palettes with different hues are put back to back, with a common color (usually white or gray) connecting them. One hue indicates values greater than the center point, while the other indicates values smaller than the center.

`sb.palplot(sb.color_palette('vlag', 9))`