# Lesson Overview
> - Scattered plots for quant vs. quant
> - Clustered bar charts for quali vs. quali
> - Heatmaps as 2D Histogram and bar charts
> - Violinplots and boxplots for quant vs. quali
> - Faceting
> - Bar charts with mean instead of count
> - Lineplots to reveal changes in values across time

## Scatterplots for quantitative variable vs. quantitative variable
> **Pearson Coefficient (R): statistic quantifying the strength of linear correlation between 2 numeric variables.**
- R doesn't necessarilly tell you the steepness of a line that models the relationship

![](images/R.png)

- **Only** capture **linear relationship.**
- Don't just look at the numbers alone, determine whether or not they are meaningful to your data!

In [None]:
# create scatter plots on log-transformed data(can make your data more clear and show a linear pattern)
def log_trans(x, inverse = False):
    if not inverse:
        return np.log10(x)
    else:
        return np.power(10, x)

sb.regplot(df['num_var1'], df['num_var2'].apply(log_trans))
tick_locs = [10, 20, 50, 100, 200, 500]
plt.yticks(log_trans(tick_locs), tick_locs)

## Overplotting, Transparency, and Jitter

![](images/1.png)

> - Overplotting can be resolved with transparency and sampling.
> - Discrete variable plotting can be resolved with jitter, which adds a small noice to a location so that datapoints with the same value are spread out a little bit.


In [None]:
sns.regplot(data = df, x = 'disc_var1', y = 'disc_var2', fit_reg = False,
           x_jitter = 0.2, y_jitter = 0.2, scatter_kws = {'alpha' : 1/3})

## Heat Maps for quant vs. quant
> - Basically a 2D histogram
> - Good for discrete variable vs. discrete variable
> - Good alternative to transparency for a large of data
> - Bins size is also important

In [None]:
plt.figure(figsize = [12, 5])

# left plot: scatterplot of discrete data with jitter and transparency
plt.subplot(1, 2, 1)
sb.regplot(data = df, x = 'disc_var1', y = 'disc_var2', fit_reg = False,
           x_jitter = 0.2, y_jitter = 0.2, scatter_kws = {'alpha' : 1/3})

# right plot: heat map with bin edges between values
plt.subplot(1, 2, 2)
bins_x = np.arange(0.5, 10.5+1, 1)
bins_y = np.arange(-0.5, 10.5+1, 1)
plt.hist2d(data = df, x = 'disc_var1', y = 'disc_var2',
           bins = [bins_x, bins_y], cmap = 'viridis_r', 
           cmin = 0.5) # setting cmin to not color grids with no values
plt.colorbar();

## Violin Plots for Quant vs. Quali

In [None]:
base_color = sb.color_palette()[0]
sb.violinplot(data = df, x = 'cat_var', y = 'num_var', color = base_color,
              inner = None)


base_color = sb.color_palette()[0]
sb.violinplot(data = df, x = 'num_var', y = 'cat_var', color = base_color,
              inner = None)

## Box Plots for Quant vs. Quali
> Instead of box plots, we can make violin plots and add lines to indicate IQR

In [None]:
plt.figure(figsize = [10, 5])
base_color = sb.color_palette()[0]

# left plot: violin plot
plt.subplot(1, 2, 1)
ax1 = sb.violinplot(data = df, x = 'cat_var', y = 'num_var', color = base_color, inner='quartile')

# right plot: box plot
plt.subplot(1, 2, 2)
sb.boxplot(data = df, x = 'cat_var', y = 'num_var', color = base_color)
plt.ylim(ax1.get_ylim()) # set y-axis limits to be same as left plot

## Clustered Bar Charts for quant vs. quali

In [None]:
plt.figure(figsize=[16,10])
sns.countplot(x='VClass', data=fuel_econ, hue='fuelType')

# heatmap for the number of datapoints visualization
sns.heatmap(data=ct_counts, annot=True, fmt='.0f')

## Faceted Plot
> - Multiple copies of the same type of plot visualized on different subsets of the data.
> - A violin plot can be thought of as a vertical histogram of different categories.
> - Axis scales and limits **must be consistent** across each subplot, otherwise you may violate data integrity.

In [None]:
bin_edges = np.arange(-3, df['num_var'].max()+1/3, 1/3)
g = sns.FacetGrid(data = df, col = 'cat_var')
g.map(plt.hist, "num_var", bins = bin_edges)

################################################################################
group_means = df.groupby(['many_cat_var']).mean()
group_order = group_means.sort_values(['num_var'], ascending = False).index

g = sb.FacetGrid(data = df, col = 'many_cat_var', col_wrap = 5, size = 2,
                 col_order = group_order)
g.map(plt.hist, 'num_var', bins = np.arange(5, 15+1, 1))
g.set_titles('{col_name}')

## Line Plots
> - Lines instead of bars to emphasize relative change and to emphasize trend across x values
> - Thus, line plots are not suitable for nominal variables for x-axis

In [None]:
# set bin edges, compute centers
bin_size = 0.25
xbin_edges = np.arange(0.5, df['num_var1'].max()+bin_size, bin_size)
xbin_centers = (xbin_edges + bin_size/2)[:-1]

# compute statistics in each bin
data_xbins = pd.cut(df['num_var1'], xbin_edges, right = False, include_lowest = True)
y_means = df['num_var2'].groupby(data_xbins).mean()
y_sems = df['num_var2'].groupby(data_xbins).sem()

# plot the summarized data
plt.errorbar(x = xbin_centers, y = y_means, yerr = y_sems)
plt.xlabel('num_var1')
plt.ylabel('num_var2')