# Include-Viz-Functions

## Define data display functions used in other notebooks

## Visualizations in Multiple Dimensions

Representing multiple dimensions on a 2D page or surface is an interesting problem. Even with a single dimension, it can be useful to visualize the range of values of an attribute (usually numeric but could be categorical). Its also useful to compare the values of a series of attritbutes side by side (agin, usually numeric, but could be categorical).

This can be done in side-by-side fashion where the series of plots represents an additional dimension; it can also be done in a compositional fashion where the series of plots compare things such as the values of a numerical attribute across slices of the dataset -- for example, the values for the population as a whole versus the values for a paricular slice of the population with categorical value C.

We have the ability to represent _up to 8 dimensions_ on a 2D page/surface by using the following representations of dimensions:
- x, y, and z axes give us 3 dimensions (for numerical attributes but can be for some categorical attributes too)
- point/bubble size (numerical), color (typically categorical), and shape (categorical) take us to 6 dimensions
- side by side plots, each plot for the value a categorical attribute takes, gives us 7 dimensions
- segmented side by side plots -- the segments standing for the values a categorical attribute can take, gives us 8 dimensions.


## Table of Contents

### 1 Dimension
A) 1 categorical attribute
 - Bar graph

B) 1 numerical attribute
 - Histogram
 - Density Plot
 - Skyline
 - Skyline histogram

### 2 Dimensions
C) 2 categorical attributes 
 - Bar side by side
 - Multi-bar on a single graph
 - Contingency table
 
D) 2 numerical attributes
 - Correlation matrix
 - Scatter plot
 - Joint plot
 - Contour plot
 - Density contour plot
 - Pairwise scatter plots (side by side)
 
E) 1 categorical + 1 numerical attribute
 - Facets with histogram
 - Facets with density plot
 - Multiple histograms (same plot a number of numerical attributes considered one by one)
 - Box plot
 - Multiple box plots (same plot with a number of numerical attributes considered one by one)
 - Violin plot
 - Jitter plot
 - Time series plot (categorical attribute = time step)
 - Time series multi plot (same plot with a number of numerical attributes considered one by one)
 - Time series bar graph
 

### 3 Dimensions
F) 1 categorical + 2 numerical
 - Pairplot with hue
 - Density plot with hue
 - Scatter plot with hue
 - Pairwise scatter plots side by side with hue

G) 2 categorical + 1 numerical attribute
 - Violin plot with quartiles
 - Box plots with hue
 - Grouped box plots
 
H) 3 categorical attributes
 - Side by side bar graphs (factor plot)
 - Multiple bars on same graph
 
I) 3 numerical attributes
 - 3D scatter plot
 - Bubble chart
 
### 4 Dimensions
J) 2 categorical + 2 numerical attributes
 - Side by side scatter plots with hue
 
K) 1 categorical + 3 numerical attributes
 - 3D scatter plot with hue
 - Bubble chart with hue
 
### 5 Dimensions
L) 1 categorical + 4 numerical
 - 3D scatter plot with bubble size and hue

M) 2 categorical + 3 numerical
 - Side by side scatter plot with bubble size and hue
 
### 6 Dimensions
N) 2 categorical + 4 numerical attributes
 - 3D scatter plot with bubble size, markers, and hue
 
O) 3 categorical + 3 numerical attributes
 - Side by side scatter plots with bubble size and hue
 

### Other
P) Multiple numerical attributes -- taken one by one, displayed separately
 - Side by side histograms
 - Side by side density plots
 - Side by side histograms and density plots overlayed
 
Q) Multiple numerical attributes -- taken one by one but displayed on the same graph
 - Box plots
 - Swarm plots
 - Correlation table
 
R) Multiple numerical attributes + 1 categorical attribute
 - Multiple line plot
 - Multiple pair plots (with or without regression lines)

In [None]:
# Setting up some non-default colors for matplotlib plots
plot_colors = ["darkorange", "green", "magenta", "violet", "aqua", "lime", "yellow", "dodgerblue", "fuchsia", "maroon", "chocolate", "darksalmon"]

## 1 Dimension

### A) 1 Categorical Attribute

In [None]:
# get the unique values of the categorical attributes/features
## Not a visualization but just a list of values
def get_cat_values(dataFrame, cat_feature_set):
    # dataFrame is the dataset in pandas dataframe format
    # cat_feature_set is a list of categorical feature names
    
    return list(zip(cat_feature_set, \
                    [dataFrame[cat_feature].unique() for cat_feature in cat_feature_set]))

In [None]:
# Single categorical attribute
# Set up the plot - use for categorical features
# Show how a categorical feature's values are distributed across the possible values it can take

def cat_value_dist(dataFrame, feature, display='percentage', orient='vert', fig_size=(8,6), ordered=1, ord=[], title=""):
    # display can be 'percentage' (default) or 'count'
    # ordered = 1 (default) orders the feature from highest to lowest
    
    fig, ax = plt.subplots(figsize=fig_size)
    
    # The unique values the feature takes
    feat_values = dataFrame[feature].unique()
    
    # Order the feature counts as needed
    if len(ord) == 0:
        ord=dataFrame[feature].value_counts().index
        
    
    # To make the plot vertical, use x=feature in the 'count' display and (feat_values, y) in the 'percentage' display
    # To make the plot horizontal, use y=feature in the 'count' display and (y, feat_values) in the 'percentage' display
    if display == 'count':
        if orient == 'vert': 
            if ordered==1:
                ax = sns.countplot(x=feature, data=dataFrame, order=ord)
            else:
                ax = sns.countplot(x=feature, data=dataFrame)
        else:
            # horiz orientation 
            if ordered==1:
                ax = sns.countplot(y=feature, data=dataFrame, order=ord)
            else:
                ax = sns.countplot(y=feature, data=dataFrame)
    elif display == 'percentage':
        y = [len([val for val in dataFrame[feature] if val == x_val])/len(dataFrame[feature]) * 100 for x_val in feat_values]
        if orient == 'vert': 
            if ordered==1:
                ax = sns.barplot(feat_values, y, order=ord)
            else:
                ax = sns.barplot(feat_values, y)
        else:
            # horiz orientation
            if ordered==1:
                ax = sns.barplot(y, feat_values, order=ord)
            else:
                ax = sns.barplot(y, feat_values)
    
    # If the number of distinct values is greater than n, rotate the labels
    n = 3
    if len(feat_values) > n:
        plt.xticks(rotation=90)
    
    if orient == 'vert':
        plt.ylabel(display)
        plt.title(feature + " " + title)
    else:
        plt.ylabel(feature)
        plt.title(display + " " + title)
    
    # If %matplotlib inline is invoked, we don't need to return plt.show()
    #return plt.show()

In [None]:
# Single categorical attribute
# Bar Plot
def cat_value_bar(dataFrame, feature, orient='vert', display='percentage', fig_size=(12,8)):
    
    # display = 'percentage' (default) or 'count'
    # orient = 'vert' (default) or 'horiz'
    
    fig, ax = plt.subplots(figsize = fig_size)
    title = fig.suptitle("Distribution by " + feature, fontsize=14)
    fig.subplots_adjust(top=0.85, wspace=0.3)

    # The unique values the feature takes
    feat_values = dataFrame[feature].unique()

    # If the number of distinct values is greater than n, rotate the labels
    n = 3
    if len(feat_values) > n:
        plt.xticks(rotation=90)

    ax.tick_params(axis='both', which='major', labelsize=10)
    bar_color = 'steelblue'
    edge_color = 'black'
    line_width = 1
    
    # Get the values for the bar chart
    w_q = dataFrame[feature].value_counts()
    w_q = (list(w_q.index), list(w_q.values))
    total = sum(w_q[1])
    
    # Get the values as counts or percentages
    if display == 'percentage': 
        vals = [(wq/total)*100 for wq in w_q[1]]
    else:
        vals = w_q[1]
       
    if orient == 'vert':
        ax.set_xlabel(feature)
        ax.set_ylabel(display) 
        bar = ax.bar(w_q[0], 
                     vals, 
                     color=bar_color, 
                     edgecolor=edge_color, 
                     linewidth=line_width)
    else:
        ax.set_xlabel(display)
        ax.set_ylabel(feature) 
        bar = ax.barh(w_q[0], 
                      vals, 
                      color=bar_color, 
                      edgecolor=edge_color, 
                      linewidth=line_width)

In [None]:
# 1 categorical attribute
# Like cat_value_bar above but when we don't have a dataframe to pass into the function
# Plot the counts of a categorical attribute for each of its category names
# Use when the raw counts and names of the categories are available but not in a dataframe format

def plot_cat_counts(cat_counts, cat_names, title=''):
    # cat_counts is a list of counts, e.g., [23, 12, 14]
    # cat_names is a list of categories, e.g., ['small', 'medium', 'large']
    
    # Set up the ticks for the labels
    x_pos = [i for i, _ in enumerate(cat_names)]
    
    fig, ax = plt.subplots(figsize=(8, 5))
    plt.bar(x_pos, cat_counts)
    plt.title(title)
    plt.xticks(x_pos, cat_names)
    
    if len(cat_names) > 3:
        plt.xticks(rotation=90)

### B) 1 Numerical Attribute

In [None]:
# Pyplot Histogram for a single numerical attribute
def simple_hist(dataFrame, num_attribute_name, title=''):
    fig = plt.figure(figsize = (6,4))
    title = fig.suptitle(title, fontsize=14)
    fig.subplots_adjust(top=0.85, wspace=0.3)

    ax = fig.add_subplot(1,1, 1)
    ax.set_xlabel(num_attribute_name)
    ax.set_ylabel("Frequency") 
    ax.text(0.7,0.8, 'mean='+str(round(dataFrame[num_attribute_name].mean(),2)), fontsize=12, transform=ax.transAxes)
    freq, bins, patches = ax.hist(dataFrame[num_attribute_name], 
                                  color='steelblue', 
                                  bins=15,
                                  edgecolor='black', 
                                  linewidth=1)

In [None]:
# Seaborn KDE Plot for a single numerical attribute
def simple_density(dataFrame, num_attribute_name, title=''):
    fig = plt.figure(figsize = (6, 4))
    title = fig.suptitle(title, fontsize=14)
    fig.subplots_adjust(top=0.85, wspace=0.3)

    ax1 = fig.add_subplot(1,1, 1)
    ax1.set_xlabel(num_attribute_name)
    ax1.set_ylabel("Frequency") 
    sns.kdeplot(dataFrame[num_attribute_name], ax=ax1, shade=True, color='steelblue')

In [1]:
# 1 numerical attribute
# Skyline of a single numerical feature
def num_skyline(dataFrame, num_feature_name):
    feat_vals = dataFrame[num_feature_name]
    # Remove feat_vals that are strings
    feat_vals = [val for val in feat_vals if type(val) != str]
    feat_vals_sorted = sorted(feat_vals)
    #feat_vals_sorted = np.array(feat_vals.sort_values())
    feat_vals_freq = [len(list(group)) for key, group in groupby(feat_vals_sorted)]
    feat_labels = np.unique(feat_vals_sorted)
    
    # Create the plot
    fig, ax = plt.subplots(figsize=(10,6))
    height = feat_vals_freq
    bars = feat_labels
    y_pos = np.arange(len(bars))
 
    # Create bars
    plt.bar(y_pos, height)

    # Add title and axis names
    plt.xlabel(num_feature_name)
    plt.ylabel("Count")

 
    # Create names on the x-axis
    plt.xticks(y_pos, feat_labels, rotation=90);

In [None]:
# 1 numerical attribute
# Bar plot with customized bin widths
def num_skyline_hist(dataFrame, 
                     num_feature_name,
                     bin_width=1, 
                     bin_min=0, 
                     bin_max=5):   
    
    # bin_width is the width of each bin in the histogram
    # bin_min is the lowest value of the feature
    # bin_max is the highest value of the feature
    
    feat_vals = dataFrame[num_feature_name]
    feat_vals_sorted = np.array(feat_vals.sort_values())
    feat_vals_freq = [len(list(group)) for key, group in groupby(feat_vals_sorted)]
    #feat_labels = np.unique(feat_vals_sorted)
    
    bin_freqs = []
    bin_labels = []
    while bin_min < bin_max:
        bin_next = round(bin_min + bin_width, 2) # round to 2 decimal places
        #print(bin_next)
        bin_label = ' '.join([str(bin_min), 'to', str(bin_next)])
        bin_labels.append(bin_label)
        vals_in_bin = len([item for item in feat_vals_sorted if (item >= bin_min) & (item < bin_next)])
        #print(vals_in_bin)
        bin_freqs.append(vals_in_bin)
        bin_min = bin_next
    
    
    # Create the plot
    fig, ax = plt.subplots(figsize=(12,8))
    height = bin_freqs
    bars = bin_labels
    y_pos = np.arange(len(bars))
 
    # Create bars
    plt.bar(y_pos, height)

    # Add title and axis names
    plt.xlabel(num_feature_name)
    plt.ylabel("Frequency")

 
    # Create names on the x-axis
    plt.xticks(y_pos, bars, rotation=90);

In [None]:
# 1 numerical attribute
# Histogram of a single numerical feature as a histogram 
# or a probability distribution (kde)
def num_hist(dataFrame, num_feature_name, kde=False):
    # dataFrame is the entire dataset
    # num_feature_name is the name of a single numerical feature, e.g., 'numerical_feature'
    fig, ax = plt.subplots(figsize=(12,8))
    sns.distplot(dataFrame[num_feature_name], kde=kde)

## 2 Dimensions

### C) 2 Categorical Attributes

In [None]:
# 2 categorical attributes with one of the categorical attributes having just 2 possible values
## Side by side bar graphs to represent the categorical attribute with 2 possible values
## Each bar graph represents the frequency of values for the second categorical attribute
# Using subplots or facets along with Bar Plots

def cat_bars_sbs(dataFrame, cat_attr_bivalued, cat_attr):
    
    color_1 = '#FF9999'
    color_2 = '#FFE888'
    edge_color = 'black'
    line_width = 1
    label_size = 8.5
    
    fig = plt.figure(figsize = (12, 6))
    title = fig.suptitle(cat_attr_bivalued + " - " + cat_attr, fontsize=14)
    fig.subplots_adjust(top=0.85, wspace=0.3)
    
    # Get the 2 values of the cat_attr_bivalued
    bivalues = list(dataFrame[cat_attr_bivalued].unique())
    
    # Get the unique values for the cat_attr
    cat_attr_vals = list(dataFrame[cat_attr].unique())
    n = 4 # threshold for when labels should flip to vertical orientation
    
    # First category of the cat_attr_bivalued - cat_attr
    ax1 = fig.add_subplot(1,2, 1)
    ax1.set_title(bivalues[0])
    ax1.set_xlabel(cat_attr)
    ax1.set_ylabel("Frequency") 
    rw_q = dataFrame[dataFrame[cat_attr_bivalued] == bivalues[0]][cat_attr].value_counts()
    rw_q = (list(rw_q.index), list(rw_q.values))
    #ax1.set_ylim([0, 2500])
    ax1.tick_params(axis='both', which='major', labelsize=label_size)
    bar1 = ax1.bar(rw_q[0], 
                   rw_q[1], 
                   color=color_1, 
                   edgecolor=edge_color, 
                   linewidth=line_width)

    # If the number of bar_names is greater than n, rotate the labels
    if len(cat_attr_vals) > n:
        plt.xticks(rotation=90)

    # Second category of the cat_attr_bivalued - cat_attr
    ax2 = fig.add_subplot(1,2, 2)
    ax2.set_title(bivalues[1])
    ax2.set_xlabel(cat_attr)
    ax2.set_ylabel("Frequency") 
    ww_q = dataFrame[dataFrame[cat_attr_bivalued] == bivalues[1]][cat_attr].value_counts()
    ww_q = (list(ww_q.index), list(ww_q.values))
    #ax2.set_ylim([0, 2500])
    ax2.tick_params(axis='both', which='major', labelsize=label_size)
    bar2 = ax2.bar(ww_q[0], 
                   ww_q[1], 
                   color=color_2, 
                   edgecolor=edge_color, 
                   linewidth=line_width)
    
    # If the number of bar_names is greater than n, rotate the labels
    if len(cat_attr_vals) > n:
        plt.xticks(rotation=90)

In [None]:
# 2 categorical attributes
# Seaborn Multi-Bar Plot
def cat_multibar(dataFrame, cat_attr_x, cat_attr_hue):
    
    num_x_vals = len(dataFrame[cat_attr_x].unique())
    
    fig, ax = plt.subplots(figsize=(12, 8))
    cp = sns.countplot(x=cat_attr_x, 
                       hue=cat_attr_hue, 
                       data=dataFrame)
    if num_x_vals > 3:
        plt.xticks(rotation=90)

In [None]:
# 2 categorical attributes
# Contingency table to track the relationship between any two categorical variables
def contingency_table(dataFrame, row_feat, col_feat):
    # dataFrame is the complete dataset
    # row_feat is the feature whose values are displayed as rows
    # col_feat is the feature whose values are displayed across columns
    ct = pd.crosstab(index=dataFrame[row_feat], 
                     columns=dataFrame[col_feat]
                    )

    return ct

In [None]:
# 2 categorical attributes
# Plot a contingency table as a stacked bar chart
def plot_contingency_table(dataFrame, row_feat, col_feat, stacked=True):
    ct = contingency_table(dataFrame, row_feat, col_feat)
    # For horizontal chart use kind='barh'
    # For vertical chart use kind='bar'
    ct.plot(kind="barh", 
            figsize=(10,8),
            stacked=stacked
           )

In [None]:
# 2 categorical attributes
# Another way to visualize the relationship between 2 categorical features
# Requires the plotnine package

def cat_2_bars(dataFrame, x_feat, y_feat):
    disp = (ggplot(dataFrame, aes(x=x_feat, fill=y_feat)) \
            + geom_bar(position='fill') \
            + ylab('Percentage') \
            + theme(axis_text_x=element_text(rotation=90, hjust=1)))
    
    return disp

### D) 2 Numerical Attributes

In [None]:
# 2 numerical attributes -- correlation table
# Seaborn Correlation Matrix Heatmap
## NOTE: Will use all numerical attributes in dataFrame. Restrict to certain list if needed by 
## restricting the columns of the dataFrame that's exposed. E.g., dataFrame[[col1, col2, ..., colN]]
def corr_heatmap(dataFrame):
    f, ax = plt.subplots(figsize=(10, 6))
    corr = dataFrame.corr()
    hm = sns.heatmap(round(corr,2), 
                     annot=True, 
                     ax=ax, 
                     cmap="coolwarm",
                     fmt='.2f', 
                     linewidths=.05)
    f.subplots_adjust(top=0.93)
    t= f.suptitle('Correlation Heatmap', fontsize=14)

In [None]:
# Pairs of 2 numerical attributes displayed side by side 
# Seaborn Pair-wise Scatter Plots
## Choose n numerical attributes in col_names to create an n x n table
## e.g., col_names = ['density', 'residual sugar', 'total sulfur dioxide', 'fixed acidity']
def num_pair_scatter_sbs(dataFrame, col_names, title=''):
    pp = sns.pairplot(dataFrame[col_names], 
                      height=1.8, 
                      aspect=1.8,
                      plot_kws=dict(edgecolor="k", linewidth=0.5),
                      diag_kind="kde", 
                      diag_kws=dict(shade=True))

    fig = pp.fig 
    fig.subplots_adjust(top=0.93, wspace=0.3)
    t = fig.suptitle(title, fontsize=14)

In [None]:
# 2 numerical attributes
# Seaborn Scatter Plot
def num_pair_scatter(dataFrame, num_attr_x, num_attr_y, title=''):
    plt.scatter(dataFrame[num_attr_x], 
                dataFrame[num_attr_y],
                alpha=0.4, 
                edgecolors='w'
               )

    plt.xlabel(num_attr_x)
    plt.ylabel(num_attr_y)
    plt.title(title, y=1.05)

In [None]:
# 2 numerical attributes
# Seaborn Joint Plot
def num_pair_joint(dataFrame, num_attr_x, num_attr_y, title=''):
    jp = sns.jointplot(x=num_attr_x, 
                       y=num_attr_y, 
                       data=dataFrame,
                       kind='reg', 
                       space=0, 
                       height=5, 
                       ratio=4)

In [None]:
# 2 numerical attributes
# Simple contour plot of the KDE for any two numerical features
def kde_contour(dataFrame, num_feature_1, num_feature_2):
    # dataFrame is the entire dataset
    # num_feature_1 and 2 are individual numerical feature names, e.g., 'numerical_feature'
    sns.kdeplot(dataFrame[num_feature_1], dataFrame[num_feature_2])

In [None]:
# 2 numerical attributes
# KDE contour + distribution for any two numerical features
def kde_contour_dist(dataFrame, num_feature_1, num_feature_2, kind='kde'):
    # dataFrame is the entire dataset
    # num_feature_1 and 2 are individual numerical feature names, e.g., 'numerical_feature'
    # kind = 'kde' or 'hex'
    with sns.axes_style('white'):
        sns.jointplot(x=dataFrame[num_feature_1], y=dataFrame[num_feature_2], kind=kind)

### E) 1 Categorical + 1 Numerical Attribute

In [None]:
# 1 categorical attribute + 1 numerical attribute
# Looking at how a single numerical feature varies across a single categorical feature
# Box plot display
def box_plot(dataFrame, cat_feature, num_feature, orient='vert'):
    # dataFrame is the entire dataset
    # cat_feature is the name of a single categorical feature 
    # num_feature is the name of a single numerical feature
    fig, ax = plt.subplots(1, 1, figsize=(12,6))
    fig.suptitle(num_feature + " by " + cat_feature, fontsize=14)
    
    if orient == 'vert':
        ax = sns.boxplot(x=dataFrame[cat_feature], y=dataFrame[num_feature], palette="Set2")
    elif orient == 'horiz':
        ax = sns.boxplot(x=dataFrame[num_feature], y=dataFrame[cat_feature], palette="Set2")
    
    ax.set_xlabel(cat_feature,size = 12,alpha=0.8)
    ax.set_ylabel(num_feature,size = 12,alpha=0.8)
    
    if len(dataFrame[cat_feature].unique()) > 3:
        plt.xticks(rotation=90);

In [None]:
# 1 categorical attribute + 1 numerical attribute
# Looking at how a single numerical feature varies across a single categorical feature
# Violin plot display
def violin_plot(dataFrame, cat_feature, num_feature):
    f, ax = plt.subplots(1, 1, figsize=(12, 4))
    f.suptitle(cat_feature + " - " + num_feature, fontsize=14)

    sns.violinplot(x=cat_feature, y=num_feature, data=dataFrame,  ax=ax)
    ax.set_xlabel(cat_feature,size = 12,alpha=0.8)
    ax.set_ylabel(num_feature,size = 12,alpha=0.8)
    
    if len(dataFrame[cat_feature].unique()) > 3:
        plt.xticks(rotation=90);

In [None]:
# 1 categorical attribute + 1 numerical attribute
# Looking at how a single numerical feature varies across a single categorical feature
# Jitter plot display
def jitter_plot(dataFrame, cat_feature, num_feature, orient='horiz', fig_w=10, fig_h=6):
    # dataFrame is the entire dataset
    # cat_feature is the name of a single categorical feature 
    # num_feature is the name of a single numerical feature
    fig, ax = plt.subplots(figsize=(fig_w,fig_h))
    if orient == 'vert':
        ax = sns.stripplot(x=cat_feature, y=num_feature, data=dataFrame, jitter=0.1)
    elif orient == 'horiz':
        ax = sns.stripplot(y=cat_feature, x=num_feature, data=dataFrame, jitter=0.1)
    
    if len(dataFrame[cat_feature].unique()) > 3:
        plt.xticks(rotation=90);

In [None]:
# 1 categorical + 1 numerical attribute (time steps)
# Time series for a single numerical feature
def time_series_plot(dataFrame, time_feature_name, num_feature_name):
    # dataFrame is the entire dataset
    # time_feature_name is the name of a the time feature, e.g., 'PUBLISHED_DATE'
    # num_feature_name is the name of the numerical feature that evolves in time, e.g., 'COUNT_IVR'
    
    # First sort the dataframe in ascending order of the time_feature_name
    df_sorted = dataFrame.sort_values(by=[time_feature_name])
    
    fig, ax = plt.subplots(figsize=(10,6))
    x = df_sorted[time_feature_name]
    y = df_sorted[num_feature_name]
    plt.plot(x,y, marker='o')
    #ax.tick_params(labelbottom='off') # turn off the x axis tick labels
    plt.xticks(rotation=90) # rotate the x axis tick labels
    ax.set_xlabel(time_feature_name)
    plt.legend()

In [None]:
# 1 numerical attribute + 1 categorical attribute (time step)
# Time series bar graph
# Example: Number of inquiries by month
# Set up the plot - use for categorical features
# Show how a categorical feature's values are distributed across the possible values it can take
def bar_graph_timeseries(bar_heights, bar_names, plt_title, orient='vert'):
    # orient can be 'vert' (default) or 'horiz'
    
    fig, ax = plt.subplots(figsize=(8,6))
    # To make the plot vertical, use x=feature in the 'count' display and (feat_values, y) in the 'percentage' display
    # To make the plot horizontal, use y=feature in the 'count' display and (y, feat_values) in the 'percentage' display
    
    plt.title(plt_title)
    value_label = 'count'
    
    if orient == 'vert': 
        ax = sns.barplot(bar_names, bar_heights)
        # If the number of bar_names is greater than n, rotate the labels
        n = 4
        if len(bar_names) > n:
            plt.xticks(rotation=90)
        plt.ylabel(value_label)
    elif orient == 'horiz':
        # horiz orientation
        ax = sns.barplot(bar_heights, bar_names)
        plt.xlabel(value_label)
    
    # If %matplotlib inline is invoked, we don't need to return plt.show()
    #return plt.show()

## 3 Dimensions

### F) 1 Categorical + 2 Numerical Attributes

In [None]:
# 1 categorical attribute + 2 numerical attributes
# Scatter plot of two numerical features grouped by a categorical feature
def scatter_plot(dataFrame, 
                 num_feat_x, 
                 num_feat_y, 
                 cat_feat, 
                 height=10, 
                 aspect=2.5, 
                 xlim=(0,160), 
                 ylim=(0,1)
                ):
    
    # height =  height of the plot
    # width = height * aspect
    
    # Uses the list xkcd_colors that is defined in Shared/Inlcude-Setup-Functions.ipynb
    # Use the 'hue' argument to provide a factor variable
    g = sns.lmplot(x=num_feat_x, 
               y=num_feat_y, 
               data=dataFrame, 
               fit_reg=False, 
               hue=cat_feat,
               height=height, 
               aspect=aspect,
               legend_out=True, 
               palette=sns.xkcd_palette(xkcd_colors), 
               scatter_kws={'s':200}
              )
    # Set the axis ranges if needed
    g.set(xlim=xlim)
    g.set(ylim=ylim)

In [None]:
# 1 categorical attribute + 2 numerical attributes
# Scatter plot of two numerical features grouped by a categorical feature
## A variant of scatter_plot that allows annotation
def scatter_plot_annot(dataFrame, 
                       num_feat_x, 
                       num_feat_y, 
                       cat_feat, 
                       fig_w=25, 
                       fig_h=10, 
                       xlim=(0,160), 
                       ylim=(0,1), 
                       num_title_chars=30, 
                       text_size='small', 
                       plt_title=''
                      ):
    
    # The unique items in the cat_feat for the given df_slice
    # Trucate each item for better display
    cat_titles = dataFrame[cat_feat].str.slice(0,num_title_chars).values
    
    # Uses the list xkcd_colors that is defined in Shared/Inlcude-Setup-Functions.ipynb
    # Use the 'hue' argument to provide a factor variable
    fig, ax = plt.subplots(figsize=(25,10))
    # basic plot
    p1=sns.regplot(data=dataFrame, 
                   x=dataFrame[num_feat_x], 
                   y=dataFrame[num_feat_y], 
                   fit_reg=False, 
                   marker="o", 
                   color="blue",
                   scatter_kws={'s':100}, 
                   ax=ax
                  )

    # add annotations one by one with a loop
    for line in range(0,dataFrame.shape[0]):
        p1.text(dataFrame[num_feat_x].values[line]+0.015, 
                dataFrame[num_feat_y].values[line]+0.015, 
                cat_titles[line], 
                horizontalalignment='left', 
                size=text_size, 
                color='black', 
                weight='normal')
    
    # Add the plot title
    ax.set_title(plt_title)
    
    # Set the axis ranges if needed
    ax.set(xlim=xlim)
    ax.set(ylim=ylim)

In [None]:
# 1 categorical attribute (with a large number of possible values) + 2 numerical attributes
# Scatter plot that accommodates the classification of the scatter dots into a large number of items
# Use when the number of the items in a group is > 5
def scatter_plot_large(dataFrame, 
                       num_feat_x, 
                       num_feat_y, 
                       cat_feat, 
                       slice_num, 
                       slicing_feat='AUTHOR_NAME', 
                       fig_w=14, 
                       fig_h=10, 
                       text_size='small', 
                       num_title_chars=25, 
                       xlim=(0, 70), 
                       ylim=(0, 130)
                      ):
    # dataFrame is the entire dataset
    # num_feat_x is the numerical feature for the x axis
    # num_feat_y is the numerical feature for the y axis
    # cat_feature is the categorical feature by which the dots are grouped
    # slice_num is the index number of the slice for which we want to create the scatter plot
    ## For example, slice_list[0] might be CIOLC, slice_list[1] might be ALC, etc.
    # slicing_feat is the slicing to be applied to the entire dataset; for example, 
    ## 'AUTHOR_NAME' slices the data set by creating a data frame for each AUTHOR_NAME which 
    ## might be the name of a functional practice (e.g., CIO, Applications, Infrastructure, ...)
    ## See IgnitionGuide visualizations for examples
    
    # The names of the various items by which to slice the dataFrame
    ## Typically, these slices will be slices by the leadership councils (slicing_feat='AUTHOR_NAME')
    slice_list = np.unique(dataFrame[slicing_feat].values)

    # Rows of data for a given item in the slice_list
    df_slice = dataFrame[dataFrame[slicing_feat] == slice_list[slice_num]]
    
    # Get the title of the plot
    plt_title = np.unique(df_slice[slicing_feat].values)[0]

    # The unique items in the cat_feat for the given df_slice
    # Trucate each item for better display
    cat_titles = df_slice[cat_feat].str.slice(0,num_title_chars).values

    fig, ax = plt.subplots(figsize=(fig_w,fig_h))
    # basic plot
    p1=sns.regplot(data=df_slice, 
                   x=df_slice[num_feat_x], 
                   y=df_slice[num_feat_y], 
                   fit_reg=False, 
                   marker="o", 
                   color="blue",
                   scatter_kws={'s':100}, 
                   ax=ax
                  )

    # add annotations one by one with a loop
    for line in range(0,df_slice.shape[0]):
        p1.text(df_slice[num_feat_x].values[line]+0.4, 
                df_slice[num_feat_y].values[line]+0.2, 
                cat_titles[line], 
                horizontalalignment='left', 
                size=text_size, 
                color='black', 
                weight='normal')

    #ax.set_title(plt_title)
    
    # Set the axes ranges if needed
    ax.set(xlim=xlim)
    ax.set(ylim=ylim)

In [None]:
# 1 categorical + 2 numerical attributes
# leveraging the concepts of hue for categorical dimension
def scatter_pairplot(dataFrame, num_attr_x, num_attr_y, cat_attr):
    jp = sns.pairplot(dataFrame, 
                      x_vars=[num_attr_x], 
                      y_vars=[num_attr_y], 
                      height=4.5,
                      hue=cat_attr, 
                      plot_kws=dict(edgecolor="k", linewidth=0.5))

In [None]:
# 1 categorical attribute + 2 numerical attributes
# Visualizing 2 numerical attributes and 1 categorical attribute using kernel density plots
# leveraging the concepts of hue for categorical dimension
def density_pairplot(dataFrame, num_attr_x, num_attr_y, cat_attr):
    
    # Get the possible values the cat_attr can take
    cat_values = list(dataFrame[cat_attr].unique())
    
    fig, ax = plt.subplots(figsize=(8,6))
    ax = sns.kdeplot(dataFrame[dataFrame[cat_attr] == cat_values[0]][num_attr_x], 
                     dataFrame[dataFrame[cat_attr] == cat_values[0]][num_attr_y],
                     cmap="YlOrBr", 
                     shade=True, 
                     shade_lowest=False,
                     label=cat_values[0], 
                     legend=True)

    ax = sns.kdeplot(dataFrame[dataFrame[cat_attr] == cat_values[1]][num_attr_x], 
                     dataFrame[dataFrame[cat_attr] == cat_values[1]][num_attr_y],
                     cmap="Reds", 
                     shade=True, 
                     shade_lowest=False, 
                     label=cat_values[1], 
                     legend=True)

### G) 2 Categorical + 1 Numerical Attribute

In [None]:
# 2 categorical attributes + 1 numerical attribute
# Looking at how a single numerical feature varies across two categorical features
# Grouped boxplot display
def grouped_boxplot(dataFrame, x_cat_feature, y_num_feature, z_cat_feature):
    fig, ax = plt.subplots(figsize=(14,8))
    sns.boxplot(x=x_cat_feature, 
                y=y_num_feature, 
                hue=z_cat_feature, 
                data=dataFrame, 
                palette="Set3"
               )
    
    ax.legend(loc='upper right')
    
    if len(dataFrame[x_cat_feature].unique()) > 3:
        plt.xticks(rotation=90);

In [1]:
# 2 categorical attributes + 1 numerical attribute
# The 2 plots compare the overall values to the values by category
# The side by side display is not for adding a dimension but rather for comparison of how a numerical attribute behaves
## overall versus in a certain category.
# Visualizing 2 categorical and 1 numerical attribute using box plots
# leveraging the concepts of hue and axes for > 1 categorical dimensions

def boxplot_comp(dataFrame, cat_attr_x1, cat_attr_x2, cat_attr_slice, num_attr):
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))
    f.suptitle(cat_attr_slice + " - " + cat_attr_x1 + " - " + cat_attr_x2 + " - " + num_attr, fontsize=14)

    cat_attr_slice_vals = dataFrame[cat_attr_slice].unique()
    cat_attr_x1_vals = dataFrame[cat_attr_x1].unique()
    
    sns.boxplot(x=cat_attr_x1, 
                y=num_attr, 
                hue=cat_attr_slice,
                data=dataFrame, 
                ax=ax1)
    ax1.set_xlabel(cat_attr_x1,size = 12,alpha=0.8)
    ax1.set_ylabel(num_attr,size = 12,alpha=0.8)
    
    if len(cat_attr_x1_vals) > 3:
        print("ROTATE!")
        plt.xticks(rotation=90)
    
    sns.boxplot(x=cat_attr_x2, 
                y=num_attr, 
                hue=cat_attr_slice,
                data=dataFrame, 
                ax=ax2)
    ax2.set_xlabel(cat_attr_x2,size = 12,alpha=0.8)
    ax2.set_ylabel(num_attr,size = 12,alpha=0.8)
    
    if len(cat_attr_slice_vals) > 3:
        plt.xticks(rotation=90)
    
    plt.legend(loc='upper right', title=cat_attr_slice)

In [None]:
# 2 categorical attributes + 1 numerical attributes
# The 2 plots compare the overall values to the values by category
# The side by side display is not for adding a dimension but rather for comparison of how a numerical attribute behaves
## overall versus in a certain category.

def quartile_violin_comp(dataFrame, cat_attr_x, cat_attr_slice, num_attr): 
    # Visualizing 2 categorical and one numerical attribute using violin plots
    # leveraging the concepts of hue and axes for > 1 categorical dimensions
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
    f.suptitle(cat_attr_slice + " - " + cat_attr_x + " - " + num_attr, fontsize=14)

    sns.violinplot(x=cat_attr_x, 
                   y=num_attr,
                   data=dataFrame, 
                   inner="quart", 
                   linewidth=1.3,
                   ax=ax1)
    ax1.set_xlabel(cat_attr_x,size = 12,alpha=0.8)
    ax1.set_ylabel(num_attr,size = 12,alpha=0.8)

    sns.violinplot(x=cat_attr_x, 
                   y=num_attr, 
                   hue=cat_attr_slice, 
                   data=dataFrame, 
                   split=False, 
                   inner="quart", 
                   linewidth=1.3,
                   ax=ax2)
    ax2.set_xlabel(cat_attr_x,size = 12,alpha=0.8)
    ax2.set_ylabel(num_attr,size = 12,alpha=0.8)
    
    l = plt.legend(loc='upper right', title=cat_attr_slice)

### H) 3 Categorical Attributes

In [None]:
# 3 categorical attributes
# Visualize the relationship between 3 categorical variables
## Requires the plotnine package ## Consider Deprecating

def cat_3_bars(dataFrame, x_feat, y_feat, z_feat):
    disp = (ggplot(dataFrame, aes(x=x_feat, fill=y_feat)) \
            + geom_bar(position='fill') \
            + facet_wrap('~' + z_feat) \
            + ylab('Percentage') \
            + theme(axis_text_x=element_text(rotation=90, hjust=1))
           )
    
    return disp

In [None]:
# 3 categorical attributes
# Side by side bar plots for distribution for a categorical attribute across 2 other categorical attributes
# Visualizing 3 categorical attributes using bar plots
# leveraging the concepts of hue and facets
def cat_3_sbs(dataFrame, cat_attr_x, cat_attr_slice_1, cat_attr_slice_2):
    
    #fig, ax = plt.subplots(figsize=(12, 6))
    cat_attr_x_vals = dataFrame[cat_attr_x].unique()
    
    fig = sns.catplot(x=cat_attr_x, 
                     hue=cat_attr_slice_1, 
                     col=cat_attr_slice_2, 
                     data=dataFrame, 
                     kind="count")
    
    if len(cat_attr_x_vals) > 3:
        fig.set_xticklabels(rotation=90)

### I) 3 Numerical Attributes

In [None]:
# 3 numerical attributes
# Bubble chart showing the relationship between any three numerical features
def bubble_chart(dataFrame, x_feature, y_feature, bubble_size_feature, scale_down=0):
    # dataFrame is the entire dataset
    # x_feature and y_feature are numerical features on the x and y axis respectively
    # bubble_size_feature is represented by the size of the bubble
    
    fig, ax = plt.subplots(figsize=(10, 8))
    x = dataFrame[x_feature]
    y = dataFrame[y_feature]
    
    if scale_down == 1:
        bubble_size = dataFrame[bubble_size_feature]*50. # scale down bubble size
    else:
        bubble_size = dataFrame[bubble_size_feature]*500 # scale up bubble size
    
    plt.scatter(x, y, s=bubble_size, c=x, cmap="Blues", alpha=0.4, edgecolors="orange", linewidth=2)
    plt.xlabel(x_feature)
    plt.ylabel(y_feature)
    plt.title("Bubble Size = " + bubble_size_feature)

In [None]:
# 3 numerical attributes
## NOTE: the attributes need not strictly be all numerical ## 
# Visualizing 3-D numeric data with a bubble chart
# length, breadth and size
def simple_bubble_chart(dataFrame, num_attr_x, num_attr_y, num_attr_bubble_size):
    fig, ax = plt.subplots(figsize=(12,8))
    plt.scatter(dataFrame[num_attr_x], 
                dataFrame[num_attr_y], 
                s=dataFrame[num_attr_bubble_size]*25, 
                alpha=0.4, 
                edgecolors='w')

    plt.xlabel(num_attr_x)
    plt.ylabel(num_attr_y)
    plt.title("Bubble Size = " + num_attr_bubble_size,y=1.05)

In [None]:
# 3 numerical attributes
# Visualizing 3-D numeric data with Scatter Plots (unlike bubble charts, all points are same size here)
# length, breadth and depth
def threeD_scatter(dataFrame, num_attr_x, num_attr_y, num_attr_z, color='blue'):
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')

    xs = dataFrame[num_attr_x]
    ys = dataFrame[num_attr_y]
    zs = dataFrame[num_attr_z]
    ax.scatter(xs, ys, zs, s=50, alpha=0.6, edgecolors='w', color=color)

    ax.set_xlabel(num_attr_x)
    ax.set_ylabel(num_attr_y)
    ax.set_zlabel(num_attr_z)

## 4 Dimensions

### J) 2 Categorical Attributes + 2 Numerical Attributes

In [None]:
# 2 categorical attributes + 2 numerical attributes
# Side by side view provides a categorical dimension
# Visualizing 2 numerical and 2 categorical attributes using facets and scatter plots
# leveraging the concepts of hue and facets for > 1 categorical attributes
def scatter_sbs(dataFrame, cat_attr_1, cat_attr_2, num_attr_x, num_attr_y):   
    # To order the cat_attr_1 values, use col_order=['value1', ..., 'valueN'] as a directive in sns.FacetGrid
    # To order the cat_attr_2 values, use hue_order=['valueA', ..., 'valueR'] as a directive in sns.FacetGrid
    g = sns.FacetGrid(dataFrame, 
                      col=cat_attr_1, 
                      hue=cat_attr_2, 
                      aspect=1.2, 
                      height=3.5)

    g.map(plt.scatter, 
          num_attr_x, 
          num_attr_y, 
          alpha=0.9, 
          edgecolor='white', 
          linewidth=0.5, 
          s=100)

    fig = g.fig 
    fig.subplots_adjust(top=0.8, wspace=0.3)
    #fig.suptitle('Wine Type - Alcohol - Quality - Acidity', fontsize=14)
    l = g.add_legend(title=cat_attr_2)

### K) 1 Categorical Attribute + 3 Numerical Attributes

In [None]:
# 1 categorical attribute + 3 numerical attributes
# Visualizing 3 numerical and 1 categorical attribute using a 3D scatter plot and 
# leveraging the concepts of hue
## CAUTION - TAKES TIME TO RUN ##

# Uses the list xkcd_colors defined in Shared/Include-Setup-Functions.ipynb
def scatter_4d(dataFrame, num_attr_x, num_attr_y, num_attr_z, cat_attr): 
    
    # Based on the values the cat_attr can take, assign a color to each data point
    color_list = ['yellow', 'red', 'green', 'blue', 'plum', 'orange', 'magenta', 'pink', 'black']
    cat_values = dataFrame[cat_attr].unique()
    colors = []
    for val in dataFrame[cat_attr]:
        for i in range(len(cat_values)): 
            if val == cat_values[i]:
                color = color_list[i]
            colors.append(color)
    
    # Create the legend
    legend_string = cat_attr + "\n"
    for i in range(len(cat_values)):
        legend_string = legend_string + "Value: " + str(cat_values[i]) + "  " + "Color: " + color_list[i] + "\n"
        
    
    fig = plt.figure(figsize=(8, 6))
    t = fig.suptitle(legend_string, fontsize=14)
    ax = fig.add_subplot(111, projection='3d')

    xs = list(dataFrame[num_attr_x])
    ys = list(dataFrame[num_attr_y])
    zs = list(dataFrame[num_attr_z])
    data_points = [(x, y, z) for x, y, z in zip(xs, ys, zs)]
    
    for data, color in zip(data_points, colors):
        x, y, z = data
        ax.scatter(x, y, z, alpha=0.4, c=color, edgecolors='none', s=30)

    ax.set_xlabel(num_attr_x)
    ax.set_ylabel(num_attr_y)
    ax.set_zlabel(num_attr_z);

In [None]:
# 1 categorical attribute + 3 numerical attributes
# Visualizing 3 numerical and 1 categorical attribute using a bubble plot
# leveraging the concepts of hue and size

def bubble_4d(dataFrame, num_attr_x, num_attr_y, num_attr_size, cat_attr):
    
    # Based on the values the cat_attr can take, assign a color to each data point
    color_list = ['yellow', 'red', 'green', 'blue', 'plum', 'orange', 'magenta', 'pink', 'black']
    cat_values = dataFrame[cat_attr].unique()
    colors = []
    for val in dataFrame[cat_attr]:
        for i in range(len(cat_values)): 
            if val == cat_values[i]:
                color = color_list[i]
            colors.append(color)
    
    # Create the legend
    legend_string = cat_attr + "\n"
    for i in range(len(cat_values)):
        legend_string = legend_string + "Value: " + str(cat_values[i]) + "  " + "Color: " + color_list[i] + "\n"
    
    size = dataFrame[num_attr_size]*25

    plt.scatter(dataFrame[num_attr_x], 
                dataFrame[num_attr_y], 
                s=size, 
                alpha=0.4, 
                color=colors, 
                edgecolors=colors
               )

    plt.xlabel(num_attr_x)
    plt.ylabel(num_attr_y)
    plt.title("Bubble Size: " + num_attr_size + "\n" + legend_string ,y=1.05)

## 5 Dimensions

### L) 1 Categorical Attribute + 4 Numerical Attributes

In [None]:
# 1 categorical attribute + 4 numerical attributes
# Visualizing 4 numerical and 1 categorical attribute using a bubble plot
# leveraging the concepts of hue, size and 3D plotting

def scatter_5d(dataFrame, num_attr_x, num_attr_y, num_attr_z, num_attr_size, cat_attr):
    
    # Based on the values the cat_attr can take, assign a color to each data point
    color_list = ['yellow', 'red', 'green', 'blue', 'plum', 'orange', 'magenta', 'pink', 'black']
    cat_values = dataFrame[cat_attr].unique()
    colors = []
    for val in dataFrame[cat_attr]:
        for i in range(len(cat_values)): 
            if val == cat_values[i]:
                color = color_list[i]
            colors.append(color)
    
    # Create the legend
    legend_string = "Bubble Size = " + num_attr_size + "\n" + cat_attr + "\n"
    for i in range(len(cat_values)):
        legend_string = legend_string + "Value: " + str(cat_values[i]) + "  " + "Color: " + color_list[i] + "\n"
    
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')
    t = fig.suptitle(legend_string, fontsize=14)

    xs = list(dataFrame[num_attr_x])
    ys = list(dataFrame[num_attr_y])
    zs = list(dataFrame[num_attr_z])
    data_points = [(x, y, z) for x, y, z in zip(xs, ys, zs)]

    ss = list(dataFrame[num_attr_size])
    
    for data, color, size in zip(data_points, colors, ss):
       x, y, z = data
       ax.scatter(x, y, z, alpha=0.4, c=color, edgecolors='none', s=size*25)

    ax.set_xlabel(num_attr_x)
    ax.set_ylabel(num_attr_y)
    ax.set_zlabel(num_attr_z);

### M) 2 Categorical Attributes + 3 Numerical Attributes

In [None]:
# 2 categorical attributes + 3 numerical attributes
# Visualizing 3 numerical and 2 categorical attributes using a bubble plot
# leveraging the concepts of hue, size and facets
# Side by side plots represent a categorical dimension

def scatter_5d_sbs(dataFrame, num_attr_x, num_attr_y, num_attr_size, cat_attr_sbs, cat_attr_hue):
    
    # To order the cat_attr_sbs values, use col_order=['value1', ..., 'valueN'] as a directive in sns.FacetGrid
    # To order the cat_attr_hue values, use hue_order=['valueA', ..., 'valueR'] as a directive in sns.FacetGrid
    
    g = sns.FacetGrid(dataFrame, 
                      col=cat_attr_sbs, 
                      hue=cat_attr_hue, 
                      aspect=1.2, 
                      height=3.5)

    g.map(plt.scatter, 
          "residual sugar", 
          "alcohol", 
          alpha=0.8, 
          edgecolor='white', 
          linewidth=0.5, 
          s=dataFrame[num_attr_size]*2)

    fig = g.fig 
    fig.subplots_adjust(top=0.8, wspace=0.3)
    fig.suptitle("Bubble Size: " + num_attr_size, fontsize=14)
    l = g.add_legend(title=cat_attr_hue)

## 6 Dimensions

### N) 2 Categorical Attribues + 4 Numerical Attributes

In [None]:
# 2 categorical attributes + 4 numerical attributes
# Visualizing 4 numerical and 2 categorical attributes using a scatter plot and bubble plot and markers
# leveraging the concepts of hue, size, depth and shape
def scatter_6d(dataFrame, num_attr_x, num_attr_y, num_attr_z, num_attr_size, cat_attr_marker, cat_attr_hue): 
    
    # Based on the values the cat_attr can take, assign a color to each data point
    color_list = ['yellow', 'red', 'green', 'blue', 'plum', 'orange', 'magenta', 'pink', 'black']
    marker_list = ['.', 'v', 'x', 'p', '*', 'h', '1', 'D', 's']
    marker_shape = ['point', 'triangle', 'x', 'plus', 'star', 'hexagon', 'y', 'diamond', 'square']
    
    cat_values_hue = dataFrame[cat_attr_hue].unique()
    cat_values_marker = dataFrame[cat_attr_marker].unique()
    
    colors = []
    for val in dataFrame[cat_attr_hue]:
        for i in range(len(cat_values_hue)): 
            if val == cat_values_hue[i]:
                color = color_list[i]
                colors.append(color)
    
    markers = []
    for val in dataFrame[cat_attr_marker]:
        for i in range(len(cat_values_marker)): 
            if val == cat_values_marker[i]:
                marker = marker_list[i]
                markers.append(marker)
    
    # Create the legend for the bubble size
    legend_string_bubble = "Bubble Size = " + num_attr_size + "\n"
    
    # Create the legend for the cat_attr_hue
    legend_string_hue = ""
    for i in range(len(cat_values_hue)):
        legend_string_hue = legend_string_hue + cat_attr_hue + "  Value: " + str(cat_values_hue[i]) + "  " + "Color: " + color_list[i] + "\n"
            
     # Create the legend for the cat_attr_marker
    legend_string_marker = ""
    for i in range(len(cat_values_marker)):
        legend_string_marker = legend_string_marker + cat_attr_marker + "  Value: " + str(cat_values_marker[i]) + "  " + "Shape: " + marker_shape[i] + "\n"
     
    
    fig = plt.figure(figsize=(12, 8))
    t = fig.suptitle(legend_string_bubble + legend_string_hue + legend_string_marker, fontsize=14)
    ax = fig.add_subplot(111, projection='3d')

    xs = list(dataFrame[num_attr_x])
    ys = list(dataFrame[num_attr_y])
    zs = list(dataFrame[num_attr_z])
    data_points = [(x, y, z) for x, y, z in zip(xs, ys, zs)]

    ss = list(dataFrame['total sulfur dioxide'])
    
    for data, color, size, mark in zip(data_points, colors, ss, markers):
        x, y, z = data
        ax.scatter(x, y, z, alpha=0.4, c=color, edgecolors='none', s=size, marker=mark)
    
    ax.set_xlabel(num_attr_x)
    ax.set_ylabel(num_attr_y)
    ax.set_zlabel(num_attr_z);

### O) 3 Categorical Attributes + 3 Numerical Attributes

In [None]:
# 3 categorical attributes + 3 numerical attributes

# Visualizing 6-D mix data using scatter charts
# leveraging the concepts of hue, facets and size
def scatter_6d_sbs(dataFrame, cat_attr_row, cat_attr_sbs, cat_attr_hue, num_attr_x, num_attr_y, num_attr_size):
    g = sns.FacetGrid(dataFrame, 
                      row=cat_attr_row, 
                      col=cat_attr_sbs, 
                      hue=cat_attr_hue, 
                      height=4)

    g.map(plt.scatter,  
          num_attr_x, 
          num_attr_y, 
          alpha=0.5, 
          edgecolor='k', 
          linewidth=0.5, 
          s=dataFrame[num_attr_size]*2)

    fig = g.fig 
    fig.set_size_inches(18, 8)
    fig.subplots_adjust(top=0.85, wspace=0.3)
    fig.suptitle("Bubble Size: " + num_attr_size, fontsize=14)
    l = g.add_legend(title=cat_attr_hue)

## Multiple Numerical Attributes

### P) Multiple Numerical Attributes Each Displayed Separately

In [None]:
# Multiple numerical attributes -- each graph contains a single attribute
## with graphs displayed side by side
# Pandas Histograms side by side
def num_attrs_sbs(dataFrame): 
    dataFrame.hist(bins=30, 
                   color='steelblue', 
                   edgecolor='black', 
                   linewidth=1.0,
                   xlabelsize=10, 
                   ylabelsize=10, 
                   grid=False)    

    plt.tight_layout(rect=(0, 0, 2, 2))

In [None]:
# 1 numerical feature -- side by side histograms of single numerical features
# Multiple histograms
def num_hist_mult(dataFrame, num_feature_list):
    # dataFrame is the entire dataset
    # num_feature_list is a list of numerical features, e.g., ['num_feat1', ..., 'num_feat_N']
    fig, ax = plt.subplots(figsize=(12,8))
    for num_feature in num_feature_list:
        plt.hist(dataFrame[num_feature], density=True, alpha=0.5, label=num_feature)
        
    plt.legend()

In [None]:
# 1 numerical feature -- side by side density plots of single numerical features
# Distribution density curves overlayed
# Distributions of a set of numerical features
def num_kde_mult(dataFrame, num_feature_list):
    # dataFrame is the entire dataset
    # num_feature_list is a list of numerical features, e.g., ['num_feat1', ..., 'num_feat_N']
    fig, ax = plt.subplots(figsize=(8,6))
    for num_feature in num_feature_list:
        sns.kdeplot(dataFrame[num_feature], label=num_feature)
        
    plt.legend()

In [None]:
# 1 numerical feature -- side by side histograms and density plots overlayed of single numerical features
# Both histograms and density curves overlayed
# Distributions of a set of numerical features
def num_hist_kde_mult(dataFrame, num_feature_list):
    # dataFrame is the entire dataset
    # num_feature_list is a list of numerical features, e.g., ['num_feat1', ..., 'num_feat_N']
    fig, ax = plt.subplots(figsize=(8,6))
    for num_feature in num_feature_list:
        sns.distplot(dataFrame[num_feature], label=num_feature)
        
    plt.legend()

### Q) Multiple Numerical Attributes Displayed on the Same Graph

In [None]:
# Multiple numerical attributes -- all compared on the same graph
# Boxplots for a set of numerical features
# The swarmplot shows the data points jittered for better visibility
# Another option instead of the jitter is to use a violinplot (for large datasets)
def num_boxplot_mult(dataFrame, num_feature_list):
    # dataFrame is the entire dataset
    # num_feature_list is a list of numerical features, e.g., ['num_feat1', ..., 'num_feat_N']
    fig, ax = plt.subplots(figsize=(8,6))
    ax = sns.boxplot(data=dataFrame[num_feature_list], palette='Set2')
    ax = sns.swarmplot(data=dataFrame[num_feature_list], color='grey')
    
     # If the number of distinct values is greater than n, rotate the labels
    n = 3
    if len(num_feature_list) > n:
        plt.xticks(rotation=90)

In [None]:
# Multiple numerical attributes -- all compared on the same graph
# Correlation Table -- Display the relationship between multiple numerical attributes
# Correlation Density Plot
def num_corr_table(dataFrame, num_feature_list):
    # dataFrame is the entire dataset
    # num_feature_list is a list of numerical features, e.g., ['num_feat1', ..., 'num_feat_N']
    fig, ax = plt.subplots(figsize=(8,6))
    cm = dataFrame[num_feature_list].corr()
    sns.set(font_scale=1)
    #### NOTE: fmt directive controls number of decimal points displayed in the correlation value. ####
    hm = sns.heatmap(cm,
                     cbar=True,
                     annot=True,
                     square=False,
                     fmt='.2f',
                     annot_kws={'size':14},
                     yticklabels=num_feature_list,
                     xticklabels=num_feature_list
                    )

    plt.title('Correlation Heat Map')

In [None]:
# Multiple numerical attributes with their pairwise correlations on a single graph
## USE THIS PAIRWISE CORRELATION FUNCTION ##
# Seaborn Correlation Matrix Heatmap
def pairwise_corr(dataFrame, title=''): 
    f, ax = plt.subplots(figsize=(10, 6))
    corr = dataFrame.corr()
    hm = sns.heatmap(round(corr,2), 
                     annot=True, 
                     ax=ax, 
                     cmap="coolwarm",
                     fmt='.2f', 
                     linewidths=.05)
    f.subplots_adjust(top=0.93)
    t= f.suptitle(title, fontsize=14)

### R) Multiple Numerical Attributes + 1 Categorical Attribute

In [None]:
# Multiple numerical attributes + 1 categorical attribute (time steps)
# Time series evolution of a list of numerical features
def time_series_mult_plot(dataFrame, time_feature_name, num_feature_list, highlighted_feature=''):
    # dataFrame is the entire dataset
    # time_feature_name is the name of a the time feature, e.g., 'PUBLISHED_DATE'
    # num_feature_list contains the names of the numerical features that evolve in time, 
    # e.g., ['COUNT_IVR', ..., 'Avg_Dwell_Time'] or quality_feats
    # highlighted_feature is the feature in the num_feature_list to highlight in the plot
    
    # First sort the dataframe in ascending order of the time_feature_name
    df_sorted = dataFrame.sort_values(by=[time_feature_name])
    
    if highlighted_feature != '':
        # Create the abriged list of numerical features
        abbr_feature_list = [x for x in num_feature_list if x != highlighted_feature]
    
    # set up the plot
    fig, ax = plt.subplots(figsize=(14,10))
    x = df_sorted[time_feature_name]
    if highlighted_feature != '':
        y = df_sorted[abbr_feature_list]
    else:
        y = df_sorted[num_feature_list]
    plt.plot(x,y)
    
    #ax.tick_params(labelbottom='off') # no x axis tick labels
    plt.xticks(rotation=90) # rotate the x axis tick labels
    ax.set_xlabel(time_feature_name)

    # Now re-plot the highlighted feature - bigger with distinct color
    if highlighted_feature != '':
        plt.plot(x, df_sorted[highlighted_feature], marker='o', color='purple', linewidth=3, alpha=0.7)
        plt.legend(abbr_feature_list + [highlighted_feature])
    else:
        plt.legend(num_feature_list)

In [None]:
# Multiple numerical attributes + 1 categorical attribute
# Visualize the relationship between a set of numerical features and 
# a given categorical feature
# Scatter plot format

def num_cat_scatter(dataFrame, num_feats_list, cat_feat_name):
    # dataFrame is the entire dataset
    # num_feats_list is the list of numerical features, e.g., doc_feats
    # cat_feat_name is the name of the single categorical feature, e.g., 'AUTHOR_NAME'
    
    # Create the combined dataframe
    feat_list = num_feats_list + [cat_feat_name]
    
    # Create the pairplot
    sns.pairplot(dataFrame[feat_list], kind='scatter', hue=cat_feat_name);

In [None]:
# Multiple numerical attributes + 1 categorical attribute
# Visualize the relationship between a set of numerical features and 
# a given categorical feature
# Regression plot format

def num_cat_regress(dataFrame, num_feats_list, cat_feat_name):
    # dataFrame is the entire dataset
    # num_feats_list is the list of numerical features, e.g., doc_feats
    # cat_feat_name is the name of the single categorical feature, e.g., 'AUTHOR_NAME'
    
    # Create the combined dataframe
    feat_list = num_feats_list + [cat_feat_name]
    
    # Create the pairplot
    sns.pairplot(dataFrame[feat_list], kind='reg', hue=cat_feat_name);

In [None]:
## CAUTION - TAKES TIME TO RUN ##
# Multiple numerical attributes and 1 categorical attribute
# Scaling attribute values to avoid few outliers
def cluster_lines(dataFrame, num_col_names, cat_attr):
    # num_col_names is a list of numerical attributes
    # e.g., ['density', 'residual sugar', 'total sulfur dioxide', 'fixed acidity']
    subset_df = dataFrame[num_col_names]

    from sklearn.preprocessing import StandardScaler
    ss = StandardScaler()

    scaled_df = ss.fit_transform(subset_df)
    scaled_df = pd.DataFrame(scaled_df, columns=num_col_names)
    final_df = pd.concat([scaled_df, dataFrame[cat_attr]], axis=1)
    #final_df.head()


    f, ax = plt.subplots(1, 1, figsize=(12, 6))
    if len(num_col_names) > 4:
        plt.xticks(rotation=90)
    
    # plot parallel coordinates
    from pandas.plotting import parallel_coordinates
    pc = parallel_coordinates(final_df, cat_attr)

In [None]:
# Multiple plots of pairs of numerical attributes + 1 categorical attribute (hue value)
# Scatter Plot with Hue for visualizing data in 3-D
def pairwise_scatter(dataFrame, num_col_names, cat_attr, title=''):

    # num_col_names is a list of the numerical attributes, e.g., 
    # ['density', 'residual sugar', 'total sulfur dioxide', 'fixed acidity']
    
    # For this plot to work, the cat_attr must be appended to the num_col_names to form col_names
    num_col_names.append(cat_attr)
    
    pp = sns.pairplot(dataFrame[num_col_names], 
                      hue=cat_attr, 
                      height=1.8, 
                      aspect=1.8,
                      plot_kws=dict(edgecolor="black", linewidth=0.5))
    fig = pp.fig 
    fig.subplots_adjust(top=0.93, wspace=0.3)
    t = fig.suptitle(title, fontsize=14)

## Other

In [None]:
# Not a visualization function per se but one that generates
## the inputs to a visualization function

# Example: Count of inquiries by month or year
# Use these results to plot the counts using the bar_graph_timeseries function above
def group_by_time_period(dataFrame, date_field_name, other_field_name, period='1M'):
    # period can be '1M' (default for monthly) or '1Y' (yearly)
    # other_field_name is any field name in the dataFrame
    
    vals = dataFrame.groupby(pd.Grouper(key=date_field_name, freq=period)).count()
    if period == '1M':
        vals.index = vals.index.strftime('%Y %B')
    elif period == '1Y':
        vals.index = vals.index.strftime('%Y')
        
    bar_names = vals[other_field_name].index
    bar_values = vals[other_field_name].values
    
    return bar_names, bar_values

In [None]:
# Use for word frequency plots
# 1 categorical attribute (words) and their associated frequencies 
# Word frequencies bar graph (x axis = word, y axis = frequency of occurrence of the word)
# Set up the plot - use for categorical features
# Show how a categorical feature's values are distributed across the possible values it can take
def bar_graph_freq(freq, plt_title='', orient='horiz'):
    # orient can be 'vert' or 'horiz' (default)
    # freq is a list of [(word1, freq count 1), ..., (wordN, freq count N)]
    # It is the output of the word_freq function in SharedFunctions/Include-NLP-Functions
    
    bar_names = [item[0] for item in freq]
    bar_heights = [item[1] for item in freq]
    
    fig, ax = plt.subplots(figsize=(8,6))
    # To make the plot vertical, use x=feature in the 'count' display and (feat_values, y) in the 'percentage' display
    # To make the plot horizontal, use y=feature in the 'count' display and (y, feat_values) in the 'percentage' display
    
    if orient == 'vert': 
        ax = sns.barplot(bar_names, bar_heights)
        # If the number of bar_names is greater than n, rotate the labels
        n = 4
        if len(bar_names) > n:
            plt.xticks(rotation=90)
        plt.ylabel('Count')
        plt.title(plt_title)
    elif orient == 'horiz':
        # horiz orientation
        ax = sns.barplot(bar_heights, bar_names)
        plt.xlabel('Count')
        plt.title(plt_title)
    
    # If %matplotlib inline is invoked, we don't need to return plt.show()
    #return plt.show()

In [None]:
# # Word Cloud Display
# from wordcloud import WordCloud
# # Create a word cloud for the words that occur in a given term's context
# # Context is defined in the get_word_context function in SharedFunctions/Include-NLP-Functions
# #### NOTE: This function requires SharedFunctions/Include-NLP-Functions to already be loaded ####
# def create_wordcloud(corpus, ignore_words=[]):
#     # word_list is a list of words -- i.e., a corpus respresented as a single tokenized sentence
    
#     # Turn the corpus into a single list of tokens
#     word_list = flatten_list(corpus)
#     # Turn the single list of tokens into one giant sentence
#     word_cloud_text = " ".join(word_list)
    
#     # Create the wordcloud object -- remove the expression itself from the wordcloud
#     wordcloud = WordCloud(width=900, 
#                           height=600, 
#                           margin=0, 
#                           stopwords=ignore_words, 
#                           colormap="Blues").generate(word_cloud_text)
    
#     # Display the generated image:
#     ax, fig = plt.subplots(figsize=(15,10))
#     plt.imshow(wordcloud, interpolation='bilinear')
#     plt.axis("off")
#     plt.margins(x=0, y=0)
#     # plt.show() # not necessary if matplotlib.inline is set 