<table class="table table-bordered">
    <tr>
        <th style="width:200px;">
            <img src='https://bcgriseacademy.com/hs-fs/hubfs/RISE%202.0%20Logo_Options_25Jan23_RISE%20-%20For%20Black%20Background.png?width=3522&height=1986&name=RISE%202.0%20Logo_Options_25Jan23_RISE%20-%20For%20Black%20Background.png' style="background-color:black; width: 100%; height: 100%;">
        </th>
        <th style="text-align:center;">
            <h1>IBF TFIP</h1>
            <h2>Pandas II - Data Visualization</h2>
        </th>
    </tr>
</table>

# Learning Objectives
#### After completing this lesson, you should be able to:

1. LO1 : Understand Data Visualization with Python
2. LO2 : Understand functions in Matplotlib and Seaborn libraries
3. LO3 : Apply Matplotlib and Seaborn libraries to create visualizations


# Table of Contents <a id='tc'></a>

1. [Plotting and Visualization in Python](#p1)
2. [MatplotLib](#p2)
3. [Seaborn](#p3)
4. [DIY debugging for data visualization using ChatGPT](#p4)
5. [Hands-On Practice Exercise](#p5)

# 1. Plotting and Visualization in Python <a id='p1' />

Plotting and visualization are essential aspects of data analysis and communication. Python provides several libraries that make it easy to create visually appealing and informative plots. Some popular libraries for plotting and visualization in Python are:

`Matplotlib:` Matplotlib is one of the most widely used libraries for creating static, interactive, and animated visualizations. It provides a wide range of plot types, including line plots, bar plots, scatter plots, histograms, and more.

`Seaborn:` Seaborn is built on top of Matplotlib and provides a higher-level interface for creating attractive statistical graphics. It is particularly useful for visualizing data distributions and relationships.

`Plotly:` Plotly is a versatile library that supports interactive and web-based visualizations. It allows users to create interactive plots, 3D plots, maps, and more.

`Pandas Plotting:` The Pandas library provides built-in plotting functionality that allows users to create basic plots directly from Pandas data structures, such as DataFrames and Series.

`Bokeh:` Bokeh is another interactive visualization library that supports web-based visualizations. It is designed for creating interactive plots with large datasets.


# 2. Matplotlib <a id='p2' />

Matplotlib is an excellent 2D and 3D graphics library for generating scientific figures in Python. Some of the many advantages of this library includes:

* Easy to get started
* Support for $\LaTeX$ formatted labels and texts
* Great control of every element in a figure, including figure size and DPI. 
* High-quality output in many formats, including PNG, PDF, SVG, EPS.
* GUI for interactively exploring figures *and* support for headless generation of figure files (useful for batch jobs).

One of the of the key features of matplotlib that I would like to emphasize, and that I think makes matplotlib highly suitable for generating figures for scientific publications is that all aspects of the figure can be controlled *programmatically*. This is important for reproducibility, convenient when one need to regenerate the figure with updated data or change its appearance. 

In [None]:
# import the required libraries
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## 2.1 Basic Visualizations with Matplotlib

In [None]:
# line chat
# Sample data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
revenue = [1000, 1200, 1100, 1300, 1500, 1400]

# Create a line chart
plt.plot(months, revenue, marker='o', linestyle='-', color='b', alpha = 0.5)

# Add labels and title
plt.xlabel('Months')
plt.ylabel('Revenue ($)')
plt.title('Monthly Revenue')
plt.show()

In [None]:
# Scatter plot
# Understanding relationships between columns of data

# Sample data for age and income
age = [25, 30, 35, 40, 45, 50, 55, 60]
income = [42000, 57000, 58000, 60000, 61000, 70000, 75000, 80000]

# Create a scatter plot
plt.scatter(age, income, marker='o', color='b', alpha = 0.5)

# Add labels and title
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.title('Income vs. Age for Employees')
plt.show()

In [None]:
# HISTOGRAM - This is important to understand distribution of any column of data

# Sample data
ages = [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

# Create a histogram plot
plt.hist(ages, bins=10, color='blue', alpha = 0.3)

# Add labels and title
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution of Survey Respondents')
plt.show()

In [None]:
# Bar/Column chart/plot

labels = ["Javascript", "Java", "Python", "R"]
usage = [62.8, 43.5, 37.2, 25]

# Plotting the bar itself
plt.bar(labels, usage, color= 'blue', alpha = 0.3)
plt.title("Usage of different programming languages")
plt.xlabel("Programming Languages")
plt.ylabel('Usage')
plt.show()



In [None]:
# Horizontal Bar Chart

labels = ["Javascript", "Java", "Python", "R"]
usage = [62.8, 43.5, 37.2, 25]

# Plotting the bar itself
plt.barh(labels, usage, color = 'green', alpha = 0.3)
plt.title("Usage of different programming languages")
plt.xlabel("Programming Languages")
plt.ylabel('Usage')
plt.show()


## 2.2 Subplots in Matplotlib

In [None]:
# Subplots in Matplotlib

import numpy as np

# Create a total of two plots - one below the other
plt.figure(figsize=(20,10)) # length/width, height

# create the first plot
plt.subplot(1,2,1) # (rows, columns, panel number/subplot area number)

plt.bar(labels, usage)
plt.title("Usage of different programming languages")
plt.xlabel("Programming Languages")
plt.ylabel('Usage')

# create the second plot
plt.subplot(1,2,2)

plt.barh(labels, usage)
plt.title("Usage of different programming languages")
plt.xlabel("Programming Languages")
plt.ylabel('Usage')
plt.show()

## 2.3 More Visualizations using Matplotlib

Let's import some data and plot a simple figure with the MATLAB-like plotting API.


In [None]:
rain = pd.read_table('./data/nashville_precip.txt', delimiter='\s+', na_values='NA', index_col=0)
rain.head()

In [None]:
# We want to plot the January rainfall for all years
x = rain.index.values
y = rain['Jan'].values

In [None]:
plt.figure()
plt.plot(x, y, 'r')
plt.xlabel('Year')
plt.ylabel('Rainfall')
plt.title('January rainfall in Nashville')
plt.show()

#### Adding data labels to the chart

In [None]:
# Set size of the plot
plt.figure()

# Set x-axis and y-axis values
rain_jan = rain.loc[2007:2012, 'Jan'].values
months = [2007, 2008 ,2009, 2010, 2011]

# Set style of the plot
plt.style.use('tableau-colorblind10')

# Set title for plot (with fontsize change)
plt.title('January rainfall in Nashville', fontsize = 20)

# Set labels for x-axis and y-axis (with fontsize change)
plt.xlabel('Year', fontsize = 15)
plt.ylabel('Rainfall', fontsize = 15)

# Set font size of ticks on the x-axis and y-axis
plt.xticks(fontsize = 10)
plt.yticks(fontsize = 10)

# Set ranges for the x-axis and y-axis
plt.xlim(left=2006,right=2012)
plt.ylim(bottom=0,top=8)

# Turn grid on or off (0 = off, 1 = on)
plt.grid(c = 'r', alpha = .2, linestyle = '--')

# Display Data Labels (annotations)
for x, y in zip(months, rain_jan): # zip joins x and y coordinates in pairs

    label = y # value of data label to display

    plt.annotate(label, # this is data label to display
                 (x, y), # this is the position to insert the data label
                 textcoords="offset points", # define how to position the text
                 xytext=(0,10), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
    
# Create plot
plt.plot(months, rain_jan, 'b-o', label = 'January rainfall') # remember to set legend labels

# Set the location of the legend
# plt.legend(loc=(1.02,0.9))

plt.legend()

# Display the plot
plt.show()

#### Customize plotting symbols

In [None]:
x = rain.index.values
y = rain['Jan'].values

plt.figure(figsize=(14,6))
plt.subplot(1,2,1)
plt.plot(x, y, 'r--')
plt.subplot(1,2,2)
plt.plot(x, rain['Feb'].values, 'g*-')

## 2.4 Object-oriented API <a id='p2.2' />


While the MATLAB-like API is easy and convenient, it is worth learning matplotlib's object-oriented plotting API. It is remarkably powerful and for advanced figures, with subplots, insets and other components it is very nice to work with. 

The main idea with object-oriented programming is to have objects with associated methods and functions that operate on them, and no object or program states should be global.

To use the object-oriented API we start out very much like in the previous example, but instead of creating a new global figure instance we store a reference to the newly created figure instance in the `fig` variable, and from it we create a new axis instance `axes` using the `add_axes` method in the `Figure` class instance `fig`.


In [None]:
fig = plt.figure()

# left, bottom, width, height (range 0 to 1)
# as fractions of figure size
axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) 

axes.plot(x, y, 'r')

axes.set_xlabel('Year')
axes.set_ylabel('Rainfall')
axes.set_title('January rainfall in Nashville');

Although a little bit more code is involved, the advantage is that we now have full control of where the plot axes are place, and we can easily add more than one axis to the figure.

In [None]:
fig = plt.figure()

axes1 = fig.add_axes([0.1, 0.1, 0.9, 0.9]) # main axes
axes2 = fig.add_axes([0.65, 0.65, 0.3, 0.3]) # inset axes

# main figure
axes1.plot(x, y, 'r')
axes1.set_xlabel('Year')
axes1.set_ylabel('Rainfall')
axes1.set_title('January rainfall in Nashville');

# inset figure
axes2.plot(x, np.log(y), 'g')
axes2.set_title('Log rainfall');

If we don't care to be explicit about where our plot axes are placed in the figure canvas, then we can use one of the many axis layout managers in matplotlib, such as `subplots` too.

In [None]:
fig, axes = plt.subplots(nrows=4, ncols=1)

months = rain.columns

for i,ax in enumerate(axes):
    ax.plot(x, rain[months[i]], 'r')
    ax.set_xlabel('Year')
    ax.set_ylabel('Rainfall')
    ax.set_title(months[i])
    

That was easy, but it's not so pretty with overlapping figure axes and labels, right?

We can deal with that by using the `fig.tight_layout` method, which automatically adjusts the positions of the axes on the figure canvas so that there is no overlapping content:

In [None]:
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10,10))

for i,ax in enumerate(axes):
    ax.plot(x, rain[months[i]], 'r')
    ax.set_xlabel('Year')
    ax.set_ylabel('Rainfall')
    ax.set_title(months[i])
    
fig.tight_layout()

## 2.5 Manipulating figure attributes <a id='p2.3' />

Matplotlib allows the aspect ratio, DPI and figure size to be specified when the `Figure` object is created, using the `figsize` and `dpi` keyword arguments. `figsize` is a tuple with width and height of the figure in inches, and `dpi` is the dot-per-inch (pixel per inch). To create a figure with size 12 by 3 pixels we can do: 

In [None]:
fig, axes = plt.subplots(figsize=(12,3))

axes.plot(x, y, 'r')
axes.set_xlabel('Year')
axes.set_ylabel('Rainfall')

#### Legends can also be added to identify labelled data.

In [None]:
fig, ax = plt.subplots(figsize=(12,3))

ax.plot(x, rain['Jan'], label="Jan")
ax.plot(x, rain['Aug'], label="Aug")
ax.set_xlabel('Year')
ax.set_ylabel('Rainfall')
ax.legend(loc=1); # upper left corner

#### Visualizations can be fine-tuned in maplotlib, using the attributes of the figure and axes.

In [None]:
# Use the following data to create the required plot
month = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
revenue = np.random.randint(500, 10000, size=12)  # randomly generate revenue for each month
profits = np.random.randint(-1000, 5000, size=12) # randomly generate profits for each month

In [None]:
# Create a figure and subplot with 1 row and 1 column, set figure size to 10 by 5
fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize=(10, 5) )

# Create one plot containing line charts for revenue and profits (remember to set legend labels)
axes.plot(month, revenue, 'b-x', label = 'Revenue')
axes.plot(month, profits, 'g-o', label = 'Profits')

# Set appropriate title for plot (set font size to 20))
axes.set_title('Revenue and Profits Earned Each Month', fontsize = 20)

# Set labels for x-axis and y-axis (set font size to 15)
axes.set_xlabel('Month', fontsize = 15)
axes.set_ylabel('Dollars', fontsize = 15)

# Set font size of ticks on the x-axis and y-axis to 10
axes.tick_params(axis = 'x', labelsize = 10)
axes.tick_params(axis = 'y', labelsize = 10)

# Set appropriate ranges for the x-axis and y-axis
axes.set_ylim(bottom = -1000, top = 10000)

# Set grid lines to red, 0.2 transparency and dashed linestyle
axes.grid(c = 'r', alpha = .2, linestyle = '--')

# Set appropriate appropriate position for the legend
axes.legend(loc=(1.02,0), borderaxespad=0, fontsize = 15)

# Set tight layout for figure
fig.tight_layout()

# Save plot as 'practice1.png' in same folder as notebook
fig.savefig('./final deliverable/practice1.png', dpi = 300)

# Display the plot
plt.show()

# 3. Seaborn <a id='p3' />

Seaborn is a Python data visualization library built on top of Matplotlib. It provides a high-level interface for creating attractive and informative statistical graphics. Seaborn is particularly useful for visualizing complex datasets and exploring the relationships between variables.

Some key features of Seaborn include:

`Attractive Default Styles:` Seaborn comes with attractive and well-designed default styles, making it easy to create visually appealing plots without much customization.

`High-Level Plotting Functions:` Seaborn provides high-level functions to create a wide range of statistical plots, including scatter plots, line plots, bar plots, box plots, violin plots, heatmaps, and more.

`Color Palettes:` Seaborn offers a variety of color palettes to make it easy to apply consistent and aesthetically pleasing colors to your plots.

`Statistical Estimations:` Seaborn can automatically compute and display statistical estimates such as confidence intervals and regression lines on your plots.

`Facet Grids:` Seaborn allows you to create facet grids for visualizing data across multiple dimensions or subsets of the data.




Seaborn is conventionally imported using the `sns` alias. Simply importing Seaborn invokes the default Seaborn settings. These are generally more muted colors with a light gray background and subtle white grid lines. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load data to construct the seaborn plots. We will be using one of the datasets available in seaborn library

df = sns.load_dataset('car_crashes')

df.head()

## 3.1 Styling and Themes in Seaborn

Visualizations are essential for gaining insights from large datasets and presenting findings to stakeholders. Enhancing the visual appeal of plots can improve both data exploration and communication. Beautiful and attractive visualizations capture attention and engage audiences more effectively than dull plots. Therefore, styling plays a crucial role in data visualization.

While Matplotlib is highly customizable, it can be challenging to fine-tune settings for visually appealing plots. Seaborn, on the other hand, provides pre-designed themes and a high-level interface to easily customize the appearance of Matplotlib figures. This feature allows users to effortlessly create visually appealing plots without delving into intricate customization details.

In [None]:
# First, let's see how we can style a simple Matplotlib plot using Seaborn’s function.

sns.set_style("whitegrid") #set the style to whitegrid
plt.scatter(df.speeding,df.alcohol)

plt.show()

In [None]:
sns.set_style("ticks") #set style to add ticks
plt.scatter(df.speeding,df.alcohol)

plt.show()


In [None]:
sns.set_style("dark") #set style to dark
plt.scatter(df.speeding,df.alcohol)

plt.show()

If we don't want to include the top and right axis spines, we can remove the top and right axis spines using the despine() function.

In [None]:
sns.set_style("ticks")
plt.scatter(df.speeding,df.alcohol)
sns.despine()
plt.show()

Playing around by changing the values of any of the parameters will alter the plot style and make our plots attractive.

In [None]:
sns.set_style("darkgrid", {'grid.color': '.5'})
plt.scatter('speeding','alcohol',data=df)
plt.show()

You can also add matplotlib's pyplot to add the chart elements, like the title, x-axis label, y-axis label etc.

In [None]:
sns.set_style("darkgrid", {'grid.color': '.5'})
plt.scatter('speeding','alcohol',data=df)
plt.title("Relationship between speeding car and alcohol consumption")
plt.xlabel("carspeed")
plt.ylabel("alcohol consumption")
plt.show()

In Seaborn, we have the flexibility to control specific elements of our graphs, enabling us to adjust the scale of these elements or the overall plot appearance using the set_context() function. There are four preset templates for contexts, each based on relative size, and they are named as follows:

* Paper
* Notebook
* Talk
* Poster

By default, the context is set to 'notebook', which is the context used in the examples above. However, we can easily try a different context to observe how it influences the appearance of our plots. Changing the context allows us to quickly adapt the visual style of our visualizations to better suit the specific use case or presentation medium.

In [None]:
sns.set_style("dark")
sns.set_context("notebook")
plt.scatter(df.speeding,df.alcohol)
plt.show()

In [None]:
sns.set_style("dark")
sns.set_context("poster")
plt.scatter(df.speeding,df.alcohol)
plt.show()

## 3.2 Seaborn Color Palette

Seaborn is renowned for enhancing the appeal of plots and graphs by using attractive color combinations. In data visualization, color is a crucial element that significantly influences how observers perceive and interpret visualizations. Effective use of color can add substantial value to the plot and enhance its overall impact.

Seaborn provides a wide range of color palettes, totaling 170 options. A palette, analogous to a painter's mixing surface, allows users to arrange and blend colors harmoniously. These diverse color palettes in Seaborn enable users to create visually appealing and engaging visualizations that captivate audiences and effectively communicate insights from data.


![image.png](attachment:image.png)

## 3.3 Seaborn Functions

### 3.3.1 Relplot

Seaborn's relplot() function is a powerful tool that offers access to various axes-level functions, each demonstrating the relationship between two variables while incorporating semantic mappings of subsets. The kind parameter enables users to choose the specific underlying axes-level function to utilize for the visualization. This flexibility allows users to explore and present relationships in the data effectively, catering to different use cases and preferences.

In [None]:
# loading new data

tips = sns.load_dataset("tips")
tips.head()

In [None]:
# creating relplot
sns.set_context("notebook")
sns.relplot(data=tips, x="total_bill", y="tip")


In [None]:
# adding hue
sns.relplot(data=tips, x="total_bill", y="tip", hue="day")

In [None]:
# adding more details to the chart and making it more readable
sns.relplot(data=tips, x="total_bill", y="tip", hue="sex", col="day", col_wrap=2)

In [None]:
# you can also use a line chart by changing to kind = 'line'
sns.relplot(data=tips, x="size", y="tip",kind="line",ci=None)

As observed in the plot, we have introduced an additional dimension by color-coding the points based on a third variable. In Seaborn, this approach is known as using a "hue semantic" since the color of each point gains significance. This is achieved by passing the third variable to the hue parameter of the relplot() function. We will cover the col parameter later when discussing the facetGrid section, which allows for further visualization customization across multiple dimensions. The hue parameter is particularly useful for visually representing relationships between variables and highlighting patterns or differences within the data.

### 3.3.2 Histogram

Histograms are used to visualize the distribution of data by dividing it into bins based on the range of values and then displaying bars to represent the number of observations falling within each bin. In Seaborn, we can create histograms using the distplot() function. Let's see an example of how to use this function to plot a histogram.

In [None]:
sns.distplot(tips['total_bill'])

By default this gives a histogram and a density plot. You may choose to only visualize the histogram.

In [None]:
sns.distplot(tips['total_bill'], kde = False)

### 3.3.3 Barplot

Seaborn offers a wide range of bar plots, and we will explore some of them here. As mentioned earlier, we will combine Seaborn with Matplotlib to showcase various types of bar plots. This combination allows us to leverage the strengths of both libraries and create visually appealing and informative visualizations. By using Seaborn's specialized bar plot functions in conjunction with Matplotlib's customization capabilities, we can effectively communicate insights and patterns present in the data.

#### Vertical Bar Plot

In [None]:
sns.set_context('paper')
 
# load dataset
titanic = sns.load_dataset('titanic')

# create plot
sns.barplot(x = 'embark_town', y = 'age', data = titanic,
            palette = 'PiYG',ci=None 
            )
plt.legend()
plt.show()
print(titanic.columns)

`sns.set_context('paper'):` This line sets the context of the plot to 'paper', which means the plot will be optimized for paper publication. Seaborn provides four preset contexts: 'paper', 'notebook', 'talk', and 'poster', with 'paper' being suitable for publication-quality plots.

`titanic = sns.load_dataset('titanic'):` This line loads the 'titanic' dataset from Seaborn's built-in datasets. The dataset contains information about passengers aboard the Titanic, including their age, embarkation town, and other details.

`sns.barplot(x='embark_town', y='age', data=titanic, palette='PiYG', ci=None):` This line creates the bar plot using Seaborn's barplot() function. We specify 'embark_town' on the x-axis and 'age' on the y-axis to represent the relationship between the passengers' age and the embarkation town. The 'palette' parameter sets the color palette for the plot. In this case, 'PiYG' is used, which stands for Pink-Yellow-Green. The ci parameter is set to None, indicating that no confidence intervals will be shown on the plot.

`plt.legend():` This line adds a legend to the plot, but in this case, since there are no separate groups being represented by different colors, the legend will not show any labels.

`plt.show():` This line displays the bar plot.

`print(titanic.columns):` This line prints the column names of the 'titanic' dataset, which allows us to see the available columns and their names.

In [None]:
sns.barplot(x = 'sex', y = 'survived', hue = 'class', data = titanic,
            palette = 'GnBu',
            order = ['male', 'female'],  
            capsize = 0.05,             
            saturation = 8,             
            errcolor = 'gray', errwidth = 2,  
            ci = None 
            )
plt.legend()
plt.show()

#### Horizontal Bar Plot

In [None]:
sns.set_context('paper')
sns.barplot(x = 'age', y = 'embark_town', data = titanic,
            palette = 'GnBu', orient = 'h', ci = None
            )
plt.show()

### 3.3.4 Count Plot

The count plot is a visualization that can be likened to a histogram for categorical variables. It displays the frequency of each category in the categorical variable. The following example demonstrates the count plot.

In [None]:
sns.set_context('paper')
# create plot
sns.countplot(x = 'class', hue = 'who', data = titanic, palette = 'magma')
plt.title('Survivors')
plt.show()

### 3.3.5 Box Plot

The box plot, commonly referred to as the box and whisker diagram, is a visualization technique used to display numerical data in groups, emphasizing the distribution through quartiles. The name "box and whisker" originates from its visual elements, comprising a box and whiskers.

The box plot provides a summary of five essential data points:

* Minimum value
* First Quartile (25th percentile)
* Median (Second Quartile or 50th percentile)
* Third Quartile (75th percentile)
* Maximum value

By presenting these key statistics, the box plot enables us to understand the spread and central tendency of the data. Additionally, it is an effective tool for identifying potential outliers in the dataset.

In [None]:
tips = sns.load_dataset("tips")
sns.boxplot(x="day", y="total_bill", data=tips, palette='GnBu')

* The bottom black horizontal line of the box plot is the minimum value
* The first black horizontal line of the rectangle shape of the box plot is the first quartile
 or 25%
* The second black horizontal line of the rectangle shape of the box plot is Second quartile or 50% or median.
* The third black horizontal line of rectangle shape of the same box plot is third quartile or 75%
* The top black horizontal line of the rectangle shape of the box plot is the maximum value.
* The small diamond shape of the box plot is outlier data.

### 3.3.6 Heatmap

A heatmap is a graphical representation of data presented in a matrix format, where each value in the matrix is depicted using colors. Seaborn allows us to create annotated heatmaps, providing additional information by annotating each cell with its actual value. Furthermore, Seaborn's heatmaps can be customized and modified using Matplotlib, allowing for flexibility in visualizing and presenting the data in the desired format.

In [None]:
# load a new dataset
car_crashes = sns.load_dataset("car_crashes")
corr=car_crashes.corr()
sns.heatmap(corr,annot=True,linewidths=.5,cmap="YlGnBu")

### 3.3.7 Pair Plot

Pair plot is a visualization technique in Seaborn that creates a grid of axes, where each numeric variable in the dataset is shared across the y-axes in a single row and the x-axes in a single column. The diagonal plots of the grid are treated specially, as they show univariate distribution plots to display the marginal distribution of the data in each column. Pair plots are particularly useful for exploring relationships between multiple numerical variables and identifying patterns and correlations within the data.

In [None]:
# load a new dataset
df = sns.load_dataset('iris')
sns.set_style("ticks")
sns.pairplot(df,hue = 'species',diag_kind = "kde",kind = "scatter",palette = "husl")
plt.show()

### [Optional] 3.3.8 Joint Plot

A Joint Plot is a visualization that combines bivariate and univariate graphs for two variables. It utilizes both Scatter Plots and Histograms to depict the relationship between the variables. Additionally, Joint Plot can present data using Kernel Density Estimates (KDE) and Hexagons for a more detailed view. Furthermore, it allows us to include a Regression Line in the Scatter Plot to show the trend between the variables. Below are some examples of Joint Plot usage.

In [None]:
# we are using the restaurant tips data
sns.set_style("dark")
sns.jointplot(x='total_bill', y='tip',data=tips)

In [None]:
# You can also add a line to this scatter plot
sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg') 

In [None]:
# Display kernel density estimate instead of scatter plot and histogram
sns.jointplot(x='total_bill', y='tip', data=tips, kind='kde', fill=True)

In [None]:
# Display hexagons instead of points in scatter plot
sns.jointplot(x='total_bill', y='tip', data=tips, kind='hex') 

### [Optional] 3.3.9 Regplot

Seaborn's regplot function is utilized for visualizing linear relationships determined through regression analysis. It displays a regression line representing the relationship between two variables, along with a slightly shaded area around the line. This shaded portion indicates the density of points and how they are scattered around a certain region. When the points are more densely populated in an area, the shaded portion around the regression line becomes narrower, while it becomes more spread out in regions where the points are more scattered. In the next example, we will plot a discrete x variable and apply some jitter to further explore these patterns.

In [None]:
ax = sns.regplot(x="total_bill", y="tip", data=tips)

### [Optional] 3.3.10 LM Plot

In Seaborn, both regplot and lmplot can be used to visualize the regression between two variables. However, there is a difference between the two plots.

The regplot function is used to perform a simple linear regression model fit and plot the relationship between the variables. It provides a straightforward way to visualize the linear relationship without considering other factors.

On the other hand, the lmplot function combines the functionality of regplot with the power of FacetGrid. FacetGrid allows you to create multiple panels or subplots to visualize the distribution of one variable and the relationship between multiple variables within subsets of your dataset. This means you can plot separate regression lines for different subsets of your data.

It's important to keep in mind that using lmplot can be more computationally intensive compared to regplot, as it allows you to fit regression models across conditional subsets of your dataset, making it a more flexible and powerful tool for data exploration.

In [None]:
tips = sns.load_dataset("tips")
sns.lmplot(x="total_bill", y="tip", data=tips)

Here is how we can use the advance features of lmplot() and use it with multi-plot grid for plotting conditional relationships.

In [None]:
sns.lmplot(x="total_bill", y="tip", col="day", hue="day",
               data=tips, col_wrap=2, height=3)

### [Optional] 3.3.11 KDE plot

A KDE plot (Kernel Density Estimate) is a visualization technique used to represent the Probability Density Function of continuous or non-parametric data variables. It allows us to visualize the distribution of a single variable (univariate) or multiple variables simultaneously. By using KDE, we can gain insights into the underlying probability distribution of the data, making it a useful tool for exploring the data's characteristics and patterns.

In [None]:
# load another dataset
sns.set_style("dark")
iris = sns.load_dataset("iris")
# Plotting the KDE Plot 
sns.kdeplot(iris.loc[(iris['species']=='setosa'), 
            'sepal_length'], color='b', shade=True, label='setosa') 
sns.kdeplot(iris.loc[(iris['species']=='virginica'), 
            'sepal_length'], color='r', shade=True, label='virginica')

plt.legend()

### External Resource Links

Below are some external resources link to learn more about Matplotlib and Seaborn:

`Matplotlib:` 

- https://matplotlib.org/tutorials/index.html 
- https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html 

`Seaborn: `

- https://seaborn.pydata.org/
- https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html


# 4. DIY debugging for data visualization using ChatGPT <a id='p4' />

You may leverage ChatGPT to help create visualizations using matplotlib and seaborn.

Some prompt examples when:

- you are not sure what is the best visualization to create.

![image.png](attachment:image.png)

You can see that ChatGPT clearly states that if we are to show distribution of age, histogram will be a good option.

It also gives a code snippet to create a histogram using matplotlib. It provides an explanation on the code too.

![image-2.png](attachment:image-2.png)

- you want to create the visualization using different library with a specific color palette.

![image-3.png](attachment:image-3.png)

Again, ChatGPT seamlessly does the task of providing the code to create a histogram using seaborn library with a specific color palette.


# 5. Hands-On Practice Exercise <a id='p4' />

You can use chatGPT to complete this hands-on practice exercise.

For the hands-on practice exercise, we will be using the below dataset.  

**Dataset: census_data.csv**

In [None]:
# load the dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
census_data = pd.read_csv("data/census_data.csv")

census_data.head()

<b>1. Create a heatmap to find the correlation between numeric columns in the dataset.</b>

<b>2. Create a line chart to plot age vs hours per week worked. </b>


<b>3. Create a horizontal boxplot to understand the distribution of age with respect to marital status.</b>

<b>4. Add the gender dimension to the above chart to see a more detailed age distribution.</b>

<b>5. Create a count plot to get the number of people in different categories of marital status.</b>

<b>6. Use the components of matplotlib to create a dashboard from any 4 visualizations above and save it as a jpg file.</b>

##### The End
[Back to Content](#tc)

Copyright © 2023 by Boston Consulting Group. All rights reserved.