# Visualising Distributions of Data



We will start using the ```seaborn``` library for data visualisation. Seaborn is a python library built on top of ```matplotlib```. It creates much more attractive plots than ```matplotlib```, and is often more concise than ```matplotlib``` when you want to customize your plots, add colors, grids etc.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings

# the commonly used alias for seaborn is sns
import seaborn as sns

# set a seaborn style of your taste
sns.set_style("whitegrid")

# data
df = pd.read_csv("./global_sales_data/market_fact.csv")

FileNotFoundError: File b'./global_sales_data/market_fact.csv' does not exist

In [None]:
import warnings
warnings.filterwarnings('ignore')

### Histograms and Density Plots

Histograms and density plots show the frequency of a numeric variable along the y-axis, and the value along the x-axis. The ```sns.distplot()``` function plots a density curve. 

In [None]:
# simple density plot
sns.distplot(df['Sales'])
plt.show()

Simple density plot (without the histogram bars) can be created by specifying ```hist=False```.

In [None]:
sns.distplot(df['Sales'], hist=False)
plt.show()

Since seaborn uses matplotlib behind the scenes, the usual matplotlib functions work well with seaborn. For example, you can use subplots to plot multiple univariate distributions.

In [None]:
# subplots

# subplot 1
plt.subplot(2, 2, 1)
plt.title('Sales')
sns.distplot(df['Sales'])

# subplot 2
plt.subplot(2, 2, 2)
plt.title('Profit')
sns.distplot(df['Profit'])

# subplot 3
plt.subplot(2, 2, 3)
# plt.title('Order Quantity')
sns.distplot(df['Order_Quantity'])

# subplot 4
plt.subplot(2, 2, 4)
# plt.title('Shipping Cost')
sns.distplot(df['Shipping_Cost'])

plt.show()


### Boxplots

Boxplots are a great way to visualise univariate data because they represent statistics such as the 25th percentile, 50th percentile, etc.

In [None]:
# boxplot
sns.boxplot(df['Order_Quantity'])
plt.title('Order Quantity')

plt.show()

In [None]:
# to plot the values on the vertical axis, specify y=variable
sns.boxplot(y=df['Order_Quantity'])
plt.title('Order Quantity')

plt.show()

## Plotting Pairwise Relationships

You'll find it helpful to plot pairwise relationships between multiple numeric variables. 

In [None]:
from sklearn.datasets import load_boston

In [None]:
boston = load_boston()

In [None]:
df_boston = pd.DataFrame(boston.data,columns=boston.feature_names)
df_boston['Price']=boston.target

### Pairwise Scatter Plots

Now, since we have multiple numeric variables, ```sns.pairplot()``` is a good choice to plot all of them in one figure.

In [None]:
# pairplot
plt.figure(figsize=(10,8))
sns.pairplot(df_boston)
plt.show()

In [None]:
df_new = df_boston[['RM','B','LSTAT','Price']]

In [None]:
df_new.shape

In [None]:
sns.pairplot(df_new)
plt.show()

In [None]:
# You can also observe the correlation between the features
# using df.corr()
cor = df_boston.corr()
round(cor, 3)
#cor

The dataframe above is a **correlation matrix** of housing dataset. Try finding some important relationships between features. 

## Heatmaps

It will be helpful to visualise the correlation matrix itself using ```sns.heatmap()```.

In [None]:
# figure size
plt.figure(figsize=(10,8))

# heatmap
sns.heatmap(cor, cmap="cool", annot=True)
plt.show()