## Problem Statement
- Load data set 
- Univariate plots
    - Histograms
    - Density Plots
    - Box or Whisker plots
- Multivariate plots
    - Scatter Plots
    - Correlation plots

#### Load Python libraries and dataset

In [None]:
import pandas as pd
import numpy

#### Load dataset stored in CSV file

In [None]:
data = pd.read_csv("../data/pima-indians-diabetes.csv")

#### Check Your Data

In [None]:
# check first 20 rows of the dataset
print(data.head(5))

##### Check Shape of the data in terms of rows and columns.

In [None]:
# check shape of the dataset
print(data.shape)
print(f"\nThe dataset has {data.shape[0]} rows and {data.shape[1]} columns.")

## Univariate Plots

#### Study three techniques to understand each attribute of the dataset independently
- Histograms
- Density plots
- Box and Whisker pltos


### Histograms 
A fast way to get an idea of the distribution of each attibute is to look at histograms. 
- Histograms group data into bins and provide a count of the number of observations in each bin. 
- From the shape of the bins one can quickly get an idea whether an attribute is Gaussian, skewed or even has an exponential distribution. 
- It can also help you see possible outliers.


#### Load Python libraries for data visualization

In [None]:
from matplotlib import pyplot

In [None]:
# show histogram for each attribute
data.hist(figsize=(15,15))
pyplot.show()

#### Inference from Histogram plots
- It can be observed that the attributes **age**, **pedi** and **test** may have an exponential distribution. 
-  the other hand, **mass**, **pres** and **plas** attributes may have a Gaussian or nearly Gaussian distribution. 
- This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

### Density Plots
Density plots are another way of getting a quick idea of the distribution of each attribute.

In [None]:
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(15,15))
pyplot.show()

### Box and Whisker Plots
- Another way to review the distribution of each attribute is to use Box and Whisker Plots or boxplots. 
- Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). 
- The whiskers give an idea of the spread of the data
- The  dots outside of the whiskers are outlier values (values that are 1.5 times greater than the size of spread of the middle 50% of the data).

![box_plot.PNG](attachment:box_plot.PNG)

![iqr.PNG](attachment:iqr.PNG)

In [None]:
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(15,15))
pyplot.show()

#### Inference from Boxplots
- Boxplot shows that the spread of attributes is quite different. 
- Some like **age**, **test** and **skin** appear quite skewed towards smaller values.


## Multivariate Plots
- Scatter Plots
- Correlation Plots

### Scatter Plots
- A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. 
- Scatter plots are useful for checking structured relationships between variables. 
- Attributes with structured relationships may also be correlated and good candidates for removal from your dataset.

#### Load Python libraries

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
## Scatter Plot
scatter_matrix(data, figsize=(16,16))
pyplot.show()

## Correlation Plots
- Correlation gives an indication of how related the changes are between two variables. 
- **Positive Correlation*** : If two variables change in the same direction they are positively correlated. 
- **Negative Correlation**: If they change in opposite directions together (one goes up, one goes down), then they are negatively correlated. 

#### Load Python libraries

In [None]:
import seaborn as sns

In [None]:
correlations = data.corr()
pyplot.figure(figsize=(9,9))
heatmap = sns.heatmap(correlations, vmin=-1, vmax=1, annot=True)