# Module 3 - Visual Analysis

Module 2 focused on understanding the general structure of a dataset and some basic statistical information. However, numbers might not fully explain the characteristics in the data. Examples such as [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and the [Datasaurus Dozen](https://www.autodeskresearch.com/publications/samestats) show us why it's important to visually explore the dataset.

In this lesson, we will use the Python library `seaborn` for data visualization. There are many other libraries in Python that can also do data visualization, like `matplotlib`, `bokeh`, `plotly`, etc. 

In [None]:
import pandas as pd
import seaborn as sns

# this shows chart outputs within the notebook
%matplotlib inline  

In [None]:
filepath = "datasets/gradedata.csv"

df = pd.read_csv(filepath)
df.head()

In [None]:
df.describe()

## Histogram

Histograms are charts that show the distribution of data. In other words, histograms show where the data is gathered or clustered together (values occur frequently) and where they are sparse. A histogram can also help identify the area(s) where there might be outliers.

A histogram that has one hump and is symmetrical is called a **normal distribution**. When the data is clustered on the right side, it has a **left-skew** (the sparse values spread out towards the left) and data clustered on the left side is a **right-skew** (sparse values are spread out towards the right). The sparse values on either side of the distribution are called the **tails**.

In [None]:
# histogram for distribition of student grades
sns.distplot(df['grade'], bins=10, kde=False)

In [None]:
# histogram for hours of study
sns.distplot(df['hours'], bins=10, kde=False)

## Boxplot

Boxplots (also known as "box-and-whisker plot") are another way of showing the distribution of data. Unlike histograms which shows where all the data may be clustered or sparse, boxplots show the range for the innermost 50% of the data (called the **Interquartile Range** or IQR) with the median inside of the IQR. On each end of the boxplot are the **whiskers**, or fences, to mark the lower and upper ends for where there may be outliers in the data.

![Boxplot](https://notebooks.azure.com/priesterkc/projects/images/raw/boxplot.png)
Source: [Towards Data Science: Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)
*Image modifications by Kenisha Priester

In [None]:
# boxplot distribution of student grades
sns.boxplot(df['grade'])

In [None]:
# boxplot distribution for student grade vs age
sns.boxplot(x='age', y='grade', data=df)

## Violinplot

Violinplots combine a distribution curve (the smoothed-out, line graph version of a histogram) with boxplots to compare similarities and differences between the two methods to view the data's shape. This plot mirrors the distribution curve on both sides of the boxplot that is inside the chart.

In [None]:
# distribution curve & boxplot of student grades
sns.violinplot(df['grade'])

In [None]:
# distribution curve & boxplot for student grade vs age
sns.violinplot(x='age', y='grade', data=df)

## Countplot

A countplot is the visual representation of the `value_counts` function in `pandas`. `value_counts` shows the frequency of items that are a particular value or category (for example, how many people like certain flavors of ice cream). The countplot is used to understand where there may be a majority or minority of items in each category.

In [None]:
# frequency of students for each age category
sns.countplot(df['age'])

## Barplot

Barplots are a part-to-part comparison chart and each bar is a statistical function value (such as mean, median, standard deviation, etc.). Seaborn's `barplot` function uses *mean* as its default statistical function.

In [None]:
# average (mean) grade per age category
sns.barplot(x=df['age'], y=df['grade'], ci=None)

## Pie Chart

Pie charts are a parts-to-whole comparison graph. Similar to countplots, pie charts are used to find a majority or minority category within the whole.

In [None]:
# percentage of students by age
df['age'].value_counts().plot(kind='pie', autopct='%1.1f%%')

## Scatterplot

A scatterplot compares two numerical features (columns of data), where one feature's values are plotted on the x-axis and the other feature's values are plotted on the y-axis. Scatterplots help us find any pattern in the data that shows how each feature may be related to each other.

In [None]:
# compare hours of study vs test grade
sns.scatterplot(x='hours', y='grade', data=df)

In [None]:
sns.boxplot(data=df)