# Visualization in Pandas


There are numerous libraries available in Python for creating visualizations. Often times, we will probably be using [Matplotlib](http://matplotlib.org/) and/or [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) for anything that is general purpose, and then other libraries if we need something more specialized ([Plotly](https://plot.ly/) for dashboards, for example). All of these libraries allow us to build great looking visualizations that can be used in a production setting. If we want something quick and dirty to visualize our data very easily, there is also some plotting functionality built into Pandas. 

If we look at the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html), we can see that the plotting available with Pandas will be called via the `plot()` method on a DataFrame object. From there, we'll pass in a bunch of potential arguments to the `plot()` method to specify exactly how to build the plot. The most important of those arguments is the `kind` keyword argument, which tells the `plot()` method what kind of visualization we would like (bar plot, histogram, scatter plot, etc.). Most of the time we'll be doing your visualization in Matplotlib or Seaborn, but here is a little taste of what Pandas can do so that you know it's there. 

A dataset on the [quality of red wines](https://archive.ics.uci.edu/ml/datasets/wine+quality) is used for this purpose. You already got to know this data set in the last notebook. Now we will try to get an even better feel for the dataset with the help of some plots. At the end of the notebook, there are some tasks where you should  create some plots with pandas. 


## Learning Objectives

At the end of this notebook you will be able to:

- create plots with the Pandas function `.plot()`
- describe the different kinds of plots (eg. histograms, scatterplots, bar plots and box plots)
- explain what conclusions you draw from these visualizations

First, we need to import pandas and our dataframe.

In [None]:
import pandas as pd
df = pd.read_csv('data/winequality-red.csv', delimiter=';')

In [None]:
df.head()

In [None]:
df.columns

We have several input variables; they are based on physicochemical tests: 
- 1 - fixed acidity
- 2 - volatile acidity
- 3 - citric acid
- 4 - residual sugar
- 5 - chlorides
- 6 - free sulfur dioxide
- 7 - total sulfur dioxide
- 8 - density
- 9 - pH
- 10 - sulphates
- 11 - alcohol

And there is one output variable, based on sensory data:
- 12 - quality (which is a score between 0 and 10)

To get a good overview of the data, we want to draw some histograms. We can visualize this with the argument `kind= 'hist'`. We also need to specify which density distribution to draw, so we need to name a particular column. The output variable, the quality of the wine, seems to be the most interesting, so we plot it first.

In [None]:
# specify a certain column and call the pandas plot function
df['quality'].plot(kind='hist')

You can see, thatthe red wines in this dataset have quality scores from 3 to 8. So we neither have very bad quality wines nor wines with the best quality scores. The score 5 is given most times. Nearly 700 wines have gotten this score.

Also other plots can be made with the plot function. Next we try scatterplots, where we have to specify a feature for the X and Y column. And set `kind='scatter'`.

In [None]:
# As you might guess from the error, we have to specify X and Y columns for Pandas to plot. 
df.plot(kind='scatter', x='total sulfur dioxide', y='free sulfur dioxide')

This is what scatter plots for continuous variables look like. You can see that when the total sulfur dioxide increases, the free sulfur dioxide also increases.
It is also interesting to see how different features affect the quality of red wines. Quality is a discrete variable, so don't be surprised that the plot looks different.

In [None]:
df.plot(kind='scatter', x='quality', y='alcohol')

This might not be the best plot to draw many conclusions, but at least we see, that wines with higher quality do not have low alcohol concentrations.

The next type of plot we want to look at are boxplots. As histograms, they should give us a better impression on the distribution of the data.

In [None]:
df.plot(kind='box')

For most of the features we cannot disinguish their distribution, this is due to the different scales of the features. Therefore, it is necessary to specify only some columns and see how that looks...


In [None]:
df[['fixed acidity', 'pH', 'alcohol']].plot(kind='box')

In [None]:
# This still doesn't look great - it's hard to really examine these three columns since pH is 
# so different from the other two. Let's drop pH and try one more time...
df[['fixed acidity', 'alcohol']].plot(kind='box')

## Check your understanding 

Now it's time to try plotting. Check out the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#visualization) and try out different types of plots. See if you find any interesting new insights into the data.

After that, there are some specific tasks to test your knowledge.

**Practice with plotting**
1. Plot the average amount of `chlorides` for each `quality` value (1 from Part 3). 

2. Plot the `alcohol` values against `pH` values. Does there appear to be any relationship between the two?

3. Plot `total_acidity` values against `pH` values. Does there appear to be any relationship between the two?

4. Plot a histogram of the `quality` values. Are they evenly distributed within the data set?

5. Plot a boxplot to look at the distribution of `citric acid`.