# 1 Univariate plotting with pandas

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head(3)

## Bar charts and categorical data

In [None]:
%matplotlib inline
reviews['province'].value_counts().head(10).plot.bar()

In [None]:
(reviews['province'].value_counts().head(10) / len(reviews)).plot.bar()

In [None]:
reviews['points'].value_counts().sort_index().plot.bar()

## line charts

In [None]:
reviews['points'].value_counts().sort_index().plot.line()

**Line charts also make it harder to distinguish between individual values.
In general, if your data can fit into a bar chart, just use a bar chart!**

## Area charts
Area charts are just line charts, but with the bottom shaded in. That's it!

In [None]:
reviews['points'].value_counts().sort_index().plot.area()

## Histograms

In [None]:
reviews[reviews['price'] < 200]['price'].plot.hist()

The only analytical difference is that instead of each bar representing a single value, it represents a range of values.

However, histograms have one major shortcoming (the reason for our 200$ caveat earlier). Because they break space up into even intervals, they don't deal very well with skewed data:

In [None]:
reviews['price'].plot.hist()

In [None]:
reviews[reviews['price'] > 1500]

There are many ways of dealing with the skewed data problem; those are outside the scope of this tutorial. **The easiest is to just do what I did: cut things off at a sensible level.**

In [None]:
reviews['points'].plot.hist()

In [None]:
pd.set_option('max_columns', None)
pokemon = pd.read_csv("../input/pokemon/Pokemon.csv")
pokemon.head(3)

In [None]:
# The frequency of Pokemon by type:
pokemon['Type 1'].value_counts().plot.bar()

In [None]:
pokemon.HP.describe()

In [None]:
# The frequency of Pokemon by HP stat total:
pokemon.HP.value_counts().sort_index().plot.line()

In [None]:
# The frequency of Pokemon by weight:
pokemon.Speed.plot.hist()

## Addendum: on Pie Charts

In [None]:
reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head()

In [None]:
reviews['province'].value_counts().head(10).plot.pie()

In [None]:
reviews['province'].value_counts().head(10).plot.pie()
# Unsquish the pie.
import matplotlib.pyplot as plt
plt.gca().set_aspect('equal')

But you shouldn't use it. The reason why is simple: can you tell me, looking at this chart, which providence produces more wine: Veneto, or Burgundy?

Research has shown that pie charts work well for quantities that are near common fractional values: one-half, one-third, and one-quarter. However, once you start to drill down into tenths, and twelvths, and so on, our ability to visually compare two pie slices, especially ones not immediately adjacent to one another, breaks down.

**Pie charts are like bar charts, but wrapped around a circle. You should just use a bar chart instead.**

# 2 Bivariate plotting with pandas

In [None]:
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head()

## Scatter plot

In [None]:
reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points')

In [None]:
reviews[reviews['price'] < 100].plot.scatter(x='price', y='points')

Note that in order to make effective use of this plot, we had to downsample our data, taking just 100 points from the full set. This is because naive scatter plots do not effectively treat points which map to the same place.

For example, if two wines, both costing 100 dollars, get a rating of 90, then the second one is overplotted onto the first one, and we add just one point to the plot.

This isn't a problem if it happens just a few times. But with enough points the distribution starts to look like a shapeless blob, and you lose the forest for the trees:

**There are a few ways to treat this problem. We've already demonstrated one way: sampling the points. Another interesting way to do this that's built right into pandas is to use our next plot type, a hexplot.**

## Hexplot
A hexplot aggregates points in space into hexagons, and then colorize those hexagons:

In [None]:
reviews[reviews['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15)

The data in this plot is directly comprable to the scatter plot from earlier, but the story it tells us is very different. **The hexplot provides us with a much more useful view on the dataset, showing that the bottles of wine reviewed by Wine Magazine cluster around 87.5 points and around $20.**

Hexplots and scatter plots can by applied to combinations of interval variables or ordinal categorical variables. 

## Stacked plots
Scatter plots and hex plots are new. But we can also use the simpler plots we saw in the last notebook.

The easiest way to modify them to support another visual variable is by using stacking. A stacked chart is one which plots the variables one on top of the other.

In [None]:
wine_counts = pd.read_csv("../input/most-common-wine-scores/top-five-wine-score-counts.csv",
                          index_col=0)
wine_counts.head()

Many pandas multivariate plots expect input data to be in this format, with one categorical variable in the columns, one categorical variable in the rows, and counts of their intersections in the entries.

In [None]:
wine_counts.plot.bar(stacked=True)

In [None]:
wine_counts.plot.area()

## Bivariate line chart
One plot type we've seen already that remains highly effective when made bivariate is the line chart. Because the line in this chart takes up so little visual space, it's really easy and effective to overplot multiple lines on the same chart.

In [None]:
wine_counts.plot.line()

## exercise

In [None]:
pokemon = pd.read_csv("../input/pokemon/Pokemon.csv", index_col=0)
pokemon.head()

In [None]:
pokemon.plot.scatter(x='Attack',y='Defense')

In [None]:
pokemon.plot.hexbin(x='Attack',y='Defense', gridsize=15)

In [None]:
pokemon_stats_legendary = pokemon.groupby(['Legendary', 'Generation']).mean()[['Attack', 'Defense']]

In [None]:
pokemon_stats_legendary.head()

In [None]:
pokemon_stats_legendary.plot.bar(stacked=True)

In [None]:
pokemon_stats_by_generation = pokemon.groupby('Generation').mean()[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
pokemon_stats_by_generation.head()

In [None]:
pokemon_stats_by_generation