1. styling your plots
2. subplot
3. Plotting with seaborn

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# styling your plots
## introduction
An important skill in plot styling is knowing how to look things up.  And for both seaborn and pandas there is a wealth of information that you can find **by looking up "how to do X with Y" on StackOverflow (replacing X with what you want to do, and Y with pandas or seaborn).** If you want to change your plot in some way not covered in this brief tutorial, and don't already know what function you need to do it, searching like this is the most efficient way of finding it.

In [None]:
reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head(3)

## points on bar

In [None]:
reviews['points'].value_counts().sort_index().plot.bar()

In [None]:
# figsize controls the size of the image, in inches. 
# It expects a tuple of (width, height) values.
reviews['points'].value_counts().sort_index().plot.bar(figsize=(12, 6))

In [None]:
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred'
)

In [None]:
# We can used fontsize to adjust labels
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)

In [None]:
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16,
    title='Rankings Given by Wine Magazine',
)

pandas doesn't give us an easy way of adjusting the title size.

**Anything that you build in pandas can be built using matplotlib directly. pandas merely make it easier to get that work done.**

In [None]:
import matplotlib.pyplot as plt

ax = reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)
ax.set_title("Rankings Given by Wine Magazine", fontsize=20)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

ax = reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)
ax.set_title("Rankings Given by Wine Magazine", fontsize=20)
sns.despine(bottom=True, left=True) #to turn off the ugly black border.

## Exercises

In [None]:
pokemon = pd.read_csv("../input/pokemon/Pokemon.csv")
pokemon.head(3)

In [None]:
pokemon.plot.scatter(
    x='Attack',
    y='Defense',
    figsize=(12,6),
    title='Pokemon by Attack and Defense'
)

In [None]:
pokemon.describe()

In [None]:
ax = pokemon['Total'].plot.hist(
    figsize=(12, 6),
    fontsize=14,
    bins=50,
    color='gray'
)
ax.set_title('Pokemon by Stat Total', fontsize=20)

In [None]:
ax = pokemon['Type 1'].value_counts().plot.bar(
    figsize=(12,6),
    fontsize=12,
)
ax.set_title('Pokemon by Primary Type',fontsize=20)

# subplotting

In [None]:
import matplotlib.pyplot as plt
fig, axarr = plt.subplots(2, 1, figsize=(12, 8))

subplots returns two things, a figure (which we assigned to fig) and an array of the axes contained therein (which we assigned to axarr). 

To tell pandas which subplot we want a new plot to go in—the first one or the second one—we need to grab the proper axis out of the list and pass it into pandas via the ax parameter:

In [None]:
fig, axarr = plt.subplots(2, 1, figsize=(12, 8))

reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0]
)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1]
)

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(12, 8))

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(12, 8))

reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0][0]
)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1][1]
)

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(12, 8))

reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0][0], fontsize=12, color='mediumvioletred'
)
axarr[0][0].set_title("Wine Scores", fontsize=18)

reviews['variety'].value_counts().head(20).plot.bar(
    ax=axarr[1][0], fontsize=12, color='mediumvioletred'
)
axarr[1][0].set_title("Wine Varieties", fontsize=18)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1][1], fontsize=12, color='mediumvioletred'
)
axarr[1][1].set_title("Wine Origins", fontsize=18)

reviews['price'].value_counts().plot.hist(
    ax=axarr[0][1], fontsize=12, color='mediumvioletred'
)
axarr[0][1].set_title("Wine Prices", fontsize=18)

plt.subplots_adjust(hspace=.3)

import seaborn as sns
sns.despine()

## exercises

In [None]:
pokemon.head(3)

In [None]:
fig,axarr = plt.subplots(2,1,figsize=(8, 8))

In [None]:
pokemon.describe()

In [None]:
pokemon.Attack.plot.hist()

In [None]:
fig,axarr = plt.subplots(2, 1, figsize=(8, 8))

pokemon.Attack.plot.hist(ax=axarr[0], title='Pokemon Attack Ratings')
pokemon['Defense'].plot.hist(ax=axarr[1], title='Pokemon Defense Ratings')

# Plotting with seaborn

seaborn is a standalone data visualization package that provides many extremely valuable data visualizations in a single package. **It is generally a much more powerful tool than pandas**

In [None]:
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
import seaborn as sns

## Countplot
unlike pandas, seaborn doesn't require us to shape the data for it via value_counts; the countplot (true to its name) aggregates the data for us!

In [None]:
sns.countplot(reviews['points'])

In [None]:
reviews['points'].value_counts().sort_index().plot.bar()

## KDE Plot
KDE, short for "kernel density estimate", is a statistical technique for **smoothing out data noise**. 

**It addresses an important fundamental weakness of a line chart: it will buff out outlier or "in-betweener" values which would cause a line chart to suddenly dip.**

In [None]:
sns.kdeplot(reviews.query('price < 200').price)

In [None]:
reviews[reviews['price'] < 200]['price'].value_counts().sort_index().plot.line()

A KDE plot is better than a line chart for getting the "true shape" of interval data. **In fact, I recommend always using it instead of a line chart for such data.**

However, it's a worse choice for ordinal categorical data. A KDE plot expects that if there are 200 wine rated 85 and 400 rated 86, then the values in between, like 85.5, should smooth out to somewhere in between (say, 300). **However, if the value in between can't occur (wine ratings of 85.5 are not allowed), then the KDE plot is fitting to something that doesn't exist.** In these cases, use a line chart instead.

In [None]:
sns.kdeplot(reviews[reviews['price'] < 200].loc[:, ['price', 'points']].dropna().sample(5000))

**Bivariate KDE plots like this one are a great alternative to scatter plots and hex plots. **They solve the same data overplotting issue that scatter plots suffer from and hex plots address, in a different but similarly visually appealing. 

However, note that bivariate **KDE plots are very computationally intensive. **

## Distplot
The seaborn equivalent to a pandas histogram is the distplot. 

In [None]:
sns.distplot(reviews['points'], bins=10, kde=False)

In [None]:
sns.distplot(reviews['points'], bins=10, kde=True)

## Scatterplot and hexplot

In [None]:
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 100])

In [None]:
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 100], kind='hex', 
              gridsize=20)

## Boxplot and violin plot

In [None]:
df = reviews[reviews.variety.isin(reviews.variety.value_counts().head(5).index)]
df.shape
df.variety.unique()

In [None]:
reviews.variety.value_counts().head(5).index

In [None]:
reviews.variety.isin(reviews.variety.value_counts().head(5).index)

In [None]:
sns.boxplot(
    x='variety',
    y='points',
    data=df
)

The center of the distributions shown above is the "box" in boxplot. The top of the box is the 75th percentile, while the bottom is the 25th percentile. In other words, half of the data is distributed within the box! The green line in the middle is the median.

The other part of the plot, the "whiskers", shows the extent of the points beyond the center of the distribution. Individual circles beyond that are outliers.

This boxplot shows us that although all five wines recieve broadly similar ratings, Bordeaux-style wines tend to be rated a little higher than a Chardonnay.

In [None]:
sns.violinplot(
    x='variety',
    y='points',
    data=reviews[reviews.variety.isin(reviews.variety.value_counts()[:5].index)]
)

A violinplot cleverly replaces the box in the boxplot with a kernel density estimate for the data. It shows basically the same data, but is harder to misinterpret and much prettier than the utilitarian boxplot.

## Why seaborn?

This data is in a "record-oriented" format. Each individual row is a single record (a review); in aggregate, the list of all rows is the list of all records (all reviews). This is the format of choice for the most kinds of data: data corresponding with individual, unit-identifiable "things" ("records"). The majority of the simple data that gets generated is created in this format, and data that isn't can almost always be converted over. This is known as a "tidy data" format.

seaborn is designed to work with this kind of data out-of-the-box, for all of its plot types, with minimal fuss. This makes it an incredibly convenient workbench tool.

pandas is not designed this way. In pandas, every plot we generate is tied very directly to the input data. In essence, pandas expects your data being in exactly the right output shape, regardless of what the input is.

Hence, in practice, despite its simplicity, the pandas plotting tools are great for the initial stages of exploratory data analytics, but seaborn really becomes your tool of choice once you start doing more sophisticated explorations.

## Examples

In [None]:
pokemon.head()

In [None]:
sns.countplot(pokemon.Generation)

In [None]:
sns.distplot(pokemon['HP'])

In [None]:
sns.jointplot(x='Attack', y='Defense', data=pokemon)

In [None]:
sns.jointplot(x='Attack', y='Defense', data=pokemon,kind='hex',gridsize=20)

In [None]:
sns.kdeplot(pokemon[['HP','Attack']])

In [None]:
sns.boxplot(
    x='Legendary',
    y='Attack',
    data=pokemon
)

In [None]:
sns.violinplot(
    x='Legendary',
    y='Attack',
    data=pokemon
)