# Data Exploration

Visual data exploration is often useful to have an initial understanding of how values are distributed.

This notebook covers 4 basic types of plots:

- line plot
- scatter plot
- histogram
- boxplot

and a few other advanced plots:
- pie chart
- hexbin plot
- candlestick plot

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = np.random.normal(0, 0.01, 1000)

## Matplotlib plotting

In [None]:
plt.plot(data, 'o')

In [None]:
fig = plt.figure(figsize=(16,12))

ax = fig.add_subplot(2,2,1)
ax.plot(data)
ax.set_title('Line plot', size=24)

ax = fig.add_subplot(2,2,2)
ax.plot(data, 'o')
ax.set_title('Scatter plot', size=24)

ax = fig.add_subplot(2,2,3)
ax.hist(data, bins=50)
ax.set_title('Histogram', size=24)
ax.set_xlabel('count', size=16)

ax = fig.add_subplot(2,2,4)
ax.boxplot(data)
ax.set_title('Boxplot', size=24)

## Pandas plotting

In [None]:
dataseries = pd.Series(data)

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(16,12))

dataseries.plot(ax=ax[0][0],
                title='Line plot')

dataseries.plot(ax=ax[0][1],
                style='o',
                title='Scatter plot')

dataseries.plot(ax=ax[1][0],
                kind='hist',
                bins=50,
                title='Histogram'
               )

dataseries.plot(ax=ax[1][1],
                kind='box',
                title='Boxplot'
               )

## Advanced plots

### Pie chart

In [None]:
categories = dataseries > 0.01

In [None]:
categories.head()

In [None]:
counts = categories.value_counts()
counts

In [None]:
counts.plot(kind='pie',
            figsize=(5, 5),
            explode=[0, 0.15],
            labels=['<= 0.01', '> 0.01'],
            autopct='%1.1f%%',
            shadow=True,
            startangle=90,
            fontsize=16)

## Hexbin plot

Hexbin plots are useful to inspect 2D distriibutions

In [None]:
data = np.vstack([np.random.normal((0, 0), 2, size=(2000, 2)),
                  np.random.normal((9, 9), 3, size=(2000, 2))
                  ])

In [None]:
plt.hexbin(data[:,0], data[:,1])

In [None]:
pd.DataFrame(data).plot(kind='hexbin', x=0, y=1)

## Interactive notebook plotting

Jupyter offers interactive plotting through the magic command `%matplotlib notebook`.

If you see nothing just run the next cell again.

In [None]:
%matplotlib notebook

In [None]:
fig = plt.plot(data[:,0])

In [None]:
%matplotlib inline

## Exercises:

### Exercise 1
- load the dataset: ../data/international-airline-passengers.csv
- inspect it using the .info() and .head() commands
- use the function pd.to_datetime() to change the column type of 'Month' to a datatime type
- set the index of df to be a datetime index using the column 'Month' and the df.set_index() method
- choose the appropriate plot and display the data
- choose appropriate scale
- label the axes
- discuss with your neighbor

In [None]:
df = pd.read_csv('../data/international-airline-passengers.csv')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df['Month'] = pd.to_datetime(df['Month'])
df = df.set_index('Month')

In [None]:
df.head()

In [None]:
df.plot()

### Exercise 2
- load the dataset: ../data/weight-height.csv
- inspect it
- plot it using a scatter plot with Weight as a function of Height
- plot the male and female populations with 2 different colors on a new scatter plot
- remember to label the axes
- discuss


In [None]:
df = pd.read_csv('../data/weight-height.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['Gender'].value_counts()

In [None]:
_ = df.plot(kind='scatter', x='Height', y='Weight')

In [None]:
males = df[df['Gender'] == 'Male']
females = df.query('Gender == "Female"')
fig, ax = plt.subplots()

males.plot(kind='scatter', x='Height', y='Weight',
           ax=ax, color='blue', alpha=0.3,
           title='Male & Female Populations')

females.plot(kind='scatter', x='Height', y='Weight',
             ax=ax, color='red', alpha=0.3)

### Exercise 3
- plot the histogram of the heights for males and for females on the same plot
- use alpha to control transparency in the plot comand
- plot a vertical line at the mean of each population using plt.axvline()


In [None]:
males['Height'].plot(kind='hist',
                     bins=50,
                     range=(50, 80),
                     alpha=0.3,
                     color='blue')

females['Height'].plot(kind='hist',
                       bins=50,
                       range=(50, 80),
                       alpha=0.3,
                       color='red')

plt.title('Height distribution')
plt.legend(["Males", "Females"])
plt.xlabel("Heigth (in)")


plt.axvline(males['Height'].mean(), color='blue', linewidth=2)
plt.axvline(females['Height'].mean(), color='red', linewidth=2)

*Copyright &copy; 2017 CATALIT LLC.  All rights reserved.*