<img src="./images/logo-iug@2x.png" alt="IUG" style="width:300px;"/>

# TEC 640
**Learning Lab #3**: Data Exploration with Python by Dr. N. Tsourakis

[ntsourakis@iun.ch](ntsourakis@iun.ch)

## Exploratory Data Analysis

`Exploratory Data Analysis (EDA)` helps us to understand the main characteristics of the data before resorting to any solution. Visual methods are most commonly used and we are going to explore a few of those throughout the notebook.

## Visualization with Seaborn

[Seaborn](http://seaborn.pydata.org/) is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics, and integrates with the functionality provided by Pandas ``DataFrame``s.

Check this gallery with different examples: [Seaborn gallery](http://seaborn.pydata.org/examples/index.html).

### Seaborn examples

We are going to create some typical plots with Seaborn using various datasets.

In [None]:
# Import the necessary libraries.
import matplotlib.pyplot as plt

%matplotlib inline
import numpy as np
import pandas as pd

Now we create some random walk data:

In [None]:
# Create some data
rng = np.random.RandomState(0)

# Return evenly spaced numbers over a specified interval: x[500]
x = np.linspace(0, 10, 500)

# Return the cumulative sum of the elements along a given axis: y[500][6]
y = np.cumsum(rng.randn(500, 6), 0)

y

We can set the style by calling Seaborn's ``set()`` method.
By convention, Seaborn is imported as ``sns``:

In [None]:
import seaborn as sns
sns.set()

Now create the line plots.

In [None]:
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');

# Try to change the legend of the lines.

Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables.

In [None]:
data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])

for col in 'xy':
    plt.hist(data[col], alpha=0.5)

# Try to change the alpha value and see what happens.

We can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with ``sns.kdeplot``:

In [None]:
for col in 'xy':
    sns.kdeplot(data[col], shade=False)

# Change the value of the shade from 'False' to 'True'.

We can combine histograms and KDE using ``distplot``:

In [None]:
sns.distplot(data['x'])
sns.distplot(data['y']);

A ``jointplot`` is used to quickly visualize and analyze the relationship between two variables and describe their individual distributions on the same plot.

In [None]:
with sns.axes_style('white'):
    sns.jointplot("x", "y", data, kind='hex')

# Change kind from 'hex' to 'scatter'.

A ``pairplot`` visualizes given data to find the relationship between them where the variables can be continuous or categorical.

We'll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:

In [None]:
iris = sns.load_dataset("iris")
iris.head()

We can now proceed with the plot.

In [None]:
sns.pairplot(iris, hue='species', size=2.5);

# Change the size to '4.0'.

Sometimes the best way to view data is via histograms of subsets. Seaborn's ``FacetGrid`` makes this extremely simple.
We'll take a look at some data that shows the amount that restaurant staff receive in tips based on various indicator data:

In [None]:
tips = sns.load_dataset('tips')
tips.head()

In [None]:
# Calculate the tip percentage.
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']
tips.head()


In [None]:
# Now show the plot.
grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));

# Change the row from 'sex' to 'smoker'.

Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a parameter within bins defined by any other parameter:

In [None]:
with sns.axes_style(style='ticks'):
    g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box")
    g.set_axis_labels("Day", "Total Bill");

# Change the kind from box to 'violin'.

Similar to the pairplot we saw earlier, we can use ``sns.jointplot`` to show the joint distribution between different datasets, along with the associated marginal distributions:

In [None]:
with sns.axes_style('white'):
    sns.jointplot("total_bill", "tip", data=tips, kind='hex')

The joint plot can even do some automatic kernel density estimation and regression:

In [None]:
sns.jointplot("total_bill", "tip", data=tips, kind='reg');

Time series can be plotted using ``sns.factorplot``. In the following example, we'll use the Planets data.

In [None]:
planets = sns.load_dataset('planets')
planets.head()

In [None]:
with sns.axes_style('white'):
    g = sns.factorplot("year", data=planets, aspect=2,
                       kind="count", color='steelblue')
    g.set_xticklabels(step=5)

We can learn more by looking at the *method* of discovery of each of these planets:

In [None]:
with sns.axes_style('white'):
    g = sns.factorplot("year", data=planets, aspect=4.0, kind='count',
                       hue='method', order=range(2001, 2015))
    g.set_ylabels('Number of Planets Discovered')