## Exploratory Data Analysis in Python

Using Python there are three main plotting tools:

* **Pandas** : for quick plotting on the fly with Pandas data frames and series
* **Matplotlib** : the primary visualization tool in Python
* **Seaborn** : wrapper for matplotlib, less flexible but often prettier than matplotlib
* **Bokeh** : python library that allows for interactive plots in web browsers
* **Plotly** : another tool for generating interactive plots and can use Python and other common languages

Two plotting libraries outside Python that are popular among data scientists include:

* **ggplot2** and **shiny** (R)
* **D3** (JavaScript)

The purpose of these notes is to demonstrate the funcionality of the Python libraries. We can begin by importing the necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv("../data/playgolf.csv", delimiter='|')
df.rename(columns=lambda x: x.lower(), inplace=True)
print df.head()

In [None]:
df.info()

In [None]:
df['date'] = pd.to_datetime(df['date'])
df['date']

In [None]:
df2 = df.set_index("date")
print df2.head()

## Plotting with pandas

### What is the distribution of temperatures? (Univariate, Quantitative)

In [None]:
df['temperature'].hist(bins=10, figsize=(8,6));

### What is the change in temperature over time? (Univariate, Quantitative)

In [None]:
df.set_index("date",inplace=True)
print df.head()

In [None]:
df['temperature'].plot(figsize=(15,6), marker='o')
plt.title("Temperature Over 2 Weeks", size=20)
plt.xlabel("Time", size=15)
plt.ylabel("Temperature", size=15)
plt.tick_params(labelsize=10);

In [None]:
df[['temperature','humidity']].plot(figsize=(15,6), marker='o')
plt.title("Temperature and Humidity Over 2 Weeks", size=20)
plt.xlabel("Time", size=15)
plt.ylabel("Degrees", size=15);

### What is the difference between the distributions between temperature and humidity? (Bivariate, 2 Quantitative)

In [None]:
df[['temperature','humidity']].plot(kind='box',sym='k.', showfliers=True, figsize=(8,6))
plt.title("Boxplots of Temperature and Humidity", size=20)
plt.ylabel("Degrees", size=15)
plt.xlabel("Weather Metrics", size=15);
plt.ylim([0,170])

### What is the daily difference in temperature and humidity? (Bivariate, 2 Quantitative)

In [None]:
df[['temperature','humidity']].plot(kind='barh', figsize=(15,10))
plt.title("Comparison of Temperature and Humidity", size=20)
plt.xlabel("Degrees", size=15)
plt.ylabel("Time", size=15);

### What is the difference in mean temperature for different outlooks? (Bivariate, 1 Quantitative, 1 Categorical)

In [None]:
df2.head()

In [None]:
df2 = df.groupby("outlook").mean()
df2.plot(x=df2.index,y='temperature', kind='barh', figsize=(8,6), legend=None);
plt.title("Mean Temperature Differences For Different Outlooks", size=20)
plt.xlabel("Temperature", size=15)
plt.ylabel("Outlook", size=15);

### What is the count of times players play/don't play golf based on the outlook?

In [None]:
df.groupby(['outlook','result']).size().plot(kind='bar', figsize=(8,6))
plt.title("Counts by Outlook and Result", size=18)
plt.xlabel("Outlook and Windiness", size=15)
plt.ylabel("Count", size=15)
plt.xticks(rotation=0)
plt.tight_layout();

### What is the relationship between temperature and humidity? (Bivariate/ 2 Quantitative)

In [None]:
df.plot(x='temperature',y='humidity', kind='scatter', figsize=(8,6))
plt.title("Scatterplot of Temperature and Humidity", size=18)
plt.xlabel("Temperature", size=15)
plt.ylabel("Outlook", size=15);

## Plot with matplotlib
- Both Pandas and Seaborn are wrappers on Matplotlib. The plotting functionality in those libraries is intended to make producing plots much more efficient. They do this, but at the expense of some flexibility and the potential for customization.
- Even if we only intend to use Matplotlib tools for plotting, we can immediately improve the look of our plots by simply importing Seaborn. When we import Seaborn, the Matplotlib defaults for plot appearance are overwritten, which is generally a good thing in our experience.
- When working with Pandas data structures (data frames, series) in Matplotlib, you will sometimes run into trouble. As a result, it is good practice to always extract the underlying array(s) using the `values` attribute, which is done throughout this demo.
- The convention is to import Matplotlib as follows: `import matplotlib.pyplot as plt`. As a result, you will almost always see Matplotlib methods used on `plt`, or else on objects pulled from the `plt` object, such as axes. We will see examples of both below.
- Figures in Matplotlib often take several lines of code, especially as you begin to get comfortable with the full spectrum of its functionality. As a result, when you set out to make a plot, it is recommended to start simple and add features one at a time, ensuring that the result is what you expect, otherwise it can be difficult to debug. 

### What is the relationship between Temperature/Humidity and their result?

In [None]:
play = df['result'] == 'Play'
fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(12, 6))
ax[0].scatter(df['temperature'].values, df['humidity'].values, color='k')
ax[0].set_xlabel('Temperature', fontsize=14)
ax[0].set_ylabel('Humidity', fontsize=14)
ax[1].scatter(df['temperature'][play].values, df['humidity'][play].values, 
              color='g', label='Play')
ax[1].scatter(df['temperature'][~play].values, df['humidity'][~play].values, 
              color='r', label="Don't Play")
ax[1].set_xlabel('Temperature', fontsize=14, labelpad=12)
ax[1].legend(loc='best', fontsize=14)
fig.suptitle('Temperature Against Humidity', fontsize=18, y=1.03)
fig.tight_layout(); 

In [None]:
crimes = pd.read_csv("../data/crime.csv")
print crimes.head()

### What is distribution for each crime?

In [None]:
column_names = crimes.columns[1:]
fig, axs = plt.subplots(4,2, figsize=(10,15))
for i,ax in enumerate(axs.reshape(-1)[:-1]):
    ax.set_title(column_names[i])
    ax.hist(crimes[column_names[i]].values)
fig.delaxes(axs.reshape(-1)[-1])
fig.suptitle("Crime Distributions", y=1.02, fontsize=20)
fig.text(-0.03, 0.5, 'Frequency', fontsize=18, va='center', rotation='vertical')
fig.tight_layout();

In [None]:
crimes.hist(figsize=(15,10));

### Scatterplot matrix of all crimes

In [None]:
pd.scatter_matrix(crimes, figsize=(20,20));

## Plot with Seaborn

### Scatterplot matrix of golf features with result as third variable

In [None]:
sns.set_style('whitegrid')
sns.pairplot(df, hue='result', size=4);

### Distribution plot of MVT (with kernel density estimate)

In [None]:
plt.figure(figsize=(8,6))
mvt = sns.distplot(crimes['Motor Vehicle Theft'])

### Distributions of all crimes with boxplots

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(crimes, orient='h');

### Heatmap of correlations of all crimes

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(crimes.corr(), annot=True, linewidth=0.2, cmap='RdYlBu')
plt.tight_layout()

## Plot with Bokeh

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
import bokeh
print 'my bokeh version', bokeh.__version__
output_notebook()

In [None]:
p = figure(plot_width=400, plot_height=400, title="Scatterplot of Temperature and Humidity",title_text_font_size="15pt")

# add a circle renderer with a size, color, and alpha
p.circle(df['temperature'].values, df['humidity'].values, size=5, color="navy", alpha=0.9)
p.yaxis.axis_label = 'Humidity'
p.xaxis.axis_label = 'Temperature'
show(p)

In [None]:
p2 = figure(plot_width=600, plot_height=400, x_axis_type="datetime")

# add a line renderer
p2.line(df.index.values, df['temperature'].values, line_width=2)
p2.yaxis.axis_label = 'Temperature'
p2.xaxis.axis_label = 'Time'
show(p2)

## Plot with Plotly

In [None]:
# ! pip install plotly --upgrade

from plotly import __version__
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import Scatter, Layout, XAxis, YAxis
print __version__ # requires version >= 1.9.0
init_notebook_mode()

In [None]:
iplot({"data": [Scatter(x=df.index.date, 
                        y=df['temperature'].values)],
       "layout": Layout(title="Changes Temperature Over Time",
                        xaxis=XAxis(title="Time"),
                        yaxis=YAxis(title="Temperature"))})

In [None]:
iplot({"data": [Scatter(x=df['temperature'].values, 
                        y=df['humidity'].values,
                       mode='markers')],
       "layout": Layout(title="Scatterplot of Temperature and Humidity",
                        xaxis=XAxis(title="Temperature"),
                        yaxis=YAxis(title="Humidity"))})