# Exploratory Data Analysis in Python

## What we'll cover

* Visualization
  * Types of visualizations
  * Tools
  * Time Series data
* Data delivery


### A Little Data Clean-up

In [None]:
# import libraries
import pandas as pd

In [None]:
# load our data we collected and processed earlier into a dataframe
census_ky = pd.read_csv('data/census_2010_ky.csv')
census_ky.head(5)

In [None]:
# use a lambda and apply to strip unwanted string from label series
label_func = lambda x: x[:-17:]
census_ky['label'] = census_ky.label.apply(label_func)
census_ky.head(5)

In [None]:
# lets find the top ten counties for single moms

single_moms = census_ky.nlargest(10, columns=['sinmoms'])
# single_moms

In [None]:
# now let's do the same for single dads

single_dads = census_ky.nlargest(10, columns=['sindads'])
# single_dads

In [None]:
# and now for total population

total_population = census_ky.nlargest(10, columns=['totpop'])
# total_population

In [None]:
# and finally median age largest and smallest

median_age_large = census_ky.nlargest(10, columns=['medage'])
median_age_small = census_ky.nsmallest(10, columns=['medage'])
# median_age_small

<hr>

## Data Visualization Libraries

<hr>

### Matplotlib

In [None]:
# import pyplot from matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.figure(figsize=(12, 8))  # setup dimension of plot

# use a bar graph to plot single moms and dads per county
# labels will show in legend
plt.bar(single_moms.label, single_moms.sinmoms, label='single moms')
plt.bar(single_dads.label, single_dads.sindads, label='single dads')
# label your axes
plt.xlabel('Counties in Kentucky')
plt.ylabel('Number of Single Parents')
# add a title
plt.title("Counties with Highest No. of Single Moms & Dads")
# create the legend on main figure
plt.legend()
# save the image as SVG
# plt.savefig('data/matplotlib_bar.svg')
# plot the graph
plt.show()

### Pandas Plot

In [None]:
# create a bar graph object
bar_graph = total_population.plot(  # use the dataframe plot
    kind='bar',  # state the type of viz
    figsize=(12,8),  # set the figure dimension
    legend=False,  # turn off legend
    title='Biggest Counties in Kentucky - 2010',  # set title
    x='label',  # set x axis
    y='totpop'  # set y axis
).set(xlabel='Counties in Kentucky', ylabel='Population Count')  # set x and y labels
plt.savefig('data/pd_bar.svg')  # save plot to file

### Seaborn

In [None]:
# import seaborn
import seaborn as sns

In [None]:
# plot single dads by median age
sns_scatter = sns.relplot(x="medage", y="sindads", data=census_ky)
sns_scatter.savefig('data/sns_scatter.svg')

In [None]:
# see the relationship between pairs of variable in your data
sns_all = sns.pairplot(census_ky)

In [None]:
# examine the same thing, but this time using KDE plots 
sns_grid = sns.PairGrid(census_ky)
sns_grid.map_diag(sns.kdeplot)
sns_grid.map_offdiag(sns.kdeplot, n_levels=6)

In [None]:
# we can examine linear relationships
sns.lmplot(x='sinmoms', y='sindads', data=census_ky);

### Bokeh

In [None]:
from bokeh.plotting import figure, output_notebook, show  # for standard .py, use output_file

In [None]:
# this example is from the Bokeh docs
# prepare some data
x = [0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
y0 = [i**2 for i in x]
y1 = [10**i for i in x]
y2 = [10**(i**2) for i in x]

In [None]:
# output to notebook
output_notebook()

In [None]:
# create a new plot
p = figure(
   tools="pan,box_zoom,reset,save",
   y_axis_type="log", y_range=[0.001, 10**11], title="log axis example",
   x_axis_label='sections', y_axis_label='particles'
)

In [None]:
# add some renderers
p.line(x, x, legend="y=x")
p.circle(x, x, legend="y=x", fill_color="white", size=8)
p.line(x, y0, legend="y=x^2", line_width=3)
p.line(x, y1, legend="y=10^x", line_color="red")
p.circle(x, y1, legend="y=10^x", fill_color="red", line_color="red", size=6)
p.line(x, y2, legend="y=10^x^2", line_color="orange", line_dash="4 4")

In [None]:
# show the results
show(p)

### Altair

In [None]:
# this example is from the altair docs
import altair as alt
from vega_datasets import data

# for the notebook only (not for JupyterLab) run this command once per session
# alt.renderers.enable('notebook')

In [None]:
iris = data.iris()

alt.Chart(iris).mark_point().encode(
    x='petalLength',
    y='petalWidth',
    color='species'
)

### Wordclouds

There are also many tools to examine qualitative text visually. The wordcloud is perhaps the simplest.

In [None]:
from wordcloud import WordCloud, STOPWORDS
import numpy as np
from PIL import Image

In [None]:
# Read the whole text.
text = open('data/text.txt').read()
# use a mask if you wish
# mask = np.array(Image.open("mask.png"))

stopwords = set(STOPWORDS)

# Generate a word cloud image
# can use kwarg mask=mask if you used a mask above
wordcloud = WordCloud(background_color="black", stopwords=stopwords, contour_width=3, contour_color='white').generate(text)
wordcloud.to_file('output_image.jpg')

<hr>

## Time Series Data

<hr>

There is a lot that goes into time series analysis, so we'll be covering the basic plotting of data here.

In [None]:
import pandas as pd
from dateutil.parser import parser
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# import our price data and assign column names to the dataframe
# data was obtained from blockchain.com Bitcoin market data
# the data has no header and the index should be the date info which needs to be parsed
market_prices = pd.read_csv('data/blockchain_market_price.csv', header=None, index_col=0, parse_dates=True)
header = ['price']  # list of our headers, in this case just price
market_prices.columns = header  # assign the headers
market_prices.index.name = 'date'  # rename our index
market_prices.head()

In [None]:
# plot all the price data
market_prices['price'].plot(
    figsize=(12,8),  # set the figure dimension
    legend=False,  # turn off legend
    title='Price of Bitcoin - Nov 2017 to Oct 2018',  # set title
)

In [None]:
# plot data from Jan to Apr 2018 during the Bitcoin Crash
market_prices.loc['2018-01':'2018-03'].plot(
    figsize=(12,8),  # set the figure dimension
    legend=False,  # turn off legend
    title='Price of Bitcoin - Jan to Apr - 2018',  # set title
)

<hr>

## Resources

<hr>

**Note:** A lot of the open-source materials are provided by people who develop those materials for a living. So please consider sending them a thank you and if you can, a few buck to support their efforts. Thanks! :)    

* [matplotlib](https://matplotlib.org/index.html)
* [seaborn](https://seaborn.pydata.org/tutorial.html)
* [bokeh](https://bokeh.pydata.org/en/latest/)
* [altair](https://altair-viz.github.io/)
* [wordcloud](https://amueller.github.io/word_cloud/)