Data Discovery with Visualization
==================

In this notebook we will explore how to use python to discover descriptive attributes of various data.

There are thirteen data files contained in the sub-directory `dat/`, with the path and generic basename `dat/dataset_k{:02d}.tsv`. These tab separated value files contain two columns of floats, `x` and `y`.

We will first explore the data by computing summary statistics.

- - -

In [None]:
# Import required modules
import pandas as pd
import numpy as np
import glob,natsort
# Generate a list of `tsv` files and sort the list
F = natsort.realsorted(glob.glob('dat/dataset_K*.tsv'))
# Generate a list of DataFrames with list index -> dataset index
dfs = [ pd.read_csv(f,delimiter='\t') for f in F ]

In [None]:
dfs[0]

In [None]:
# Useful function
def gen_summary_statistics(df):
    methods = [ 
        df.mean, 
        df.std, 
        df.skew, 
        df.kurt,
    ]
    return np.r_[
        np.array([ method().values for method in methods ]).flatten(),
        df.corr().values[1,0]
    ]

# Column names of generated summary stats
cols = [
    'x_mean','y_mean',
    'x_std', 'y_std',
    'x_skew','y_skew',
    'x_kurt','y_kurt',
    'corr',
]

In [None]:
# Loop through DataFrames and generate summary statistics DataFrame
sdf = pd.DataFrame([ gen_summary_statistics(df) for df in dfs ], columns=cols)
sdf

In [None]:
# Get to the visualizin
import pylab as plt
from plt_style import *
def mk_grids(): 
    plt.grid(which='major',linestyle='--',color='#333333')
    plt.grid(which='minor',linestyle=':',color='#555555')

In [None]:
plt.plot(sdf.x_mean,sdf.y_mean,'o',mec='k',mfc=gld,ms=8); mk_grids()
plt.xlabel('$<x>$')
plt.ylabel('$<y>$')
plt.title('Mean Comparison')

In [None]:
plt.plot(sdf.x_std,sdf.y_std,'o',mec='k',mfc=gld,ms=8); mk_grids()
plt.xlabel('$\sigma(x)$')
plt.ylabel('$\sigma(y)$')
plt.title('Standard Dev. Comparison')

In [None]:
plt.plot(sdf.x_mean/sdf.x_std,sdf.y_mean/sdf.y_std,'o',mec='k',mfc=gld,ms=8); mk_grids()
plt.xlabel(r'SNR$(x)$')
plt.ylabel(r'SNR$(y)$')
plt.title('Signal to Noise Comparison')

In [None]:
plt.plot(sdf['corr'],'o',mec='k',mfc=gld,ms=8); mk_grids()
plt.xlabel(r'$k$')
plt.ylabel(r'$\rho(xy)$')
plt.title('Correlation Comparison')

In [None]:
plt.plot(sdf['x_skew'],sdf['y_skew'],'o',mec='k',mfc=gld,ms=8); mk_grids()
plt.xlabel(r'$\gamma(x)$')
plt.ylabel(r'$\gamma(y)$')
plt.title('Skewness Comparison')

In [None]:
plt.plot(sdf['x_kurt'],sdf['y_kurt'],'o',mec='k',mfc=gld,ms=8); mk_grids()
plt.xlabel(r'$\kappa(x)$')
plt.ylabel(r'$\kappa(y)$')
plt.title('Kurtosis Comparison')

- - - 

What do the data actually look like?
---

In [None]:
f,axarr = plt.subplots(4,4,figsize=(21,20))
# Only 13 plots, remove last 3 empty frames
f.delaxes(axarr[-1,-3])
f.delaxes(axarr[-1,-2])
f.delaxes(axarr[-1,-1])
# loop over dataframes and plot data
for k,df in enumerate(dfs):
    ax = axarr[k//4,k%4]
    ax.plot(df.x,df.y,'o',c='k' if k != 3 else 'r')
    ax.set_xlabel(r'$x$')
    ax.set_ylabel(r'$y$') 
    ax.set_title(f'Index {k:02d}')
plt.axis('equal')    
plt.tight_layout()

---------------------------------------


![Anscombe.png](fig/anscombe.png)

Data and more from [this blog post](https://www.autodeskresearch.com/publications/samestats).
