# Exploratory Data Analysis
Exploratory data analysis, or EDA, is a standard practice prior to any data manipulation and analysis.

Recall that data engineering is primarily about data preparation to *serve* smooth and effective data analysis.  Exploratory data analysis generally refers to the step of understanding the data:  
- **summarizing characteristics of raw data**
- **visualizing data (single and multiple variables)**
- identifying missing data
- identifying outliers

This document primarily deals with the first two items.  

## Goals
In the **exploratory** phase, these are for people behind the scenes to see.  

The main goals here are:
- capture main message
- (relatively) quick exploration across many summaries (including plots)
- *not* intended for a client or presentation

What does this translate to, technically?
- each summary should have meaningful information
- **label** your plots

## Data summary
As a starting point, simply looking at the data is worth the while.  Some common questions to consider are the following:  


1. General dataset info: size, dtypes  
2. Missing values?  
3. Duplicate data?  
4. Continuous variables  
5. Categorical variables  
6. Bivariate relationships  
7. Potential data quality issues, e.g., inconsistency, special NA characters

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

|![sns](../img/sns.jpg)|
|:---:|
|[The origin of sns.](https://seaborn.pydata.org/faq.html#why-is-seaborn-imported-as-sns)|

In [None]:
# return all available datasets in seaborn
sns.get_dataset_names()

### [Exoplanet dataset source](https://science.nasa.gov/exoplanets/exoplanet-catalog/)

In [None]:
# load and save a copy of the planet dataset
planet = sns.load_dataset('planets', cache=True, data_home='dataset/')

In [None]:
# take a glimpse of the data


In [None]:
# view a summary of the full data


In [None]:
# checks for duplicates (also ask if duplicates make sense)


In [None]:
# duplicates


In [None]:
# a quick numerical summary 


In [None]:
# checks for possible statistical assumption(s)
import scipy.stats as sps

In [None]:
# extract only numeric variables


In [None]:
# for example, normality test


In [None]:
# for example, another normality test


In [None]:
# pairwise correlation


## Data visualization

In [None]:
sns.set(context='talk', style='ticks')  # simply for aesthetics
sns.set_palette('magma')
%matplotlib inline 

# planet = planet.sample(n=500)  # (if too slow) for illustration purposes

In [None]:
# histogram for continuous variables using pandas built-in plots 


In [None]:
# relative frequency? ...


In [None]:
# histogram of masses by group


In [None]:
# other types of plots


In [None]:
# counts for categorical variables


In [None]:
# barplots by group

In [None]:
# bivariate plots

In [None]:
# bivariate plots (log-log)

In [None]:
# pairwise plots  (time-consuming)

In [None]:
# another pairwise plot by group

# (Exercise) Penguins data

We may dive into the penguins dataset as an exercise.

Example questions:

- Practice on some of the exploratory questions above
- How many penguins are in the dataset for each species? 
- Do the penguin sizes differ by species, or where they live?
- If we were to build a predictive model for the "sex" of penguins, how might we approach this?
- If we were to build a predictive model for the "size" of penguins, how might we approach this?

In [None]:
penguins = sns.load_dataset('penguins', cache=True, data_home='dataset')
penguins.info()

## (In case you need this) Jupyter notebook setup

Visit https://docs.jupyter.org/en/latest/install/notebook-classic.html for some guidance to set up jupyter notebook.


---

*Note:* These notes are adapted from a blog post on [Tom's Blog](https://tomaugspurger.net/posts/modern-6-visualization/).
