# 5. Summarizing and Describing
*Module: Exploratory Data Analysis (Sprint 1 of 2)

## Sprint Module Review and Data Stories

#### Module 3: Exploratory Data Analysis
*Data analysis is built around questions, and exploratory data analysis helps you know what questions to ask. Descriptive statistics and basic visualizations that summarize features or suggest relationships inspire the generation of hypotheses to confirm with statistical tests or build into statistical models.*

|Data Journalist| Data Engineer | Statistical Modeler| Business Analyst |
|----|----------------|------------------|----|
|I need to **summarize the data I have** so that I can report basic findings| I need to **identify errors and inconsistencies in the data** so that I can develop solutions to address them, and possibly their source| I need to **produce basic visual plots and summary statistics of the central tendencies and range of my data set** so that I can develop an intuition for and a familiarity with my data set| I need to **construct inventories and quality assessment of the data available** so that I can propose high value ways to use the data assets|

## Analytical Process Big Picture
![Curriculum Summary](../curriculum_summary.png)

### Whiteboard Exercise
Illustrate the collection of data from a natural process.

For Example: Someone asks you the question: What are the people in Honolulu like?
Illustrate any/as many as possible of the following
- 1.) All the people in Honolulu
- 2.) All the characteristics of people in Honolulu, for example, height
- 3.) A dataset that captures all of the people in Honolulu and their characteristics
- 4.) A process for creating that dataset
- 5.) A summary of a single characteristic

### Shifting to Analysis

Previously we worked on getting acquainted with the building blocks of analytical processes. We practiced working with data structures, and data formats, programming and scripts, using the command line and using libraries.

With these building blocks in place, we can begin actually *looking* at our data. 

### Exploring

The purpose of exploratory data analysis is to get a feel for your dataset. Understand what's in there, what the ranges of values are, where it might have holes. To the extent that you can understand the underlying natural phenomenon you want to develop a sense of that as well.

Ultimately, while you're exploring your data, you want to use the opportunity to develop questions. Questions about what is going on with the data, and questions that the data seems to be able to answer.

### Where does exploratory data analysis fit in

- Data Cleaning: idenfitying problems to fix, inconsistencies to smooth out
- Modeling: First thing you do when you get a data set
- Governance: Understanding the data you're trying to secure
- Production Development: Looking at the size and types of data you need to engineer around

In a nutshell, though, analysis is fraught with "gotchas". As you've probably experienced in your programming, the devil is in the details. The more you know those details through exploration, the more productive your final analysis will be as you troubleshoot your way through it.

### Key Questions
- What is Exploratory Data Analysis?
- How does it differ from Data Analysis?
- What tools do I use to conduct Exploratory Data Analysis?
- How can you summarize data?
- What kind of Questions do I ask in Exploratory Data Analysis?
- What is the connection between a dataset and the natural process that created it?

### Three takes on EDA

** R for Data Science **
>EDA is an iterative cycle. You:

>1. Generate questions about your data.

>2. Search for answers by visualising, transforming, and modelling your data.

>3. Use what you learn to refine your questions and/or generate new questions.

>EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends.


** Experimental Design and Analysis **

> exploratory data analysis or “EDA” is a critical
first step in analyzing the data from an experiment. Here are the main reasons we
use EDA:
- detection of mistakes
- checking of assumptions
- preliminary selection of appropriate models
- determining relationships among the explanatory variables, and
- assessing the direction and rough size of relationships between explanatory and outcome variables.

>Loosely speaking, any method of looking at data that does not include formal
statistical modeling and inference falls under the term exploratory data analysis.

** Engineering Statistics Handbook **

> Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
- maximize insight into a data set;
- uncover underlying structure;
- extract important variables;
- detect outliers and anomalies;
- test underlying assumptions;
- develop parsimonious models; and
- determine optimal factor settings.

### Key Concepts and Definitions
- sample
- statistic
- population
- parameter
- central tendency
- variation
- univariate
- multivariate
- distribution
- categorical variable
- continuous / quantitative variable
- variance
- standard deviation
- interquartile range
- skewness
- kurtosis
- historgram
- stem and leaf
- box plot
- outlier
- cross tabulation
- anova
- correlation
- covariance



### Exploratory Data Analysis Resources
- #### Overview: http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
- #### In Depth: http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf 
- #### R: http://r4ds.had.co.nz/exploratory-data-analysis.html
- #### Tableau: http://tableauafterdark.com/exploratory-data-analysis-eda-for-tableau/ 

### Develop a Process
This is something you will do over and over again. Reusable scripts are useful. So is a reusable format for documentation.

### Project Ideas

Explore these datasets
https://simplystatistics.org/2018/01/22/the-dslabs-package-provides-datasets-for-teaching-data-science/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+SimplyStatistics+%28Simply+Statistics%29

Find and Replicate a Kaggle Machine Learning EDA workflow
https://www.kaggle.com/xchmiao/eda-with-python

From Justin for Both Sprints on EDA
- What is your intuition about the data? What do you ‘see’ when you look at a time series? What do you ‘see’ when you plot that same series? How does the visualization aid or deny your intuitions? What is the best way to visually encode the data you are currently evaluating?
- Histogram everything
- Using a csv or excel file of a data series like stock prices, weather data, house prices,  take some basic summary stats
- `df.describe` and `df.info` metadata vs summary stats?
- Tools: pandas for basic visualization



In [1]:
#Randomizer
import random
import numpy
cohort = ["hunter","jon","michael", "nat", "runjini", "sheuli","tori"]
random.shuffle(cohort)

print("Day 1/2/3:")
print(cohort)
cohort = numpy.roll(cohort,1)
print("Day 4/5/6:")
print(cohort)


Day 1/2/3:
['runjini', 'tori', 'nat', 'michael', 'hunter', 'sheuli', 'jon']
Day 4/5/6:
['jon' 'runjini' 'tori' 'nat' 'michael' 'hunter' 'sheuli']


In [7]:
import pandas as pd
data = {'Country': ['Belgium', 'India', 'Brazil'],
 'Capital': ['Brussels', 'New Delhi', 'Brasília'],
 'Population': [11190846, 1303171035, 207847528]}

df = pd.DataFrame(data,
 columns=['Country', 'Capital', 'Population'])

df[['Country', 'Capital']]

Unnamed: 0,Country,Capital
0,Belgium,Brussels
1,India,New Delhi
2,Brazil,Brasília
