# 6. Preliminary Findings and Hypotheses
*Module: Exploratory Data Analysis (Sprint 2 of 2)

## Sprint Module Review and Data Stories

#### Module 3: Exploratory Data Analysis
*Data analysis is built around questions, and exploratory data analysis helps you know what questions to ask. Descriptive statistics and basic visualizations that summarize features or suggest relationships inspire the generation of hypotheses to confirm with statistical tests or build into statistical models.*

## Sprint 6: Preliminary Findings and Hypotheses
|Data Journalist| Data Engineer | Statistical Modeler| Business Analyst |
|:----------------:|:----:|:------------------:|:----:|
|I need to **identify interesting patterns** so that I can direct further investigation| I need to **understand the volume and data types of data** to understand their performance implications| I need to **produce statistical summaries that explain how variables in my data set relate to each other** so that I can develop hypotheses to guide my analysis| I need to **produce preliminary charts and dashboards** so that I can communicate with other areas of the business about problems we need to solve with joint expertise and refine data collection based on feedback|

### Questions from Last EDA
- How do we know what we CAN'T see in our data?
- How do we account for biases, false leads?
- There's a lot of data to go through. So many data sets, how do we get through all of them?
- How do we address the questions we come up with fast enough and jump from one idea to the next?
- How much domain knowledge do we need to do Exploratory Data Analysis?

### Charting ideas
- Python Chart Library: https://python-graph-gallery.com/
- http://visualizationuniverse.com/charts/?sortBy=volume&sortDir=desc
- http://datavizproject.com/
- http://ggplot2.tidyverse.org/
- http://ggplot.yhathq.com/
- http://seaborn.pydata.org
- https://github.com/wesm/feather
- https://www.statmethods.net/graphs/scatterplot.html
- http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization

### Histograms, Correlations, and other Statistical Topics
- http://flowingdata.com/2017/06/07/how-histograms-work/
- http://tinlizzie.org/histograms/
- http://rpsychologist.com/d3/correlation/
- http://rpsychologist.com/d3/CI/

## Structured Sprint Option
Objectives
- Explore key statistical concepts *(that provide the basis of the preliminary findings)*
- Develop rigorous process *(that drive towards preliminary findings)*
- Develop Expertise in a dataset that is relevant to you. *(preliminary findings that are relevant to data in your field, hypotheses)*

Structure
- Week 1: Paired Sprint, develop conceptual depth and process
- Week 2: Follow Intuitive Inquiry, Polish, Communicate, and Share findings.



### Key Concepts and Definitions
#### Sprint 5
- sample
- statistic
- population
- parameter
- central tendency
- variation
- univariate
- multivariate
- distribution
- categorical variable
- continuous / quantitative variable
- variance
- standard deviation
- interquartile range
- skewness
- kurtosis
- histogram
- stem and leaf
- box plot
- outlier
- cross tabulation
- anova
- correlation
- covariance

#### Sprint 6
- one dimension
- multi-dimension
- scatterplot
- scatterplot matrix
- dimensionality
- dimensionality reduction
- covariation
- clustering
- k-means
- feature extraction
- feature selection / elimination
- feature engineering
- principle component analysis
- factor analysis




R in a Nutshell
Data Science from Scratch
R For Data Science

#### Patterns
> Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

#### Correlation and Covariance

>Very often, when analyzing data, you want to know if two variables are correlated. Informally, correlation answers the question, “When we increase (or decrease) x, does y increase (or decrease), and by how much?” 

> Formally, correlation measures the linear dependence between two random variables. Correlation measures range between −1 and 1; 1 means that one variable is a (positive) linear function of the other, 0 means the two variables aren’t correlated at all, and −1 means that one variable is a negative linear function of the other (the two move in completely op- posite directions; 

#### Correlation Matrix
>With many dimensions, you’d like to know how all the dimensions relate to one another. A simple approach is to look at the correlation matrix, in which the entry in row i and column j is the correlation between the ith dimension and the jth dimen‐ sion of the data:

>A more visual approach (if you don’t have too many dimensions) is to make a scatter‐ plot matrix (Figure 10-4) showing all the pairwise scatterplots. To do that we’ll use plt.subplots(), which allows us to create subplots of our chart. We give it the num‐ ber of rows and the number of columns, and it returns a figure object (which we won’t use) and a two-dimensional array of axes objects (each of which we’ll plot to):


#### Simpson’s Paradox
> One not uncommon surprise when analyzing data is Simpson’s Paradox, in which correlations can be misleading when confounding variables are ignored.

#### Correlation and Causation
> You have probably heard at some point that “correlation is not causation,” most likely by someone looking at data that posed a challenge to parts of his worldview that he was reluctant to question. Nonetheless, this is an important point—if x and y are strongly correlated, that might mean that x causes y, that y causes x, that each causes the other, that some third factor causes both, or it might mean nothing.

#### Dimensionality Reduction
> You might ask the question, “How do I take all of the variables I’ve collected and focus on only a few of them?” In technical terms, you want to “reduce the dimension of your feature space.” By reducing the dimension of your feature space, you have fewer relationships between variables to consider and you are less likely to overfit your model. (Note: This doesn’t immediately mean that overfitting, etc. are no longer concerns — but we’re moving in the right direction!)

>Somewhat unsurprisingly, reducing the dimension of the feature space is called “dimensionality reduction.” There are many ways to achieve dimensionality reduction, but most of these techniques fall into one of two classes:

>- Feature Elimination
>- Feature Extraction

#### Principal Components Analysis
> Another technique for analyzing data is principal components analysis. Principal components analysis breaks a set of (possibly correlated) variables into a set of un- correlated variables.

> A brief description
> https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

#### Factor Analysis
> In most data analysis problems, there are some quantities that we can observe and some that we cannot. The classic examples come from the social sciences. Suppose that you wanted to measure intelligence. It’s not possible to directly measure an abstract concept like intelligence, but it is possible to measure performance on dif- ferent tests. You could use factor analysis to analyze a set of test scores (the observed values) to try to determine intelligence (the hidden value).
Factor analysis is available in R through the function factanal in the stats package:

#### Bootstrap Resampling
> When analyzing statistics, analysts often wonder if the statistics are sensitive to a few outlying values. Would we get a similar result if we were to omit a few points? What is the range of values for the statistic? It is possible to answer these questions for an arbitrary statistic using a technique called bootstrapping.

> Formally, bootstrap resampling is a technique for estimating the bias of an estimator. An estimator is a statistic calculated from a data sample that provides an estimate of a true underlying value, often a mean, a standard deviation, or a hidden parameter.

> Bootstrapping works by repeatedly selecting random observations from a data sam- ple (with replacement) and recalculating the statistic. In R, you can use bootstrap resampling through the boot function in the boot package:

#### Lattice Graphic Package 
> Lattice functions make it easy to do some things that are hard to do with standard graphics, such as plotting multiple plots on the same page or superimposing plots. Additionally, most lattice functions produce clean, readable output by default. This chapter shows what lattice graphics can do and explains how to use them.

>The real strength of the lattice package is in splitting a chart into different panels (shown in a grid), or groups (shown with different colors or symbols) using a con- ditioning or grouping variable. This chapter includes many examples that start with a simple chart and then split it into multiple pieces to answer a question raised by the original plot.

#### k-means

> Your goal is to segment the users. This process is known by various names: besides being called segmenting, you could say that you’re go‐ ing to stratify, group, or cluster the data. They all mean finding similar types of users and bunching them together.
Why would you want to do this? Here are a few examples:

> • You might want to give different users different experiences. Mar‐ keting often does this; for example, to offer toner to people who are known to own printers.

> • You might have a model that works better for specific groups. Or you might have different models for different groups.

> • Hierarchical modeling in statistics does something like this; for example, to separately model geographical effects from household effects in survey results.


### Project Ideas

Explore these datasets
https://simplystatistics.org/2018/01/22/the-dslabs-package-provides-datasets-for-teaching-data-science/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+SimplyStatistics+%28Simply+Statistics%29

Find and Replicate a Kaggle Machine Learning EDA workflow
https://www.kaggle.com/xchmiao/eda-with-python

From Justin for Both Sprints on EDA
- What is your intuition about the data? What do you ‘see’ when you look at a time series? What do you ‘see’ when you plot that same series? How does the visualization aid or deny your intuitions? What is the best way to visually encode the data you are currently evaluating?
- Histogram everything
- Using a csv or excel file of a data series like stock prices, weather data, house prices,  take some basic summary stats
- `df.describe` and `df.info` metadata vs summary stats?
- Tools: pandas for basic visualization



In [1]:
#Randomizer
import random
import numpy
cohort = ["hunter","jon","michael", "runjini", "sheuli","tori"]
random.shuffle(cohort)

print("Day 1/2/3:")
print(cohort)
cohort = numpy.roll(cohort,1)
print("Day 4/5/6:")
print(cohort)


Day 1/2/3:
['runjini', 'tori', 'nat', 'michael', 'hunter', 'sheuli', 'jon']
Day 4/5/6:
['jon' 'runjini' 'tori' 'nat' 'michael' 'hunter' 'sheuli']
