# Introduction

> “I think, therefore I am”

- What is data analysis?
- What type of questions can be answered?
- Developing a hypothesis drive approach.
- Making the case.


## Data Analysis as an Art

> "Science is knowledge which we understand so well that we can teach it to a computer. Everything else is art" - Donald Knuth

- We need to know the science, we need to learn the art.
- Analogous examples - Creating a hit song, Diagnosing a medical problem.
- Business problems are 'wicked in nature' - multiple stakeholder, different problem definition, different solutions, interdependence, constraints, amplifying loops


![](img/problems.png)

> "Data analysis is hard, and part of the problem is that few people can explain how to do it. It’s not that there aren’t any people doing data analysis on a regular basis. It’s that the people who are really good at it have yet to enlighten us about the thought process that goes on in their heads." - Roger Peng

![](img/data_analysis.png)


## Hypothesis driven Approach
Hypothesis is an educated guess / hunch. 

Hypothesis generation asks the question "what if"; Hypotheses testing follows it up by saying "if x, then y" with relevant data and analysis. If we keep doing this, the we can keep improving the hypothesis. It is process of "iteration and learning". Both the definition of the problem and the solution are not separate and we keep refining and reshaping and sharpening both of them 

Hypothesis testing is based on abductive reasoning. When you have Induction - you start with data, working backward to form a rule...  you look at a set of data and notice when price increase, demand falls. When you have deduction, you start with rule and makes a prediction of what you will observe = when price increase, demand falls. Abduction however reasons from effect to cause - if demand is down, it might be because prices is up. 
- Induction - something is operative
- Deduction - proves that something must be. 
- Abductions - only suggest that something may be

Now why is abduction important - Possibility of both problem and solutions are unbounded, good hypothesis generations is critical. Because the solution is invented choice, rather than discovered truth - its contestability requires persuasive argumentation. 


## Making the Case

"Making the case" is important and compelling case comes from data based hypothesis. Explaining 'what is' is an essential step in building confidence in the recommendation. Learning and changing mental models is needed for implementation and acceptance


# 1. Frame

## Types of Question

> "Doing data analysis requires quite a bit of thinking and we believe that when you’ve completed a good data analysis, you’ve spent more time thinking than doing." - Roger Peng

1. **Descriptive** - "seeks to summarize a characteristic of a set of data"
2. **Exploratory** - "analyze the data to see if there are patterns, trends, or relationships between variables" (hypothesis generating) 
3. **Inferential** - "a restatement of this proposed hypothesis as a question and would be answered by analyzing a different set of data" (hypothesis testing)
4. **Predictive** - "determine the impact on one factor based on other factor in a population - to make a prediction"
5. **Causal** - "asks whether changing one factor will change another factor in a population - to establish a causal link" 
6. **Mechanistic** - "establish *how* the change in one factor results in change in another factor in a population - to determine the exact mechanism"

# 2. Acquire

> "Data is the new oil"

**Ways to acquire data** (typical data source)

- Download from an internal system
- Obtained from client, or other 3rd party
- Extracted from a web-based API
- Scraped from a website
- Extracted from a PDF file
- Gathered manually and recorded

**Data Formats**
- Flat files (e.g. csv)
- Excel files
- Database (e.g. MySQL)
- JSON
- HDFS (Hadoop)

# 3. Refine the Data
 
> "Data is messy"

- **Remove** e.g. remove redundant data from the data frame
- **Derive** e.g. State and City from the market field
- **Parse** e.g. extract date from year and month column

Other stuff you may need to do to refine are...
- **Missing** e.g. Check for missing or incomplete data
- **Quality** e.g. Check for duplicates, accuracy, unusual data


# 4. Transform the data

> "A rough diamond is cut and shaped into a beautiful gem"

- **Convert** e.g. free text to coded value
- **Calculate** e.g. percentages, proportion
- **Merge** e.g. first and surname for full name
- **Aggregate** e.g. rollup by year, cluster by area
- **Filter** e.g. exclude based on location
- **Sample** e.g. extract a representative data
- **Summary** e.g. show summary stats like mean

# 5. Explore the Data

> "I don't know, what I don't know"

- Why do **visual exploration**?
- Understand Data Structure & Types
- Explore **single variable graphs** - Quantitative, Categorical
- Explore **dual variable graphs** - (Q & Q, Q & C, C & C)
- Explore **multi variable graphs**

We want to first **visually explore** the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.

For this we will start by loading the data and understanding the data structure of the dataframe we have.

### PRINCIPLE: Subset a Dataframe

![](img/subsetrows.png)

How do you subset a dataframe on a given criteria

`newDataframe` = `df`[ <`subset condition`> ] 

###  Principle: Split Apply Combine

How do we get the sum of quantity for each city.

We need to **SPLIT** the data by each city, **APPLY** the sum to the quantity row and then **COMBINE** the data again


![](img/splitapplycombine.png)


In pandas, we use the `groupby` function to do this.

### PRINCIPLE: Pivot Table

Pivot table is a way to summarize data frame data into index (rows), columns and value 

![](img/pivot.png)

# 6. Model

> "All models are wrong, Some of them are useful"

- The power and limits of models
- Tradeoff between Prediction Accuracy and Model Interpretability
- Assessing Model Accuracy
- Regression models (Simple, Multiple)
- Classification model

### PRINCIPLE: Correlation

Correlation refers to any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other.

![](img/corr.svg)

# 7. Insight

> “The goal is to turn data into insight”
  
- Why do we need to communicate insight?
- Types of communication - Exploration vs. Explanation
- Explanation: Telling a story with data
- Exploration: Building an interface for people to find stories