# Ultimate Data

## Introduction

Based on Gophercon 2017 [Daniel Whitenack's Ultimate Data - Introduction](https://github.com/ardanlabs/gotraining/blob/master/topics/courses/data/introduction/README.md).

### Introduction to Data Analysis

#### What is Data Analysis?

Data analysis transforms datasets into **insights** that have corresponding **actions** and **consequences**.

#### Prepare your mind

Before and during any data analytics project, you must be able to answer the following questions:

- What insights do I want to generate?
- What actions are triggered by the insights?
- What are the consequences of those actions?
- What do the results need to contain to best represent the desired insights?
- What is the data required to produce a valid result?
- How will I measure the validity of results?
- Can the results be effectively conveyed to decision makers as insights?
- Am I confident in the results?

#### Order of Operations

Data analytics projects should follow these steps in this order:

1. Understand the insights, actions and consequences involved.
2. Understand the relevant data to be gathered and analyzed.
3. Gather and organize the relevant data.
4. Understand the expectations for determining valid results.
5. Determine the most interpretable process that can produce valid results.
6. Determine how you will test the validity of results.
7. Develop the determined process and tests.
8. Test the results and evaluate against your expectations.
9. Refactor as necessary.
10. Looks for ways to simplify, minimize and reduce.

#### Guidelines, Decision Making and Trade-Offs

Develop your design philosophy around these major categories in this order: Integrity, Value, Readability/Interpretability, and Performance.

**1) Integrity** - Generating bad insights may cause irreparable damage to real people.
- Error handling code in the main code.
- You must understand the data.
- Control the input and output of your processes.
- You must be able to reproduce results.

**2) Value** - ust because you can produce a result, does not mean the result contains value.
- If an action can not be taken based on a result, the result does not have value.
- If the impact of a result can not be measured, the result does not have value.

**3) Readability and Interpretability** - Avoid unnecessary data transformations and analysis complexity that hides:
- The cost/impact of individual steps of the analyses.
- The underlying purpose of the data transformations and analyses.

**4) Performance** - Make your analyses run as fast as possible and produce results that minimize a given measure of error. When code is written with this as the priority, it is very difficult to write code that is readable, simple or idiomatic.

![](https://github.com/ardanlabs/gotraining/raw/master/topics/data/data_analysis/forbes_data_science.jpg)

### Pachyderm

Pachyderm lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance.

#### Version control for data

Pachyderm version controls data, similar to what Git does with code. You can track the state of your data over time, backtest models on historical data, share data with teammates, and revert to previous states of data.

![](http://www.pachyderm.io/images/pachyderm-graph.png)

#### Language-agnostic data pipelines

Pachyderm lets you use the tools and frameworks you need, from bash scripts to Tensorflow. You just declaratively tell Pachyderm what you want to run, and Pachyderm takes care of triggering, data sharding, parallelism, and resource management on the backend.

![](http://www.pachyderm.io/images/pachyderm-factory.png)

For more information visit the [local installation](http://pachyderm.readthedocs.io/en/latest/getting_started/local_installation.html) site and the [beginner tutorial](http://pachyderm.readthedocs.io/en/latest/getting_started/beginner_tutorial.html).