# A Brief Theory of Data

## Types of data

In this section, when I talk about the "type" of the data, I am not talking about the `dtype` (`int`, `float`, `bool`, `str`) used to represent the data in a NumPy array or Pandas DataFrame. In this context the "type" of the data is used in a more abstract sense.

## Take 1: Data+Design

[Data+Design]() is an excellent online book about the theory of data. It is very well thought out and beautiful as well. I highly recommend spending time reading it. In Chapter 1 of Data+Design, the authors cover [Basic Data Types](https://infoactive.co/data-design/ch01.html). Here is a short summary of those basic data types:

### Nominal

* Non-numerical
* Usually, but not always strings
* Non-ordered
* Cannot be averaged

In [None]:
states = ['Oregon', 'California', 'Texas', 'Colorado']
states

In [None]:
grocery_sections = ["produce", 'diary', 'frozen']
grocery_sections

In [None]:
gender = ['male', 'female']
gender

### Ordinal

* Non-numerical
* Usually, but not always strings
* Natural ordering
* Sometimes can be averaged
* Can assign numerical scale, but it will be arbitrary

In [None]:
response = ['strongly disagree', 'disagre', 'neutral', 'agree', 'strongly agree']
response

In [None]:
temp = ['cold', 'hot']
temp

In [None]:
height = ['short', 'medium', 'tall']
height

### Interval

* Equally spaced numerical data
* Ordered
* Can either be discrete (int) or continuous (float)
* No meaningful zero point
* Examples:
  - Temperature in F or C
  - Dates/Times

In [None]:
temps = [32.1, 99.4, 210.0, -76.4]
temps

### Ratio

* Equally spaced, ordered numerical data
* Can either be discrete or continuous
* Meaningful zero point that indicates an absence of the measured entity
* Examples:
  - Age in years
  - Height in inches

In [None]:
ages = [random.randint(0,100) for i in range(10)]
ages

In [None]:
height = [76.0*random.random() for i in range(10)]
height

### Categorical

* Data is labelled by well separated categories
* Often used as an umbrella for nominal and ordinal, which are unordered and ordered categorical data types respectively.

## Take 2: Polaris, Tableau, d3/vega

The data visualization community has spent a lot of time thinking carefully about fundamental data types. There is a large body of research and software projects that encode the results of that research into usable forms. Good examples of this research and software are:

* [Polaris: A System for...](http://graphics.stanford.edu/papers/polaris_extended/polaris.pdf), C. Stolte, D. Tank and P. Hanrahan (2002).
* [Tableau](http://www.tableau.com/), Tableau Software, website (2016).
* [d3](http://d3js.org/), Data Driven Documents, website (2016).
* [Vega](http://vega.github.io/vega/), Vega: A Visualization Grammar, website (2016).
* [Vega-Lite](http://vega.github.io/vega-lite/), Vega-Lite: A High-Level Visualization Grammar, website (2016).
* [polestar](http://vega.github.io/polestar/), Polestar website (2016).

Here is a rough union of the different data types found in this body of work:

* Ordinal (same as above)
* Nominal (same as above)
* Quantitative (ratio, interval)
* Date/time (calendar dates and/or times)
* Geographic (states, latitude/longitude)

Some of these sofware packages also have a `text` data type that is meant for textual data that is not categorical.

## Variables

* A **variable** is some quantity that is measured, such as "age"
* A single variable can be measured in different ways that give different data types:
  - "young" or "old" = ordinal
  - Age ranges (0-9, 10-19, ...) = ordinal
  - Age in years = ratio

## Records and data sets

 * A **record** or **sample** is one measurement of a set of variables
 * A **data set** is a set of records that measure the same set of variables in the same way

In [None]:
ages = [random.randint(0,100) for i in range(10)]
heights = [76.0*random.random() for i in range(10)]
data_set = [{'age':a, 'height':h} for a, h in zip(ages, heights)]
data_set

In [None]:
sample0 = data_set[0]
sample0

## Resources

* [Data+Design](https://infoactive.co/data-design) Trina Chiasson, Dyanna Gregory, et al (2016).