Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Data Science and the Nature of Data

## Types of variables

Structured data begins with **measurements** of some type of thing in the real world, which we call a **variable**.
Consider the example of height. 
I may measure 10 people and find that their heights in centimeters are:

| Height |
|--------|
| 165    |
| 188    |
| 153    |
| 164    |
| 150    |
| 190    |
| 169    |
| 163    |
| 165    |
| 190    |

Each of these values (e.g. 165) is a measurement of the variable *height*.
We call *height* a variable because its value isn't constant.
If everyone in the world were the same height, we wouldn't call height a variable, and we also wouldn't bother measuring it, because we'd know everyone is the same.

Variables have different **types** that can affect your analysis.

### Nominal

A nominal variable consists of unordered categories, like *male* or *female* for biological sex.
Notice that these categories are not numbers, and there is no order to the categories.
We do not say that male comes before female or is smaller than female.

### Ordinal

Ordinal variables consist of ordered categories.
You can think of it as nominal data but with an ordering from first to last or smallest to largest.
A common example of ordinal data are Likert questions like:

```
(1) Strongly disagree
(2) Disagree
(3) Neither agree nor disagree
(4) Agree
(5) Strongly agree
```

Even though these options are numbered 1 to 5, those numbers only indicate which comes before the others, not how "big" an option is.
For example, we wouldn't say that the difference between *Agree*  and *Disagree* is the same as the difference between *Neither agree nor disagree* and *Strongly agree*.

### Interval

Interval variables are ordered *and* their measurement scales are evenly spaced.
A classic example is temperature in Fahrenheit.
In degrees Fahrenheit, the difference between 70 and 71 is the same as the difference between 90 and 91 - either case is one degree.
The other most important characteristic of interval variables is also the most confusing one, which is that interval variables don't have a meaningful zero value.
Degrees Fahrenheit is an example of this because there's nothing special about 0 degrees. 
0 degrees doesn't mean there's no temperature or no heat energy, it's just an arbitrary point on the scale.

### Ratio

Ratio variables are like interval variables but with meaningful zeros.
Age and height are good examples because 0 age means you have no age, and 0 height means you have no height.
The name *ratio* reflects that you can form a ratio with these variables, which means that you can say age 20 is twice as old as age 10.
Notice you can't say that about degrees Fahrenheit: 100 degrees is not really twice as hot as 50 degrees, because 0 degrees Fahrenheit doesn't mean "no temperature."

## Tabular data

The most common type of structured data is **tabular data** which is what you find in spreadsheets.
If you've ever used a spreadsheet, you know something about tabular data!

Here's an example of tabular data, with *height* in centimeters, *age* in years, and *weight* in kilograms:

| Height | Age | Weight |
|--------|-----|--------|
| 161    | 50  | 53     |
| 161    | 17  | 53     |
| 155    | 33  | 84     |
| 180    | 51  | 84     |
| 186    | 18  | 88     |

In tabular data like this, each **row** is a person.
More generically, we would say each row is an **observation** or **datapoint** (in statistics terminology) or an **item** (in machine learning terminology).
In each row, we have measurements for each of our variables for that particular person.
Since we have five rows of measurements, we know that there are five people in this dataset.

We can also think about tabular data in terms of **columns**.
Each column represents a variable, with the name of that variable in the **column header**.
For example, *height* is at the top of the first column and is the name of the variable for that column.
Importantly, the header is not an observation but rather a description of our data.
This is why we don't count the header when we are counting the rows in our data.

### Delimited tabular data - CSV and TSV

You are probably familiar with spreadsheet files, e.g. Microsoft Excel has files that end in `.xls` or `.xlsx`.
However, in data science, it is more common to have tabular data files that are **delimited**.
A delimited file is just a plain text file where column boundaries are represented by a specific character, usually a comma or a tab.

Here's what the data above looks like in **comma separated value (CSV)** form:

```
Height,Age,Weight
161,50,53
161,17,53
155,33,84
180,51,84
186,18,88
```

and here's what the data looks like in **tab separated value (TSV)** form:

```
Height	Age	Weight
161	50	53
161	17	53
155	33	84
180	51	84
186	18	88
```

The choice of the delimiter (comma, tab, or something else) is really arbitrary, but it's always better to use a delimiter that doesn't appear in your data.

## Dataframes

Data scientists often load tabular data into a **dataframe** that they can manipulate in a program.
In other words, tabular data from a file is brought into the computational notebook in a variable that represents rows, columns, header, etc just like they are stored in the tabular data file.
Because dataframes match tabular data in files, they are very intuitive to work with, which may explain their popularity.

We're now at the practical portion of this notebook, so let's work with dataframes!

### Read CSV into a dataframe

First, we need to import a dataframe library called `pandas`.

**Follow the steps in the video below**

Once the code is in the Jupyter cell below, you must **execute** or **run** it by either pressing the &#9658; button at the top of the window or by pressing Shift + Enter on your keyboard.

We can now do things with `pd`, like load datasets!

Our file is called `height-age-weight.csv` and it is in the `datasets` folder.
That means the **path** from this notebook (the one you're reading) to the data is `datasets/height-age-weight.csv`.

To read this file into a dataframe, we will use `pd`. 

**Follow the steps in the video below**

Execute or run the cell by pressing the &#9658; button.

When you run the cell, it will display the dataframe directly below it.
This is one of the nice things about Jupyter - **it will display the output of the last line of code in a cell**, even if the output is text, a table, or a plot.

Right now, we haven't actually stored the dataframe anywhere.
We used `pd` to read the csv file, and then Jupyter output that so we could see it.
But if we wanted to do anything with the dataframe, we'd have to read the file again.

Instead of reading the file every time we want to access the data, we can **store it in a variable**.
In other words, we will create a variable and set it to be the dataframe we created from the file.

**Follow the steps in the video below**

As always, you need to hit the &#9658; button or press Shift + Enter to run the code.

The output is the same as before - the only difference is that we've read the csv and stored the data in the `dataframe` variable, so we will use the `dataframe` variable whenever we want to work with the data.

### Select rows from a dataframe by position

There are many things we can do with dataframes.
One thing we can do is get specific rows.

**Follow the steps in the video below**

*Then &#9658; or Shift + Enter*

As you can see, the output is only the first row of the dataframe.

Try it again in the cell below, but this time, change the `1` to a `2`

*Then &#9658; or Shift + Enter*

Now the output is the first two rows of the dataframe.
We could get arbitrary rows of the dataframe by starting at a different number and ending at a different number.
Sometimes people call this a **slice**.

### Select columns from a dataframe by name

We can get a column of the dataframe by using the name of the variable for that column.
Before we go any further, let's step back for a second to talk about **lists**.

We can think of a dataframe in two ways:

- A list of rows
- A list of columns

We just saw the list of rows way.
So why are columns any different?
The difference is that our columns have variable names, and we often want to refer to columns using those names.
For example, we want to say something like "give me the Age column" instead of "give me column 2."

Let's make a list from scratch to illustrate this.

**Follow the steps in the video below**

Now execute the cell (scroll up if you need a reminder how).

This is a list with one thing inside it, `"Height"`.
Lists can have multiple things inside them, making lists a container for other variables.

Let's use a list to get a column from the dataframe.

**Follow the steps in the video below**

And run it.

We can get more than one column by adding another element to the list. 

**Follow the steps in the video below**

And run the cell (try Shift + Enter if you haven't tried it yet).

To recap, dataframes are both lists of rows and lists of columns, and lists are themselves containers for other variables.

### Select rows from a dataframe by value

There are many, many things we can do with dataframes, but let's just talk about one more for now.

We can select rows based on a value in a particular column:

**Follow the steps in the video below**

Don't forget to run the cell!

The resulting column is `True` or `False` depending on whether the value of `Height` was above 161 or not (notice a few were exactly equal to 161, so they weren't greater).

What we're about to do next is magical.

**Follow the steps in the video below**

And run it!

The dataframe only kept the rows for which `Height` was > 161, i.e. those for which this was `True`.

Notice this time we didn't put `Height` in a list. It won't work if we do.

<!--  -->