Copyright 2022 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Data Science and the Nature of Data

This notebook introduces some foundational concepts in data science.
As a result, this notebook will have more reading and less practical exercises than normal.
But don't worry, we'll have some practical exercises at the end.
<!-- for getting data from files. -->

We have organized this notebook around **big ideas** in data science.
You may wish to refer to this notebook throughout the course when these ideas come up.
It's OK if you don't completely understand them today.
Some of these ideas are quite subtle and take time to master.

Let's get started!

## Not all data is the same

When you hear people talk about "data," you may get the impression that all data is the same.
However, there are *many* different kinds of data.
Just like we can compare animals by how many legs they have, whether they have fur, or whether they have tails, we can compare data along various *dimensions* that affect how we work with the data and what kinds of conclusions we can draw from it.

### Is the data structured or unstructured?

One of the most basic properties of data is whether it is **structured.**
It might surprise you to hear that data can be unstructured; after all, why would someone collect data that wasn't structured?
And you'd be right: normally when people plan to collect data, they structure it.
Structure means that the data is organized and ready for analysis.
The most common kind of structure is **tabular data**, like you'd see in a spreadsheet.
We'll talk about this more in a little bit.
Databases are another common source of structured data.

Unstructured data usually comes about when the data collection wasn't planned or if people didn't know how to structure it in the first place.
For example, imagine that you have a million photographs - how would you structure them as data?
Textual data is another common example of unstructured data.
If textual data were structured, in the way we're talking about, we wouldn't need search engines like Google to find things for us!
When we work with unstructured data, we must take an extra step of structuring it somehow for our analysis, i.e. we have to turn unstructured data into structured data to do something with it.
Again, images, audio, and text are common examples that need this extra step.

### Is the data clean or dirty?

Another basic property of data is **cleanliness**.
If data is clean, then we don't need to correct it or process it to remove garbage values or correct noisy values.
Just like with structured data, you might expect all data to be clean. 
However, even carefully collected data can have problems that require correction before it can be used properly.
Dirty data is the norm for unplanned data collection.
So unplanned data collection is more likely to result in *both* unstructured data and dirty data.

There are many ways that data can be dirty, but to make the idea more concrete, let's consider a few examples.
Imagine that you are interested in the weather in your backyard, so you put out a battery operated thermometer that records the temperature every hour.
You then leave it there for a month.
Now imagine that it worked fine for the first two weeks, but since you didn't change the batteries, the measurements for the last two weeks become increasingly unreliable, e.g. reporting up to 10 degrees above or below the actual temperature, until it finally shuts off leaving you with no data for the remaining days.
This kind of problem, an **instrument failure** leading to **unreliable measurement**, is actually quite common and can take a lot of planning to avoid.
<!-- Another example of dirty data is at the recording stage. 
Imagine a computer is recording audio data by writing it to the hard disk.
If the computer suddenly becomes active with another task (say streaming a video or installing an operating system update) the audio data may "glitch". -->

While *very* dirty data is usually obvious, it can sometimes be hard to recognize. 
For this reason, it is important to check your data for problems (e.g. crazy values, like a person being 1 ft tall or 200 years old) and think very seriously about how problems should be corrected.
Data cleaning is such a tricky topic that we will delay it until much later in the course.

### Is the data experimental or nonexperimental?

The last major dimension of data we'll talk about is whether the data came from an **experiment** or not.
Why is this important?
Knowing whether the data came from an experiment is important because it tells you if you can draw causal conclusions from it.
When we talk about experiments here, what we mean are randomized controlled trials (RCTs) or an equivalent method of constructing a counterfactual.
The basic idea with an RCT is that your **randomly** assign what you are studying (i.e. people, animals, etc) into **two or more groups**.
In one of those groups, you do nothing - this is the control group.
In one of the other groups, you **do something** to what you are studying - this is the treatment group.
After the experiment, you can see what happened when you **did something** by comparing the treatment group to the control group.
Since the two groups are the same in every other respect, you know that any differences are a result of what you did.
This is why experiments allow you to draw causal conclusions - **because you only changed one thing, you know that change caused the difference you see.**

Let's take a common example, vaccines.
To discover if a vaccine against coronavirus is effective, I would randomly assign people to two groups.
The treatment group would receive the vaccine, and the control group wouldn't.
I would then follow up with both groups 1-2 months later and see which of them had gotten sick and which hadn't.
If there was no difference in illness between the two groups, I'd say that the vaccine had no effect.
Otherwise, I'd say the vaccine had some effectiveness.
There's some subtlety we're skipping over here about *reliable differences*, but this is the basic idea.

Let's take another example of a non-experimental study.
Some researchers sent a survey to a million people in Europe and asked them how much coffee they drank and how old they were.
After analyzing the results, the researchers found that older people drank more coffee.
Can we infer that coffee makes people live longer?
No, we can't, because we have no control group to compare to.
Without that control group, there are many other reasons that older people could be drinking more coffee.
It could be that coffee is less popular now than 10 years ago, it could be that older people have more money to spend on coffee, or it could be some other reason we haven't thought of yet.
When we have a non-experimental result like this, we have to be **very** careful about interpretation. 
The best we can say is that there seems to be an **association** between drinking coffee and being older, but we can't say what the cause is.
We'll talk more about associations like this later on in the course.

## Different questions need different types of data

As you might expect, there are many different things data can tell us.
However, what data can tell us is limited by the type of data we have.
We can think of three kinds of questions we can ask of data in terms of *levels*, as shown in the image below.

<!-- Andrew Olney made this picture -->
<!-- Too big, resizing with html -->
<!-- ![Screenshot_2020-05-30_11-47-06.png](attachment:Screenshot_2020-05-30_11-47-06.png) -->
<div>
<img src="attachment:Screenshot_2020-05-30_11-47-06.png" width="300"/>
</div>

The most basic questions are descriptive questions.
For example, we could have a dataset containing the heights of everyone in the U.S., and we could ask descriptive questions like:

- How tall is the tallest person?
- How short is the shortest person?
- What is the average height?

Descriptive questions highlight a particular data point (like the tallest height) or summarize multiple data points (like the average height).

The next level above description is prediction.
In prediction, we have at least two kinds of data, or variables, where we can predict one using the other.
For example, we can look at age and height, and predict height based on age.
Clearly there is a strong relationship between age and height, at least up to the teenage years, when people tend to stop growing.
We can ask descriptive questions about both age and height, and once we bring them together, we can further asks predictive questions like:

- How much taller does someone grow in a year?
- If someone is 10 years old, how tall do we expect them to be?

The final level is explaining. 
In explaining, we want to understand causality, which we just talked about with respect to experimental data.
Going back to our example, we want to know *why* people get taller as they age, not just that they do.
To know why, we need to perform an experiment.
Explaining is really important and is the main focus of science.
However, it's important to appreciate that describing and predicting are also really important, and that they may be the only questions you are interested in.
For example, if I want to increase my coffee sales, I don't need to understand why older people drink more coffee, I just need to know that they do.
Then I can target my marketing to older age groups and expect to increase my profits.

## Limits of analysis

As you progress through this course, you will feel empowered by the models you can build and the questions you can answer. 
Therefore, it's important for you to understand that none of your models will be 100% correct, for reasons we'll discuss next.
This idea has been famously expressed as:

> Essentially, all models are wrong, but some are useful.
>
> &mdash; <cite>Box & Draper (1987), p. 424</cite>

There are many ways for models to be wrong, but we'll focus on three common ways to illustrate this concept.

### Missing variables and misspecified models

Let's return to the example of growing taller with age.
If we collected a lot of data, we'd see this is a pretty strong relationship.
However, is it the case that there are no other variables that determine height?
Thinking about it more, we realize that nutrition is also an important factor. 
Are there other important factors?
It turns out that air pollution is associated with stunted growth.
We could go on and on here, but the basic idea is this: you may have identified some of the important variables in your model, and you may have identified the most important ones, but it is unlikely that you've identified *all* of them.

### Measurement error

Even if your model is perfectly specified, your data might be subject to measurement error.
For example, let's say I'm interested in how many squirrels get run over in December vs. June.
I might send out teams of students to walk up and down streets looking for dead squirrels.
Some people on those teams might be very diligent and accurately count the squirrels, but others may not pay as much attention and only count about half of them.
As a result, my model will be based on inaccurate data, which may lead me to draw the wrong conclusion.
Almost all data has *some* measurement error, so this can be a real issue.

### Generalization

Finally, my model may be specified well, and my data may be free of measurement error, but I may not be **sampling** my data in a way that allows for **generalization**.
Suppose I'm trying to predict the outcome of the next election with survey data, and I only send surveys to farmers in Iowa.
Will that help me predict how people in Chicago will vote? 
Or the U.S. as a whole?
Probably not, because I have not captured the diversity of the U.S. in my sample -- I've only captured one occupation in one area of the country.
If you want your model to generalize to new situations, which we almost always want, it's important to think about whether your data captures the complexity and diversity of the real world or only a small slice of it.

## Types of variables

We've talked about structured vs. unstructured data already, but we haven't gone into detail about how structured data is created.
Structured data begins with **measurements** of some type of thing in the real world, which we call a **variable**.
Let's return to the example of height. 
I may measure 10 people and find that their heights in centimeters are:

| Height |
|--------|
| 165    |
| 188    |
| 153    |
| 164    |
| 150    |
| 190    |
| 169    |
| 163    |
| 165    |
| 190    |

Each of these values (e.g. 165) is a measurement of the variable *height*.
We call *height* a variable because its value isn't constant.
If everyone in the world were the same height, we wouldn't call height a variable, and we also wouldn't bother measuring it, because we'd know everyone is the same.

Variables have different **types** that can affect your analysis.

### Nominal

A nominal variable consists of unordered categories, like *male* or *female* for biological sex.
Notice that these categories are not numbers, and there is no order to the categories.
We do not say that male comes before female or is smaller than female.

### Ordinal

Ordinal variables consist of ordered categories.
You can think of it as nominal data but with an ordering from first to last or smallest to largest.
A common example of ordinal data are Likert questions like:

```
(1) Strongly disagree
(2) Disagree
(3) Neither agree nor disagree
(4) Agree
(5) Strongly agree
```

Even though these options are numbered 1 to 5, those numbers only indicate which comes before the others, not how "big" an option is.
For example, we wouldn't say that the difference between *Agree*  and *Disagree* is the same as the difference between *Neither agree nor disagree* and *Strongly agree*.

### Interval

Interval variables are ordered *and* their measurement scales are evenly spaced.
A classic example is temperature in Fahrenheit.
In degrees Fahrenheit, the difference between 70 and 71 is the same as the difference between 90 and 91 - either case is one degree.
The other most important characteristic of interval variables is also the most confusing one, which is that interval variables don't have a meaningful zero value.
Degrees Fahrenheit is an example of this because there's nothing special about 0 degrees. 
0 degrees doesn't mean there's no temperature or no heat energy, it's just an arbitrary point on the scale.

### Ratio

Ratio variables are like interval variables but with meaningful zeros.
Age and height are good examples because 0 age means you have no age, and 0 height means you have no height.
The name *ratio* reflects that you can form a ratio with these variables, which means that you can say age 20 is twice as old as age 10.
Notice you can't say that about degrees Fahrenheit: 100 degrees is not really twice as hot as 50 degrees, because 0 degrees Fahrenheit doesn't mean "no temperature."

## Measurement

We previously said that structured data begins with measurement of a variable, but we haven't explained what measurement really is.
Measurement is, quite simply, the assignment of a value to a variable.
In the context of a categorical variable like biological sex, we would say the assignment of *male* or *female* is a measurement.
Similarly for height, we would say that 180 cm is a measurement.
Notice that in these two examples, the measurement depends closely on type of variable (e.g. categorical or ratio).

How we measure is tightly connected to how we've defined the variable.
This makes sense, because our measurements serve as a way of defining the variable.
For some variables, this is more obvious than for other variables.
For example, we all know what *length* is. 
It is a measure of distance that we can see with our eyes, and we can measure it in different units like centimeters or inches.
However, some variables are not as obvious, like *justice*.
How do we measure *justice*?
One way would be to ask people, e.g. to ask them how just or unjust they thought a situation was.
There are two problems with this approach.
First, different people will tell you different things.
Second, you may not really be measuring *justice* when you ask this question; you could end up measuring something else by accident, like people's religious beliefs.

When we talk about measurement, especially of things we can't directly observe, there are two important properties of measurement that we want, **validity** and **reliability**.
The picture below presents a conceptual illustration of these ideas using a target.

<!-- Attribution: © Nevit Dilmen -->
<!-- https://commons.wikimedia.org/wiki/File:Reliability_and_validity.svg -->
![image.png](attachment:image.png)

Simply stated, **validity means we are measuring what we intend to measure**.
In the images, validity is being "on target," so that our measurements are *centered* on what we are trying to measure.

In contrast, **reliability means our measurements are consistent**. 
Our measurements could be consistently wrong, which would make them reliable but not valid (lower left).
Ideally, our measurements will be both valid and reliable (lower right).

When it comes to validity and reliability, the most important thing to understand is that **validity is not optional.**
If you don't have validity, your variable is wrong - you're not measuring what you think you're measuring.
Reliability is optional to a certain extent, but if the reliability is very low, we won't be able to get much information out of the variable.

## Tabular data

The most common type of structured data is **tabular data** which is what you find in spreadsheets.
If you've ever used a spreadsheet, you know something about tabular data!

Here's an example of tabular data, with *height* in centimeters, *age* in years, and *weight* in kilograms:

| Height | Age | Weight |
|--------|-----|--------|
| 161    | 50  | 53     |
| 161    | 17  | 53     |
| 155    | 33  | 84     |
| 180    | 51  | 84     |
| 186    | 18  | 88     |

In tabular data like this, each **row** is a person.
More generically, we would say each row is an **observation** or **datapoint** (in statistics terminology) or an **item** (in machine learning terminology).
In each row, we have measurements for each of our variables for that particular person.
Since we have five rows of measurements, we know that there are five people in this dataset.

We can also think about tabular data in terms of **columns**.
Each column represents a variable, with the name of that variable in the **column header**.
For example, *height* is at the top of the first column and is the name of the variable for that column.
Importantly, the header is not an observation but rather a description of our data.
This is why we don't count the header when we are counting the rows in our data.

### Delimited tabular data - CSV and TSV

You are probably familiar with spreadsheet files, e.g. Microsoft Excel has files that end in `.xls` or `.xlsx`.
However, in data science, it is more common to have tabular data files that are **delimited**.
A delimited file is just a plain text file where column boundaries are represented by a specific character, usually a comma or a tab.

Here's what the data above looks like in **comma separated value (CSV)** form:

```
Height,Age,Weight
161,50,53
161,17,53
155,33,84
180,51,84
186,18,88
```

and here's what the data looks like in **tab separated value (TSV)** form:

```
Height	Age	Weight
161	50	53
161	17	53
155	33	84
180	51	84
186	18	88
```

The choice of the delimiter (comma, tab, or something else) is really arbitrary, but it's always better to use a delimiter that doesn't appear in your data.

### Dataframes

Data scientists often load tabular data into a **dataframe** that they can manipulate in a program.
In other words, tabular data from a file is brought into the computational notebook in a variable that represents rows, columns, header, etc just like they are stored in the tabular data file.
Because dataframes match tabular data in files, they are very intuitive to work with, which may explain their popularity.

We're now at the practical portion of this notebook, so let's work with dataframes!

**If you haven't seen a demonstration of Blockly, [see this short video tutorial](https://youtu.be/ovCJln08mG8?vq=hd720) or [this long video tutorial](https://youtu.be/-luPzplPDI0?vq=hd720).**

#### Read CSV into dataframe

First, let's read a CSV file into a dataframe.
To do that, we need to import a library for reading files called `readr`.
**If it isn't already open**, open up the Blockly extension by clicking on the painter's palette icon, then clicking on `Blockly R`.

![image.png](attachment:image.png)

Using the IMPORT menu in the Blockly palette, click on an import block `library some library`:

![image.png](attachment:image.png)

When you click on the block, it drops onto the Blockly workspace.
Change `some library` to `pandas` by typing into that box.
Click on the `some library` dropdown, choose `Rename variable...`, and type `readr` into the box that pops up.
This imports the `readr` library and gives it the variable name, or alias, `readr`.

In the future, we will abbreviate these steps as:

- `library readr`

Make sure the code cell below is selected (has a blue bar next to it) and press the `Blocks to Code` button below the Blockly workspace.
This will insert the code corresponding to the blocks into the **active cell** in Jupyter, which is the cell that has a blue bar next to it.

Once the code appears in that Jupyter cell, you must **execute** or **run** it by either pressing the &#9658; button at the top of the window or by pressing Shift + Enter on your keyboard.

In [1]:
library(readr)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(cA1)X2lCPQio$W{:j4y">readr</variable></variables><block type="import_R" id="Gv_0#Q!yW+GN:NDyn9P/" x="16" y="10"><field name="libraryName" id="(cA1)X2lCPQio$W{:j4y">readr</field></block></xml>

We can now do things with `readr`, like load datasets!

Our file is called `height-age-weight.csv` and it is in the `datasets` folder.
That means the **path** from this notebook (the one you're reading) to the data is `datasets/height-age-weight.csv`.

To read this file into a dataframe, we will use `readr`. 
Go to the VARIABLES menu in the Blockly palette and click on the `with readr do ...` block.

![image.png](attachment:image.png)

After it drops into the Blockly workspace, wait until the dropdown stops loading, and then click on it and select `read_csv`.
You can also start typing `read_csv` to narrow the dropdown to matching options.
Then get a `" "` block from TEXT, drop it on the workspace, drag it to the `using` part of the first block, and type the file path `datasets/height-age-weight.csv` into it.
Your blocks should look like this:

![image.png](attachment:image.png)

Make sure the cell below is selected, then press `Blocks to Codes`, and execute the cell to run the code by pressing the &#9658; button.

In the future, we will abbreviate these steps as:

- `with readr do read_csv using "datasets/height-age-weight.csv"`

In [2]:
readr::read_csv("datasets/height-age-weight.csv")

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(cA1)X2lCPQio$W{:j4y">readr</variable></variables><block type="varDoMethod_R" id="!AFB9x,(K:Q.,_2FYY+u" x="8" y="176"><mutation items="1"></mutation><field name="VAR" id="(cA1)X2lCPQio$W{:j4y">readr</field><field name="MEMBER">read_csv</field><data>readr:read_csv</data><value name="ADD0"><block type="text" id="]^)Tk(d-R3[)=xBi|9=?"><field name="TEXT">datasets/height-age-weight.csv</field></block></value></block></xml>

[1mRows: [22m[34m5[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (3): Height, Age, Weight

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Height,Age,Weight
<dbl>,<dbl>,<dbl>
161,50,53
161,17,53
155,33,84
180,51,84
186,18,88


When you run the cell, it will display some information and then the dataframe directly below it.
This is one of the nice things about Jupyter - **it will display the output of the last line of code in a cell**, even if the output is text, a table, or a plot.

Right now, we haven't actually stored the dataframe anywhere.
We used `readr` to read the csv file, and then Jupyter output that so we could see it.
But if we wanted to do anything with the dataframe, we'd have to read the file again.

Instead of reading the file every time we want to access the data, we can **store it in a variable**.
In other words, we will create a variable and set it to be the dataframe we created from the file.

Using VARIABLES menu in the Blockly palette, click on `Create variable...` and type `dataframe` into the pop up window.
Then click on the `set dataframe to` block so that your blocks below look like this:

![image.png](attachment:image.png)

Then go get the same blocks you used before to read the file and connect them to the `set dataframe to` block.
You can do this from scratch or you can use the following procedure:

- Click the code cell below and press `Blocks to Code` to save your intermediate work (the `set dataframe to` block)
- Go back to the previous cell, click on the block you want, and copy it using Ctrl+c
- Click on the code cell below to select it, click the Blockly workspace, and paste the block using Ctrl+v

*Tip: If you don't save your intermediate work, you'll lose it because `Notebook Sync` will clear the Blockly workspace when it loads the blocks in the previous cell.*

After you've added the blocks to read the dataframe, drop a variable block for `dataframe` underneath it to display the dataframe.
The result should look like this:

![image.png](attachment:image.png)

In the future, we will abbreviate these steps as:

- Create `dataframe` and set it to `with readr do read_csv using "datasets/height-age-weight.csv"`
- `dataframe`

As always, you need to hit the &#9658; button or press Shift + Enter to run the code.

In [5]:
dataframe = readr::read_csv("datasets/height-age-weight.csv")

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable><variable id="(cA1)X2lCPQio$W{:j4y">readr</variable></variables><block type="variables_set" id="aEmL:SB)NF-^e4,:*KEN" x="17" y="204"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="VALUE"><block type="varDoMethod_R" id="!AFB9x,(K:Q.,_2FYY+u"><mutation items="1"></mutation><field name="VAR" id="(cA1)X2lCPQio$W{:j4y">readr</field><field name="MEMBER">read_csv</field><data>readr:read_csv</data><value name="ADD0"><block type="text" id="]^)Tk(d-R3[)=xBi|9=?"><field name="TEXT">datasets/height-age-weight.csv</field></block></value></block></value></block><block type="variables_get" id="3?lrwsCvbw.I,.6Ab_k/" x="13" y="283"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></xml>

[1mRows: [22m[34m5[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (3): Height, Age, Weight

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Height,Age,Weight
<dbl>,<dbl>,<dbl>
161,50,53
161,17,53
155,33,84
180,51,84
186,18,88


You should see the same output as before - the only difference is that we've read the csv and stored the data into the `dataframe` block, so we will use the `dataframe` block whenever we want to work with the data.

#### Dataframes as a list of rows

There are many things we can do with dataframes.
One thing we can do is get specific rows, which are our datapoints.
We can manipulate dataframes easily using another library called `dplyr`, so let's load it first:

- `library dplyr`

*Then &#9658; or Shift + Enter*

In [3]:
library(dplyr)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable></variables><block type="import_R" id="K.nY/JzUnFt!B~)xw=j7" x="16" y="10"><field name="libraryName" id="LiPrc==C!jd{;fWA-(}6">dplyr</field></block></xml>

“package ‘dplyr’ was built under R version 4.1.3”

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Don't worry too much about the messages displayed at this point.

Let's get the first row of the dataframe.
We can do that using the `slice` function:

- with `dplyr` do `slice` using `dataframe` and `1` 

To get an extra slot for `1`, use the `+` button on the block.
You can get a `1` block by getting a number block `123` from MATH  changing the value of the number block to `1`.

Your blocks should look like this:

![image.png](attachment:image.png)

*Make sure the code cell below is selected, then &#9658; or Shift + Enter*

In [7]:
dplyr::slice(dataframe,1)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="$|@c}XUhYyE.FtSAhaE%" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">slice</field><data>dplyr:slice</data><value name="ADD0"><block type="variables_get" id="0^vw_mn4A5MRT%xD%Qm("><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="math_number" id="pGinSCPehnw}2-.2Rr-S"><field name="NUM">1</field></block></value></block></xml>

Height,Age,Weight
<dbl>,<dbl>,<dbl>
161,50,53


As you can see, the output is only the first row of the dataframe.

Try it again (i.e. copy the blocks, select the cell below, and paste the blocks in the Blockly workspace), but this time, change the `1` to a `1:2`:

- with `dplyr` do `slice` using `dataframe` and `1:2` 

To make a `1:2` block, use a block from the `FREESTYLE` category.
You can read `1:2` as "from 1 to 2".

*Then &#9658; or Shift + Enter*

In [9]:
dplyr::slice(dataframe,1:2)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="$|@c}XUhYyE.FtSAhaE%" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">slice</field><data>dplyr:slice</data><value name="ADD0"><block type="variables_get" id="0^vw_mn4A5MRT%xD%Qm("><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="O?ev4Bf5{{xYFXU=Cc_Y"><field name="CODE">1:2</field></block></value></block></xml>

Height,Age,Weight
<dbl>,<dbl>,<dbl>
161,50,53
161,17,53


Now the output is the first two rows of the dataframe.
We could get arbitrary rows of the dataframe by starting at a different number and ending at a different number.

#### Dataframes as a list of columns

Similarly, we can get a column of the dataframe by using the name of that column in a freestyle block.
The name must exactly match the spelling and case of the column:

- with `dplyr` do `select` using `dataframe` and `Height` 

And run it.

In [14]:
dplyr::select(dataframe,Height)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="eGwb`#8yU[:`w{muZddM" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">select</field><data>dplyr:select</data><value name="ADD0"><block type="variables_get" id="k(}Ej}(Zx@:mO9D!9z~#"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="BOQ$uM/td%JJik~_Lc(X"><field name="CODE">Height</field></block></value></block></xml>

Height
<dbl>
161
161
155
180
186


Just like before when we got more than one row, we can get more than one column:

- with `dplyr` do `select` using `dataframe` and `Height:Age`

And run the cell (try Shift + Enter if you haven't tried it yet).

In [18]:
dplyr::select(dataframe,Height:Age)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="eGwb`#8yU[:`w{muZddM" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">select</field><data>dplyr:select</data><value name="ADD0"><block type="variables_get" id="k(}Ej}(Zx@:mO9D!9z~#"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="BOQ$uM/td%JJik~_Lc(X"><field name="CODE">Height:Age</field></block></value></block></xml>

Height,Age
<dbl>,<dbl>
161,50
161,17
155,33
180,51
186,18


If instead of a range of columns, we wanted a column here and a column there, we could instead use `and` to give the names of the columns we want.

To recap, dataframes are both lists of rows and lists of columns.
Whether we treat a dataframe as a list of rows or list of columns depends on what we want to do.
If we want to select datapoints (observations), then we treat it as a list of rows, because each row is a datapoint.
In our dataset above, this would be like selecting the people in the dataset we want to analyze, since each row is a person.
In contrast, if we want to select variables, then we treat the dataframe like a list of columns.

#### Dataframes and types of variables

Earlier we talked about four different kinds of variables: nominal, ordinal, interval, and ratio.
These are really important to know, because many kinds of analysis are only valid on particular types of variable.

Does `readr` take care of this for us?
Let's find out!
We can use the `spec` function to give us the specifications of the dataframe:

- with `readr` do `spec` using `dataframe`

And run it.

In [20]:
readr::spec(dataframe)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(cA1)X2lCPQio$W{:j4y">readr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="NBbE2M,V(Js`SQt{!7A6" x="-77" y="164"><mutation items="1"></mutation><field name="VAR" id="(cA1)X2lCPQio$W{:j4y">readr</field><field name="MEMBER">spec</field><data>readr:spec</data><value name="ADD0"><block type="variables_get" id="{sNlqY/W($*30Q^l00O@"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value></block></xml>

cols(
  Height = [32mcol_double()[39m,
  Age = [32mcol_double()[39m,
  Weight = [32mcol_double()[39m
)

The output tells us the data type of each variable, in this case `double`
This is just one way the computer can store information.
Some of the common ways are:

- logical: TRUE or FALSE
- integer: an integer (no decimal)
- double: a floating point value (has decimal)
- character: a text string
- factor: a nominal value
- ordered: an ordinal value

As you can see, data types don't line up exactly with nominal, ordinal, interval, and ratio types of variables.
Additionally, when you read in a file, `readr` will guess the types of each column based on the values in your file.

`readr` will never guess factor or ordered.
Instead, `readr` will interpret either of these as `character`, so if you want a column to be nominal or ordinal, you have to give `readr` special instructions.

There's no explicit representation of ratio or interval either. 
These could both be mapped to integer or double.
What this means in practice is that we have to be vigilant and keep track of the type of variable ourselves, because `readr` won't automatically do it for us.
That means by default `readr` will let us do things with our data that don't make sense, so watch out!

<!--  -->