# Rows and Observational Units

Recall that the rows of a tabular data set represent observations (or records). Whenever you encounter a new (tabular) data
set, the first question you should ask yourself is

> "What is the observational unit?"

In other words, what does each row of the
`DataFrame` represent? In the case of
the OKCupid data set in the previous section, the observational unit was clearly an OKCupid user. But it is not always so obvious what the observational unit is.

For example, consider the [Framingham Heart Study](https://www.framinghamheartstudy.org/fhs-about/) data set, which is available at https://raw.githubusercontent.com/kevindavisross/data301/main/data/framingham_long.csv.
This data comes from a study of men and women in
the town of Framingham, Massachusetts, which has enrolled
thousands of patients since it began in 1948 and is still ongoing. The goal of the study is to identify risk factors for cardivascular disease (CVD) by following the subjects over time. The data set that we will analyze was collected on 4,434 subjects between 1956 and 1968. A description of the data set is available [here](https://biolincc.nhlbi.nih.gov/media/teachingstudies/FHS_Teaching_Longitudinal_Data_Documentation_2021a.pdf).

You might guess that the observational unit is a subject. Let's see if that guess is correct.

In [None]:
import pandas as pd

In [None]:
df_framingham = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/framingham_long.csv")
df_framingham

Each `RANDID` corresponds to a unique subject in the study, but each subject appears multiple times in the data set. That is because this is a *longitudinal* study; each subject was measured at multiple points during their lifetime. So the observational unit in the Framingham Heart Study data set is a _measurement_ of a subject at a point in time.

If there is a variable or a set of variables in the data set that uniquely identifies the observational unit, then it is customary to make those variables the **index** of the `DataFrame`. In the Framingham data set, `RANDID` and `TIME` uniquely identify the observational unit, so we move these columns to the index. (Notice that we specify `inplace=True` so that `.set_index()` modifies the existing `DataFrame` rather than returning a new one.)

In [None]:
df_framingham.set_index(["RANDID", "TIME"], inplace=True)
df_framingham

## Selecting Rows

We can select an individual row from a `DataFrame` using its label in the index. For example, the fourth row in the Framingham data set above has `(RANDID, TIME)` label (6238, 2156). The `.loc` attribute of the `DataFrame` is used to select a row by its label.

In [None]:
row = df_framingham.loc[(6238, 2156)]
row

We can also select a row by its position using the `.iloc` attribute. Keeping in mind that Python uses zero-based indexing so the first row is actually row 0, the fourth row could also be extracted as:

In [None]:
df_framingham.iloc[3]

Notice that a single row from a `DataFrame` is no longer a `DataFrame` but a different data structure, called a `Series`.

In [None]:
type(row)

We can also select multiple rows by passing a _list_ of labels or positions to `.loc` and `.iloc`, respectively.

In [None]:
rows = df_framingham.loc[[(2448, 4628), (6238, 2156)]]
rows

In [None]:
df_framingham.iloc[[1, 3]]

Notice that when we select multiple rows, we get a `DataFrame` back.

In [None]:
type(rows)

A `Series` can be used to store a single observation (across multiple variables), while a `DataFrame` is used to store multiple observations (across multiple variables). We will see soon that a `Series` can also be used to store a single column (across multiple observations), so

- A `Series` stores one-dimensional data (i.e., a single row or column).
- A `DataFrame` stores two-dimensional data (i.e., both rows and columns).

If selecting consecutive rows, we can use Python's `slice` notation. For example, the code below selects all rows from the fourth row, up to (but not including) the tenth row.

In [None]:
df_framingham.iloc[3:9]