# Rows and Observational Units

Recall that the rows of a tabular data set represent observations (or records). Whenever you encounter a new (tabular) data
set, the first question you should ask yourself is

> "What is the observational unit?"

In other words, what does each row of the
`DataFrame` represent? In the case of
the OKCupid data set in the previous section, the observational unit was clearly an OKCupid user. But it is not always so obvious what the observational unit is.

For example, consider the [Framingham Heart Study](https://www.framinghamheartstudy.org/fhs-about/) data set, which is available at https://raw.githubusercontent.com/kevindavisross/data301/main/data/framingham_long.csv.
This data comes from a study of men and women in
the town of Framingham, Massachusetts, which has enrolled
thousands of patients since it began in 1948 and is still ongoing. The goal of the study is to identify risk factors for cardivascular disease (CVD) by following the subjects over time. The data set that we will analyze was collected on 4,434 subjects between 1956 and 1968. A description of the data set is available [here](https://biolincc.nhlbi.nih.gov/media/teachingstudies/FHS_Teaching_Longitudinal_Data_Documentation_2021a.pdf).

You might guess that the observational unit is a subject. Let's see if that guess is correct.

In [6]:
import pandas as pd

In [7]:
df_framingham = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/framingham_long.csv")
df_framingham

Unnamed: 0,RANDID,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
0,2448,1,195.0,39,106.0,70.0,0,0.0,26.97,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
1,2448,1,209.0,52,121.0,66.0,0,0.0,,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
2,6238,2,250.0,46,121.0,81.0,0,0.0,28.73,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
3,6238,2,260.0,52,105.0,69.5,0,0.0,29.43,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
4,6238,2,237.0,58,108.0,66.0,0,0.0,28.50,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11622,9998212,1,173.0,46,126.0,82.0,0,0.0,19.17,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,0
11623,9998212,1,153.0,52,143.0,89.0,0,0.0,25.74,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,0
11624,9999312,2,196.0,39,133.0,86.0,1,30.0,20.91,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,4201
11625,9999312,2,240.0,46,138.0,79.0,1,20.0,26.39,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,4201


Each `RANDID` corresponds to a unique subject in the study, but each subject appears multiple times in the data set. That is because this is a *longitudinal* study; each subject was measured at multiple points during their lifetime. So the observational unit in the Framingham Heart Study data set is a _measurement_ of a subject at a point in time.

If there is a variable or a set of variables in the data set that uniquely identifies the observational unit, then it is customary to make those variables the **index** of the `DataFrame`. In the Framingham data set, `RANDID` and `TIME` uniquely identify the observational unit, so we move these columns to the index. (Notice that we specify `inplace=True` so that `.set_index()` modifies the existing `DataFrame` rather than returning a new one.)

In [8]:
df_framingham.set_index(["RANDID", "TIME"], inplace=True)
df_framingham

Unnamed: 0_level_0,Unnamed: 1_level_0,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,BPMEDS,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
RANDID,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2448,0,1,195.0,39,106.0,70.0,0,0.0,26.97,0,0.0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
2448,4628,1,209.0,52,121.0,66.0,0,0.0,,0,0.0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
6238,0,2,250.0,46,121.0,81.0,0,0.0,28.73,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
6238,2156,2,260.0,52,105.0,69.5,0,0.0,29.43,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
6238,4344,2,237.0,58,108.0,66.0,0,0.0,28.50,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9998212,2333,1,173.0,46,126.0,82.0,0,0.0,19.17,0,0.0,...,0,1,8766,8766,8766,8766,8766,8766,8766,0
9998212,4538,1,153.0,52,143.0,89.0,0,0.0,25.74,0,0.0,...,0,1,8766,8766,8766,8766,8766,8766,8766,0
9999312,0,2,196.0,39,133.0,86.0,1,30.0,20.91,0,0.0,...,0,1,8766,8766,8766,8766,8766,8766,8766,4201
9999312,2390,2,240.0,46,138.0,79.0,1,20.0,26.39,0,0.0,...,0,1,8766,8766,8766,8766,8766,8766,8766,4201


## Selecting Rows

We can select an individual row from a `DataFrame` using its label in the index. For example, the fourth row in the Framingham data set above has `(RANDID, TIME)` label (6238, 2156). The `.loc` attribute of the `DataFrame` is used to select a row by its label.

In [9]:
row = df_framingham.loc[(6238, 2156)]
row

SEX            2.00
TOTCHOL      260.00
AGE           52.00
SYSBP        105.00
DIABP         69.50
CURSMOKE       0.00
CIGPDAY        0.00
BMI           29.43
DIABETES       0.00
BPMEDS         0.00
HEARTRTE      80.00
GLUCOSE       86.00
educ           2.00
PREVCHD        0.00
PREVAP         0.00
PREVMI         0.00
PREVSTRK       0.00
PREVHYP        0.00
PERIOD         2.00
HDLC            NaN
LDLC            NaN
DEATH          0.00
ANGINA         0.00
HOSPMI         0.00
MI_FCHD        0.00
ANYCHD         0.00
STROKE         0.00
CVD            0.00
HYPERTEN       0.00
TIMEAP      8766.00
TIMEMI      8766.00
TIMEMIFC    8766.00
TIMECHD     8766.00
TIMESTRK    8766.00
TIMECVD     8766.00
TIMEDTH     8766.00
TIMEHYP     8766.00
Name: (6238, 2156), dtype: float64

We can also select a row by its position using the `.iloc` attribute. Keeping in mind that Python uses zero-based indexing so the first row is actually row 0, the fourth row could also be extracted as:

In [10]:
df_framingham.iloc[3]

SEX            2.00
TOTCHOL      260.00
AGE           52.00
SYSBP        105.00
DIABP         69.50
CURSMOKE       0.00
CIGPDAY        0.00
BMI           29.43
DIABETES       0.00
BPMEDS         0.00
HEARTRTE      80.00
GLUCOSE       86.00
educ           2.00
PREVCHD        0.00
PREVAP         0.00
PREVMI         0.00
PREVSTRK       0.00
PREVHYP        0.00
PERIOD         2.00
HDLC            NaN
LDLC            NaN
DEATH          0.00
ANGINA         0.00
HOSPMI         0.00
MI_FCHD        0.00
ANYCHD         0.00
STROKE         0.00
CVD            0.00
HYPERTEN       0.00
TIMEAP      8766.00
TIMEMI      8766.00
TIMEMIFC    8766.00
TIMECHD     8766.00
TIMESTRK    8766.00
TIMECVD     8766.00
TIMEDTH     8766.00
TIMEHYP     8766.00
Name: (6238, 2156), dtype: float64

Notice that a single row from a `DataFrame` is no longer a `DataFrame` but a different data structure, called a `Series`.

In [11]:
type(row)

pandas.core.series.Series

We can also select multiple rows by passing a _list_ of labels or positions to `.loc` and `.iloc`, respectively.

In [12]:
rows = df_framingham.loc[[(2448, 4628), (6238, 2156)]]
rows

Unnamed: 0_level_0,Unnamed: 1_level_0,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,BPMEDS,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
RANDID,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2448,4628,1,209.0,52,121.0,66.0,0,0.0,,0,0.0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
6238,2156,2,260.0,52,105.0,69.5,0,0.0,29.43,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766


In [13]:
df_framingham.iloc[[1, 3]]

Unnamed: 0_level_0,Unnamed: 1_level_0,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,BPMEDS,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
RANDID,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2448,4628,1,209.0,52,121.0,66.0,0,0.0,,0,0.0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
6238,2156,2,260.0,52,105.0,69.5,0,0.0,29.43,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766


Notice that when we select multiple rows, we get a `DataFrame` back.

In [14]:
type(rows)

pandas.core.frame.DataFrame

A `Series` can be used to store a single observation (across multiple variables), while a `DataFrame` is used to store multiple observations (across multiple variables). We will see soon that a `Series` can also be used to store a single column (across multiple observations), so

- A `Series` stores one-dimensional data (i.e., a single row or column).
- A `DataFrame` stores two-dimensional data (i.e., both rows and columns).

If selecting consecutive rows, we can use Python's `slice` notation. For example, the code below selects all rows from the fourth row, up to (but not including) the tenth row.

In [15]:
df_framingham.iloc[3:9]

Unnamed: 0_level_0,Unnamed: 1_level_0,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,BPMEDS,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
RANDID,TIME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
6238,2156,2,260.0,52,105.0,69.5,0,0.0,29.43,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
6238,4344,2,237.0,58,108.0,66.0,0,0.0,28.5,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
9428,0,1,245.0,48,127.5,80.0,1,20.0,25.34,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
9428,2199,1,283.0,54,141.0,89.0,1,30.0,25.34,0,0.0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
10552,0,2,225.0,61,150.0,95.0,1,30.0,28.58,0,0.0,...,1,1,2956,2956,2956,2956,2089,2089,2956,0
10552,1977,2,232.0,67,183.0,109.0,1,20.0,30.18,0,0.0,...,1,1,2956,2956,2956,2956,2089,2089,2956,0
