## Learning Objectives

By the end of this lecture, you will:
- Load epidemiological datasets into a Pandas DataFrame in Colab
- Inspect the structure and content of real-world data
- Explore variable types and distributions
- Practice basic data exploration techniques


## 🗂️ 1. Loading Data in Google Colab

In [42]:
import numpy as np
import pandas as pd

In [49]:
filename = '../Data/frmgham2.csv'
frame = pd.read_csv(filename)
frame.head()

Unnamed: 0,RANDID,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
0,2448,1,195.0,39,106.0,70.0,0,0.0,26.97,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
1,2448,1,209.0,52,121.0,66.0,0,0.0,,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
2,6238,2,250.0,46,121.0,81.0,0,0.0,28.73,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
3,6238,2,260.0,52,105.0,69.5,0,0.0,29.43,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
4,6238,2,237.0,58,108.0,66.0,0,0.0,28.5,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766


## 🔍 2. Inspecting Data Structure

In [3]:
# Check dimensions
print(frame.shape)

(11627, 39)


In [4]:
# List column names
frame.columns

Index(['RANDID', 'SEX', 'TOTCHOL', 'AGE', 'SYSBP', 'DIABP', 'CURSMOKE',
       'CIGPDAY', 'BMI', 'DIABETES', 'BPMEDS', 'HEARTRTE', 'GLUCOSE', 'educ',
       'PREVCHD', 'PREVAP', 'PREVMI', 'PREVSTRK', 'PREVHYP', 'TIME', 'PERIOD',
       'HDLC', 'LDLC', 'DEATH', 'ANGINA', 'HOSPMI', 'MI_FCHD', 'ANYCHD',
       'STROKE', 'CVD', 'HYPERTEN', 'TIMEAP', 'TIMEMI', 'TIMEMIFC', 'TIMECHD',
       'TIMESTRK', 'TIMECVD', 'TIMEDTH', 'TIMEHYP'],
      dtype='object')

In [5]:
# Check variable types
frame.dtypes

RANDID        int64
SEX           int64
TOTCHOL     float64
AGE           int64
SYSBP       float64
DIABP       float64
CURSMOKE      int64
CIGPDAY     float64
BMI         float64
DIABETES      int64
BPMEDS      float64
HEARTRTE    float64
GLUCOSE     float64
educ        float64
PREVCHD       int64
PREVAP        int64
PREVMI        int64
PREVSTRK      int64
PREVHYP       int64
TIME          int64
PERIOD        int64
HDLC        float64
LDLC        float64
DEATH         int64
ANGINA        int64
HOSPMI        int64
MI_FCHD       int64
ANYCHD        int64
STROKE        int64
CVD           int64
HYPERTEN      int64
TIMEAP        int64
TIMEMI        int64
TIMEMIFC      int64
TIMECHD       int64
TIMESTRK      int64
TIMECVD       int64
TIMEDTH       int64
TIMEHYP       int64
dtype: object

In [6]:
# Quick info summary
frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11627 entries, 0 to 11626
Data columns (total 39 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   RANDID    11627 non-null  int64  
 1   SEX       11627 non-null  int64  
 2   TOTCHOL   11218 non-null  float64
 3   AGE       11627 non-null  int64  
 4   SYSBP     11627 non-null  float64
 5   DIABP     11627 non-null  float64
 6   CURSMOKE  11627 non-null  int64  
 7   CIGPDAY   11548 non-null  float64
 8   BMI       11575 non-null  float64
 9   DIABETES  11627 non-null  int64  
 10  BPMEDS    11034 non-null  float64
 11  HEARTRTE  11621 non-null  float64
 12  GLUCOSE   10187 non-null  float64
 13  educ      11332 non-null  float64
 14  PREVCHD   11627 non-null  int64  
 15  PREVAP    11627 non-null  int64  
 16  PREVMI    11627 non-null  int64  
 17  PREVSTRK  11627 non-null  int64  
 18  PREVHYP   11627 non-null  int64  
 19  TIME      11627 non-null  int64  
 20  PERIOD    11627 non-null  in

## 📊 3. Exploring the Content

In [9]:
# Summary statistics of all variables
frame.describe(include='all')

Unnamed: 0,RANDID,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
count,11627.0,11627.0,11218.0,11627.0,11627.0,11627.0,11627.0,11548.0,11575.0,11627.0,...,11627.0,11627.0,11627.0,11627.0,11627.0,11627.0,11627.0,11627.0,11627.0,11627.0
mean,5004741.0,1.568074,241.162418,54.79281,136.324116,83.037757,0.432528,8.250346,25.877349,0.045584,...,0.249333,0.74327,7241.556893,7593.846736,7543.036725,7008.153608,7660.880021,7166.082996,7854.10295,3598.956395
std,2900877.0,0.495366,45.36803,9.564299,22.798625,11.660144,0.495448,12.186888,4.10264,0.208589,...,0.432646,0.436848,2477.78001,2136.730285,2192.120311,2641.344513,2011.077091,2541.668477,1788.369623,3464.164659
min,2448.0,1.0,107.0,32.0,83.5,30.0,0.0,0.0,14.43,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.0,0.0
25%,2474378.0,1.0,210.0,48.0,120.0,75.0,0.0,0.0,23.095,0.0,...,0.0,0.0,6224.0,7212.0,7049.5,5598.5,7295.0,6004.0,7797.5,0.0
50%,5006008.0,2.0,238.0,54.0,132.0,82.0,0.0,0.0,25.48,0.0,...,0.0,1.0,8766.0,8766.0,8766.0,8766.0,8766.0,8766.0,8766.0,2429.0
75%,7472730.0,2.0,268.0,62.0,149.0,90.0,1.0,20.0,28.07,0.0,...,0.0,1.0,8766.0,8766.0,8766.0,8766.0,8766.0,8766.0,8766.0,7329.0
max,9999312.0,2.0,696.0,81.0,295.0,150.0,1.0,90.0,56.8,1.0,...,1.0,1.0,8766.0,8766.0,8766.0,8766.0,8766.0,8766.0,8766.0,8766.0


### Demographics

In [10]:
# Value counts of a categorical variable
frame['SEX'].value_counts()

SEX
2    6605
1    5022
Name: count, dtype: int64

In [16]:
# Value counts on age works but not well:
pd.DataFrame(frame['AGE'].value_counts()).reset_index().sort_values('AGE')

Unnamed: 0,AGE,count
49,32,1
47,33,5
45,34,18
41,35,42
38,36,84
36,37,93
32,38,145
30,39,178
28,40,212
27,41,213


In [20]:
# Create age group variable
frame['age_group'] = pd.cut(frame['AGE'], bins = [0,18,25,35,45,55,65,76,85,95])
frame['age_group'].value_counts()

age_group
(45, 55]    4155
(55, 65]    3592
(35, 45]    2082
(65, 76]    1651
(76, 85]      81
(25, 35]      66
(0, 18]        0
(18, 25]       0
(85, 95]       0
Name: count, dtype: int64

In [56]:
frame['educ'].value_counts()

educ
1.0    4690
2.0    3410
3.0    1885
4.0    1347
Name: count, dtype: int64

### Outcomes

In [35]:
frame['PREVCHD'].value_counts()

PREVCHD
0    10785
1      842
Name: count, dtype: int64

In [38]:
pd.DataFrame(frame[['age_group','PREVCHD']].value_counts()).sort_values(['PREVCHD','age_group'])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
age_group,PREVCHD,Unnamed: 2_level_1
"(25, 35]",0,66
"(35, 45]",0,2059
"(45, 55]",0,3997
"(55, 65]",0,3227
"(65, 76]",0,1371
"(76, 85]",0,65
"(35, 45]",1,23
"(45, 55]",1,158
"(55, 65]",1,365
"(65, 76]",1,280


# 4. Longitudinal observations per person
The Framingham heart study follows a cohort of participants over multiple time points. What does the design look like?

In [21]:
# Observations vs people:
print('Dataset has %i rows for %i unique individuals.'%(frame.shape[0],len(frame['RANDID'].unique())))

Dataset has 11627 rows for 4434 unique individuals.


In [25]:
frame['RANDID'].unique()[:10]

array([ 2448,  6238,  9428, 10552, 11252, 11263, 12629, 12806, 14367,
       16365])

# 5. Zooming in on subgroups

#### Slicing and indexing

In [83]:
frame.loc[1:5,['SEX','AGE']]

Unnamed: 0,SEX,AGE
1,1,52
2,2,46
3,2,52
4,2,58
5,1,48


In [84]:
frame.loc[:5,'SEX':'AGE']

Unnamed: 0,SEX,TOTCHOL,AGE
0,1,195.0,39
1,1,209.0,52
2,2,250.0,46
3,2,260.0,52
4,2,237.0,58
5,1,245.0,48


In [88]:
frame.iloc[:5,:5]

Unnamed: 0,RANDID,SEX,TOTCHOL,AGE,SYSBP
0,2448,1,195.0,39,106.0
1,2448,1,209.0,52,121.0
2,6238,2,250.0,46,121.0
3,6238,2,260.0,52,105.0
4,6238,2,237.0,58,108.0


Loc with booleans

In [87]:
frame.loc[frame['AGE']>80,['AGE','TOTCHOL']]

Unnamed: 0,AGE,TOTCHOL
8570,81,251.0
8924,81,305.0
11405,81,250.0


In [89]:
df_age_39 = frame.loc[frame['AGE']==39]

In [82]:
df_age_39.shape

(178, 39)

Tracking one person:

In [91]:
# Track one person:
person_df = frame[frame['RANDID']==2448]
print(person_df.iloc[:,:5])
print(person_df[['TIME','AGE']].diff())
12*365

   RANDID  SEX  TOTCHOL  AGE  SYSBP
0    2448    1    195.0   39  106.0
1    2448    1    209.0   52  121.0
     TIME   AGE
0     NaN   NaN
1  4628.0  13.0


4380

In [92]:
frame[frame['RANDID']==6238]

Unnamed: 0,RANDID,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
2,6238,2,250.0,46,121.0,81.0,0,0.0,28.73,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
3,6238,2,260.0,52,105.0,69.5,0,0.0,29.43,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
4,6238,2,237.0,58,108.0,66.0,0,0.0,28.5,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
