## Framingham Reshaping

Turning the longitudinal Framingham data into a format suitable for multistate modeling.

We will focus on creating a data set suitable for four states: No disease, Hypertension, Cardiovascular disease (any), and Death (absorbing).

Some people will begin in State 2 due to PREVHYP.

Retained covariates will include: Age, Sex, Diabetes (y/n), and Smoker (y/n)

In [1]:
%matplotlib inline
import matplotlib
import pandas as pd
import numpy as np

In [2]:
framingham = pd.read_csv("Datasets/framingham.csv")

In [3]:
framingham

Unnamed: 0,SEX,RANDID,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,...,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
0,1,2448,195,39,106.0,70.0,0,0,26.97,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
1,1,2448,209,52,121.0,66.0,0,0,.,0,...,1,0,8766,6438,6438,6438,8766,6438,8766,8766
2,2,6238,250,46,121.0,81.0,0,0,28.73,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
3,2,6238,260,52,105.0,69.5,0,0,29.43,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
4,2,6238,237,58,108.0,66.0,0,0,28.5,0,...,0,0,8766,8766,8766,8766,8766,8766,8766,8766
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11622,1,9998212,173,46,126.0,82.0,0,0,19.17,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,0
11623,1,9998212,153,52,143.0,89.0,0,0,25.74,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,0
11624,2,9999312,196,39,133.0,86.0,1,30,20.91,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,4201
11625,2,9999312,240,46,138.0,79.0,1,20,26.39,0,...,0,1,8766,8766,8766,8766,8766,8766,8766,4201


In [4]:
framingham.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11627 entries, 0 to 11626
Data columns (total 38 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEX       11627 non-null  int64  
 1   RANDID    11627 non-null  int64  
 2   TOTCHOL   11627 non-null  object 
 3   AGE       11627 non-null  int64  
 4   SYSBP     11627 non-null  float64
 5   DIABP     11627 non-null  float64
 6   CURSMOKE  11627 non-null  int64  
 7   CIGPDAY   11627 non-null  object 
 8   BMI       11627 non-null  object 
 9   DIABETES  11627 non-null  int64  
 10  BPMEDS    11627 non-null  object 
 11  HEARTRTE  11627 non-null  object 
 12  GLUCOSE   11627 non-null  object 
 13  PREVCHD   11627 non-null  int64  
 14  PREVAP    11627 non-null  int64  
 15  PREVMI    11627 non-null  int64  
 16  PREVSTRK  11627 non-null  int64  
 17  PREVHYP   11627 non-null  int64  
 18  TIME      11627 non-null  int64  
 19  PERIOD    11627 non-null  int64  
 20  HDLC      11627 non-null  ob

## Define the at risk population
We want to begin with people who do not already have cardiovascular disease

In [5]:
at_risk = framingham[(framingham["PREVAP"] == 0) & (framingham["PREVCHD"] == 0) & (framingham["PREVMI"] == 0) & (framingham["PREVSTRK"] == 0)] 
at_risk = at_risk[["RANDID", "TIME", "PERIOD", "CVD", "TIMECVD", "DEATH", "TIMEDTH", "HYPERTEN", "TIMEHYP", "PREVHYP", "AGE", "SEX", "CURSMOKE", "DIABETES"]]
at_risk

Unnamed: 0,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES
0,2448,0,1,1,6438,0,8766,0,8766,0,39,1,0,0
1,2448,4628,3,1,6438,0,8766,0,8766,0,52,1,0,0
2,6238,0,1,0,8766,0,8766,0,8766,0,46,2,0,0
3,6238,2156,2,0,8766,0,8766,0,8766,0,52,2,0,0
4,6238,4344,3,0,8766,0,8766,0,8766,0,58,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11622,9998212,2333,2,0,8766,0,8766,1,0,1,46,1,0,0
11623,9998212,4538,3,0,8766,0,8766,1,0,1,52,1,0,0
11624,9999312,0,1,0,8766,0,8766,1,4201,0,39,2,1,0
11625,9999312,2390,2,0,8766,0,8766,1,4201,0,46,2,1,0


## Discover Ns

The last day of the Framingham study was day # 8,766.

Patients who reached day 8,766 **without** getting CVD would have TIMECVD set to this number - meaning they could have gotten CVD after this day, but nobody would know about it.

Another possibility is that they were censored because they dropped out of the study before day 8,766.  In this case, DEATH would be 0, and TIMEDTH would be set to the drop-out date.

Being in an unknown state, or censored, is something we want to account for.

In other words, if TIME in days is set to 8,766 for a DISEASE CATEGORY (eg. TIMEDTH, TIMECVD), it indicates the patient reached the end of the study period (or was censored) without that disease and/or without dying OR, if the indicator flag for that disease state (HYPERTEN, CVD, DEATH) is set to 0, the value in the TIMEDTH variable is the censorship day.

So we want to figure out how many unique individuals in the study **did** get diagnosed with HYPERTENSION (State 2), get diagnosed with CVD (State 3), or Died (State 4) and how many were censored.

In [6]:
last_day_of_study = 8766

ind_ids = at_risk.drop_duplicates(subset=["RANDID"])["RANDID"].tolist()
died_ids = at_risk[(at_risk["DEATH"] == 1) & (at_risk["TIMEDTH"]<last_day_of_study)]["RANDID"].drop_duplicates()
cvd_ids = at_risk[(at_risk["CVD"] == 1) & (at_risk["TIMECVD"]<last_day_of_study)]["RANDID"].drop_duplicates()
htn_ids = at_risk[((at_risk["HYPERTEN"] == 1) & (at_risk["TIMEHYP"] > 0) & (at_risk["TIMEHYP"]<last_day_of_study)) | 
                  (at_risk["PREVHYP"]==1)]["RANDID"].drop_duplicates()

censored_ids = at_risk[((at_risk["TIMEDTH"] > 0) & (at_risk["DEATH"] == 0) & 
                        ((at_risk["CVD"] == 0) | (at_risk["HYPERTEN"] == 0)))]["RANDID"].drop_duplicates()

There are 4,215 unique individuals in the study.

In [7]:
len(ind_ids)

4215

3,065 of them (73%) either started the study with Hypertension, or became Hypertensive after TIME 0 (entered State 2).

This is a lot of people by today's standards, but in the 1950s and 60s when this data were collected, many people had untreated hypertension and physicians generally considered it part of the process of aging and didn't treat it.

In [8]:
htn_ids

7          10552
9          11252
12         11263
15         12629
17         12806
          ...   
11613    9990894
11616    9993179
11619    9995546
11621    9998212
11624    9999312
Name: RANDID, Length: 3065, dtype: int64

996 of them (24%) got Cardiovascular disease after TIME 0 (entered State 3)

In [9]:
cvd_ids

0           2448
7          10552
12         11263
55         43770
64         54224
          ...   
11590    9967157
11595    9969773
11604    9982118
11608    9984683
11619    9995546
Name: RANDID, Length: 996, dtype: int64

And 1,388 of them (33%) died after TIME 0 but before the last day of the study (entered State 4)

In [10]:
died_ids

7          10552
32         23727
35         24721
38         30928
42         33555
          ...   
11607    9983319
11608    9984683
11610    9989287
11613    9990894
11616    9993179
Name: RANDID, Length: 1388, dtype: int64

2,507 individuals (59%) reached the end of the study alive, in an unknown state.

We will consider them censored.

In [11]:
censored_ids

0           2448
2           6238
5           9428
9          11252
15         12629
          ...   
11582    9960803
11584    9961615
11601    9978986
11621    9998212
11624    9999312
Name: RANDID, Length: 2507, dtype: int64

Now that we know how many individuals reached each state, we need to figure out what state they started in.

Some people started in State 1 - No Disease, while others started in State 2 - Hypertension.

Some had recorded data at TIME 0 and then no further recorded data, or no recorded data until period 3 (TIME 45xxish).  

Even when Framingham participants had no further TIME/examination data, the exact DAY of future diagnoses were filled in by researchers.  So a person could have a single row, at TIME 0, and that row could have the day they got Hypertension (after TIME 0) and the day they Died (after TIME 0), for example.

Relatedly, transition times from one state to another are in the TIMEHYP, TIMECVD and TIMEDTH columns, NOT in the TIME column.  The TIME column just records examination times for longitudinal study participants, so we don't care about the TIME column, we just want to make sure we have the latest information for covariates such as smoking and diabetes status.

In [12]:
start_times = at_risk.drop_duplicates(subset=["RANDID"], keep="first")
start_times

Unnamed: 0,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES
0,2448,0,1,1,6438,0,8766,0,8766,0,39,1,0,0
2,6238,0,1,0,8766,0,8766,0,8766,0,46,2,0,0
5,9428,0,1,0,8766,0,8766,0,8766,0,48,1,1,0
7,10552,0,1,1,2089,1,2956,1,0,1,61,2,1,0
9,11252,0,1,0,8766,0,8766,1,4285,0,46,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11613,9990894,0,1,0,6433,1,6433,1,2219,0,48,2,1,0
11616,9993179,0,1,0,6729,1,6729,1,4396,0,44,2,1,0
11619,9995546,0,1,1,5209,0,8766,1,735,0,52,2,0,0
11621,9998212,0,1,0,8766,0,8766,1,0,1,40,1,0,0


Looks like all 4,215 participants started at TIME 0.

In [13]:
start_times["TIME"].value_counts()

0    4215
Name: TIME, dtype: int64

2,917 of them (69%) were healthy, and 1,298 (31%) already knew they had Hypertension.

In [14]:
start_times["PREVHYP"].value_counts()

0    2917
1    1298
Name: PREVHYP, dtype: int64

In [15]:
framingham_ms1 = at_risk.drop_duplicates(subset=["RANDID"], keep="first").reset_index()
framingham_ms1["STATE"] = framingham_ms1["PERIOD"].copy()
framingham_ms1.loc[framingham_ms1["PREVHYP"]==1, "STATE"] = 2
framingham_ms1["DAYS"] = 0
framingham_ms1

Unnamed: 0,index,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,0,2448,0,1,1,6438,0,8766,0,8766,0,39,1,0,0,1,0
1,2,6238,0,1,0,8766,0,8766,0,8766,0,46,2,0,0,1,0
2,5,9428,0,1,0,8766,0,8766,0,8766,0,48,1,1,0,1,0
3,7,10552,0,1,1,2089,1,2956,1,0,1,61,2,1,0,2,0
4,9,11252,0,1,0,8766,0,8766,1,4285,0,46,2,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4210,11613,9990894,0,1,0,6433,1,6433,1,2219,0,48,2,1,0,1,0
4211,11616,9993179,0,1,0,6729,1,6729,1,4396,0,44,2,1,0,1,0
4212,11619,9995546,0,1,1,5209,0,8766,1,735,0,52,2,0,0,1,0
4213,11621,9998212,0,1,0,8766,0,8766,1,0,1,40,1,0,0,2,0


Now we have the first States recorded for all the participants.

We need to add rows for those who also experienced State 2 after TIME 0, and those who experienced State 3 (Cardiovascular disease) and State 4 (Death).

## Things we need to account for:

Sometimes a person might get Hypertension & CVD on the same day.

Or, they might get CVD & Die on the same day.

Or, they might get Hypertension and Die on the same day.

Or, potentially, all THREE things might occur on the same day.

We have to resolve these conflicts, or the model won't run.

So, wherever a person is recorded as transitioning from State 1 to State 2 & 3 or 3 & 4, or 2, 3 & 4 on the same day, for example, this must be replaced by a transition ONLY to state 4.

## Also:

Some people will have covariate updates after TIME 0, and before DEATH and others won't.

We want the latest covariates for those who survive to additional checkups (after TIME 0), where checkups provide information about a later transition.

But we *don't* want age information, because we will need to update that later, based on time to transition(s).

In [16]:
framingham_ms2 = at_risk[(at_risk["TIMEHYP"]>0) & (at_risk["PREVHYP"]==0) 
                         & (at_risk["HYPERTEN"] == 1) & (at_risk["TIMEHYP"]<last_day_of_study) &
                        (at_risk["DEATH"]==0) & (at_risk["CVD"]==0)].drop_duplicates(subset=["RANDID"], keep="last").reset_index()

framingham_ms2b = at_risk[(at_risk["TIMEHYP"]>0) & (at_risk["PREVHYP"]==0) 
                         & (at_risk["HYPERTEN"] == 1) & (at_risk["TIMEHYP"]<last_day_of_study) &
                        (at_risk["DEATH"]==0) & (at_risk["CVD"]==0)].drop_duplicates(subset=["RANDID"], keep="first").reset_index()

framingham_ms2["STATE"] = 2
framingham_ms2["DAYS"] = framingham_ms2["TIMEHYP"].copy()
framingham_ms2["AGE"] = framingham_ms2b["AGE"].copy()
framingham_ms2

Unnamed: 0,index,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,10,11252,2072,2,0,8766,0,8766,1,4285,0,46,2,1,0,2,4285
1,15,12629,0,1,0,8766,0,8766,1,2212,0,63,2,0,0,2,2212
2,19,12806,4289,3,0,8766,0,8766,1,8679,0,45,2,1,0,2,8679
3,43,34689,0,1,0,8766,0,8766,1,2157,0,38,2,1,0,2,2157
4,46,36459,0,1,0,8766,0,8766,1,1469,0,41,1,0,0,2,1469
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1173,11561,9939850,2064,2,0,8766,0,8766,1,4416,0,56,2,0,0,2,4416
1174,11568,9948320,0,1,0,8766,0,8766,1,1491,0,52,2,0,0,2,1491
1175,11572,9949037,2205,2,0,8766,0,8766,1,2856,0,48,1,0,0,2,2856
1176,11584,9961615,0,1,0,8766,0,8766,1,2205,0,45,1,1,0,2,2205


In [17]:
framingham_ms_temp = pd.concat([framingham_ms1, framingham_ms2], ignore_index=True)
framingham_ms_temp = framingham_ms_temp.drop_duplicates(subset=["RANDID", "DAYS"], keep="last").reset_index()
framingham_ms_temp = framingham_ms_temp.drop(columns=["level_0", "index"], axis=1)
framingham_ms_temp


Unnamed: 0,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,2448,0,1,1,6438,0,8766,0,8766,0,39,1,0,0,1,0
1,6238,0,1,0,8766,0,8766,0,8766,0,46,2,0,0,1,0
2,9428,0,1,0,8766,0,8766,0,8766,0,48,1,1,0,1,0
3,10552,0,1,1,2089,1,2956,1,0,1,61,2,1,0,2,0
4,11252,0,1,0,8766,0,8766,1,4285,0,46,2,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5388,9939850,2064,2,0,8766,0,8766,1,4416,0,56,2,0,0,2,4416
5389,9948320,0,1,0,8766,0,8766,1,1491,0,52,2,0,0,2,1491
5390,9949037,2205,2,0,8766,0,8766,1,2856,0,48,1,0,0,2,2856
5391,9961615,0,1,0,8766,0,8766,1,2205,0,45,1,1,0,2,2205


In [18]:
framingham_ms3 = at_risk[(at_risk["TIMECVD"]>0) & (at_risk["CVD"] == 1) & 
                         (at_risk["TIMECVD"]<last_day_of_study)].drop_duplicates(subset=["RANDID"], keep="last").reset_index()

framingham_ms3b = at_risk[(at_risk["TIMECVD"]>0) & (at_risk["CVD"] == 1) & 
                          (at_risk["TIMECVD"]<last_day_of_study)].drop_duplicates(subset=["RANDID"], keep="first").reset_index()

framingham_ms3["STATE"] = 3
framingham_ms3["DAYS"] = framingham_ms3["TIMECVD"].copy()
framingham_ms3["AGE"] = framingham_ms3b["AGE"].copy()
framingham_ms3

Unnamed: 0,index,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,1,2448,4628,3,1,6438,0,8766,0,8766,0,39,1,0,0,3,6438
1,8,10552,1977,2,1,2089,1,2956,1,0,1,61,2,1,0,3,2089
2,14,11263,4351,3,1,5719,0,8766,1,0,1,43,2,0,1,3,5719
3,57,43770,4375,3,1,6384,1,6410,1,723,1,52,2,0,1,3,6384
4,64,54224,0,1,1,430,1,430,0,430,0,47,1,1,0,3,430
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,11591,9967157,2195,2,1,3273,1,7362,0,7362,0,58,1,0,0,3,3273
992,11597,9969773,4305,3,1,7994,0,8766,1,0,1,50,2,0,1,3,7994
993,11606,9982118,4429,3,1,8346,1,8457,1,0,1,58,1,0,0,3,8346
994,11608,9984683,0,1,1,1884,1,4300,1,0,1,50,1,1,0,3,1884


In [19]:
framingham_ms_temp2 = pd.concat([framingham_ms_temp, framingham_ms3], ignore_index=True)
framingham_ms_temp2 = framingham_ms_temp2.drop_duplicates(subset=["RANDID", "DAYS"], keep="last").reset_index()
framingham_ms_temp2 = framingham_ms_temp2.drop(columns=["level_0", "index"], axis=1)
framingham_ms_temp2

Unnamed: 0,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,2448,0,1,1,6438,0,8766,0,8766,0,39,1,0,0,1,0
1,6238,0,1,0,8766,0,8766,0,8766,0,46,2,0,0,1,0
2,9428,0,1,0,8766,0,8766,0,8766,0,48,1,1,0,1,0
3,10552,0,1,1,2089,1,2956,1,0,1,61,2,1,0,2,0
4,11252,0,1,0,8766,0,8766,1,4285,0,46,2,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6384,9967157,2195,2,1,3273,1,7362,0,7362,0,58,1,0,0,3,3273
6385,9969773,4305,3,1,7994,0,8766,1,0,1,50,2,0,1,3,7994
6386,9982118,4429,3,1,8346,1,8457,1,0,1,58,1,0,0,3,8346
6387,9984683,0,1,1,1884,1,4300,1,0,1,50,1,1,0,3,1884


In [20]:
framingham_ms4 = at_risk[(at_risk["TIMEDTH"]>0) & (at_risk["DEATH"] == 1) & 
                         (at_risk["TIMEDTH"]<last_day_of_study)].drop_duplicates(subset=["RANDID"], keep="last").reset_index()

framingham_ms4b = at_risk[(at_risk["TIMEDTH"]>0) & (at_risk["DEATH"] == 1) & 
                         (at_risk["TIMEDTH"]<last_day_of_study)].drop_duplicates(subset=["RANDID"], keep="first").reset_index()

framingham_ms4["STATE"] = 4
framingham_ms4["DAYS"] = framingham_ms4["TIMEDTH"].copy()
framingham_ms4["AGE"] = framingham_ms4b["AGE"].copy()
framingham_ms4

Unnamed: 0,index,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,8,10552,1977,2,1,2089,1,2956,1,0,1,61,2,1,0,4,2956
1,34,23727,4503,3,0,5592,1,5592,1,0,1,41,2,0,0,4,5592
2,37,24721,4408,3,0,6411,1,6411,1,4408,1,39,2,1,0,4,6411
3,38,30928,0,1,0,146,1,146,1,0,1,38,2,1,0,4,146
4,42,33555,0,1,0,1442,1,1442,0,1442,0,46,2,1,0,4,1442
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1383,11607,9983319,0,1,0,565,1,565,1,0,1,68,1,0,0,4,565
1384,11608,9984683,0,1,1,1884,1,4300,1,0,1,50,1,1,0,4,4300
1385,11612,9989287,4439,3,0,7746,1,7746,0,7746,0,51,1,1,0,4,7746
1386,11615,9990894,4417,3,0,6433,1,6433,1,2219,1,48,2,1,0,4,6433


In [21]:
framingham_ms_temp3 = pd.concat([framingham_ms_temp2, framingham_ms4], ignore_index=True)
framingham_ms_temp3 = framingham_ms_temp3.drop_duplicates(subset=["RANDID", "DAYS"], keep="last").reset_index()
framingham_ms_temp3 = framingham_ms_temp3.drop(columns=["level_0", "index"], axis=1)
framingham_ms_temp3

Unnamed: 0,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,2448,0,1,1,6438,0,8766,0,8766,0,39,1,0,0,1,0
1,6238,0,1,0,8766,0,8766,0,8766,0,46,2,0,0,1,0
2,9428,0,1,0,8766,0,8766,0,8766,0,48,1,1,0,1,0
3,10552,0,1,1,2089,1,2956,1,0,1,61,2,1,0,2,0
4,11252,0,1,0,8766,0,8766,1,4285,0,46,2,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7620,9983319,0,1,0,565,1,565,1,0,1,68,1,0,0,4,565
7621,9984683,0,1,1,1884,1,4300,1,0,1,50,1,1,0,4,4300
7622,9989287,4439,3,0,7746,1,7746,0,7746,0,51,1,1,0,4,7746
7623,9990894,4417,3,0,6433,1,6433,1,2219,1,48,2,1,0,4,6433


In [22]:
# Defining the Censored Population:
# those who reached the end of their involvement, or the study without dying regardless of state.

framingham_ms5 = at_risk[((at_risk["TIMEDTH"] > 0) & 
                          (at_risk["DEATH"] == 0) & 
                          ((at_risk["CVD"] == 0) | (at_risk["HYPERTEN"] == 0)))].drop_duplicates(subset=["RANDID"], keep="last").reset_index()
framingham_ms5b = at_risk[((at_risk["TIMEDTH"] > 0) & 
                          (at_risk["DEATH"] == 0) & 
                          ((at_risk["CVD"] == 0) | (at_risk["HYPERTEN"] == 0)))].drop_duplicates(subset=["RANDID"], keep="first").reset_index()
framingham_ms5["STATE"] = 99
framingham_ms5["DAYS"] = framingham_ms5["TIMEDTH"].copy()
framingham_ms5["AGE"] = framingham_ms5b["AGE"].copy()
framingham_ms5

Unnamed: 0,index,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,1,2448,4628,3,1,6438,0,8766,0,8766,0,39,1,0,0,99,8766
1,4,6238,4344,3,0,8766,0,8766,0,8766,0,46,2,0,0,99,8766
2,6,9428,2199,2,0,8766,0,8766,0,8766,0,48,1,1,0,99,8766
3,11,11252,4285,3,0,8766,0,8766,1,4285,1,46,2,1,0,99,8766
4,15,12629,0,1,0,8766,0,8766,1,2212,0,63,2,0,0,99,8766
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2502,11583,9960803,2187,2,0,8766,0,8766,0,8766,0,47,1,0,0,99,8766
2503,11586,9961615,4684,3,0,8766,0,8766,1,2205,1,45,1,1,1,99,8766
2504,11603,9978986,4230,3,0,8766,0,8766,1,0,1,56,2,0,0,99,8766
2505,11623,9998212,4538,3,0,8766,0,8766,1,0,1,40,1,0,0,99,8766


In [23]:
framingham_ms_temp4 = pd.concat([framingham_ms_temp3, framingham_ms5], ignore_index=True)
framingham_ms_temp4 = framingham_ms_temp4.drop_duplicates(subset=["RANDID", "DAYS"], keep="last").reset_index()
framingham_ms_temp4 = framingham_ms_temp4.drop(columns=["level_0", "index"], axis=1)
framingham_ms_temp4

Unnamed: 0,RANDID,TIME,PERIOD,CVD,TIMECVD,DEATH,TIMEDTH,HYPERTEN,TIMEHYP,PREVHYP,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS
0,2448,0,1,1,6438,0,8766,0,8766,0,39,1,0,0,1,0
1,6238,0,1,0,8766,0,8766,0,8766,0,46,2,0,0,1,0
2,9428,0,1,0,8766,0,8766,0,8766,0,48,1,1,0,1,0
3,10552,0,1,1,2089,1,2956,1,0,1,61,2,1,0,2,0
4,11252,0,1,0,8766,0,8766,1,4285,0,46,2,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10127,9960803,2187,2,0,8766,0,8766,0,8766,0,47,1,0,0,99,8766
10128,9961615,4684,3,0,8766,0,8766,1,2205,1,45,1,1,1,99,8766
10129,9978986,4230,3,0,8766,0,8766,1,0,1,56,2,0,0,99,8766
10130,9998212,4538,3,0,8766,0,8766,1,0,1,40,1,0,0,99,8766


After adding in those who were censored (reached the end of the study or left before any diagnosis / state change), we have 10,132 state transitions recorded!

In [24]:
framingham_ms = framingham_ms_temp4.copy()

In [25]:
framingham_ms = framingham_ms.drop(columns=["TIME", "PERIOD", "CVD", "TIMECVD", "DEATH", "TIMEDTH", "HYPERTEN", 
                                            "TIMEHYP", "PREVHYP"], axis=1)
framingham_ms["YEARS"] = framingham_ms["DAYS"]/365
framingham_ms

Unnamed: 0,RANDID,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS,YEARS
0,2448,39,1,0,0,1,0,0.000000
1,6238,46,2,0,0,1,0,0.000000
2,9428,48,1,1,0,1,0,0.000000
3,10552,61,2,1,0,2,0,0.000000
4,11252,46,2,1,0,1,0,0.000000
...,...,...,...,...,...,...,...,...
10127,9960803,47,1,0,0,99,8766,24.016438
10128,9961615,45,1,1,1,99,8766,24.016438
10129,9978986,56,2,0,0,99,8766,24.016438
10130,9998212,40,1,0,0,99,8766,24.016438


## Update ages



In [26]:
framingham_ms.loc[(framingham_ms["YEARS"]>0) & (framingham_ms["STATE"]==2), "AGE"] = framingham_ms["AGE"].copy() + round(framingham_ms["YEARS"].copy())
framingham_ms.loc[(framingham_ms["YEARS"]>0) & (framingham_ms["STATE"]==3), "AGE"] = framingham_ms["AGE"].copy() + round(framingham_ms["YEARS"].copy())
framingham_ms.loc[(framingham_ms["YEARS"]>0) & (framingham_ms["STATE"]==4), "AGE"] = framingham_ms["AGE"].copy() + round(framingham_ms["YEARS"].copy())
framingham_ms.loc[(framingham_ms["YEARS"]>0) & (framingham_ms["STATE"]==99), "AGE"] = framingham_ms["AGE"].copy() + round(framingham_ms["YEARS"].copy())

In [27]:
framingham_ms

Unnamed: 0,RANDID,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS,YEARS
0,2448,39.0,1,0,0,1,0,0.000000
1,6238,46.0,2,0,0,1,0,0.000000
2,9428,48.0,1,1,0,1,0,0.000000
3,10552,61.0,2,1,0,2,0,0.000000
4,11252,46.0,2,1,0,1,0,0.000000
...,...,...,...,...,...,...,...,...
10127,9960803,71.0,1,0,0,99,8766,24.016438
10128,9961615,69.0,1,1,1,99,8766,24.016438
10129,9978986,80.0,2,0,0,99,8766,24.016438
10130,9998212,64.0,1,0,0,99,8766,24.016438


In [28]:
framingham_ms = framingham_ms.sort_values(by=['RANDID', 'YEARS'])
framingham_ms = framingham_ms.reset_index()
framingham_ms = framingham_ms.drop(columns=["index"], axis=1)
framingham_ms

Unnamed: 0,RANDID,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS,YEARS
0,2448,39.0,1,0,0,1,0,0.000000
1,2448,57.0,1,0,0,3,6438,17.638356
2,2448,63.0,1,0,0,99,8766,24.016438
3,6238,46.0,2,0,0,1,0,0.000000
4,6238,70.0,2,0,0,99,8766,24.016438
...,...,...,...,...,...,...,...,...
10127,9998212,40.0,1,0,0,2,0,0.000000
10128,9998212,64.0,1,0,0,99,8766,24.016438
10129,9999312,39.0,2,1,0,1,0,0.000000
10130,9999312,51.0,2,1,0,2,4201,11.509589


## The final dataframe has 10,132 entries.

- 2,917 persons experienced State 1 (No disease).
- 2,476 persons experienced State 2 (Hypertension) - some of these persons (1,298) never experienced State 1, but began in State 2. 
- 844 persons survived State 3 (Cardiovascular disease).
- 1388 persons experienced State 4 (Death).
- and 2,507 persons reached the end of their participation in an unknown state, without dying (were Censored)

In [29]:
framingham_ms["STATE"].value_counts()

1     2917
99    2507
2     2476
4     1388
3      844
Name: STATE, dtype: int64

In [30]:
framingham_ms.to_csv("Datasets/framingham_ms.csv")

In [31]:
#21 and 22
framingham_ms.iloc[20:22]

Unnamed: 0,RANDID,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS,YEARS
20,12806,69.0,2,1,0,99,8766,24.016438
21,14367,52.0,1,0,0,2,0,0.0


In [32]:
framingham_ms[framingham_ms["RANDID"]==12629]

Unnamed: 0,RANDID,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS,YEARS
15,12629,63.0,2,0,0,1,0,0.0
16,12629,69.0,2,0,0,2,2212,6.060274
17,12629,87.0,2,0,0,99,8766,24.016438


In [33]:
framingham_ms[framingham_ms["RANDID"]==30928]

Unnamed: 0,RANDID,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS,YEARS
35,30928,38.0,2,1,0,2,0,0.0
36,30928,38.0,2,1,0,4,146,0.4


In [34]:
framingham_ms[framingham_ms["RANDID"]==69134]

Unnamed: 0,RANDID,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS,YEARS
79,69134,59.0,2,0,0,2,0,0.0
80,69134,61.0,2,0,0,3,724,1.983562
81,69134,62.0,2,0,0,4,1047,2.868493


In [35]:
framingham_ms[framingham_ms["RANDID"]==63221]

Unnamed: 0,RANDID,AGE,SEX,CURSMOKE,DIABETES,STATE,DAYS,YEARS
68,63221,61.0,2,0,0,2,0,0.0
69,63221,61.0,2,0,0,4,168,0.460274
