## Framington Nurse Study

When there are multiple visits, we need to change the format of the data such that we have the start/end times for each visit along with its covariates.

```
subject time1 time2 death creatinine1
5     0    90     0        0.92
5    90   120     0        1.53       
5   120   185     1        1.2
```

In [5]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [1]:
df <- read.csv("../../../../usecases_data/framington_pseudo_heart_study/FRAMINGHAM_teaching_2019a/csv/frmgham2.csv")

In [2]:
head(df)

Unnamed: 0_level_0,RANDID,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,⋯,CVD,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<int>,<int>,<dbl>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,2448,1,195,39,106.0,70.0,0,0,26.97,0,⋯,1,0,8766,6438,6438,6438,8766,6438,8766,8766
2,2448,1,209,52,121.0,66.0,0,0,,0,⋯,1,0,8766,6438,6438,6438,8766,6438,8766,8766
3,6238,2,250,46,121.0,81.0,0,0,28.73,0,⋯,0,0,8766,8766,8766,8766,8766,8766,8766,8766
4,6238,2,260,52,105.0,69.5,0,0,29.43,0,⋯,0,0,8766,8766,8766,8766,8766,8766,8766,8766
5,6238,2,237,58,108.0,66.0,0,0,28.5,0,⋯,0,0,8766,8766,8766,8766,8766,8766,8766,8766
6,9428,1,245,48,127.5,80.0,1,20,25.34,0,⋯,0,0,8766,8766,8766,8766,8766,8766,8766,8766


In [4]:
names(df)

#### Data wrangling

To tell R that we have multiple visits from the same individual, you need to structure the data you specify the time interval (start, end) within `Surv` like the following:

```
cox_phm <- coxph(Surv(time=START_TIME, time2=TIME, event=DEATH) ~ SEX, df)
```

How can you wrangle the data such that this will work?

In [6]:
agg_df <- df %>% group_by(RANDID) %>% summarise(sex=tail(SEX, 1),
                                                age_start=min(AGE),
                                                time_start=min(TIME),
                                                time_end=max(TIME),
                                                num_visits=n())

`summarise()` ungrouping output (override with `.groups` argument)



In [8]:
head(agg_df)

RANDID,sex,age_start,time_start,time_end,num_visits
<int>,<int>,<int>,<int>,<int>,<int>
2448,1,39,0,4628,2
6238,2,46,0,4344,3
9428,1,48,0,2199,2
10552,2,61,0,1977,2
11252,2,46,0,4285,3
11263,2,43,0,4351,3


#### Please check the data quality of your wrangling, what would you look at?

#### Exam the survival rate by gender first

#### What else would you add into the model?