## Understanding Loop Dataset
This notebook tries provides details on the structure of the Loop dataset and makes suggestions how to process the data.

## The Loop study

**Title**: An Observational Study of Individuals with Type 1 Diabetes Using the Loop System for Automated Insulin Delivery: The Loop Observational Study (LOS)


**Description**: Passive data collection to collect data on the efficacy, safety, usability, and quality of life/ psychosocial effects of the Loop System
    
**Devices**: insulin pump and a Dexcom or Medtronic CGM

**Study Population**: People of any age with Type 1 Diabetes

# Data
The study data folder is named **Loop study public dataset 2023-01-31**

From the DataGlossary.rtf file, the following relevant files were identified which are stored in the **Data Tables** subfolder.

* **LOOPDeviceBasal#.txt**: #:1-3. LOOP study Basal data exported from Tidepool
* **LOOPDeviceBolus.txt**: Bolus data exported from Tidepool
* **LOOPDeviceCGM#.txt**: #;1-6. List of cgm data dowloaded 
* **PtRoster.txt**: Patient Roster

These are csv files ("|" separator) and host many columns related to the Tandem pump events and the Dexcom cgm. The glossary provides information about each column. Each file contains a limited amount of columns compared to the FLAIR data. Below are **all** of the columns contained in each file

## LOOPDeviceBasal1-3
* **PtID**: Patient ID
* **DeviceDtTm**: Local device date and time; note not present in most rows because unavailable in Tidepool data source
* **UTCDtTm**: Date and time with timezone offset
* **Duration**: Actual number of milliseconds basal will be in effect
* **ExpectedDuration**: Expected number of milliseconds basal will be in effect
* **Percnt**: Percentage of suppressed basal that should be delivered
* **Rate**: Number of units per hour
* **ExtendedBolusPortion**: Flag distinguishing the immediate (Now) portion of the bolus (if any) from the extended (Later) portion [Now, Later]
* **SuprBasalType**: Suppressed basal delivery type (suppressed basal = basal event not being delivered because this one is active)
* **SuprDuration**: Suppressed duration
* **SuprRate**: Suppressed rate
* **TmZnOffset**: Timezone offset

## LOOPDeviceBolus
* **PtID**: Patient ID
* **DeviceDtTm**: Local device date and time; note not present in most rows because unavailable in Tidepool data source
* **UTCDtTm**: Device date and time (with timezone offset)
* **BolusType**: Subtype of data (ex: "Normal" and "Square" are subtypes of "Bolus" type)
* **Normal**: Number of units of normal bolus
* **ExpectedNormal**: Expected number of units of normal bolus
* **Extended**: Number of units for extended delivery
* **ExpectedExtended**: Expected number of units for extended delivery
* **Duration**: Time span over which the bolus was delivered (milliseconds for Tidepool data, minutes for Diasend data)
* **ExpectedDuration**: Expected time span over which the bolus should have been delivered (milliseconds for Tidepool data, minutes for Diasend data)
* **TmZnOffset**: Timezone offset
* **OriginName**: Data origin name
* **OriginType**: Data origin type

## LOOPDeviceCGM1-6
* **PtID**: Patient ID
* **DeviceDtTm**: Local device date and time; note not present in most rows because unavailable in Tidepool data source
* **UTCDtTm**: Device date and time (with timezone offset)
* **CGMVal**: Glucose reading from the CGM (in mmol/L from Tidepool)
* **Units**: Glucose reading units
* **DexInternalDtTm**: Dexcom Internal date and time
* **DexTrend**: Dexcom trend
* **TmZnOffset**: Timezone offset


## Notes
* There are 3 Basal files, 1 Bolus file, and 6 CGM files
* The Basal files are 2.9GB, 2.9GB, and 1.35GB in size
* The bolus file is 349 MB
* The CGM files are 2.14, 2.24, 2.3, 2.31, 2.33, and 1.53 GB
* There is exercise data contained within LOOPDeviceExercise
* There is food data within LOOPDeviceFood
## Questions
* how do we determine if the data is uploaded from Tidepool of Diasend? This effects the extended boluses
  

In [1]:
import os, sys, time, random
import pandas as pd
from datetime import datetime, timedelta
import numpy as np
from matplotlib import pyplot as plt
import dask.dataframe as dd

## Determine best way to load large files

In [2]:
#get the file path
current_dir = os.getcwd(); 
original_data_path = os.path.join(current_dir, '..', 'data/raw')
cleaned_data_path = os.path.join(current_dir,  '..', 'data/cleaned')
path = os.path.join(original_data_path, 'Loop study public dataset 2023-01-31', 'Data Tables', 'LOOPDeviceBasal1.txt')

In [4]:
t = time.time()
df_insulin = pd.read_csv(path, sep="|", low_memory=False)
elapsed = time.time() - t
print(f"Loading one file of the Loop basal data (no optimizations) takes {elapsed:.2f}s")

Loading one file of the Loop basal data (no optimizations) takes 675.51s


In [5]:
df_insulin.columns

Index(['PtID', 'RecID', 'ParentLOOPDeviceUploadsID', 'DeviceDtTm', 'UTCDtTm',
       'BasalType', 'Duration', 'ExpectedDuration', 'Percnt', 'Rate',
       'SuprBasalType', 'SuprDuration', 'SuprRate', 'TmZnOffset', 'OriginName',
       'OriginVers', 'OriginType', 'OriginDeviceFirmwrVer',
       'OriginDeviceHardwrVer', 'OriginDeviceManufact', 'OriginDeviceModel',
       'OriginOperatingSystVer', 'OriginProductType'],
      dtype='object')

In [4]:
t = time.time()
df_insulin = pd.read_csv(path, sep="|", low_memory=False,
                         usecols=['PtID', 'UTCDtTm', 'BasalType', 'ExpectedDuration', 'Percnt', 'ExpectedDuration', 'Percnt', 'Rate',
                         'SuprBasalType', 'SuprDuration', 'SuprRate', 'TmZnOffset','OriginName', 'OriginVers', 'OriginType'])
elapsed = time.time() - t
print(f"Loading the Loop basal data (with reduced columns) takes {elapsed:.2f}s")

Loading the Loop basal data (with reduced columns) takes 507.86s


### Try dask for improved load times

In [59]:
t = time.time()
df = dd.read_csv(path, sep="|",
                 usecols=['PtID', 'UTCDtTm', 'BasalType', 'ExpectedDuration', 'Percnt', 'ExpectedDuration', 'Percnt', 'Rate',
                         'SuprBasalType', 'SuprDuration', 'SuprRate', 'TmZnOffset'],
                 dtype={'DeviceDtTm': 'object',
                       },
                parse_dates=[1])
elapsed = time.time() - t
print(f"Loading one file of the Loop basal data with Dask while parsing dates takes {elapsed:.2f}s")

Loading one file of the Loop basal data with Dask while parsing dates takes 0.05s


In [36]:
display(df.head());
print('Length of Dataframe: ',len(df.PtID.to_dask_array(lengths=True)))

Unnamed: 0,PtID,UTCDtTm,BasalType,ExpectedDuration,Percnt,Rate,SuprBasalType,SuprDuration,SuprRate,TmZnOffset
0,1082,2018-05-29 10:02:56,temp,,,1.475,scheduled,,1.6,
1,1082,2018-05-29 09:57:55,temp,,,1.196,scheduled,,1.6,
2,1082,2018-05-29 09:54:04,temp,,,1.558,scheduled,,1.6,
3,1082,2018-05-29 09:49:44,temp,,,1.385,scheduled,,1.6,
4,1082,2018-05-29 09:45:42,temp,,,1.488,scheduled,,1.6,


Length of Dataframe:  19156089


* Dask should allow us to quickly load these data sets and the dataframes are used in the same way as pandas.
* Documentation: https://docs.dask.org/en/stable/dataframe.html

## Load the data

In [28]:
path = os.path.join(original_data_path, 'Loop study public dataset 2023-01-31', 'Data Tables', 'LOOPDeviceBasal1.txt')
df_basal = dd.read_csv(path, sep="|",
                 usecols=['PtID', 'UTCDtTm', 'BasalType', 'ExpectedDuration', 'Percnt', 'ExpectedDuration', 'Percnt', 'Rate'],
                 dtype={'DeviceDtTm': 'object',
                       },
                parse_dates=[1])

path = os.path.join(original_data_path, 'Loop study public dataset 2023-01-31', 'Data Tables', 'LOOPDeviceBolus.txt')
df_bolus = dd.read_csv(path, sep="|",
                 usecols=['PtID', 'UTCDtTm', 'BolusType', 'Normal', 'Extended', 'Duration'],
                 dtype={'DeviceDtTm': 'object',
                       },
                parse_dates=[1])

path = os.path.join(original_data_path, 'Loop study public dataset 2023-01-31', 'Data Tables', 'LOOPDeviceCGM1.txt')
df_cgm = dd.read_csv(path, sep="|",
                 usecols=['PtID', 'UTCDtTm', 'CGMVal', 'Units', 'DexInternalDtTm'],
                 dtype={'DeviceDtTm': 'object',
                       'DexInternalDtTm': 'object'},
                parse_dates=[1])

## inspecting the Dask dataframe

### data is partitioned into multiple pandas data frame

In [42]:
partitions = df_cgm.npartitions 
print(f'There are {partitions:d} partitions in the CGM data')

There are 33 partitions in the CGM data


In [47]:
length = len(df_cgm.partitions[0])
print(f'There are {length:d} rows in the first partition of the CGM data')

There are 580069 rows in the first partition of the CGM data


### number of unique values across the entire dataframe

In [62]:
df_cgm.melt().groupby('variable')['value'].nunique().compute()

variable
CGMVal               315691
DexInternalDtTm      123994
PtID                    589
UTCDtTm            14592206
Units                     1
Name: value, dtype: int64

### number of unique values in the first partition

In [64]:
df_cgm.partitions[0].melt().groupby('variable')['value'].nunique().compute()

variable
CGMVal              25351
DexInternalDtTm      8659
PtID                   24
UTCDtTm            558881
Units                   1
Name: value, dtype: int64

In [70]:
df_cgm.partitions[0].PtID.unique().compute()

0     1082
1     1173
2      296
3     1152
4     1013
5      829
6      104
7      942
8      105
9      960
10     467
11     458
12     589
13     407
14     856
15     605
16     738
17     787
18     691
19     763
20     653
21     511
22      85
23     164
Name: PtID, dtype: int64

In [71]:
df_cgm.partitions[1].PtID.unique().compute()

0      164
1      425
2      213
3     1080
4      153
5      621
6      462
7      543
8      607
9      826
10      67
11     961
12     970
13     456
14    1082
15    1173
16     296
17    1152
18    1013
19     104
20     942
21     105
22     960
23     467
24     605
25     458
26     589
27     407
28     856
29     738
30     787
31     691
32     763
33     653
34     511
35      85
36     920
37     936
38    1079
39    1081
40     968
41    1138
42     125
43     644
44     859
45     650
46     587
47     212
Name: PtID, dtype: int64

### There are multiple crossovers between each partition on patient IDs
 how do we ensure there are only unique ids in each partition?

In [74]:
print('Partition 1 date range: ', df_cgm.partitions[0].UTCDtTm.min().compute(),' to ', df_cgm.partitions[0].UTCDtTm.max().compute())

Partition 1 date range:  2017-10-20 15:11:26  to  2019-01-13 03:49:46


In [75]:
print('Partition 2 date range: ', df_cgm.partitions[1].UTCDtTm.min().compute(),' to ', df_cgm.partitions[1].UTCDtTm.max().compute())

Partition 2 date range:  2017-10-23 19:44:45  to  2019-01-17 04:39:34


In [76]:
print('Partition 1 CGM range: ', df_cgm.partitions[0].CGMVal.min().compute(),' to ', df_cgm.partitions[0].CGMVal.max().compute())

Partition 1 CGM range:  2.10928  to  22.2585


In [77]:
print('Partition 1 CGM range: ', df_cgm.partitions[1].CGMVal.min().compute(),' to ', df_cgm.partitions[1].CGMVal.max().compute())

Partition 1 CGM range:  1.77624  to  22.2585


In [3]:
path = os.path.join(original_data_path, 'Loop study public dataset 2023-01-31', 'Data Tables', 'LOOPDeviceCGM1.txt')
t = time.time()
df_cgm = dd.read_csv(path, sep="|",
                 usecols=['PtID', 'UTCDtTm', 'CGMVal', 'Units', 'DexInternalDtTm'],
                 dtype={'DeviceDtTm': 'object',
                       'DexInternalDtTm': 'object'},
                parse_dates=[1])
elapsed = time.time() - t
print(f"Loading Loop CGM data with Dask while parsing dates takes {elapsed:.2f}s")

Loading Loop CGM data with Dask while parsing dates takes 0.03s


In [8]:
t = time.time()
df_cgm_dt = df_cgm.set_index(df_cgm.UTCDtTm, sorted=False)
df_cgm_dt = df_cgm_dt.repartition(npartitions=25)
elapsed = time.time() - t
print(f"Repartitioning data takes {elapsed:.2f}s")

Repartitioning data takes 141.12s


In [82]:
print('Partition 1 date range: ', df_cgm_dt.partitions[0].UTCDtTm.min().compute(),' to ', df_cgm_dt.partitions[0].UTCDtTm.max().compute())

Partition 1 date range:  2017-03-05 06:21:42  to  2018-01-23 18:34:48


In [83]:
print('Partition 2 date range: ', df_cgm_dt.partitions[1].UTCDtTm.min().compute(),' to ', df_cgm_dt.partitions[1].UTCDtTm.max().compute())

Partition 2 date range:  2018-01-23 18:34:50  to  2018-03-01 17:03:59


In [5]:
t = time.time()
df_cgm_id = df_cgm.set_index(df_cgm.PtID, sorted=False)
df_cgm_id = df_cgm_id.repartition(npartitions=589)
elapsed = time.time() - t
print(f"Repartitioning data takes {elapsed:.2f}s")

Repartitioning data takes 137.21s


In [89]:
df_cgm_id.partitions[0].PtID.unique().compute()

0    3
Name: PtID, dtype: int64

In [90]:
df_cgm_id.partitions[1].PtID.unique().compute()

0    4
Name: PtID, dtype: int64

In [91]:
display(df_cgm_id.partitions[0].head());

Unnamed: 0_level_0,PtID,UTCDtTm,CGMVal,Units,DexInternalDtTm
PtID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,3,2018-10-20 12:12:57,9.71381,mmol/L,
3,3,2018-10-12 08:25:18,15.1535,mmol/L,
3,3,2018-10-12 08:20:17,15.5421,mmol/L,
3,3,2018-10-12 08:15:18,15.9307,mmol/L,
3,3,2018-10-12 08:10:18,16.0972,mmol/L,


In [6]:
df_cgm_id

Unnamed: 0_level_0,PtID,UTCDtTm,CGMVal,Units,DexInternalDtTm
npartitions=589,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,int64,datetime64[ns],float64,string,string
4,...,...,...,...,...
...,...,...,...,...,...
1209,...,...,...,...,...
1211,...,...,...,...,...


In [3]:
path = os.path.join(original_data_path, 'Loop study public dataset 2023-01-31', 'Data Tables', 'LOOPDeviceCGM*.txt')
t = time.time()
df_cgm = dd.read_csv(path, sep="|",
                 usecols=['PtID', 'UTCDtTm', 'CGMVal', 'Units'],
                 dtype={'DeviceDtTm': 'object',
                       'DexInternalDtTm': 'object'},
                parse_dates=[1])
elapsed = time.time() - t
print(f"Loading all 6 Loop CGM data files with Dask while parsing dates takes {elapsed:.2f}s")

Loading all 6 Loop CGM data files with Dask while parsing dates takes 0.04s


In [8]:
df_cgm.melt().groupby('variable')['value'].nunique().compute()

variable
CGMVal               346237
DexInternalDtTm      343999
PtID                    851
UTCDtTm            45095186
Units                     1
Name: value, dtype: int64

In [4]:
t = time.time()
df_cgm_id = df_cgm.set_index(df_cgm.PtID, sorted=False)
df_cgm_id = df_cgm_id.repartition(npartitions=851)
elapsed = time.time() - t
print(f"Repartitioning CGM data into individual PtID partitions takes {elapsed:.2f}s")

Repartitioning CGM data into individual PtID partitions takes 796.21s


In [6]:
display(df_cgm_id.partitions[800].head(15));

Unnamed: 0_level_0,PtID,UTCDtTm,CGMVal,Units,DexInternalDtTm
PtID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1211,1211,2018-10-06 02:30:26,6.93843,mmol/L,
1211,1211,2018-09-09 11:03:08,6.82742,mmol/L,
1211,1211,2018-09-09 11:03:08,6.82742,mmol/L,
1211,1211,2018-09-09 10:58:08,6.60539,mmol/L,
1211,1211,2018-09-09 10:58:08,6.60539,mmol/L,
1211,1211,2018-09-09 10:53:08,6.16133,mmol/L,
1211,1211,2018-08-21 05:04:11,9.93584,mmol/L,
1211,1211,2018-08-21 04:59:11,9.93584,mmol/L,
1211,1211,2018-08-21 04:59:11,9.93584,mmol/L,
1211,1211,2018-08-21 04:54:12,10.0469,mmol/L,
