## Project&data description

This project aims at providing T1 diabetes management support through machine learning data analytics performed on wearable sensor data signals. The dataset was acquired on a study that included 20 patients with T1 diabetes using freestyle LibreLink from the diabetes center of Royal Berkshire Hospital.

This is the first stage of this project , we focus on the clustering of glucose time series data and find out the phenotype based on glucose variations.

## Preprocessing

In [None]:
import pandas as pd
import os
import _pickle as pickle
import datetime
import numbers
import numpy as np

from joblib import Parallel, delayed
from tqdm import tqdm
import math

import itertools
from random import sample 

#### 1. Formalize

The original dataset are noisy, the first step of preprocessing is to merge duplicated data and transform time index using Pandas .

In [None]:
def extract_bg(df1, countNo):
    df1['bg'] = df1['Historic Glucose mmol/L'].fillna(df1['Scan Glucose mmol/L'])
    df_p = df1[['Device Timestamp','bg']]
    df_p.columns = ['time', 'bg']
    df_p['time'] = pd.to_datetime(df_p['time'], format='%d-%m-%Y %H:%M')
    df_p = df_p.set_index('time').sort_index(ascending=True)
#     df_p = df_p.interpolate()
    df_p['No'] = countNo
    return df_p

In [None]:
dfs = []
countNo = 0
for file in os.listdir('../data/libre/'):
#     print(file)
    if file.find('.csv')!=-1:
        df = pd.read_csv('../data/libre/'+file)
        df_p = extract_bg(df, countNo)
        dfs.append(df_p)

### 2. Partitioning

The second step, we partitioned the long time series data into different sizes of windows, we defined overlap ratio of each dataset to represent how much the adajcent windows are overlapped.

In [None]:
def select_concecutive(df, minutes_step=150, each_gap=15, over_lap=0):
    # df the data to be processed 
    # minutes_step  window size
    # each_gap  split
    counts = int(minutes_step/each_gap)
    lag = math.ceil(counts *(1 - over_lap))
    print('No=', df.No[0])
    sep_sets = []
    len_index = len(df.index)
    for i in range(0, len_index, lag):
        sep = []
        prev_i = df.index[i]
        prev_bg = df.loc[prev_i].bg
        if isinstance(prev_bg, float):
            sep.append(prev_bg)
        else:
            pass
        for j in range(counts):
            next_i = prev_i + datetime.timedelta(minutes=each_gap)
            if df.index.contains(next_i):
                next_bg = df.loc[next_i].bg
                if isinstance(next_bg, float):
                    sep.append(next_bg)
                prev_i = next_i
            else:
                sep = []
                break                
        if len(sep) == (counts+1):
            sep_sets.append(sep)
    return sep_sets 

# 2D --> 1D
def flattern(all_sets):
    all_sets_flat = list(itertools.chain.from_iterable(all_sets))
    df_all = pd.DataFrame(np.stack(all_sets_flat))
    df_all.fillna(method ='pad',inplace=True)
    return df_all
dfs_sub = dfs[:20]

The range of window sizes is 120min, 150min and 180min. The range of over_lap ratios is 0, 0.25, 0.5 and 0.75. Thus we have 3*4=12 combinations. To boost the speed of calculation, we employed parallel computing.

In [None]:
data_sets =[]
for step in (60,90,120,150,180):
    for ratio in (0, 0.25, 0.5, 0.75):
        print('step:',step)
        print('overlap_ratio:', ratio)
        _set = Parallel(n_jobs=-1)(delayed(select_concecutive)(df,minutes_step=step, over_lap=ratio) for df in tqdm(dfs_sub))
        df_set = flattern(_set)
        df_set.to_csv('datasets2/'+str(step)+'_'+str(ratio)+'.csv', index=False)
        data_sets.append(df_set)

## Clustering evaluation(Part 2 with R)

## Data visualization