# Further Dataset Difficulties and Biases

The objective of the first step in this learning project was to explore the dataset and conduct a thorough analysis to uncover its structure, challenges, and potential biases. This notebook highlights the most significant observations and considerations. While many important decisions cannot be made at this stage, they will be addressed in the subsequent steps of the project.

## Observations on data

- <strong>Inconsistency:</strong> different time intervals for different patients in the dataset (some recorded every 5 minutes and others every 15 minutes)
- <strong>Potential data loss:</strong> if we attempt to standardize the time intervals, we risk losing data or oversimplifying trends, especially in patients with finer granularity (5-minute intervals).

<b>Possible Approaches:</b>

<b>1.</b> Resampling the data to a consistent time interval:
- Up-sampling (resample to 5 minutes for all patients).
- Down-sampling (resample to 15 minutes for all patients).

<b>a) Up-sampling:</b> involves resampling all patient data to the finer 5-minute intervals. For patients who have data recorded every 15 minutes, we could interpolate missing values for the in-between time points.

<b>Pros:</b>
- Keeps the finer-granularity data intact for those patients who already have 5-minute intervals.
- Allows for more detailed time-series analysis.
  
<b>Cons:</b>
- Interpolating data for the patients who originally have 15-minute intervals might introduce artificial data points, which may not capture the true variability of the measurements.

<b>b) Down-sampling:</b> resampling everyone’s data to 15-minute intervals by aggregating the 5-minute data for patients with finer granularity (e.g. taking averages or sums over each 15-minute period).

<b>Pros:</b>
- Simpler, and no interpolation is needed.
- Keeps the data more consistent with what was actually measured for those with 15-minute intervals.

<b>Cons:</b>
- We might lose detail from patients who had more frequent measurements, which could result in losing important patterns in their data.

<b>2.</b> Handling each interval separately: another option for building models on patients with 5-min interval separately from the patients with 15-min interval.

<b>Pros:</b>
- We retain the original data for each patient.
- We don’t need to interpolate or downsample keeping the true variability.

<b>Cons:</b>
- This requires more work since we’re essentially running two analyses or models, one for each group.
- It leads to smaller training sets for each group.

<b>3.</b> Feature Engineering: we can create additional feature to account for differences in time intervals:
- Time interval: indicating whether the data point comes from a 5-min or 15-min interval.

This will let the model account for different time resolutions without explicitly resampling the data.

<b>Recommended approach:</b> Based on the fact that we have only 3 patients with a 15-minute interval and 6 patient with a 5-minute interval, it is reasonable to resample the dataset to 5-minute intervals for all patients. Steps:
- Resample the data for all patients to 5-minute intervals.
- Interpolate the missing data points for patients who originally had 15-minute intervals.



## Challenges and potential difficulties

- <b>Missing Data</b>
    - In the dataset, many columns (e.g. carbs-\*, activity-\*) had high percentages of missing values. This is a common issue in studies involving self-reported data or continuous monitoring (device resets etc.).    
    - Handling strategy: impute missing values, use of statistical models to fill in gaps.
- <b>Outliers</b>
    - From the variable analysis, we observed a large number of outliers, which were primarily caused by the skewness present in most features. As a result, the commonly used IQR method for outlier detection is not applicable in this case.
    - Handling strategy: consider transformations to reduce skewness before detecting outliers.
- <b>Non-normal distributions</b>
    - Features like insulin and blood glucose are often highly skewed. This could impact the performance of models that assume normality or require scaling.
- <b>Feature engineering for Time-Series</b>
    - Extracting useful features such as:
        - Lagged features: past glucose, insulin, carb or other values from previous time steps.
        - Rolling averages/windows: smoothed glucose or insulin averages over time to capture trends.
- <b>Multicollinearity</b>
    - Since the dataset contains many related measurements over time, there might be strong correlations between features. This could impact model performance, especially for regression models.
    - Handling strategy: regularization techniques (Lasso, Ridge) or dimensionality reduction (PCA).  <font color="red">(?)</font>


## Possible biases

- <b>Selection Bias:</b>
    - If the study only included certain types of participants (e.g. individuals with a specific health condition, age group, or gender), this could lead to selection bias, meaning the results may not generalize to other populations.
- <b>Measurement Bias:</b>
    - Self-reported data like carbohydrate intake and activity are prone to inaccuracies, which can lead to measurement bias. Wearable devices (for heart rate, steps, etc.) can also be less reliable depending on usage conditions.
- <b>Sampling Bias:</b>
    - If the self-reported measurements are not taken consistently (e.g., some participants report their data more frequently than others), the dataset might not represent all participants equally, leading to biased conclusions.
