# Quality Control and Assurance

At this point, we have a dataset that is ready for analysis. However, before we start the analysis, we need to ensure that the dataset is consistent and valid. This is a crucial step in the data analysis process, as it ensures that the results of the analysis are accurate and reliable.

The training dataset consists of daily time series for each patient. The rows are supposed to be ordered consecutively by time. However, since no explicit "day" column is provided, we must verify that the rows are indeed sequential. To do this, the lag features columns are utilized. 

**Validation Approach**:
* Compare the lag features in each row to the corresponding data in preceding rows.
* Identify and flag any gaps or inconsistencies in the time series data.

**Expected Outcome**:  
If the lag features align correctly with the data in prior rows, we can assume that the dataset is sequential and free from significant gaps, validating its integrity for further analysis.

## Implementation

To ensure the correctness and reliability of the validation algorithm, we implemented it in two independent ways, developed by different team members.

1. [The first implementation](https://github.com/DataScientest-Studio/sep24_bds_int_medical/blob/main/notebooks/0.5-christian-timeseries-and-coherence-checks.ipynb):
    * performs a cell-by-cell comparison of the lag feature columns,
    * identifies any changes in data types or values within the lag features.

2. [The second implementation](https://github.com/DataScientest-Studio/sep24_bds_int_medical/blob/main/notebooks/0.2-ralf-consistency-check.ipynb): 
    * shifts and reindexes the lag feature columns based on the time differences.
    * detects non-unique values in the parameter columns for each row, effectively spotting discrepancies in sequential data.

Both implementations together enhance the robustness of the validation process and reduce the risk of undetected errors in the dataset.

### Load the Dataset

In [1]:
import pandas as pd
import numpy as np
import os

# Load the dataset
df = pd.read_csv(os.path.join('..', '..', '..', 'data', 'raw', 'train.csv'), na_values=np.nan, low_memory=False)

### Define a Date Time Index

In [2]:
from helpers import set_datetime_index

df = set_datetime_index(df)
display(df.iloc[:, :4].head())
df = df.drop(columns=['time'])

Unnamed: 0_level_0,id,p_num,time,bg-5:55
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 06:10:00,p01_0,p01,06:10:00,
2000-01-01 06:25:00,p01_1,p01,06:25:00,
2000-01-01 06:40:00,p01_2,p01,06:40:00,
2000-01-01 06:55:00,p01_3,p01,06:55:00,
2000-01-01 07:10:00,p01_4,p01,07:10:00,


### Consistency Check for Lag Features of Parameters and the Target Variable

In [3]:
from helpers import consistency_check

result_dict = consistency_check(df)

### Display the Results

The following table summarizes the results of the validation process. It shows the number of rows where the lag features did not correspond to the data in the preceding rows, alongside the total number of rows for each patient and parameter. Additionally, the `target` column compares the `bg` parameter (blood glucose) with a time difference of +1:00.

In [4]:
from helpers import get_parameters

result = pd.DataFrame.from_dict(result_dict, columns=get_parameters() + ['total'], orient='index')
result = result.loc[:, (result != 0).any(axis=0)]
result

Unnamed: 0,hr,steps,cals,total
p01,0,0,0,16865
p02,0,0,0,26335
p03,0,0,0,26427
p04,0,0,0,25047
p05,0,0,0,16248
p06,0,0,0,16674
p10,0,0,0,25874
p11,68,26,45,25205
p12,0,0,0,26048


## Conclusion

The analysis reveals that the lag features are largely consistent with the data in the preceding rows, indicating a high level of dataset reliability. However, there are minor inconsistencies observed in Patient `p11`, particularly in the parameters `heartrate`, `steps`, `cals` and the target parameter (`bg+1:00`).

While the total number of inconsistent rows is small relative to the total dataset size, these discrepancies warrant further investigation to ensure complete data integrity. 