# Outliers and Anomalies

In this notebook, we will analyze the dataset for outliers and anomalies, as addressing them is a crucial step in **standardizing** numerical values during the implementation of a machine learning project.

Given that the explanatory variables represent time series, we will treat the group of column series (e.g., col-*) as a single variable type and inspect them in detail.

In [1]:
import pandas as pd
import os

base_dir = os.path.abspath(os.path.join('..', '..', '..', 'data', 'raw'))
file_path = os.path.join(base_dir, 'train.csv') 

patients = pd.read_csv(file_path, low_memory=False)

In [None]:
bg_cols = [col for col in patients.columns if col.startswith('bg-')]
insulin_cols = [col for col in patients.columns if col.startswith('insulin-')]
carbs_cols = [col for col in patients.columns if col.startswith('carbs-')]
hr_cols = [col for col in patients.columns if col.startswith('hr-')]
steps_cols = [col for col in patients.columns if col.startswith('steps-')]
cals_cols = [col for col in patients.columns if col.startswith('cals-')]
activity_cols = [col for col in patients.columns if col.startswith('activity-')]

> *Note:*  The commonly used IQR method for outlier detection is not applicable to our dataset due to its skewed distribution. This issue will be addressed in the next phase of the project - "Data Preprocessing".

## Outliers for bg-* columns

In [None]:
patients[bg_cols].describe()

In [None]:
# Get the describe() result for all 'bg-*' columns
df_bg = patients[bg_cols].describe()

# Extract the minimum and maximum values across all bg-* columns
overall_min = df_bg.loc['min'].min()  # Get the smallest 'min' value across all bg-* columns
overall_max = df_bg.loc['max'].max()  # Get the largest 'max' value across all bg-* columns

# Display the results
print(f"Overall minimum value across all bg-* columns: {overall_min}")
print(f"Overall maximum value across all bg-* columns: {overall_max}")

In [None]:
patients.groupby('p_num')['bg-5:45'].agg(['min', 'max'])

> <b>Summary</b>: There are some extreme values for blood glucose present but still realistic for some patients.

## Outliers for insulin-* columns

In [None]:
# Get the describe() result for all 'insulin-*' columns
df_insulin = patients[insulin_cols].describe()

# Extract the minimum and maximum values across all insulin-* columns
overall_min = df_insulin.loc['min'].min() 
overall_max = df_insulin.loc['max'].max()  

# Display the results
print(f"Overall minimum value across all insulin-* columns: {overall_min}")
print(f"Overall maximum value across all insulin-* columns: {overall_max}")

<b>a) Investigating negative insulin</b>

In [None]:
# (patients[insulin_cols] < 0).groupby(patients['p_num']).sum()  # result: only p12 has some negativ values

# Create an empty list to store unique negative values
unique_negative_values = set()

# Iterate through each insulin column and patient 'p12'
for col in insulin_cols:
    negative_values = patients[(patients['p_num'] == 'p12') & (patients[col] < 0)][col]
    
    # Add the negative values to the set (automatically handles uniqueness)
    unique_negative_values.update(negative_values.dropna())

# Convert the set to a sorted list and display
unique_negative_values = sorted(unique_negative_values)
print(unique_negative_values)


<b>a) Investigating positiv extremes</b>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

insulin_values = patients["insulin-0:00"].dropna()

# Create a figure for the histogram and KDE plot
plt.figure(figsize=(7, 3))

sns.histplot(insulin_values, bins=30, kde=True, color='blue', stat='density')

# Add labels and title
plt.xlabel('Insulin dose')
plt.ylabel('Density')
plt.title('Distribution of insulin doses (insulin-0:00) for all patients')

plt.show()

In [None]:
patients.groupby('p_num')['insulin-0:00'].agg(['min', 'max'])

> <b>Summary:</b> For patient p12, we detected some negative values. Negative insulin values are often the result of data entry errors, sensor malfunctions, or recording issues in the dataset.
> Since there are only two different negative values and they fall outside the usual value ranges, we will replace them with their corresponding positive values. This step will be carried out during the data preprocessing phase.
>
> Additionally, for two other patients, we observe extremely high insulin doses. While these values do not appear to be errors, the data is still right-skewed and will need to be addressed.

## Outliers for carbs-* columns

In [None]:
# Get the describe() result for all 'carbs-*' columns
df_carbs = patients[carbs_cols].describe()

# Extract the minimum and maximum values across all hr-* columns
overall_min = df_carbs.loc['min'].min() 
overall_max = df_carbs.loc['max'].max() 

# Display the results
print(f"Overall minimum value across all carbs-* columns: {overall_min}")
print(f"Overall maximum value across all carbs-* columns: {overall_max}")

> <b>Summary:</b> The values for carbohydrate consumption, ranging from 1.0 to 852.0 grams, raise concerns about their realism and appropriateness for our use case. Given that these data are self-reported by patients and 98% of the values are missing, we may consider excluding these columns from the model.

## Outliers for hr-* columns

In [None]:
# Get the describe() result for all 'hr-*' columns
df_hr = patients[hr_cols].describe()

# Extract the minimum and maximum values across all hr-* columns
overall_min = df_hr.loc['min'].min()  
overall_max = df_hr.loc['max'].max()  # Group by 'p_num' and calculate the min and max of 'hr-5:55' for each patient
patients.groupby('p_num')['hr-5:55'].agg(['min', 'max'])

# Display the results
print(f"Overall minimum value across all hr-* columns: {overall_min}")
print(f"Overall maximum value across all hr-* columns: {overall_max}")

In [None]:
# Group by 'p_num' and calculate the min and max of 'hr-5:55' for each patient
patients.groupby('p_num')['hr-5:55'].agg(['min', 'max'])

> <b>Summary:</b>
> For one patient (p06), we observed an unusually low heart rate of 37.6 bpm. While this value could be realistic for elite athletes or during deep sleep, it is generally considered abnormally low for most people. For now, we will leave this value as is, as it might indicate an important clinical signal (e.g., severe bradycardia). However, we should keep in mind that in a small dataset of only nine patients, even a single outlier can have a larger impact compared to larger datasets. If we apply models like random forests, gradient boosting or SVM - which are more robust to outliers - this outlier might have minimal impact.
> 
>A similar consideration applies to the extreme value of 185.3 bpm for patient p02.

## Outliers for steps-* columns

In [None]:
# Get the describe() result for all 'steps-*' columns
df_steps = patients[steps_cols].describe()

# Extract the minimum and maximum values across all steps-* columns
overall_min = df_steps.loc['min'].min()  
overall_max = df_steps.loc['max'].max() 

# Display the results
print(f"Overall minimum value across all steps-* columns: {overall_min}")
print(f"Overall maximum value across all steps-* columns: {overall_max}")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

steps_values = patients["steps-0:00"].dropna()

# Create a figure for the histogram and KDE plot
plt.figure(figsize=(7, 3))

sns.histplot(steps_values, bins=30, kde=True, color='blue', stat='density')

# Add labels and title
plt.xlabel('Steps')
plt.ylabel('Density')
plt.title('Distribution of steps (steps-0:00) for all patients')

plt.show()

In [None]:
# Group by 'p_num' and calculate the min and max of 'steps-0:00' for each patient
patients.groupby('p_num')['steps-0:00'].agg(['min', 'max'])

In [None]:
# Filter the dataset for patient 'p02' where 'steps-0:00' equals 1359
activity_steps_1359_p02 = patients[(patients['p_num'] == 'p02') & (patients['steps-0:00'] == 1359)][['activity-0:00', 'steps-0:00']]

# Display the result
print(activity_steps_1359_p02)

In [None]:
patients[(patients['p_num'] == 'p02') & (patients['steps-0:00'] == 1359)][steps_cols + activity_cols]

><b>Summary:</b> The average walking speed for a healthy adult is around 100 to 120 steps per minute. Over a 5-minute period, a typical person would walk approximately 500 to 600 steps at a moderate pace. Therefore, a value of 1359 steps appears to be an outlier or extreme value. However, after considering the frequency and trends in neighboring records, along with the fact that the activity was reported as "Walking," we can conclude that this value is likely valid in the context of possible intense physical activity.

## Outliers for cals-* columns

In [None]:
# Get the describe() result for all 'cals-*' columns
df_cals = patients[cals_cols].describe()

# Extract the minimum and maximum values across all steps-* columns
overall_min = df_cals.loc['min'].min()  
overall_max = df_cals.loc['max'].max() 

# Display the results
print(f"Overall minimum value across all cals-* columns: {overall_min}")
print(f"Overall maximum value across all cals-* columns: {overall_max}")

><b>Summary:</b> The values observed here appear realistic, considering potential physical activities.