# Revisiting and Enhancing the Original Model for Predicting Workplace Air Concentrations

**Date:** June 25, 2024 <br>
**Author:** Jacob Kvasnicka

### Recap Previous Discussion
In our previous meeting, we outlined a three-strage framework for prioritizing chemicals of concern in workplaces:
- Predict industrial sectors from chemical structure
- Predict inhalation exposure using an enhanced version of Jeff's Bayesian model
- Prioritize chemicals of concern, incorporating human-equivalent points of departure

### Initial Objective 
The initial objective of this notebook was to replicate the original methodology for predicting air concentrations of chemicals using OSHA data. The goal was to determine whether similar performance could be attained using a model that's less computationally intensive than the Bayesian hierarchical model.

As an alternative to the Bayesian model, I began to implement a Random Forests approach, maintaining the exact training and test sets from the original study:
1. Classify whether a sample has a detectable air concentration
2. Predict the concentration levels for those detectable samples

### Issues Identified in Data Preparation
However, further inspection of the original code and dataset preparation revealed potential issues that may have influenced the performance estimates:
- **Duplicate Samples:** A considerable number of duplicate samples were found, attributable to errors in the data preparation process.
- **Identically-Valued Samples:** A large proportion of the classification training samples are identical, which could potentially bias the model towards these repeated values.
- **Data Leakage:** Preprocessing steps such as centering and scaling were conducted on the entire dataset prior to the train-test split, likely causing data leakage and potentially overoptimistic performance estimates.

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, roc_auc_score, mean_squared_error

In [2]:
# Load data files
osha_full = pd.read_csv('../ht_occupational-1.0.0/output/osha_processed_full.csv')
osha_train = pd.read_csv('../ht_occupational-1.0.0/output/osha_processed_training.csv')
osha_test = pd.read_csv('../ht_occupational-1.0.0/output/osha_processed_test.csv')

  osha_full = pd.read_csv('../ht_occupational-1.0.0/output/osha_processed_full.csv')
  osha_train = pd.read_csv('../ht_occupational-1.0.0/output/osha_processed_training.csv')


In [3]:
# Define data configuration settings
data_config = {
    'phys_chem_cols' : [
        'logp_pred', 
        'bp_pred', 
        'loghl_pred', 
        'rt_pred', 
        'logoh_pred', 
        'logkoc_pred'
    ],
    'naics_col' : 'naics_unified',
    'class_target_col' : 'detected',
    'reg_target_col' : 'conc_mgm3',
}

In [4]:
def prepare_classification_data(
        osha_data, 
        phys_chem_cols, 
        naics_col, 
        target_col
    ):
    '''
    Prepare binary classification data.

    The features include physical-chemical properties plus NAICS subsector 
    codes. The target is the sample detection category (detect/non-detect).
    
    Parameters
    ----------
    osha_data : pd.DataFrame
        The OSHA dataset, which may be the train or test set.
    phys_chem_cols : list
        List of columns representing physical-chemical properties.
    naics_col : str
        Column name representing NAICS code.
    target_col : str
        Column name representing the target variable.

    Returns
    -------
    tuple
        Features and target data.
    '''
    X = prepare_features(osha_data, phys_chem_cols, naics_col)
    y = osha_data[target_col]

    return X, y
        
def prepare_regression_data(
        osha_data, 
        phys_chem_cols, 
        naics_col, 
        target_col
    ):
    '''
    Prepare regression data for detected samples.
    
    Parameters
    ----------
    osha_data : pd.DataFrame
        The OSHA dataset, which may be the train or test set.
    phys_chem_cols : list
        List of columns representing physical-chemical properties.
    naics_col : str
        Column name representing NAICS code.
    target_col : str
        Column name representing the target variable.

    Returns
    -------
    tuple
        Features and target data.
    '''
    X = prepare_features(osha_data, phys_chem_cols, naics_col)

    # Filter out non-detect values
    where_detected = osha_data['detected'] == 1
    y = np.log10(osha_data.loc[where_detected, target_col])
    X = X.loc[y.index]

    return X, y

def prepare_features(osha_data, phys_chem_cols, naics_col):
    '''
    Prepare features data including physical-chemical and NAICS features.
    
    Parameters
    ----------
    osha_data : pd.DataFrame
        The OSHA dataset, which may be the train or test set.
    phys_chem_cols : list
        List of columns representing physical-chemical properties.
    naics_col : str
        Column name representing NAICS code.
        
    Returns
    -------
    pd.DataFrame
        Prepared features data.
    '''
    X_phys_chem = osha_data[phys_chem_cols]
    X_naics = prepare_naics_features(osha_data, naics_col)
    return pd.concat([X_phys_chem, X_naics], axis=1)
    
def prepare_naics_features(osha_data, naics_col):
    '''
    Prepare NAICS features.

    For now, this is just the subsector codes.
    
    Parameters
    ----------
    osha_data : pd.DataFrame
        The OSHA dataset, which may be the train or test set.
    naics_col : str
        Column name representing NAICS code.

    Returns
    -------
    pd.Series
        Prepared NAICS features.
    '''
    return osha_data[naics_col].str[:3]

## Duplicate Samples

In [5]:
def analyze_duplicates(dataframe):
    '''
    Return the percentage of duplicate rows in a dataframe.

    Parameters
    ----------
    dataframe : pd.DataFrame
        The dataframe to analyze for duplicates.
    '''
    total_duplicates = dataframe.duplicated().sum()

    proportion_duplicates = total_duplicates / len(dataframe)

    return proportion_duplicates * 100

In [6]:
print(f'{round(analyze_duplicates(osha_full))}% of all samples are duplicates')

13% of all samples are duplicates


In [7]:
osha_full.iloc[:2]

Unnamed: 0,inspection_number,establishment_name,preferred_name,naics_unified,sector_name,subsector_name,industry_group_name,sample_type,conc_mgm3,detected,...,bp x logoh,bp x logkoc,loghl x rt,loghl x logoh,loghl x logkoc,rt x logoh,rt x logkoc,logoh x logkoc,index_s,index_ss
0,42,HIGHWAY DEPARTMENT TOWN ROTTERDAM,"1,6-Diisocyanatohexane",611110,Educational Services,Educational Services,Elementary and Secondary Schools,P,0.071,1,...,0.12756,-0.200765,-0.051483,-0.077787,0.122428,0.115624,-0.18198,-0.274959,5,11
1,42,HIGHWAY DEPARTMENT TOWN ROTTERDAM,"1,6-Diisocyanatohexane",611110,Educational Services,Educational Services,Elementary and Secondary Schools,P,0.071,1,...,0.12756,-0.200765,-0.051483,-0.077787,0.122428,0.115624,-0.18198,-0.274959,5,11


## Identically-Valued Samples
This issue arises partly from duplication errors and partly from the method of data aggregation, which grouped samples by inspection number rather than by unique chemical or unique chemical-subsector combination.

In [8]:
# Define helper function for convenience
combine_Xy = lambda X, y: pd.concat([X, y], axis=1)

Xy_train_class = combine_Xy(*prepare_classification_data(
    osha_train,
    data_config['phys_chem_cols'], 
    data_config['naics_col'],
    data_config['class_target_col']
))
print(f'{round(analyze_duplicates(Xy_train_class))}% of classification training samples are identical')

87% of classification training samples are identical


In [9]:
Xy_train_class

Unnamed: 0,logp_pred,bp_pred,loghl_pred,rt_pred,logoh_pred,logkoc_pred,naics_unified,detected
0,-0.536572,0.305187,-0.186106,0.276632,0.417972,-0.657841,611,1
1,-0.536572,0.305187,-0.186106,0.276632,0.417972,-0.657841,611,1
2,0.109881,-1.096611,1.223598,-0.985595,-0.655160,-0.398183,611,1
3,-1.418435,-0.053236,-1.456628,-0.985595,0.271104,-1.281596,611,0
4,0.220900,1.005122,-0.918491,3.024304,0.623042,1.073490,611,1
...,...,...,...,...,...,...,...,...
41385,-0.039618,-0.560311,0.621655,-0.182988,0.232836,-0.475266,332,1
41386,-0.804940,-1.202590,0.510407,-0.985595,-0.670397,-0.798232,332,1
41387,0.334963,-0.844608,1.277709,-0.292999,0.020578,-0.162848,332,1
41388,-0.536572,0.305187,-0.186106,0.276632,0.417972,-0.657841,926,0


In [10]:
Xy_test_class = combine_Xy(*prepare_classification_data(
    osha_test,
    data_config['phys_chem_cols'], 
    data_config['naics_col'],
    data_config['class_target_col']
))
print(f'{round(analyze_duplicates(Xy_test_class))}% of classification test samples are identical')

88% of classification test samples are identical


In [11]:
Xy_train_reg = combine_Xy(*prepare_regression_data(
    osha_train,
    data_config['phys_chem_cols'], 
    data_config['naics_col'],
    data_config['reg_target_col']
))
print(f'{round(analyze_duplicates(Xy_train_reg))}% of regression training samples are identical')

14% of regression training samples are identical


In [12]:
Xy_test_reg = combine_Xy(*prepare_regression_data(
    osha_test,
    data_config['phys_chem_cols'], 
    data_config['naics_col'],
    data_config['reg_target_col']
))
print(f'{round(analyze_duplicates(Xy_test_reg))}% of regression test samples are identical')

0% of regression test samples are identical


## Data Leakage

By inspecting the original code, not shown here, I noticed that preprocessing steps like centering and scaling were conducted on the entire dataset prior to splitting into training and test sets. This inadvertently incorporated test set information into the model training process.

The code below checks for any samples that are present in both the training and test sets.

In [13]:
def count_overlapping_samples(train_df, test_df):
    '''
    Check for overlapping samples between training and test sets. 
    
    A single identifier is created by joining all column values as strings.

    Parameters
    ----------
    train_df : pd.DataFrame
        Training dataset containing features and target.
    test_df : pd.DataFrame
        Test dataset containing features and target.

    Returns
    -------
    int
        The number of overlapping samples between the training and test sets.
    '''
    train_df, test_df = train_df.copy(), test_df.copy()
    
    train_df['combined'] = train_df.apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
    test_df['combined'] = test_df.apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

    train_set = set(train_df['combined'])
    test_set = set(test_df['combined'])

    overlap = test_set.intersection(train_set)

    return len(overlap)

In [14]:
overlap_class = count_overlapping_samples(Xy_train_class, Xy_test_class) / len(Xy_test_class) * 100
print(f'{round(overlap_class)}% of classification test samples included in training set')

0% of classification test samples included in training set


In [15]:
overlap_reg = count_overlapping_samples(Xy_train_reg, Xy_test_reg) / len(Xy_test_reg) * 100
print(f'{round(overlap_reg)}% of regression test samples included in training set')

0% of regression test samples included in training set


## Potential Path Forward

This last section outlines an approach to address both the immediate concerns identified and potential enhancements for a future publication.

### Resolving Data Preparation Errors (Original Paper)
1. **Address Duplicate Samples and Improper Preprocessing**: Correct the origins of duplicate samples and ensure that preprocessing steps like centering and scaling are applied appropriately after data splitting.
2. **Re-execute the Original Bayesian Model**: After addressing the data issues, rerun the original model to assess if the previously reported performances were affected.
3. **Consider Issuing an Erratum**: If the re-evaluation yields significantly different results, consider issuing an erratum to the original publication to correct the scientific record.

### Additional Enhancements (Future Paper)
1. **Implement More Rigorous Data Cleaning**: Adopt the more rigorous data cleaning methodology suggested by Lavoué et al., 2023, to ensure the integrity of the dataset.
2. **Incorporate Additional OSHA Datasets**: Expand the dataset to include more recent inspection years and the IMIS dataset, enhancing the model's robustness and relevance.
3. **Aggregate Samples by Unique Chemical or Chemical-Subsector Combination**: This strategy mitigates the issue with identically-valued samples and aligns with the intended use of the model.
4. **Implement Cross-Validation**: Employ cross-validation techniques to more fully gauge the model's prediction error, moving beyond a single train-test split.
5. **Evaluate Less Computationally Intensive Models**: Explore simpler models, such random forests, which may offer sufficient performance with reduced computational demands.