name: lcms_data_processing
date: 08/20/2024
version: 1.0
author: Justin Sankey

description: Takes raw liquit chromatography mass spectroscopy (LCMS) data, 
extracts relevant paramters for analysis and writes them to excel.

When you execute the notebook for the first time you need to install all required python packages.
So, type the following commands in your python console or anaconda prompt:
- pip install pandas
- pip install numpy
- pip install matplotlib
- pip install seaborn
- pip install scikit-learn
- pip install jinja2

In [None]:
# import all needed packages
import os
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from openpyxl import load_workbook
from openpyxl.drawing.image import Image
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

This is the only cell that you should have to make edits to.
Enter in your desired input and output file paths 
and change what you deem to be an acceptable recovery range.
*replaced .txt with /t separtaor with .csv with , separator*
*should we implement both options? What will be used?*
*New: indicate directory to save plots to*.


In [None]:
# raw data upload file path
raw_filepath = r'example_data_raw\PFAS_ACF_Batch1.csv'

# processed data output excel file path
processed_filepath =r'example_data_processed\PFAS_ACF_Batch1_processed.xlsx'

# processed data long format
processed_filepath_long =r'example_data_processed\PFAS_ACF_Batch1_processed.csv'

# file path to write QCS0 data to 
qcs0_filepath = r'example_data_processed\QCS0_area_values.csv' 

# directory to save plots to
plot_directory = r'example_figures'

#color-coding for recoveries table
in_range = 'background-color: green'
in_range_min_val = 0.6 
in_range_max_val = 1.4
out_range = 'background-color: red'
out_range_min_val = 0.4 
out_range_max_val = 1.8
question_range = 'background-color: yellow'

# file paths for IDL and IQL data - not meant to be adopted
IDL_2024_filepath = r'example_data_raw\IDL_2024.csv'
IDL_IQL_filepath = r'example_data_raw\IDL_IQL.csv'

In [None]:
# Load data file
if raw_filepath[-4:] == '.csv':
    data = pd.read_csv(raw_filepath, delimiter=',', encoding='utf-8', low_memory=False, header=0,)
elif raw_filepath[-4:] == '.txt':
    data = pd.read_csv(raw_filepath, delimiter='/t', encoding='utf-8', low_memory=False, header=0,)

# clean up messy 'calculated concentration' column
data['Calculated Concentration'] = data['Calculated Concentration'].replace(
    {'<1 points': np.nan, '< 0': np.nan, 'no root': np.nan, 'NaN': np.nan}
    ).astype('float')

# correct channel names in original data (all of the TOF channels are labelled by _TOF MS, only 2 of them are labeled by only _TOF)
mask_names = data['Component Name'].str.endswith('_TOF')
data['Component Name'][mask_names] = [compound + ' MS' for compound in data['Component Name'][mask_names].to_list()]

# split data into quantification data, calibration data, and process blanks
calibration_only = data[(data['Sample Type'] == 'Standard')]
quantification_blank = data[(data['Sample Type'] != 'Standard')]
blank_only = quantification_blank[quantification_blank['Sample Comment'].str.contains('Blank', na=False)]
quantification_only = quantification_blank[~quantification_blank['Sample Comment'].str.contains('Blank', na=False)]

# correct channel names in original data (all of the TOF channels are labelled by _TOF MS, only 2 of them are labeled by only _TOF)
mask_names = quantification_only['Component Name'].str.endswith('_TOF')
quantification_only['Component Name'][mask_names] = [compound + ' MS' for compound in quantification_only['Component Name'][mask_names].to_list()]

# display settings 
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_colwidth', None)  # Show full width of columns


Calculate response factors from calibration data: ratio of (i) calculated area of IDA (=extracted standard) * actual concentration of IPS and (ii) calculated area of IPS (=non extracted standard) * actual concentration of IDA
Generate boxplot of response factors and write mean as well as standard deviation for each IDA to csv.

Conclusio: used values are very close to new calculation - maybe one or two calibration points missing or added - not a big problem. \n
Response factor varies tremendously from PFAS to PFAS, I would not use average of all response factors. \n
Right now concentration is used for normalization only for IPS - discuss with Jitka and Simon why IDA concentration is not included in evaluation.

In [None]:
# define funtion which calculates the IDA IPS ratio.
# challenge - search right IPS row indicated in the Component Group Name of IDA.
def calulate_ida_ips_ratio(data: pd.DataFrame, column_name:str) -> pd.DataFrame:
    """Calculates IDA area times IDA concentration divided by IPS area times IPS concentration.

    :param data: Entire data junk (including all rows and the following columns:
    Component Name, Sample Index, Component Group Name, Actual Concentration and Area.
    :type data: pd.DataFrame
    :param column_name: name of column, the calculated ratio should be saved to
    :type column_name: str
    :return: Data junk only containing IDA rows with the corresponding ratio saved to new column
    :rtype: pd.DataFrame
    """ 
    data_only_ida = data[data['Component Name'].str.contains('IDA')] 
    data_only_ida.loc[:,f'{column_name}'] = [np.nan] * len(data_only_ida)
    data_only_ida.loc[:,'IPS Area'] = [np.nan] * len(data_only_ida)
    data_only_ida.loc[:,'IPS Concentration'] = [np.nan] * len(data_only_ida)
    for row_index in data_only_ida.index:
        sample_index = data_only_ida.loc[row_index, 'Sample Index']
        ips_channel_name = data_only_ida.loc[row_index, 'Component Group Name']
        corresponding_ips_area_row = data[(
            (data['Sample Index'] == sample_index) &
            (data['Component Name'] == ips_channel_name)
            )]
        data_only_ida.loc[row_index, 'IPS Area'] = corresponding_ips_area_row['Area'].iloc[0]
        data_only_ida.loc[row_index, 'IPS Concentration'] = corresponding_ips_area_row['Actual Concentration'].iloc[0]
        data_only_ida.loc[row_index, f'{column_name}'] = \
            (data_only_ida.loc[row_index, 'Area'] * corresponding_ips_area_row['Actual Concentration']).iloc[0] \
            / (corresponding_ips_area_row['Area'] * data_only_ida.loc[row_index, 'Actual Concentration']).iloc[0]
    
    return data_only_ida

In [None]:
# extract values of IDAs (= extracted standard) and IPS (=non extracted standard)
# and save basic sample information as well as areas of intensity peak, actual concentration and Component Group Name.
# the component group name is useful to assoiciate the right IPS to each IDA.
selected_columns = ['Sample Name', 'Sample Index','Component Name', 'Component Group Name', 'Area', 'Actual Concentration', 'IDA Average Response Factor']
calibration_only_ida = calibration_only[selected_columns]

# calucluate IDA IPS ratio to compute response factors with calibration data
calibration_only_ida = calulate_ida_ips_ratio(data=calibration_only_ida, column_name="Response Factor from Scratch")

# create and save response factor box plots
image_path = os.path.join(plot_directory, 'response_factors.png')
_, ax = plt.subplots(figsize=(8, 8))
calibration_only_ida.boxplot(column='Response Factor from Scratch', by='Component Name', ax=ax)
plt.ylabel('Response Factor (IDA area/IPS area)')
plt.xticks(rotation=90)
plt.savefig(image_path)
plt.show()

# place holder to compare response factors with original data
calibration_only_ida['Response Factor from Scratch'] = calibration_only_ida['Response Factor from Scratch'] / 1

# create data frame with this response factor calculation (from scratch), the standard deviation and the original values evaluated by Sciex,
response_factor = calibration_only_ida.groupby('Component Name', as_index=False)['Response Factor from Scratch'].mean()
response_factor['Response Factor Std'] = calibration_only_ida.groupby('Component Name')['Response Factor from Scratch'].std().to_list()
response_factor['Response Factor Sciex'] = calibration_only_ida.groupby('Component Name')['IDA Average Response Factor'].mean().to_list()
response_factor.rename(columns={'Response Factor from Scratch': 'Response Factor Mean'}, inplace=True)
response_factor.index = response_factor['Component Name']
response_factor.drop(columns=['Component Name'], inplace=True)

# write response factor to excel file
with pd.ExcelWriter(processed_filepath, engine='openpyxl') as writer:
    response_factor.to_excel(writer, sheet_name='Response Factor')

# save recoveries image to excel file
workbook = load_workbook(processed_filepath)
plot_sheet = workbook.create_sheet('Details RF')

img = Image(os.path.join(plot_directory, 'response_factors.png'))

cell_position = plot_sheet.cell(row=1, column=1).coordinate
plot_sheet.add_image(img, cell_position)

workbook.save(processed_filepath)

Calculate recovery rate for each IDA compound in each sample.
$$
recovery~rate = \frac{\frac{area_{IDA~sample}~\cdot~concentration_{IPS~sample}}{area_{IPS~sample}~\cdot~concentration_{IDA~sample}}}{average(\frac{area_{IDA~calibration}~\cdot~concentration_{IPS~calibration}}{(area_{IPS~calibration}~\cdot~concentration_{IDA~calibration})})} = \frac{ratio}{response~factor}
$$
Indicate uncertainty by maximum error method.
$$
\Delta recovery~rate = \Delta response~factor * \frac{ratio}{response~factor^{2}} + \Delta ratio * \frac{1}{response~factor}
$$

Assign right recovery rate with uncertainty to each PFAS default channel.
The assignment works for each sample by using the 'IS Name' column which provides information on which IDA standard is associated to which PFAS.

In [None]:
# select ida rows from quantification data and calculate ida ips ratio
selected_columns = [
    'Sample Name', 'Sample Index', 'Component Name', 'Component Group Name', 'Area', 'Actual Concentration', 'Reported Recovery'
    ]
quantification_ida = quantification_only[selected_columns]
quantification_ida = calulate_ida_ips_ratio(data=quantification_ida, column_name="IDA-IPS Ratio")

# assign right response factor to each IDA and calculate Recovery rate
quantification_ida.loc[:, 'Response Factor Mean'] = [np.nan] * len(quantification_ida)
quantification_ida.loc[:, 'Response Factor Std'] = [np.nan] * len(quantification_ida)

for component in response_factor.index:
    quantification_ida.loc[quantification_ida['Component Name'] == component, 'Response Factor Mean'] = response_factor.loc[component, 'Response Factor Mean']
    quantification_ida.loc[quantification_ida['Component Name'] == component, 'Response Factor Std'] = response_factor.loc[component, 'Response Factor Std']

# calculate recovery rate and its related uncertainty
quantification_ida['Recovery Rate'] = quantification_ida['IDA-IPS Ratio'] / quantification_ida['Response Factor Mean']
quantification_ida['Recovery Rate Uncertainty 1'] = quantification_ida['Response Factor Std'] * quantification_ida['IDA-IPS Ratio'] \
     / (quantification_ida['Response Factor Mean'] * quantification_ida['Response Factor Mean'])

# assign right recovery rate to each natural pfas channel
selected_columns = [
    'Sample Name', 'Sample Index', 'Acquisition Date & Time', 'Component Name', 'IS Name', 'Calculated Concentration', 'Area',
    ]
quantification_pfas_default = quantification_only[selected_columns]
quantification_pfas_default.rename(columns={'IS Name': 'IDA Name'}, inplace=True)

# get only PFAS default channels
quantification_pfas_default = quantification_pfas_default[~quantification_pfas_default['Component Name'].str.contains('IDA|IPS|13C|MS')]

# assign right recovery rate and uncertainty to each PFAS compound
quantification_pfas_default.loc[:,'IDA Area'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IDA Concentration'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IPS Area'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IPS Concentration'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IDA-IPS Ratio'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IPS Name'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'Recovery Rate'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'Recovery Rate Uncertainty 1'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'Recovery Rate Sciex'] = [np.nan] * len(quantification_pfas_default)
for row_index in quantification_pfas_default.index:
    sample_index = quantification_pfas_default.loc[row_index, 'Sample Index']
    ida_channel_name = quantification_pfas_default.loc[row_index, 'IDA Name']
    recovery_row = quantification_ida[(
        (quantification_ida['Component Name'] == ida_channel_name) &
        (quantification_ida['Sample Index'] == sample_index)
    )]
    quantification_pfas_default.loc[row_index,'IDA Area'] = recovery_row.loc[:,'Area'].to_list()[0]
    quantification_pfas_default.loc[row_index,'IDA Concentration'] = recovery_row.loc[:,'Actual Concentration'].to_list()[0]
    quantification_pfas_default.loc[row_index,'IPS Area'] = recovery_row.loc[:,'IPS Area'].to_list()[0]
    quantification_pfas_default.loc[row_index,'IPS Concentration'] = recovery_row.loc[:,'IPS Concentration'].to_list()[0]
    quantification_pfas_default.loc[row_index, 'IDA-IPS Ratio'] = \
        (recovery_row.loc[:,'IDA-IPS Ratio'] * 100).to_list()[0]
    quantification_pfas_default.loc[row_index, 'IPS Name'] = \
        recovery_row.loc[:,'Component Group Name'].to_list()[0]
    quantification_pfas_default.loc[row_index, 'Recovery Rate'] = \
        (recovery_row.loc[:,'Recovery Rate'] * 100).to_list()[0]
    quantification_pfas_default.loc[row_index, 'Recovery Rate Uncertainty 1'] = \
        (recovery_row.loc[:,'Recovery Rate Uncertainty 1'].round(decimals=3) * 100).to_list()[0]
    quantification_pfas_default.loc[row_index, 'Recovery Rate Sciex'] = \
        recovery_row.loc[:,'Reported Recovery'].to_list()[0]
    
    for column in ['IDA-IPS Ratio', 'Recovery Rate', 'Recovery Rate Uncertainty 1', 'Recovery Rate Sciex']:
        quantification_pfas_default[f'{column}'] = quantification_pfas_default[f'{column}'].round(decimals=1)

# put relevant information to pivot tables
ratio_ida_ips = quantification_pfas_default.pivot_table(
    index=('Sample Name',), columns='Component Name', values='IDA-IPS Ratio', aggfunc='first'
    )
recovery_sciex = quantification_pfas_default.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Recovery Rate Sciex', aggfunc='first'
    )
recovery = quantification_pfas_default.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Recovery Rate', aggfunc='first'
    )
recovery_uncertainty_1 = quantification_pfas_default.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Recovery Rate Uncertainty 1', aggfunc='first'
    )

# write to excel file
with pd.ExcelWriter(processed_filepath, engine='openpyxl', mode='a') as writer:
    recovery_sciex.to_excel(writer, sheet_name='Recovery Rate Sciex')
    recovery.to_excel(writer, sheet_name='Recovery Rate')
    recovery_uncertainty_1.to_excel(writer, sheet_name='Recovery Rate Uncertainty 1')
    ratio_ida_ips.to_excel(writer, sheet_name='IDA-IPS Ratio')

Calculation of sample ida - ips ratio and calibration ida - ips ratio (refered to as response factor above) - Method Justin:
Uses columns "IDA area" and "IPS area" directly


In [None]:
#Area IDA values for IDA components
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Area IDA']
area_ida = quantification_blank[selected_columns_area]
area_ida = area_ida[area_ida['Component Name'].str.contains('IDA')]
area_ida.loc[:,'Sample Name Date'] = area_ida['Sample Name'].astype(str) + "_" + area_ida['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
area_ida_piv = area_ida.pivot_table(index=('Sample Name Date',), columns='Component Name', values='Area IDA', aggfunc='first')
area_ida_piv

In [None]:
#Calibration Area IDA Average for IDA components
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Area IDA']
cal_area_ida = calibration_only[selected_columns_area]
cal_area_ida = cal_area_ida[cal_area_ida['Component Name'].str.contains('IDA')]
cal_area_ida.loc[:,'Sample Name Date'] = cal_area_ida['Sample Name'].astype(str) + "_" + cal_area_ida['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
cal_area_ida_piv = cal_area_ida.pivot_table(index=('Sample Name Date',), columns='Component Name', values='Area IDA', aggfunc='first')
#Establishes a row with average of each column
mean_cal = cal_area_ida_piv.mean(skipna=True)
mean_cal=pd.DataFrame(mean_cal).T
mean_cal.index=['Average']
cal_area_ida_piv = pd.concat([cal_area_ida_piv,mean_cal])

In [None]:
#IPS Area Values for IDA components
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Area IPS']
area_ips = quantification_blank[selected_columns_area]
area_ips = area_ips[area_ips['Component Name'].str.contains('IDA')]
area_ips.loc[:,'Sample Name Date'] = area_ips['Sample Name'].astype(str) + "_" + area_ips['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
area_ips_piv = area_ips.pivot_table(index=('Sample Name Date',), columns='Component Name', values='Area IPS', aggfunc='first')

In [None]:
#IPS Calibration Area and Average for IDA components
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Area IPS']
cal_area_ips = calibration_only[selected_columns_area]
cal_area_ips = cal_area_ips[cal_area_ips['Component Name'].str.contains('IDA')]
cal_area_ips.loc[:,'Sample Name Date'] = cal_area_ips['Sample Name'].astype(str) + "_" + cal_area_ida['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
cal_area_ips_piv = cal_area_ips.pivot_table(index=('Sample Name Date',), columns='Component Name', values='Area IPS', aggfunc='first')
#establishes mean for each column
mean_cal = cal_area_ips_piv.mean(skipna=True)
mean_cal=pd.DataFrame(mean_cal).T
mean_cal.index=['Average']
cal_area_ips_piv = pd.concat([cal_area_ips_piv,mean_cal])


In [None]:
#calibration IDA/IPS ratio
cal_ida_ips_ratio = cal_area_ida_piv.loc['Average']/cal_area_ips_piv.loc['Average']
cal_area_ips_piv.loc['Average']
cal_area_ida_piv.loc['Average']
#Sample IDA/IPS Ratio
sample_ida_ips_ratio = area_ida_piv/area_ips_piv

In [None]:
#IPS normalized recoveries to calibration data
ips_norm_recovery = sample_ida_ips_ratio/cal_ida_ips_ratio
# ips_norm_recovery
# color code recoveries
def color_map(val):
    if in_range_min_val <= val <= in_range_max_val:
        return in_range
    elif val < out_range_min_val or val > out_range_max_val:
        return out_range
    else:
        return question_range

# Apply the style function to the entire DataFrame
styled_ips_norm_recovery = ips_norm_recovery.style.applymap(color_map)
styled_ips_norm_recovery

In [None]:
#Reported Recovery Pivot Table
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Reported Recovery']
reported_recovery = quantification_blank[selected_columns_area]
reported_recovery = reported_recovery[reported_recovery['Component Name'].str.contains('IDA')]
reported_recovery.loc[:,'Sample Name Date'] = reported_recovery['Sample Name'].astype(str) + "_" + reported_recovery['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
reported_recovery_piv = reported_recovery.pivot_table(index=('Sample Name Date',), columns='Component Name', values='Reported Recovery', aggfunc='first')
reported_recovery_piv=reported_recovery_piv/100

In [None]:
#Color map of Reported recoveries
styled_reported_recovery = reported_recovery_piv.style.applymap(color_map)
styled_reported_recovery

In [None]:
# The following file extracts quality control standard data (QSC0) from the isotope dilution analaysis (IDA) 
# and appends it to an excisting QCS0 file which is indicated in the second code block
# (*qcs0_filepath*).

# Filter rows where 'Sample Name' contains 'QCS0'
#qcs0_samples = ida_area_piv.reset_index()
#qcs0_samples = qcs0_samples[qcs0_samples['Sample Name Date'].str.contains('QCS0')]

# Append the filtered DataFrame to an existing CSV file
#if os.path.exists(qcs0_filepath):
    # Load the existing data
   # qsc0_existing_data = pd.read_csv(qcs0_filepath)
    
    # Combine existing data with new data, avoiding duplicates
    #qsc0_combined_data = pd.concat([qsc0_existing_data, qcs0_samples]).drop_duplicates(subset=['Sample Name Date'])
    
    # Write back to the CSV without writing headers again
    #qsc0_combined_data.to_csv(qcs0_filepath, index=False)
#else:
    # If the file doesn't exist, write the data with headers
    #qcs0_samples.to_csv(qcs0_filepath, index=False)

Compute method detection limits (MDL) based on avaerage and standard deviation of process blanks. Use instrument detection limits (IDL) for the PFAS compounds not included in the process blanks. \n
Introduce new column 'Below Detection Threshold' and indicate all Calculated Concentration Values of PFAS quantification below the determined detection limits. 

In [None]:
selected_columns = ['Sample Name', 'Component Name', 'Calculated Concentration']

# isolate blank data and remove IDA/IPS values as well as TOF channels
blank_only = blank_only[selected_columns]
blank_only = blank_only[~blank_only['Component Name'].str.contains('IDA|IPS|13C|MS')]

# create data frame with average and standard deviation values for MDL calculation and caluclate MDL
mdl = blank_only.groupby('Component Name', as_index=False)['Calculated Concentration'].mean()
mdl['Calculated Concentration Std'] = blank_only.groupby('Component Name', as_index=False)['Calculated Concentration'].std()['Calculated Concentration'].to_list()
mdl['MDL'] = mdl['Calculated Concentration'] + 3 * mdl['Calculated Concentration Std']
mdl.rename(columns={"Calculated Concentration": "Calculated Concentration Mean"}, inplace=True)

# load idl values from IDL_IQL file and IDL_2024 file
idl_2024 = pd.read_csv(IDL_2024_filepath, index_col=0, low_memory=False, nrows=1)
idl = pd.read_csv(IDL_IQL_filepath, skiprows=[2], index_col=0)
idl = idl.apply(pd.to_numeric, errors='coerce')

# use idl_2024 and append all values from idl, which are not included in 2024
for idl_column in idl.columns:
    if idl_column not in idl_2024.columns:
        idl_2024[idl_column] = idl[idl_column]

# write each iql value in new column of mdl dataframe
mdl.loc[:,'IDL'] = [np.nan] * len(mdl)
for row_index in mdl.index:
    component_name = mdl.loc[row_index, 'Component Name']
    try:
        mdl.loc[row_index, 'IDL'] = idl_2024[f'{component_name}'].to_list()[0]
    except:
        print(f'No IDL available for {component_name}')

mdl.index = mdl['Component Name']
mdl.drop(columns=['Component Name'], inplace=True)

mdl['Detection Threshold'] = mdl['MDL']
mdl['Detection Threshold'].fillna(mdl.IDL, inplace=True)

# write detection threshold to excel file
with pd.ExcelWriter(processed_filepath, engine='openpyxl', mode='a') as writer:
    mdl.to_excel(writer, sheet_name='Detection Threshold')

quantification_pfas_default.loc[:, 'Below Detection Threshold'] = [False] * len(quantification_pfas_default)
quantification_pfas_default.loc[:, 'Detection Threshold'] = [False] * len(quantification_pfas_default)
for row_index in quantification_pfas_default.index:
    component_name = quantification_pfas_default.loc[row_index, 'Component Name']
    detection_threshold = mdl.loc[component_name, 'Detection Threshold']
    quantification_pfas_default.loc[row_index, 'Detection Threshold'] = detection_threshold
    if quantification_pfas_default.loc[row_index, 'Calculated Concentration'] < detection_threshold:
        quantification_pfas_default.loc[row_index, 'Below Detection Threshold'] = True

Evaluate ratio of calculated concentration from default channel and _TOF MS channel for all PFAS components and all samples.
Detect all calculated concentration values which deviate more than 30 % between channels

In [None]:
# assign TOF channel to each PFAS and save calculated concentration to new column
quantification_pfas_default.loc[:,'Calculated Concentration TOF'] = [np.nan] * len(quantification_pfas_default)
for row_index in quantification_pfas_default.index:
    sample_index = quantification_pfas_default.loc[row_index, 'Sample Index']
    tof_channel_name = quantification_pfas_default.loc[row_index, 'Component Name'] + '_TOF MS'
    tof_channel_row = quantification_only[(
        (quantification_only['Component Name'] == tof_channel_name) &
        (quantification_only['Sample Index'] == sample_index)
    )]
    quantification_pfas_default.loc[row_index, 'Calculated Concentration TOF'] = \
        tof_channel_row.loc[:,'Calculated Concentration'].to_list()[0]

# calculate average deviation of the channels
quantification_pfas_default['Channel Ratio'] = (
    200 * (quantification_pfas_default['Calculated Concentration'] - quantification_pfas_default['Calculated Concentration TOF']) \
        / (quantification_pfas_default['Calculated Concentration'] + quantification_pfas_default['Calculated Concentration TOF'])
).round(decimals=1)

# Introduce new column where everything below method detection limit is marked
quantification_pfas_default['Channel Ratio > 30'] = abs(quantification_pfas_default['Channel Ratio']) > 30

#create pivot table
channel_ratio = quantification_pfas_default.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Channel Ratio', aggfunc='first'
)

# color code percentage deviation
def deviation_color_map(val):
    if val is np.nan:
        return
    if -30 <= val <= 30:
        return in_range
    elif val < -30 or val > 30:
        return out_range

channel_ratio = channel_ratio.style.applymap(deviation_color_map)

# write to existing excel file
with pd.ExcelWriter(processed_filepath, engine='openpyxl', mode='a') as writer:
    channel_ratio.to_excel(writer, sheet_name='Channel Ratio')

Write final concentration table with all information to excel. Use concentration value, indicate everything below detection threshold with '< MDL', and indicate everything with more than 30 % deviation between challes with '> CD'

In [None]:
final_table = quantification_pfas_default[['Sample Name', 'Component Name', 'Calculated Concentration']]
final_table.loc[final_table.index[quantification_pfas_default['Channel Ratio > 30']], 'Calculated Concentration'] = '> CR'
final_table.loc[final_table.index[quantification_pfas_default['Below Detection Threshold']], 'Calculated Concentration'] = '< MDL'

final_table = final_table.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Calculated Concentration', aggfunc='first'
)

# write to existing excel file
with pd.ExcelWriter(processed_filepath, engine='openpyxl', mode='a') as writer:
    final_table.to_excel(writer, sheet_name='Concentration Table')

quantification_pfas_default = quantification_pfas_default[[
    'Sample Name', 'Sample Index', 'Acquisition Date & Time','Component Name', 'Area', 'Calculated Concentration',
    'IDA Name', 'IDA Area', 'IDA Concentration', 'IPS Name', 'IPS Area', 'IPS Concentration', 'IDA-IPS Ratio',
    'Recovery Rate', 'Recovery Rate Uncertainty 1', 'Recovery Rate Sciex', 'Detection Threshold', 'Below Detection Threshold',
    'Channel Ratio', 'Channel Ratio > 30'
]]

quantification_pfas_default.to_csv(processed_filepath_long)

Determine linear calibration curve based on calibration data.
*to be discussed: method for R2 computation*

In [None]:
# Function to sanitize file names
def sanitize_filename(name):
    """Removes special characters from PFAS name to create valid directory names."""    
    return re.sub(r'[\\/*?:"<>|]', "_", name)

# extract PFAS data
selected_columns = ['Sample Name', 'Component Name', 'Actual Concentration','IS Actual Concentration','Area','IS Area']
data_pfas = data[data['Sample Name'].str.contains('PFAS CS')].copy()
data_pfas = data_pfas.fillna(0)
data_pfas = data_pfas[~data_pfas['Component Name'].str.contains('IDA|IPS|13C|d3|d5')]
data_pfas_ = data_pfas[~data_pfas['Component Name'].str.contains('TOF')]
data_pfas_['Concentration/IS Concentration'] = data_pfas_['Actual Concentration']/data_pfas_['IS Actual Concentration']
data_pfas_['Area/IS Area'] = data_pfas_['Area']/data_pfas_['IS Area']

# Create a list of unique sample names and count them.
components = data_pfas_['Component Name'].unique()
n_components = len(components)

image_paths = []
# Iterate over each component and create scatter plots with regression lines
for i, component in enumerate(components):
    component_data = data_pfas_[data_pfas_['Component Name'] == component]
    
    # Extract x and y values
    x = component_data['Concentration/IS Concentration'].values.reshape(-1, 1)
    y = component_data['Area/IS Area'].values
    
    # Perform linear regression
    model = LinearRegression()
    model.fit(x, y)
    y_pred = model.predict(x)
    r2 = r2_score(y, y_pred)

    # Regression equation
    slope = model.coef_[0]
    intercept = model.intercept_
    equation = f'y = {slope:.2f}x + {intercept:.2f}'

    plt.figure(figsize = (8,6))
    # Plot with Seaborn
    sns.regplot(
       # ax=axes[i], 
        x=x.flatten(), 
        y=y, 
        scatter=True, 
        fit_reg=True,
        line_kws={"color": "red"},  # Color of the regression line
        scatter_kws={"s": 50, "alpha": 0.7},  # Customize scatter points
        ci=95
    )
    # Set the title with the component name
    plt.title(f'{component}')
    plt.xlabel('Concentration/IS Concentration')
    plt.ylabel('Area/IS Area')
    plt.text(0.05, 0.95, f'{equation}\n$R^2$ = {r2:.2f}', 
             transform=plt.gca().transAxes, 
             fontsize=10, 
             verticalalignment='top', 
             bbox=dict(boxstyle="round,pad=0.3", edgecolor="black", facecolor="white"))
   
    # Sanitize the file name
    sanitized_component = sanitize_filename(component)
    image_path = os.path.join(plot_directory, f'{sanitized_component}.png')
    
    # Save the plot as an image
    plt.savefig(image_path)
    plt.close()
    
    image_paths.append(image_path)

Writes all relevant data to excel file and adds calibration curves.

In [None]:
# create excel with pandas excelwriter
with pd.ExcelWriter(processed_filepath, engine='openpyxl', mode='a') as writer:
    #styled_recovery.to_excel(writer, sheet_name='Recoveries')
    styled_ips_norm_recovery.to_excel(writer, sheet_name = 'Calculated Recoveries')
    styled_reported_recovery.to_excel(writer, sheet_name = 'Reported Recoveries')
    #ida_area_piv.to_excel(writer, sheet_name='Area_Pivot')
    cal_ida_ips_ratio.to_excel(writer, sheet_name='Calibration IDA_IPS Ratio')
    cal_area_ida_piv.to_excel(writer, sheet_name='Calibration IDA')
    cal_area_ips_piv.to_excel(writer, sheet_name='Calibration IPS')
    sample_ida_ips_ratio.to_excel(writer, sheet_name='Sample IDA_IPS Ratio')
    area_ida_piv.to_excel(writer, sheet_name='Sample IDA')
    area_ips_piv.to_excel(writer, sheet_name='Sample IPS')
    data_pfas_.to_excel(writer, sheet_name='Calibration Data')

workbook = load_workbook(processed_filepath)
plot_sheet = workbook.create_sheet('Calibration Curves')
    
# Insert all images into one sheet in a grid format
row_offset = 1  # Start at the first row
col_offset = 1  # Start at the first column
images_per_row = 2  # Number of images per row

for i, image_path in enumerate(image_paths):
    # Calculate the position for each image
    row_position = row_offset + (i // images_per_row) * 15  # Adjust the multiplier to control spacing
    col_position = col_offset + (i % images_per_row) * 20   # Adjust the multiplier to control spacing
    
    # Load the image
    img = Image(image_path)
    
    # Place the image at the calculated position
    cell_position = plot_sheet.cell(row=row_position, column=col_position).coordinate
    plot_sheet.add_image(img, cell_position)

workbook.save(processed_filepath)
