name: lcms_data_processing
date: 10/14/2024
version: 1.1
author: Justin Sankey, Johanna Ganglbauer

description: Takes raw liquit chromatography mass spectroscopy (LCMS) data (exported table from SCIEX Analyst Software), computes recovery rates, method detection limits, and ratios of default channel to MS TOF channel. Generates plots and writes results to excel and creates long format table (.csv) for data publications.

**Wish List**
- JG: allow seperate recovery filters for separate compounds
- JG: automatically create list of chemicals including precursor mass / child mass and retention time from data

In [None]:
# import all needed packages
import os
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from openpyxl import load_workbook
from openpyxl.drawing.image import Image
import math as ma

This is the first cell you have to make edits to. \
**Enter in your input filepahts:** \
(1) raw_data_filepaths is a list of files pointing to the exported table from the Sciex Analyst Software you want to analyze \
(2) idl_2024_filepath is a .csv file containing instrumentation detection limits (IDLs) \
(3) idl_iql_filepath is another .csv file containing instrumentation detection limits for more PFAS components \
(4) processed filepath indicates location and file prefix to indicate where your results should be saved to \
(5) plot directory indictes location where your plots should be created. \

for (2) and (3) use default files available on the Lohmann drive under....

In [None]:
# raw data upload file path
# raw_filepath = r'example_data_raw\20240903_pfas_kynol_ks_single_compound.csv'
# raw_filepath = r'example_data_raw\20230703_ACF_Batch1_update_20240923.txt'
raw_filepaths = [
    r'example_data_raw\test_data.txt',
]

# file paths for IDL and IQL data - not meant to be adopted
idl_2024_filepath = r'example_data_raw\IDL_2024.csv'
idl_iql_filepath = r'example_data_raw\IDL_IQL.csv'

# processed data output excel file path
processed_filepath =r'example_data_processed\test'

# directory to save plots to
plot_directory = r'example_figures'

This is the second cell you can make edits to:

(1) Change what you deem to be an acceptable recovery range. \
(2) Indicate which samples you want to use to calculate MDL values. You can only use terms which are used in the "Sample Comment" column of your input data. \
(3) Introduce the hard facts about your sample. \

A common source of confusion is the fact that the actual concentration of targeted compounds in the LCMS analysis is calculated in ng/sample by the developed method.\
To convert the results to ng/l (water samples) or ng/g (tissue, sediment, etc.) you need to indicate the 'non extracted sample quantity'. \
To compare the results to the calibration data you need to indicate the volume of your extracted sample. (it is commonly 0.5 ml).

In [None]:
# color-coding and range for recoveries table
in_range = 'background-color: green'
in_range_min_val = 0.8 
in_range_max_val = 1.2
out_range = 'background-color: red'
out_range_min_val = 0.4 
out_range_max_val = 1.6
question_range = 'background-color: yellow'

# threshold for acceptance of absolute percentage difference between default channel and TOF MS channel
allowed_channel_deviation = 30

# samples used for MDL calculatation - keywords need to be used in "Sample Comment" column of input data
# if you want to use IDLs only, use an empty list.
mdl_selection = [
    'IS Check', 'Process Blank', 'Water Extraction Blank',
]

# hard facts about your sample - to correct compare IPS concentrations and convert the PFAS concentration.
ips_concentration_calibration = 4  # ng/ml
extracted_sample_volume = 0.5  # ml
sample_unit = 'l'  # eiter 'l' for liter or 'g' for gram
nonextracted_sample_quantity = 1 # indicate the weight (if sample_unit is 'g') or volume (if sample unit is 'l') or your non-extracted sample.

Reads in data and cleans it up.

In [None]:
# Parse inputs
if sample_unit not in ['l', 'g']:
    raise Exception("""Please use either 'g' or 'l' for variable sample_unit. """)

# Ensure file path and folder path exist to write outputs to and create folders, if they do not exist
folder_path = os.path.dirname(processed_filepath)
if folder_path and not os.path.exists(folder_path):
    os.makedirs(folder_path)
processed_filepath_xlsx = processed_filepath + '.xlsx'
processed_filepath_csv = processed_filepath + '.csv'
    
if not os.path.exists(plot_directory):
    os.makedirs(plot_directory)

# Define columns of input which are needed for further processes:
columns_considered = [
    'Sample Name', 'Sample Index', 'Sample Comment', 'Sample Type',
    'Component Name',  'Component Group Name', 'Component Comment', 'IS Name',
    'Acquisition Date & Time', 'Injection Volume', 'Used',
    'Calculated Concentration', 'Actual Concentration', 'IS Actual Concentration', 
    'Reported Recovery', 'IDA Average Response Factor',
    'Area', 'IS Area', 'Area IDA', 'Area IPS',
    'Retention Time', 'IS Retention Time', 'Retention Time Error (%)', 'Retention Time Delta (min)',
    'Start Time', 'IS Start Time', 'End Time', 'IS End Time',
    'Precursor Mass', 'Fragment Mass', 
]

# Load input data files and put them all in one dataframe
data = pd.DataFrame()  # initialize empty data frames
component_index = 0  # initialize Component Index
for file in raw_filepaths:
    # read in file
    if file[-4:] == '.csv':
        this_data = pd.read_csv(file, delimiter=',', encoding='utf-8', low_memory=False, header=0,)
    elif file[-4:] == '.txt':
        this_data = pd.read_csv(file, delimiter='\t', encoding='utf-8', low_memory=False, header=0,)
    else:
        print('Raw input file paths must either be .csv or .txt files.')
    # increase component index to remain unique for multiple data frames
    this_data['Component Index'] = this_data['Component Index'] + component_index
    component_index += max(this_data['Component Index'])

    if data.empty:
        data = this_data[columns_considered]  # initialize data in first step (when data is empty)
    else:
        data = pd.concat([data, this_data[columns_considered]], ignore_index=True)  # append to data
        

# Clean up 'Calculated Concentration' column - set all strange strings to NaN
data['Calculated Concentration'] = data['Calculated Concentration'].replace(
    {'<1 points': np.nan, '< 0': np.nan, 'no root': np.nan, 'NaN': np.nan}
    ).astype('float')

# Replace 'IPS-13C2_PFOA' values: optional, only when component name occurs
if any(data['Component Name'].isin(['IPS-13C2_PFOA'])):
    data['Component Group Name'] = data['Component Group Name'].replace('IPS-13C2_PFOA', 'IPS-13C4_PFOA')

    # Find rows where 'Component Group Name' is 'IPS-13C4_PFOA' (after replacement)
    mask = data['Component Group Name'] == 'IPS-13C4_PFOA'

    # Iterate through each of these rows and replace area in column
    for idx, row in data[mask].iterrows():
        sample_name = row['Sample Name']
        
        # Find the corresponding row with 'Component Name' == 'IPS-13C4_PFOA' and the same 'Sample Name'
        matching_row = data[(data['Component Name'] == 'IPS-13C4_PFOA') & (data['Sample Name'] == sample_name)]
        
        if not matching_row.empty:
            # Update the 'Area IPS' with the value from 'Area' in the matching row
            data.at[idx, 'Area IPS'] = matching_row['Area'].values[0]

# Correct channel names in original data (all of the TOF channels are labelled by _TOF MS, only 2 of them are labeled by only _TOF)
mask_names = data['Component Name'].str.endswith('_TOF')
data['Component Name'][mask_names] = [compound + ' MS' for compound in data['Component Name'][mask_names].to_list()]

mask_names = data['Component Name'].str.endswith('_TOF_MS')
data['Component Name'][mask_names] = [compound[:-3] + ' MS' for compound in data['Component Name'][mask_names].to_list()]

# Get the order of components and split it to default channel (MS/MS) and TOF channel.
# If either channel is not available a copy of the other one is used respectively.
first_sample_id = data['Sample Index'].value_counts().index[0]  # index of first sample
components_sorted = data.loc[data['Sample Index'] == first_sample_id, 'Component Name']  # channel names of first sample
components_sorted = components_sorted[~components_sorted.str.contains('IDA|IPS|13C')].to_list()  # channel names excluding IPS and IDA

# initialize and fill lists of sorted components
components_default = []
components_tof = []
skip_index = []
for (index, component) in enumerate(components_sorted):
    if index in skip_index:
        continue
    if '_TOF MS' in component:
        if component[:-7] in components_sorted:
            components_default.append(component[:-7])
            components_tof.append(component)
            skip_index.append(components_sorted.index(component[:-7]))
        else:
            continue
    else:
        components_default.append(component)
        if component + '_TOF MS' in components_sorted:
            components_tof.append(component + '_TOF MS')
            skip_index.append(components_sorted.index(component + '_TOF MS'))
        else:
            components_tof.append(component)

# display settings 
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_colwidth', None)  # Show full width of columns

Separates data in calibration data, data for mdl calculation (blank data) and quantification data.

In [None]:
# Split data into quantification data, calibration data, and blanks for mdl calculation
calibration_only = data[(data['Sample Type'] == 'Standard')]
quantification_blank = data[(data['Sample Type'] != 'Standard')]

if not quantification_blank.empty:
    # select data for computation of mdl and exclude it from quantification data
    if mdl_selection == []:
        blank_only = None
        quantification_only = quantification_blank
    else:
        blank_selection = quantification_blank['Sample Comment'].str.contains('|'.join(mdl_selection))
        if sum(blank_selection) == 0:
            (f'Be careful, no samples have been collected for the MDL calculation because {mdl_selection} is not a Sample Comment.')
            print(f'If you do not select blank set variable mdl_selection to [].')
        if sum(blank_selection) == len(quantification_blank):
            print(f'Be careful, all samples have been collected for the MDL calculation.')
            print(f'If this is not what you want reset the variable mdl_selection in the top block.')
        blank_selection.replace({np.nan: False}, inplace=True)
        blank_only = quantification_blank[blank_selection]
        quantification_only = quantification_blank[~blank_selection]

To avoid misunderstandings, in the following two abbreviations are extensively used:
- IDA: isotope dilution analysis, also known as SS=surrogate standard or EIS=extracted internal standard
- IPS: isotope performance standard, also known as IS=injection standard or NIS=non-extracted internal standard

The following two blocks calculate response factors from calibration data: ratio of (i) calculated area of IDA * **actual concentration of IPS** and (ii) calculated area of IPS * actual concentration of IDA. \
The data is saved to an excel file and a boxplot of response factors is created.

Note: The response factor calculation within sciex uses the ratio of (i) calculated area of IDA and (ii) calculated area of IPS * actual concentration of IDA. \
As the concentration of IPS is missing in the calculation, the response factors deviate by a factor of 4, which is the actual concentration of IPS in the calibration data.

In [None]:
# define funtion which calculates the IDA IPS ratio.
# challenge - search right IPS row indicated in the Component Group Name of IDA.
def calulate_ida_ips_ratio(data: pd.DataFrame, column_name:str) -> pd.DataFrame:    
    """Calculates IDA area times IPS concentration divided by IPS area times IDA concentration and save the results in the indicated column.

    :param data: Entire data junk (including all rows and the following columns:
    Component Name, Sample Index, Component Group Name, Actual Concentration, Area, and IDA Average Response Factor
    :type data: pd.DataFrame
    :param column_name: name of column, the calculated ratio should be saved to
    :type column_name: str
    :return: Data junk only containing IDA rows with the corresponding ratio saved to new column
    :rtype: pd.DataFrame
    """ 
    data_only_ida = data[data['Component Name'].str.contains('IDA')] 
    data_only_ida.loc[:,f'{column_name}'] = [np.nan] * len(data_only_ida)
    data_only_ida.loc[:,'IPS Area'] = [np.nan] * len(data_only_ida)
    data_only_ida.loc[:,'IPS Concentration'] = [np.nan] * len(data_only_ida)
    for row_index in data_only_ida.index:
        sample_index = data_only_ida.loc[row_index, 'Sample Index']
        ips_channel_name = data_only_ida.loc[row_index, 'Component Group Name']
        corresponding_ips_area_row = data[(
            (data['Sample Index'] == sample_index) &
            (data['Component Name'] == ips_channel_name)
            )]
        data_only_ida.loc[row_index, 'IPS Area'] = corresponding_ips_area_row['Area'].iloc[0]
        data_only_ida.loc[row_index, 'IPS Concentration'] = corresponding_ips_area_row['Actual Concentration'].iloc[0]
        data_only_ida.loc[row_index, f'{column_name}'] = \
            (data_only_ida.loc[row_index, 'Area'] * corresponding_ips_area_row['Actual Concentration']).iloc[0] \
            / (corresponding_ips_area_row['Area'] * data_only_ida.loc[row_index, 'Actual Concentration']).iloc[0]
    
    return data_only_ida

In [None]:
# Extract values of IDAs and IPS
# Save basic sample information as well as areas of intensity peak, actual concentration and Component Group Name.
# The 'Component Group Name' is useful to assoiciate the right IPS to each IDA.

# calucluate IDA IPS ratio to compute response factors with calibration data
calibration_only_ida = calulate_ida_ips_ratio(
    data=calibration_only, column_name='Response Factor from Scratch',
    )

# create data frame with this response factor calculation (from scratch), the standard deviation and the original values evaluated by Sciex,
response_factor = calibration_only_ida.groupby('Component Name', as_index=False)['Response Factor from Scratch'].mean()
response_factor['Response Factor Std'] = calibration_only_ida.groupby('Component Name')['Response Factor from Scratch'].std().to_list()
response_factor['Response Factor Sciex'] = calibration_only_ida.groupby('Component Name')['IDA Average Response Factor'].mean().to_list()
response_factor.rename(columns={'Response Factor from Scratch': 'Response Factor Mean'}, inplace=True)
response_factor.index = response_factor['Component Name']
response_factor.drop(columns=['Component Name'], inplace=True)

# write response factor to excel file
with pd.ExcelWriter(processed_filepath_xlsx, engine='openpyxl') as writer:
    response_factor.to_excel(writer, sheet_name='Response Factor')

# create and save response factor box plots
image_path = os.path.join(plot_directory, 'response_factors.png')
fig, ax = plt.subplots(figsize=(8, 8))
calibration_only_ida.boxplot(column='Response Factor from Scratch', by='Component Name', ax=ax, label='data')
ax.plot([np.nan] + (response_factor['Response Factor Sciex'] * 4).to_list(), color='red', linestyle='', marker="o", label='4 * RF Sciex')
ax.plot([np.nan] + (response_factor['Response Factor Sciex']).to_list(), color='blue', linestyle='', marker="o", label='RF Sciex')
fig.suptitle('')
ax.set_title('')
plt.ylabel('Response Factor (IDA area/IPS area)')
plt.xticks(rotation=90)
plt.legend()
plt.savefig(image_path, bbox_inches='tight')
plt.show()

# save response factor box plot to excel file
workbook = load_workbook(processed_filepath_xlsx)
plot_sheet = workbook.create_sheet('Details RF')

img = Image(os.path.join(plot_directory, 'response_factors.png'))

cell_position = plot_sheet.cell(row=1, column=1).coordinate
plot_sheet.add_image(img, cell_position)

workbook.save(processed_filepath_xlsx)

The following block is useful to detect and correct wrong information about the amount of IPS added to the samples. \
The ratio of mean IPS area in the samples and mean IPS area in the calibration data is a good indicator.

If the IPS areas are in the same range for calibration and quantification, also the added IPS concentration should be in the same range. \
If the IPS areas of your quantification is twice as much as in the calibration, also the added IPS concentration should be twice as musch.

**Source of confusion:**
The IPS concentration in the calibration data is given per mililiter.\
The IPS concentration in your quantification data is given per sample. When you have a volume of 0.5 ml for your extracted sample the IPS concentration in the quantification will be twice as high.

Here, the IPS concentration in your sample is converted from concentration per sample to concentration per ml to have a fair comparison to the calibration data. \
The data indicated in 'Actual Concentration' Column is usually given per sample - it is converted by dividing through the volume of your extracted samples. 

After checking the graph, make sure you inputed the amount of standard you added in the SCIEX software correctly.

In [None]:
# Calculate IPS average area per compound in calibration data and plot it
calibration_only_ips = calibration_only[calibration_only['Component Name'].str.contains('IPS')]
quantification_blank_only_ips = quantification_blank[quantification_blank['Component Name'].str.contains('IPS')]

# evaluate mean per component
calibration_ips_area_averages = calibration_only_ips.groupby('Component Name')['Area'].mean()
quantification_blank_ips_area_averages = quantification_blank_only_ips.groupby('Component Name')['Area'].mean()

# evaluate IPS concentration in quantification
ips_concentration = quantification_blank_only_ips['Actual Concentration'].mean() / extracted_sample_volume

# create plot for comparison
image_path = os.path.join(plot_directory, 'ips_areas.png')
fig, ax = plt.subplots(ncols=2, figsize=(8, 8), sharey=True)
calibration_only_ips.boxplot(column='Area', by='Component Name', ax=ax[0], label='calibration')
ax[0].plot([np.nan] + calibration_ips_area_averages.to_list(), color='red', linestyle='', marker="o", label='calibration average')
quantification_blank_only_ips.boxplot(column='Area', by='Component Name', ax=ax[1], label='quantification')
ax[1].plot([np.nan] + quantification_blank_ips_area_averages.to_list(), color='red', linestyle='', marker="o", label='quantification average')
fig.suptitle('')
ax[0].set_title('Calibration: 4 ng/ml')
ax[1].set_title(f'Quantification: {ips_concentration} ng/ml (?)')
ax[0].set_xticks(
    ticks=range(len(calibration_ips_area_averages) + 1),
    labels=[''] + calibration_ips_area_averages.index.to_list(), rotation=90
    )
ax[1].set_xticks(
    ticks=range(len(quantification_blank_ips_area_averages) + 1),
    labels=[''] + quantification_blank_ips_area_averages.index.to_list(), rotation=90
    )
ax[0].set_ylabel('IPS area')
[this_ax.set_xlabel('') for this_ax in ax]
plt.savefig(image_path, bbox_inches='tight')
plt.show()

# save IPS area comparison plot to excel file
workbook = load_workbook(processed_filepath_xlsx)
plot_sheet = workbook.create_sheet('Details IPS concentration')

img = Image(os.path.join(plot_directory, 'ips_areas.png'))

cell_position = plot_sheet.cell(row=1, column=1).coordinate
plot_sheet.add_image(img, cell_position)

workbook.save(processed_filepath_xlsx)

The following block calculates recovery rates for each IDA compound in each sample.
$$
recovery~rate = \frac{\frac{area_{IDA~sample}~\cdot~concentration_{IPS~sample}}{area_{IPS~sample}~\cdot~concentration_{IDA~sample}}}{average(\frac{area_{IDA~calibration}~\cdot~concentration_{IPS~calibration}}{area_{IPS~calibration}~\cdot~concentration_{IDA~calibration}})} = \frac{ratio}{response~factor}
$$

The recovery rates are illustrated for all samples and all IDAs to enable a quick visual sanity check.

Recovery rate computed within the SCIEX software is not normalized by concentration. So the recovery rate calculated by SCIEX reads:
$$
recovery~rate = \frac{\frac{area_{IDA~sample}}{area_{IPS~sample}}}{average(\frac{area_{IDA~calibration}}{area_{IPS~calibration}})}
$$

The uncertainty of the recovery rate is indicate by a maximum error method - for now, you can just ignore this part...
$$
\Delta recovery~rate = \Delta response~factor \cdot \frac{ratio}{response~factor^{2}} + \Delta ratio \cdot \frac{1}{response~factor}
$$

In [None]:
# Select ida rows from quantification data and calculate ida ips ratio
# function is defined in previous block
quantification_ida = calulate_ida_ips_ratio(
    data=quantification_only, column_name="IDA-IPS Ratio",
    )

# Assign right response factor to each IDA
quantification_ida.loc[:, 'Response Factor Mean'] = [np.nan] * len(quantification_ida)
quantification_ida.loc[:, 'Response Factor Std'] = [np.nan] * len(quantification_ida)

for component in response_factor.index:
    quantification_ida.loc[quantification_ida['Component Name'] == component, 'Response Factor Mean'] = response_factor.loc[component, 'Response Factor Mean']
    quantification_ida.loc[quantification_ida['Component Name'] == component, 'Response Factor Std'] = response_factor.loc[component, 'Response Factor Std']

# Calculate recovery rate and its related uncertainty
quantification_ida.loc[:, 'Recovery Rate'] = quantification_ida.loc[:, 'IDA-IPS Ratio'] / quantification_ida.loc[:, 'Response Factor Mean']
quantification_ida.loc[:, 'Recovery Rate Uncertainty 1'] = quantification_ida.loc[:,'Response Factor Std'] * quantification_ida.loc[:,'IDA-IPS Ratio'] \
     / (quantification_ida.loc[:, 'Response Factor Mean'] * quantification_ida.loc[:, 'Response Factor Mean'])

# Put recovery rate in pivot table
recovery = quantification_ida.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Recovery Rate', aggfunc='first', dropna=False,
    )

# Write recovery rate to excel file
with pd.ExcelWriter(processed_filepath_xlsx, engine='openpyxl', mode='a') as writer:
    recovery.style.map(color_map).to_excel(writer, sheet_name='Recovery Rate')

# Multiply recovery rate by 100
quantification_ida["Recovery Rate"] = 100 * quantification_ida["Recovery Rate"] 

# # Plot recovery rates vs sciex recovery rates
# for title, group in quantification_ida.groupby('Component Name'):
#     plt.figure()
#     group.plot(x='Recovery Rate', y='Reported Recovery', title=title, kind='scatter', xlim=[0, 200], ylim=[0, 400], grid=True)
#     plt.show()

# Box plot for recovery rates
image_path = os.path.join(plot_directory, 'recovery_rates_box.png')
fig, ax = plt.subplots(figsize=(8, 8))
quantification_ida.boxplot(column='Recovery Rate', by='Component Name', ax=ax, label='data',)
ax.set_ylim([0,500])
fig.suptitle('')
ax.set_title('')
plt.ylabel('Recovery Rate')
plt.xticks(rotation=90)
plt.legend()
plt.savefig(image_path, bbox_inches='tight')
plt.show()

# plot data as points for recovery rates:
plot_data = quantification_ida.groupby('Sample Name')  # group data for plotting
cmap = plt.cm.get_cmap('tab20', len(plot_data)) # initialize colours
image_path = os.path.join(plot_directory, 'recovery_rates.png')  # set path for figure

fig, ax = plt.subplots(figsize=(8, 8))
for index, (title, group) in enumerate(plot_data):
    group.set_index(group['Component Name'], inplace=True)
    group.sort_index(inplace=True)
    group.plot(
        y='Recovery Rate', ax=ax, marker='.', linestyle='None', label=title, grid=True, color = cmap(index),
    )
ax.set_ylim([0,400])
fig.suptitle('')
ax.set_xticks(range(len(group)))
ax.set_xticklabels(group['Component Name'], rotation=90)
ax.set_title('')
plt.xticks(rotation=90)
plt.ylabel('Recovery Rate')
plt.legend(loc='center right', bbox_to_anchor=(1.4, 0.5))
plt.savefig(image_path, bbox_inches='tight')
plt.show()

# save recovery rate plot to excel file
workbook = load_workbook(processed_filepath_xlsx)
plot_sheet = workbook.create_sheet('Details Recovery Rates')

img1 = Image(os.path.join(plot_directory, 'recovery_rates_box.png'))
img1.anchor = 'A1'
plot_sheet.column_dimensions['A'].width = img1.width / 6
plot_sheet.row_dimensions[1].height = img.height
plot_sheet.add_image(img1)

img2 = Image(os.path.join(plot_directory, 'recovery_rates.png'))
img2.anchor = 'B1'
plot_sheet.column_dimensions['B'].width = img2.width / 6
plot_sheet.add_image(img2)

workbook.save(processed_filepath_xlsx)

# Initialize data frame for following assignments, select only needed columns and use reasonable naming.
selected_columns = [
    'Sample Name', 'Sample Index', 'Acquisition Date & Time', 'Component Name', 'IS Name', 'Calculated Concentration', 'Area',
    ]
quantification_pfas_default = quantification_only[selected_columns]
quantification_pfas_default.rename(columns={'IS Name': 'IDA Name'}, inplace=True)

The right recovery rate is assigned to each PFAS. Each PFAS component has a corresponding IDA - the recovery rate is deduced from the recovery rate of the corresponding IDA.
The assignment works for each sample by using the 'IS Name' column which provides information on which IDA standard is associated to which PFAS.

In [None]:
# Get only PFAS default channels
quantification_pfas_default = quantification_pfas_default[quantification_pfas_default['Component Name'].isin(components_default)]

# Assign right recovery rate and uncertainty to each PFAS compound
quantification_pfas_default.loc[:,'IDA Area'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IDA Concentration'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IPS Area'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IPS Concentration'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'IPS Name'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'Recovery Rate'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'Recovery Rate Uncertainty 1'] = [np.nan] * len(quantification_pfas_default)
quantification_pfas_default.loc[:,'Recovery Rate Sciex'] = [np.nan] * len(quantification_pfas_default)
for row_index in quantification_pfas_default.index:
    sample_index = quantification_pfas_default.loc[row_index, 'Sample Index']
    ida_channel_name = quantification_pfas_default.loc[row_index, 'IDA Name']
    if ida_channel_name in quantification_ida['Component Name'].to_list():
        recovery_row = quantification_ida[(
            (quantification_ida['Component Name'] == ida_channel_name) &
            (quantification_ida['Sample Index'] == sample_index)
        )]
        quantification_pfas_default.loc[row_index,'IDA Area'] = recovery_row.loc[:,'Area'].to_list()[0]
        quantification_pfas_default.loc[row_index,'IDA Concentration'] = recovery_row.loc[:,'Actual Concentration'].to_list()[0]
        quantification_pfas_default.loc[row_index,'IPS Area'] = recovery_row.loc[:,'IPS Area'].to_list()[0]
        quantification_pfas_default.loc[row_index,'IPS Concentration'] = recovery_row.loc[:,'IPS Concentration'].to_list()[0]
        quantification_pfas_default.loc[row_index, 'IPS Name'] = \
            recovery_row.loc[:,'Component Group Name'].to_list()[0]
        quantification_pfas_default.loc[row_index, 'Recovery Rate'] = \
            (recovery_row.loc[:,'Recovery Rate']).to_list()[0]
        quantification_pfas_default.loc[row_index, 'Recovery Rate Uncertainty 1'] = \
            (recovery_row.loc[:,'Recovery Rate Uncertainty 1'].round(decimals=3) * 100).to_list()[0]
        quantification_pfas_default.loc[row_index, 'Recovery Rate Sciex'] = \
            recovery_row.loc[:,'Reported Recovery'].to_list()[0]
        
    else:
        quantification_pfas_default.loc[row_index,'IDA Area'] = np.nan
        quantification_pfas_default.loc[row_index,'IDA Concentration'] = np.nan
        quantification_pfas_default.loc[row_index,'IPS Area'] = np.nan
        quantification_pfas_default.loc[row_index,'IPS Concentration'] = np.nan
        quantification_pfas_default.loc[row_index, 'IPS Name'] = np.nan
        quantification_pfas_default.loc[row_index, 'Recovery Rate'] = np.nan
        quantification_pfas_default.loc[row_index, 'Recovery Rate Uncertainty 1'] = np.nan
        quantification_pfas_default.loc[row_index, 'Recovery Rate Sciex'] = np.nan
    
    for column in ['Recovery Rate', 'Recovery Rate Uncertainty 1', 'Recovery Rate Sciex']:
        quantification_pfas_default[f'{column}'] = quantification_pfas_default[f'{column}'].round(decimals=1)

# Put recovery rate in pivot table
recovery = quantification_pfas_default.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Recovery Rate', aggfunc='first', dropna=False,
    )
recovery = recovery[components_default]

# Write to excel file
with pd.ExcelWriter(processed_filepath_xlsx, engine='openpyxl', mode='a') as writer:
    recovery.to_excel(writer, sheet_name='Recovery Rate Extended')

Calculation of sample ida - ips ratio and calibration ida - ips ratio (refered to as response factor above) - Method Justin:
Uses columns "IDA area" and "IPS area" directly


In [None]:
#Area IDA values for IDA components
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Area IDA']
area_ida = quantification_blank[selected_columns_area]
area_ida = area_ida[area_ida['Component Name'].str.contains('IDA')]
area_ida.loc[:,'Sample Name Date'] = area_ida['Sample Name'].astype(str) + "_" + area_ida['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
area_ida_piv = area_ida.pivot_table(
    index=('Sample Name Date',), columns='Component Name', values='Area IDA', aggfunc='first', dropna=False,
    )
area_ida_piv

In [None]:
#Calibration Area IDA Average for IDA components
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Area IDA']
cal_area_ida = calibration_only[selected_columns_area]
cal_area_ida = cal_area_ida[cal_area_ida['Component Name'].str.contains('IDA')]
cal_area_ida.loc[:,'Sample Name Date'] = cal_area_ida['Sample Name'].astype(str) + "_" + cal_area_ida['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
cal_area_ida_piv = cal_area_ida.pivot_table(
    index=('Sample Name Date',), columns='Component Name', values='Area IDA', aggfunc='first', dropna=False,
    )
#Establishes a row with average of each column
mean_cal = cal_area_ida_piv.mean(skipna=True)
mean_cal=pd.DataFrame(mean_cal).T
mean_cal.index=['Average']
cal_area_ida_piv = pd.concat([cal_area_ida_piv,mean_cal])

In [None]:
#IPS Area Values for IDA components
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Area IPS']
area_ips = quantification_blank[selected_columns_area]
area_ips = area_ips[area_ips['Component Name'].str.contains('IDA')]
area_ips.loc[:,'Sample Name Date'] = area_ips['Sample Name'].astype(str) + "_" + area_ips['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
area_ips_piv = area_ips.pivot_table(
    index=('Sample Name Date',), columns='Component Name', values='Area IPS', aggfunc='first', dropna=False,
    )

In [None]:
#IPS Calibration Area and Average for IDA components
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Area IPS']
cal_area_ips = calibration_only[selected_columns_area]
cal_area_ips = cal_area_ips[cal_area_ips['Component Name'].str.contains('IDA')]
cal_area_ips.loc[:,'Sample Name Date'] = cal_area_ips['Sample Name'].astype(str) + "_" + cal_area_ida['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
cal_area_ips_piv = cal_area_ips.pivot_table(
    index=('Sample Name Date',), columns='Component Name', values='Area IPS', aggfunc='first', dropna=False,
    )
#establishes mean for each column
mean_cal = cal_area_ips_piv.mean(skipna=True)
mean_cal=pd.DataFrame(mean_cal).T
mean_cal.index=['Average']
cal_area_ips_piv = pd.concat([cal_area_ips_piv,mean_cal])

In [None]:
#calibration IDA/IPS ratio
cal_ida_ips_ratio = cal_area_ida_piv.loc['Average']/cal_area_ips_piv.loc['Average']
cal_area_ips_piv.loc['Average']
cal_area_ida_piv.loc['Average']
#Sample IDA/IPS Ratio
sample_ida_ips_ratio = area_ida_piv/area_ips_piv

Calculate response factor by dividing sample ratio through calibration ratio.

In [None]:
# IPS normalized recoveries to calibration data
ips_norm_recovery = sample_ida_ips_ratio / cal_ida_ips_ratio

# Apply the style function to the entire DataFrame
styled_ips_norm_recovery = ips_norm_recovery.style.map(color_map)
styled_ips_norm_recovery


In [None]:
# Reset the index to turn 'Sample Name Date' into a column
calculated_recovery = ips_norm_recovery.reset_index()
test = '240821 PB_9/5/2024 10:47'

pattern = r'(.+?)_(\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}$)'

# Apply the regex to extract 'Sample Name' and 'Acquisition Date & Time'
calculated_recovery[['Sample Name', 'Acquisition Date & Time']] = calculated_recovery['Sample Name Date'].str.extract(pattern)

# # Convert 'Acquisition Date & Time' to datetime format -> nice idea, should be implemented in the future, does currently not work for any type of input file
# calculated_recovery['Acquisition Date & Time'] = pd.to_datetime(calculated_recovery['Acquisition Date & Time'], format='%m/%d/%Y %H:%M:%S', errors='coerce')
# calculated_recovery['Acquisition Date & Time'] = calculated_recovery['Acquisition Date & Time'].dt.strftime('%m/%d/%Y %H:%M:%S')

# Drop the original 'Sample Name Date' if no longer needed
calculated_recovery.drop('Sample Name Date', axis=1, inplace=True)
calculated_recovery= pd.melt(calculated_recovery, 
                      id_vars=['Sample Name', 'Acquisition Date & Time'],  # Keep these as identifier variables
                      var_name='IDA Name',    # The new column for 'IDA name' will contain what were previously column names
                      value_name='Calculated Recovery')     # This column will contain the values from the old 'IDA name' columns

# Perform an inner merge to get only the rows that match on 'Sample Name', 'Acquisition Date & Time', and 'IDA Name'
matching_rows_df = pd.merge(
    quantification_pfas_default[['Sample Name', 'Acquisition Date & Time', 'IDA Name']],
    calculated_recovery[['Sample Name', 'Acquisition Date & Time', 'IDA Name']],
    on=['Sample Name', 'Acquisition Date & Time', 'IDA Name'],
    how='inner'
)

# Filter quantification_pfas_default for rows that match in all three columns
quantification_matching = quantification_pfas_default[
    quantification_pfas_default[['Sample Name', 'Acquisition Date & Time', 'IDA Name']].apply(
        tuple, axis=1
    ).isin(matching_rows_df.apply(tuple, axis=1))
]

# Filter calculated_recovery for rows that match in all three columns
calculated_recovery_matching = calculated_recovery[
    calculated_recovery[['Sample Name', 'Acquisition Date & Time', 'IDA Name']].apply(
        tuple, axis=1
    ).isin(matching_rows_df.apply(tuple, axis=1))
]
# Merge the filtered calculated_recovery back into quantification_pfas_default
quantification_pfas_default = quantification_pfas_default.merge(
    calculated_recovery_matching[['Sample Name', 'Acquisition Date & Time', 'IDA Name', 'Calculated Recovery']],
    on=['Sample Name', 'Acquisition Date & Time', 'IDA Name'],
    how='left'  # Using 'left' join to keep all rows from quantification_pfas_default
)
# Add a new column 'Poor Recovery' based on whether 'Calculated Recovery' is outside the range
# quantification_pfas_default.loc[:,'Poor Recovery'] = ~quantification_pfas_default['Calculated Recovery'].between(
#     out_range_min_val, out_range_max_val, inclusive='both',
#     )

quantification_pfas_default.loc[:,'Poor Recovery'] = ~quantification_pfas_default['Recovery Rate'].between(
    out_range_min_val * 100, out_range_max_val * 100, inclusive='both',
    )
quantification_pfas_default.loc[quantification_pfas_default['Recovery Rate'].isnull(), 'Poor Recovery'] = False

In [None]:
#Reported Recovery Pivot Table
selected_columns_area = ['Sample Name', 'Sample Index','Acquisition Date & Time','Component Name', 'Reported Recovery']
reported_recovery = quantification_blank[selected_columns_area]
reported_recovery = reported_recovery[reported_recovery['Component Name'].str.contains('IDA')]
reported_recovery.loc[:,'Sample Name Date'] = reported_recovery['Sample Name'].astype(str) + "_" + reported_recovery['Acquisition Date & Time']

# Create pivot table with Sample name as the index, component name as the column headers, and area as the values
reported_recovery_piv = reported_recovery.pivot_table(
    index=('Sample Name Date',), columns='Component Name', values='Reported Recovery', aggfunc='first', dropna=False,
    )
reported_recovery_piv=reported_recovery_piv/100

In [None]:
#Color map of Reported recoveries
styled_reported_recovery = reported_recovery_piv.style.applymap(color_map)
styled_reported_recovery

# create excel with pandas excelwriter
with pd.ExcelWriter(processed_filepath_xlsx, engine='openpyxl', mode='a') as writer:
    # styled_ips_norm_recovery.to_excel(writer, sheet_name = 'Calculated Recoveries')
    styled_reported_recovery.to_excel(writer, sheet_name = 'Reported Recoveries')

In [None]:
# The following file extracts quality control standard data (QSC0) from the isotope dilution analaysis (IDA) 
# and appends it to an excisting QCS0 file which is indicated in the second code block
# (*qcs0_filepath*).

# file path to write QCS0 data to 
# qcs0_filepath = r'example_data_processed\QCS0_area_values.csv' 

# Filter rows where 'Sample Name' contains 'QCS0'
#qcs0_samples = ida_area_piv.reset_index()
#qcs0_samples = qcs0_samples[qcs0_samples['Sample Name Date'].str.contains('QCS0')]

# Append the filtered DataFrame to an existing CSV file
#if os.path.exists(qcs0_filepath):
    # Load the existing data
   # qsc0_existing_data = pd.read_csv(qcs0_filepath)
    
    # Combine existing data with new data, avoiding duplicates
    #qsc0_combined_data = pd.concat([qsc0_existing_data, qcs0_samples]).drop_duplicates(subset=['Sample Name Date'])
    
    # Write back to the CSV without writing headers again
    #qsc0_combined_data.to_csv(qcs0_filepath, index=False)
#else:
    # If the file doesn't exist, write the data with headers
    #qcs0_samples.to_csv(qcs0_filepath, index=False)

Compute method detection limits (MDL) based on avaerage and standard deviation of selected samples (process blanks, etc.). Use instrument detection limits (IDL) for the PFAS compounds not included in the process blanks. \n
Introduce new column 'Below Detection Threshold' and indicate all Calculated Concentration Values of PFAS quantification below the determined detection limits. 

In [None]:
if blank_only is None or blank_only.empty:
    # Make empty dataframe if blanks for MDL calculations are not available
    mdl = quantification_only.groupby('Component Name', as_index=False)['Calculated Concentration'].sum()
    mdl['Calculated Concentration'] = [np.nan] * len(mdl)
    mdl['Calculated Concentration Std'] = [np.nan] * len(mdl)
    mdl['MDL'] = [np.nan] * len(mdl)
else:
    # Isolate blank data and remove IDA/IPS values as well as TOF channels
    selected_columns = ['Sample Name', 'Component Name', 'Calculated Concentration']
    blank_only_default = blank_only[selected_columns]
    blank_only_default.rename(columns={'IS Name': 'IDA Name'}, inplace=True)
    blank_only_default = blank_only_default[blank_only_default['Component Name'].isin(components_default)]

    # create data frame with average and standard deviation values for MDL calculation and caluclate MDL
    mdl = blank_only_default.groupby('Component Name', as_index=False)['Calculated Concentration'].mean()
    mdl['Calculated Concentration Std'] = blank_only_default.groupby('Component Name', as_index=False)['Calculated Concentration'].std()['Calculated Concentration'].to_list()
    mdl['MDL'] = mdl['Calculated Concentration'] + 3 * mdl['Calculated Concentration Std']

# Specify column name for clarity
mdl.rename(columns={'Calculated Concentration': 'Calculated Concentration Mean'}, inplace=True)

# Load idl values from IDL_IQL file and IDL_2024 file
idl_2024 = pd.read_csv(idl_2024_filepath, index_col=0, low_memory=False, nrows=1)
idl = pd.read_csv(idl_iql_filepath, skiprows=[2], index_col=0)
idl = idl.apply(pd.to_numeric, errors='coerce')

# Use idl_2024 and append all values from idl, which are not included in 2024
for idl_column in idl.columns:
    if idl_column not in idl_2024.columns:
        idl_2024[idl_column] = idl[idl_column]

# Write each iql value in new column of mdl dataframe
mdl.loc[:,'IDL'] = [np.nan] * len(mdl)
for row_index in mdl.index:
    component_name = mdl.loc[row_index, 'Component Name']
    try:
        mdl.loc[row_index, 'IDL'] = idl_2024[f'{component_name}'].to_list()[0]
    except:
        print(f'No IDL available for {component_name}')

# Use component name as new index of data frame.
mdl.index = mdl['Component Name']
mdl.drop(columns=['Component Name'], inplace=True)

mdl['Detection Threshold'] = mdl['MDL']
mdl['Detection Threshold'].fillna(mdl.IDL, inplace=True)

# write detection threshold to excel file
with pd.ExcelWriter(processed_filepath_xlsx, engine='openpyxl', mode='a') as writer:
    mdl.to_excel(writer, sheet_name='Detection Threshold')

# include detection threshold to long format table data quantification_pfas_default
# assign right detection thresholds to right rows.
quantification_pfas_default.loc[:, 'Below Detection Threshold'] = [False] * len(quantification_pfas_default)
quantification_pfas_default.loc[:, 'Detection Threshold'] = [False] * len(quantification_pfas_default)
for row_index in quantification_pfas_default.index:
    component_name = quantification_pfas_default.loc[row_index, 'Component Name']
    detection_threshold = mdl.loc[component_name, 'Detection Threshold']
    quantification_pfas_default.loc[row_index, 'Detection Threshold'] = detection_threshold
    if quantification_pfas_default.loc[row_index, 'Calculated Concentration'] < detection_threshold:
        quantification_pfas_default.loc[row_index, 'Below Detection Threshold'] = True

Evaluate ratio of calculated concentration from default channel and _TOF MS channel for all PFAS components and all samples.
Detect all calculated concentration values which deviate more than 30 % between channels

In [None]:
# Assign TOF channel to each PFAS and save calculated concentration to new column
quantification_pfas_default.loc[:,'Calculated Concentration TOF'] = [np.nan] * len(quantification_pfas_default)
for row_index in quantification_pfas_default.index:
    sample_index = quantification_pfas_default.loc[row_index, 'Sample Index']
    tof_channel_index = components_default.index(quantification_pfas_default.loc[row_index, 'Component Name'])
    tof_channel_name = components_tof[tof_channel_index]
    tof_channel_row = quantification_only[(
        (quantification_only['Component Name'] == tof_channel_name) &
        (quantification_only['Sample Index'] == sample_index)
    )]
    quantification_pfas_default.loc[row_index, 'Calculated Concentration TOF'] = \
        tof_channel_row.loc[:,'Calculated Concentration'].to_list()[0]

# Calculate average deviation of the channels
quantification_pfas_default['Channel Ratio'] = (
    200 * (quantification_pfas_default['Calculated Concentration'] - quantification_pfas_default['Calculated Concentration TOF']) \
        / (quantification_pfas_default['Calculated Concentration'] + quantification_pfas_default['Calculated Concentration TOF'])
).round(decimals=1)

# Introduce new column where everything below method detection limit is marked
quantification_pfas_default[f'Channel Ratio > {allowed_channel_deviation}'] = abs(quantification_pfas_default['Channel Ratio']) > allowed_channel_deviation

# Create pivot table
channel_ratio = quantification_pfas_default.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Channel Ratio', aggfunc='first', dropna=False,
)
channel_ratio = channel_ratio[components_default]

# color code percentage deviation
def deviation_color_map(val):
    if val is np.nan:
        return
    if allowed_channel_deviation * (-1) <= val <= allowed_channel_deviation:
        return in_range
    elif val < allowed_channel_deviation * (-1) or val > allowed_channel_deviation:
        return out_range

channel_ratio = channel_ratio.style.map(deviation_color_map)

# write to existing excel file
with pd.ExcelWriter(processed_filepath_xlsx, engine='openpyxl', mode='a') as writer:
    channel_ratio.to_excel(writer, sheet_name='Channel Ratio')

Write final concentration table with all information to excel. Use concentration value, indicate everything below detection threshold with '< MDL', and indicate everything with more than 30 % deviation between challes with '> CR' /
In addition, concentration values are converted to ng\g or ng\l respectively.

In [None]:
def flag_values(data: pd.DataFrame, column: str) -> pd.DataFrame:
    """Flags column values (most probably concentrations) with channel ratio, recovery rates and MDLs.

    :param data: data frame containing columns 'Channel Ratio', 'Poor Recovery', and 'Below Detection Threshold',
    as well as the column you indicated.
    :type data: pd.DataFrame
    :param column: Colun name of data frame to be filtered or flagged.
    :type column: str
    :return: Data frame, where the column data is flagged.
    :rtype: pd.DataFrame
    """    
    final_table = data[['Sample Name', 'Component Name', column]]
    final_table.loc[final_table.index[quantification_pfas_default[f'Channel Ratio > {allowed_channel_deviation}']], column] = '> CR'
    final_table.loc[final_table.index[quantification_pfas_default['Poor Recovery']], column] = 'Poor Recovery'
    final_table.loc[final_table.index[quantification_pfas_default['Below Detection Threshold']], column] = '< MDL'
    return final_table

# Transform concentration to ng/g or ng/l, depending on your sample_unit
quantification_pfas_default[f'Concentration in ng per {sample_unit}'] = quantification_pfas_default['Calculated Concentration'] / nonextracted_sample_quantity

# Flag concentration values with channel ratio, recovery rates and detection threshold.
calculated_concentration = flag_values(data=quantification_pfas_default, column='Calculated Concentration')
calculated_concentration_II = flag_values(data=quantification_pfas_default, column=f'Concentration in ng per {sample_unit}')

# Pivot concentration tables.
calculated_concentration = calculated_concentration.pivot_table(
    index=('Sample Name',), columns='Component Name', values='Calculated Concentration', aggfunc='first', dropna=False,
)
calculated_concentration = calculated_concentration[components_default]

calculated_concentration_II = calculated_concentration_II.pivot_table(
    index=('Sample Name',), columns='Component Name', values=f'Concentration in ng per {sample_unit}', aggfunc='first', dropna=False,
)
calculated_concentration_II = calculated_concentration_II[components_default]

# Write pivot tables to existing excel file
with pd.ExcelWriter(processed_filepath_xlsx, engine='openpyxl', mode='a') as writer:
    calculated_concentration.to_excel(writer, sheet_name='Concentration Table')
    if sample_unit == "l":
        calculated_concentration_II.to_excel(writer, sheet_name=f'Concentration (ng per l)')
    else:
        calculated_concentration_II.to_excel(writer, sheet_name=f'Concentration (ng per g)')

# Write long format data to csv
quantification_pfas_default = quantification_pfas_default[[
    'Sample Name', 'Sample Index', 'Acquisition Date & Time','Component Name', 'Area', 'Calculated Concentration',
    f'Concentration in ng per {sample_unit}', 'IDA Name', 'IDA Area', 'IDA Concentration', 'IPS Name', 'IPS Area', 'IPS Concentration',
    'Recovery Rate', 'Recovery Rate Uncertainty 1', 'Recovery Rate Sciex', 'Detection Threshold', 'Below Detection Threshold',
    'Channel Ratio', f'Channel Ratio > {allowed_channel_deviation}','Calculated Recovery','Poor Recovery'
]]

quantification_pfas_default.to_csv(processed_filepath_csv)

# Print output
calculated_concentration

Determine linear calibration curve based on calibration data.
*to be moved somewhere else, not relevant at the moment*

In [None]:
# Function to sanitize file names
def sanitize_filename(name):
    """Removes special characters from PFAS name to create valid directory names."""    
    return re.sub(r'[\\/*?:"<>|]', "_", name)

# extract PFAS data
selected_columns = ['Sample Name', 'Component Name', 'Actual Concentration','IS Actual Concentration','Area','IS Area', 'Used']
calibration_only = calibration_only[selected_columns]
calibration_only = calibration_only[~calibration_only['Component Name'].str.contains('IDA|IPS|13C|MS')]
calibration_only['Concentration/IS Concentration'] = calibration_only['Actual Concentration']/calibration_only['IS Actual Concentration']
calibration_only['Area/IS Area'] = calibration_only['Area']/calibration_only['IS Area']
calibration_only = calibration_only.loc[calibration_only['Used'], :]
# calibration_only.loc[calibration_only['Area/IS Area'].isnull(), :] = 0
calibration_only['Area/IS Area'].replace({np.nan: 0}, inplace=True)

# Create a list of unique sample names and count them.
components = calibration_only['Component Name'].unique()

image_paths = []
calibration_coefficients = {'slope': {}, 'intercept':{}}
# Iterate over each component and create scatter plots with regression lines
for i, component in enumerate(components):
    component_data = calibration_only[calibration_only['Component Name'] == component]
    
    # Extract x and y values
    x = component_data['Concentration/IS Concentration'].values.reshape(-1, 1).flatten()
    y = component_data['Area/IS Area'].values
    y_forced = [0 if x_elem==0 else y_elem for (x_elem, y_elem) in zip(x, y)]  # force calibration curve through zero

    # log values
    # x = [1e-3 if elem == 0 else ma.log10(elem) for elem in x]
    # y_forced = [1e-3 if elem == 0 else ma.log10(elem) for elem in y_forced]

    weights = []
    for elem in x:
        if elem == 0:
            weights.append(1e3)
        else:
            weights.append(1/elem)
    if all(elem==1e3 for elem in weights):
        print(f'All calibration data for {component_name} in NaN. Therefor component is skipped')
        continue

    # perform linear regression with numpy and weights
    numpy_model = np.polyfit(x=x, y=y_forced, deg=1, w=weights)

    slope = numpy_model[0]
    intercept = numpy_model[1]

    # fit values, and mean
    ypred = [numpy_model[0] * elem + numpy_model[1] for elem in x]                        # or [p(z) for z in x]
    ybar = np.sum(y)/len(y)          # or sum(y)/len(y)
    ssreg = np.sum((ypred-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = np.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
    r2 = ssreg / sstot
    
    # save calibration coefficient to dictionary
    calibration_coefficients['slope'][component] = slope
    calibration_coefficients['intercept'][component] = intercept

    # Regression equation
    equation = f'y = {slope:.2f}x + {intercept:.2f}'

    plt.figure(figsize = (8,6))
    # Plot with Seaborn
    sns.regplot(
       # ax=axes[i], 
        x=x, 
        y=y, 
        scatter=True, 
        fit_reg=True,
        line_kws={"color": "red"},  # Color of the regression line
        scatter_kws={"s": 50, "alpha": 0.7},  # Customize scatter points
        ci=95
    )
    plt.plot(x, ypred)
    # Set the title with the component name
    plt.title(f'{component}')
    plt.ylabel('Concentration/IS Concentration')
    plt.xlabel('Area/IS Area')
    plt.text(0.05, 0.95, f'{equation}\n$R^2$ = {r2:.2f}', 
             transform=plt.gca().transAxes, 
             fontsize=10, 
             verticalalignment='top', 
             bbox=dict(boxstyle="round,pad=0.3", edgecolor="black", facecolor="white"))
    plt.show()
   
    # Sanitize the file name
    sanitized_component = sanitize_filename(component)
    image_path = os.path.join(plot_directory, f'{sanitized_component}.png')
    
    # Save the plot as an image
    plt.savefig(image_path)
    plt.close()
    
    image_paths.append(image_path)

In [None]:
# compare calibration with sciex data and print S
i = 0
for (_, row) in quantification_pfas_default.iterrows():
    try:
        coef_k = calibration_coefficients['slope'][row['Component Name']]
        coef_d = calibration_coefficients['intercept'][row['Component Name']]
    except:
        print(f'Calibration curve for {component} does not excist')
        continue
    concentration_sciex = row['Calculated Concentration']
    concentration_from_scratch = (row['Area'] / row['IDA Area'] - coef_d) * row['IDA Concentration'] / coef_k
    if concentration_sciex == concentration_sciex:
        percentage_deviation = round(
            200 * (concentration_sciex - concentration_from_scratch)/(concentration_sciex + concentration_from_scratch), 1
            )
        if abs(percentage_deviation) > 5:
            print(i, row['Component Name'], percentage_deviation)
            i+=1