name: link_data \
date: 12/12/2024 \
version: 1.0 \
author: Juliana Agudelo, Johanna Ganglbauer \

**description**: Links LCMS data with general sample description, plots and evaluates correlations.

**changes in comparison to previous version**:

**what needs to be implemented**:


In [None]:
# install packages
library('ggplot2')
library('corrplot')
library('xlsx')
library('dplyr')
library('glue')

In [2]:
# indicate filepath where general sample description is saved to.
linking_data_file = 'example_data_raw/linking_data_input/linking_data.csv'

# specific case
proteomics_results_file = 'example_data_raw/linking_data_input/TPA-calc_23-09-05-10-16_20230905_VICTERProject_FoldChange.xlsx'
proteomics_results_sheet_name = 'Total Proteins'
lcms_results_file = 'example_data_raw/linking_data_input/202303_and_202306_milk_sample_lcms.xlsx'
lcms_results_sheet_name = '06_23 results'

# general case
results_folder = 'example_data_processed/linking_data_results/'

# thresholds
sigma = 0.5  # threshold for correlation value to be shown in correlation plot (correlations below sigma will not be shown)
cols_per_plot <- 50  # number of proteins shown on each correlation plot
nan_to_zero <- FALSE  # empty cells in proteomics and <MDL in LCMS are treated as zero?

Below you find code to automatically read in lcms result files from lcms data analysis script, find samples of interst (relevant samples are indicated in linking_data_file) and put it in a data frame.

In [3]:
# read in linking data, LCMS data and put everything together in a data frame
link_data <- data.frame(read.csv(linking_data_file, na.strings=''))

# # get all results in lcms result folder and put them together in a dataframe
# # curent status: 1 file: todo -> make list!
# list_csv_files = list.files(path = lcms_results_folder, pattern = '*.csv')
# lcms_data <- read.csv(file.path(lcms_results_folder, list_csv_files[1]), na.strings='', stringsAsFactors=FALSE)
# lcms_data <- data.frame(lcms_data)
# for (column in c('Below.Detection.Threshold')){
#     lcms_data[[column]] = as.logical(lcms_data[[column]])
#     lcms_data[is.na(column), column] <- FALSE
# }

# lcms_data$Calculated.Concentration[lcms_data$Below.Detection.Threshold] <- NA

# for (pfas_component in lcms_data$Component.Name){
#     link_data[, pfas_component] <- NA
#     relevant_concentration = lcms_data[lcms_data$Component.Name == pfas_component,]
#     for (row in 1:nrow(link_data)){
#         concentration = relevant_concentration$Calculated.Concentration[relevant_concentration$Sample.Name == link_data[row, 'sample_name_in_lcms']]
#         link_data[row, pfas_component] = concentration
#     }
# }

Very specific case: special input file for LCMS (not generalized), relevant for one concrete example. \
Read in data and replace '<MDL' with NA \
Append lcms data to original 'linking_data' data frame

In [4]:
lcms_data <- read.xlsx(
    lcms_results_file, lcms_results_sheet_name, rowIndex = 18:21, colIndex = 1:14, stringsAsFactors = FALSE
    )
lcms_data <- lcms_data %>% mutate_all(~ifelse(. == "<MDL", NA, .))

for (pfas_component in colnames(lcms_data)){
    if (pfas_component == 'ng.g') {
        next
        }
    for (row in 1:nrow(link_data)){
        concentration = as.numeric(as.character(lcms_data[[pfas_component]][lcms_data$ng.g == link_data[row, 'sample_name_in_lcms']]))
        link_data[row, pfas_component] = concentration
    }
}

Input file for proteomics.
Append proteomics data to original 'linking_data' data frame.

Shorten descriptions, if they are longer than 150 letters, because long descriptions do not work with correlation plots...

Replace all NaNs with zeros if it is supposed to be replaced.

In [5]:
proteomics_data <- read.xlsx(
    proteomics_results_file, proteomics_results_sheet_name, stringsAsFactors = FALSE
    )

for (protein in proteomics_data$PG.ProteinDescriptions){
    protein_concentrations <- proteomics_data[proteomics_data$PG.ProteinDescriptions == protein, ]
    for (row in 1:nrow(link_data)){
        sample_name_in_proteomics <- make.names(link_data[row, 'sample_name_in_proteomics'])
        link_data[row, protein] <- protein_concentrations[[sample_name_in_proteomics]]
    }
}

for (column_index in seq_along(colnames(link_data))){
    column_name = colnames(link_data)[column_index]
    if (nchar(column_name) > 100){
        colnames(link_data)[column_index] <- substr(column_name, start=1, stop=100)

    }
}

if (nan_to_zero){
    link_data <- link_data %>% mutate_all(~ifelse(is.na(.), 0, .))
}

Function to compute correlation and filter certain values. \
Plot correlation data only from fields which are not NaN, not 1 and the magnitude is greater than sigma.
Plot correlations from each relevant point (not 1, not NaN and magnitude greater than sigma)

In [6]:
corr_simple <- function(xdata, ydata, sig){
  # run a correlation and drop the insignificant ones
  corr <- cor(xdata, ydata, use='pairwise.complete.obs')
  # drop perfect correlations
  corr[corr == 1] <- NA
  # turn into a 3-column table
  corr <- as.data.frame(as.table(corr))
  # remove the NA values from above 
  corr <- na.omit(corr) 
  # select significant values  
  corr <- subset(corr, abs(Freq) > sig) 
  # sort by highest correlation
  corr <- corr[order(-abs(corr$Freq)),] 
  # turn corr back into matrix in order to plot with corrplot
  mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
  # return correlation matrix
  return(mtx_corr)
  }

plot_corr <-function(mtx_corr, results_folder, n){
  # plot correlations visually
  png(glue('{results_folder}correlation_plot_{n}.png'))
  corrplot(mtx_corr, is.corr=FALSE, tl.cex=0.5, tl.col="black", na.label=" ")
  dev.off()
}

plot_single_corrs <-function(mtx_corr, link_data, results_folder, sigma){
  for (pfas in rownames(mtx_corr)){
    xdata = link_data[[pfas]]
    i <- 0
    for (protein in colnames(mtx_corr)){
      correlation <- mtx_corr[pfas, protein]
      if (is.na(correlation)){
        next
      } else{
        if (abs(correlation) > sigma){
          ydata = link_data[[protein]]
          png(glue('{results_folder}correlation_{pfas}_{i}.png'))
          gg <- ggplot(data=link_data, aes(x=xdata, y=ydata)) + geom_point() + geom_smooth(method='lm', se=TRUE, linetype='dashed') + 
          labs(x=glue('{pfas} [ng/g]'), y=glue('{protein} [?]')) + theme_minimal()
          print(gg)
          dev.off()
          i <- i + 1
        }
      }
    }
  }
}

Make various correlation plots (PFAS vs. proteomics), because otherwise plot would be huge and not useful to interpret

In [None]:
mtx_corr = corr_simple(
  xdata=link_data[c(4:16)], ydata=link_data[c(seq(from=17, to=ncol(link_data), by=1))], sig=sigma
  )
  
for (n in seq(from=1, to=floor(ncol(mtx_corr)/cols_per_plot), by=1)){
  plot_corr(mtx_corr=mtx_corr[, c(seq((n-1) * cols_per_plot, n * cols_per_plot, 1))], results_folder=results_folder, n=n)
}
plot_corr(mtx_corr=mtx_corr[, c(seq((n-1) * cols_per_plot, n * cols_per_plot, 1))], results_folder=results_folder, n=n)

plot_single_corrs(mtx_corr=mtx_corr, link_data=link_data, results_folder=results_folder, sigma=sigma)


In [8]:
# # evaluate correlations and plot correlation matrix for some components
# correlation_matrix <- cor(link_data[c(4:ncol(link_data))], use='pairwise.complete.obs')
# print('corr is not the problem')

# #corrplot(correlation_matrix, type = "upper", order = "hclust", tl.cex=0.05, na.label = NA)
# corrplot(correlation_matrix, tl.cex=0.05)
#         #  tl.col = "black", tl.srt = 45,  na.label = "NA")

for (pfas in colnames(link_data)[4:16]){
    xdata = link_data[[pfas]]
    for (protein in colnames(link_data)[17:ncol(link_data)])
        ydata = link_data[[protein]]
        gg <- ggplot(data=link_data, aes(x=xdata, y=ydata)) + geom_point() + geom_smooth(method='lm', se=TRUE, linetype='dashed') +
        labs(x=glue('{pfas} [ng/g]'), y=glue('{protein} [?]')) + theme_minimal()
}