# How does the taxonomic distribution of contigs differ by soil layer (mineral vs organic) in Colorado?

This notebook uses the existing NMDC-runtime API endpoints (as of May 2024) to explore how the taxononomic distribution of metagenome contigs differ by the mineral and organic soil layers in Colorado. It involves 9 API requests to reach the scaffold lineage TSV data objects in order to analyze the taxanomic distribution. Iterating through the TSV files includes 100+ API calls to get the necessary taxonomic counts and is time consuming. 

In [2]:
# Load essential libraries
library(jsonlite, warn.conflicts=FALSE)
library(dplyr, warn.conflicts=FALSE)
library(tidyr, warn.conflicts=FALSE)
library(readr, warn.conflicts=FALSE)
library(ggplot2, warn.conflicts=FALSE)

## Define a general API call funtion to nmdc-runtimeThis function provides a general-purpose way to make an API request to NMDC's runtime API. Note that this function will only return the first page of results. The function's input includes the name of the collection to access (e.g. biosample_set), the filter to be performed, the maximum page size, and a list of the fields to be retrieved. It returns the metadata as a dataframe.

In [3]:
get_first_page_results <- function(collection, filter, max_page_size, fields) {
  og_url <- paste0(
      'https://api.microbiomedata.org/nmdcschema/', 
      collection, '?&filter=', filter, '&max_page_size=', max_page_size, '&projection=', fields
      )
  
  response <- jsonlite::fromJSON(URLencode(og_url, repeated = TRUE))
  
  return(response)
}

## Define an nmdc-runtime API call function to include paginationThe get_next_results function uses the get_first_page_results function, defined above, to retrieve the rest of the results from a call with multiple pages. It takes the same inputs as the get_first_page_results function above: the name of the collection to be retrieved, the filter string, the maximum page size, and a list of the fields to be returned. This function returns the results as a single dataframe (can be nested). It uses the next_page_token key in each page of results to retrieve the following page.

In [4]:
get_next_results <- function(collection, filter_text, max_page_size, fields) {
  initial_data <- get_first_page_results(collection, filter_text, max_page_size, fields)
  results_df <- initial_data$resources
  
  if (!is.null(initial_data$next_page_token)) {
    next_page_token <- initial_data$next_page_token
    
    while (TRUE) {
      url <- paste0('https://api.microbiomedata.org/nmdcschema/', collection, '?&filter=', filter_text, '&max_page_size=', max_page_size, '&page_token=', next_page_token, '&projection=', fields)
      response <- jsonlite::fromJSON(URLencode(url, repeated = TRUE))

      results_df <- results_df %>% bind_rows(response$resources)
      next_page_token <- response$next_page_token
      
      if (is.null(next_page_token)) {
        break
      }
    }
  }
  
  return(results_df)
}

# 1. Get all biosamples where soil_horizon exists and the geo_loc_name has "Colorado" in the name

The first step in answering how the taxonomic distribution of contigs differ by soil layer is to get a list of all the biosamples that have metadata for soil_horizon and a string matching "Colorado, Rocky Mountains" for the geo_loc_name. We use the get_next_results function (defined above) to do this. We query the biosample_set collection with a mongo-like filter of {"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}, a maximum page size of 100, and specifying that we want three fields returned id, soil_horizon, and geo_loc_name. Note that id is always returned. Since we will be joining the results of multiple API requests with a field of id for different collections, we can change the name of the id key to be more explicit - calling it biosample_id instead.

In [5]:
# Get biosamples using get_next_results function
biosample_df <- get_next_results(
    collectio = 'biosample_set', 
    filter_text = '{"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}', 
    max_page_size = 100, 
    fields = 'id,soil_horizon,geo_loc_name'
    )

# Clarify the column names
biosample_df <- biosample_df %>%
    unnest(
        cols = c(
            geo_loc_name
        ), names_sep = "_") %>% 
    rename(biosample_id = id,
           geo_loc_name = geo_loc_name_has_raw_value)
head(biosample_df)

“cannot open URL 'https://api.microbiomedata.org/nmdcschema/biosample_set?&filter=%7B%22soil_horizon%22:%7B%22$exists%22:%20true%7D,%20%22geo_loc_name.has_raw_value%22:%20%7B%22$regex%22:%20%22Colorado%22%7D%7D&max_page_size=100&projection=id,soil_horizon,geo_loc_name': HTTP status was '503 Service Unavailable'”


ERROR: Error in open.connection(con, "rb"): cannot open the connection to 'https://api.microbiomedata.org/nmdcschema/biosample_set?&filter=%7B%22soil_horizon%22:%7B%22$exists%22:%20true%7D,%20%22geo_loc_name.has_raw_value%22:%20%7B%22$regex%22:%20%22Colorado%22%7D%7D&max_page_size=100&projection=id,soil_horizon,geo_loc_name'


## Define an API request function that uses a list of ids to filter onThis function constructs a different type of API request that takes a list of ids or similar (e.g. `biosample` ids as retreived above). The `id_field` input is a string of the name of the id field name (e.g. `id` or `has_output`), the name of the new collection to be queried, the name of the field to match the previous ids on in the new collection, and a list of the fields to be returned.

In [6]:
get_results_by_id <- function(collection, match_id_field, id_list, fields, max_id = 50) {
    # collection: the name of the collection to query
    # match_id_field: the field in the new collection to match to the id_list
    # id_list: a list of ids to filter on
    # fields: a list of fields to return
    # max_id: the maximum number of ids to include in a single query
    
    # If id_list is longer than max_id, split it into chunks of max_id
    if (length(id_list) > max_id) {
        id_list <- split(id_list, ceiling(seq_along(id_list)/max_id))
    } else {
        id_list <- list(id_list)
    }
    
    output <- list()
    for (i in 1:length(id_list)) {
        # Cast as a character vector and add double quotes around each ID
        mongo_id_string <- as.character(id_list[[i]]) %>%
            paste0('"', ., '"') %>%
            paste(collapse = ', ')
        
        # Create the filter string
        filter = paste0('{"', match_id_field, '": {"$in": [', mongo_id_string, ']}}')
        
        # Get the data
        output[[i]] = get_next_results(
            collection = collection,
            filter = filter,
            max_page_size = max_id*3, #assumes that there are no more than 3 records per query
            fields = fields
        )
    }
    output_df <- bind_rows(output)
    }

# 2. Get all Pooling results where the Pooling `has_input` are the biosample idsWe use the `get_results_by_id` function above to get a list of all pooling results whose field, `has_input` are the `biosample_id`s we retrieved in step 1. After, the pooling results are unnested to a flat data frame, andthe names are cleaned up so it is clear which collection the results are from. 

In [7]:
pooling_df <- get_results_by_id(
    collection = 'pooling_set',
    match_id_field = 'has_input',
    id_list = biosample_df$biosample_id,
    fields = 'id,has_input,has_output',
    max_id = 20
)

# Unnest the has_input and has_output columns, get unique results, and rename the columns.
pooling_df2 <- pooling_df %>%
    unnest(
        cols = c(
            has_input,
            has_output
        ), names_sep = "_") %>%
    distinct() %>%
    rename(pooling_id = id,
           biosample_id = has_input,
           pooling_has_output = has_output)
head(pooling_df2)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df' not found


Merge the biosample and pooling dataframes together to get a dataframe with biosample and pooling data.

In [9]:
biosample_df2 <- left_join(biosample_df, pooling_df2, by = 'biosample_id')
head(biosample_df2)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df' not found


# 3. Get processed samples where the the processed sample `id`s are the `pooling_has_output` fieldWe use the `get_results_by_id` function, again, to get a list of the processed sample results whose field, `pooling_has_output` are the processed sample ids. We will return the results only for the processed sample id field and clean up the names so it is clear that they are the identifiers from the `processed_sample_set`. Finally, the results are converted to a data frame and columns are renamed.

In [10]:
process_set1_df <- get_results_by_id(
    collection = 'processed_sample_set',
    match_id_field = 'id',
    id_list = unique(pooling_df2$pooling_has_output),
    fields = 'id',
    max_id = 20
)

process_set1_df <- process_set1_df %>%
    rename(processed_sample_id = id)
head(process_set1_df)

ERROR: Error in eval(expr, envir, enclos): object 'pooling_df2' not found


Merge the processed sample data with the biosample and pooling data, where the processed sample id is the same as the pooling_has_output


In [11]:
biosample_df3 <- biosample_df2 %>%
    rename(processed_sample_id = pooling_has_output) %>%
    left_join(process_set1_df, by = join_by(processed_sample_id))
head(biosample_df3)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df2' not found


# 4. Get extraction results where `processed_sample1` identifier is the `has_input` to the `extraction_set`The `get_id_results` function is used, again (you can see the pattern), but this time to query the `extraction_set` using the `processed_sample1` identifier as the `has_input` for the `extraction_set`. The resulting dataframe is unnested and the names are adjusted to make it clear which set the inputs, outputs, and ids are from.

In [None]:
extraction_df <- get_results_by_id(
    collection = 'extraction_set',
    match_id_field = 'has_input',
    id_list = unique(biosample_df3$processed_sample_id),
    fields = 'id,has_input,has_output',
    max_id = 20
)

extraction_df <- extraction_df %>%
    unnest(
        cols = c(
            has_input,
            has_output
        ), names_sep = "_") %>%
    distinct() %>%
    rename(extraction_id = id,
           processed_sample_id = has_input,
           extraction_has_output = has_output)
head(extraction_df)

# 5. Get processed sample results from the output of the extraction resultsWe query the `processed_sample_set` again, but this time using the `extract_has_output` ids to query the set. We only need to return the `processed_sample_set` identifiers.

In [12]:
process_set2_df <- get_results_by_id(
    collection = 'processed_sample_set',
    match_id_field = 'id',
    id_list = unique(biosample_df4$extraction_has_output),
    fields = 'id',
    max_id = 20
)

process_set2_df <- process_set2_df %>%
    rename(processed_sample_id2 = id)
head(process_set2_df)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df4' not found


Merge the processed sample data with the biosample, pooling, processed sample, and extraction data

In [14]:
biosample_df5 <- biosample_df4 %>%
    rename(processed_sample_id2 = extraction_has_output) %>%
    left_join(process_set2_df, by = join_by(processed_sample_id2))
head(biosample_df5)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df4' not found


# 6. Get the `library_preparation_set`

Using the `processed_sample2` identifiers from the last query as the `has_input` for the the `library_preparation_set`, we get a new batch of results, returning the library preparation identifiers, inputs and outputs. The results is unnested and names are clarified to demonstrate they are from the `library_preparation_set`.

In [15]:
library_prep_df <- get_results_by_id(
    collection = 'library_preparation_set',
    match_id_field = 'has_input',
    id_list = unique(biosample_df5$processed_sample_id2),
    fields = 'id,has_input,has_output',
    max_id = 20
)

library_prep_df <- library_prep_df %>%
    unnest(
        cols = c(
            has_input,
            has_output
        ), names_sep = "_") %>%
    distinct() %>%
    rename(library_preparation_id = id,
           processed_sample_id2 = has_input,
           library_preparation_has_output = has_output)
head(library_prep_df)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df5' not found


Merge the library preparation data with the biosample, pooling, processed sample, extraction, and processed sample data

In [16]:
biosample_df6 <- biosample_df5 %>%
    left_join(library_prep_df, by = join_by(processed_sample_id2))
head(biosample_df6)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df5' not found


# 7. Get third set of proccessed samples from the library preparation outputFor a third, and last time, we query the `processed_sample_set` identifier field using the `lp_has_output` identifiers. We only return the id field (as `processed_sample3`)

In [17]:
process_set3_df <- get_results_by_id(
    collection = 'processed_sample_set',
    match_id_field = 'id',
    id_list = unique(biosample_df6$library_preparation_has_output),
    fields = 'id',
    max_id = 20
)

process_set3_df <- process_set3_df %>%
    rename(processed_sample_id3 = id)
head(process_set3_df)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df6' not found


Merge the processed sample data with the biosample, pooling, processed sample, extraction, processed sample, and library preparation data

In [18]:
biosample_df7 <- biosample_df6 %>%
    rename(processed_sample_id3 = library_preparation_has_output) %>%
    left_join(process_set3_df, by = join_by(processed_sample_id3))
head(biosample_df7)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df6' not found


# 8. Get omics_processing results from the processed sample identifiers

Using the third batch of processed sample identifiers, we query the `omics_processing_set` on the `has_input` field. The `id` and `has_input` field names are changed to specify that they came from the `omics_processing_set`.

In [19]:
omics_processing_df <- get_results_by_id(
    collection = 'omics_processing_set',
    match_id_field = 'has_input',
    id_list = unique(biosample_df7$processed_sample_id3),
    fields = 'id,has_input',
    max_id = 20
)

omics_processing_df <- omics_processing_df %>%
    unnest(
        cols = c(
            has_input
        ), names_sep = "_") %>%
    rename(omics_processing_id = id,
           processed_sample_id3 = has_input)
head(omics_processing_df)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df7' not found


Merge the omics processing data with the biosample, pooling, processed sample, extraction, processed sample, library preparation, and processed sample data

In [20]:
biosample_df8 <- biosample_df7 %>%
    left_join(omics_processing_df, by = join_by(processed_sample_id3))
head(biosample_df8)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df7' not found


# 9. Get the metagenome_annotation_activity_set using the omics processing identifiers

The `metagenome_annotation_activity_set` is queried using the identifiers obtained from the omics processing to match with the `was_informed_by` field in the `metagenome_annotation_activity_set`. Field names are clarified, once again to specify the collection they came from.

In [21]:
metagenome_annotation_df <- get_results_by_id(
    collection = 'metagenome_annotation_activity_set',
    match_id_field = 'was_informed_by',
    id_list = unique(biosample_df8$omics_processing_id),
    fields = 'id,was_informed_by,has_output',
    max_id = 20
)

metagenome_annotation_df <- metagenome_annotation_df %>%
    unnest(
        cols = c(
            was_informed_by,
            has_output
        ), names_sep = "_") %>%
    rename(metagenome_annotation_id = id,
           omics_processing_id = was_informed_by,
           matagenome_annotation_has_output = has_output)
head(metagenome_annotation_df)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df8' not found


Merge the metagenome annotation data with the biosample, pooling, processed sample, extraction, processed sample, library preparation, processed sample, omics processing, and processed sample data

In [22]:
biosample_df9 <- biosample_df8 %>%
    left_join(metagenome_annotation_df, by = join_by(omics_processing_id))
head(biosample_df9)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df8' not found


# 10. Get data objects from the metagenome activity result outputsWe query the `data_object_set` using the `matagenome_annotation_has_output` identifiers to match the `id` field in the data objects. We then filter the results for only those results with a `data_object_type` of `Scaffold Lineage tsv` (since this has contig taxonomy results). Note that the `url` is a new field returned that contains the tsvs we will need for the final analysis.

In [23]:
data_object_df <- get_results_by_id(
    collection = 'data_object_set',
    match_id_field = 'id',
    id_list = unique(biosample_df9$matagenome_annotation_has_output),
    fields = 'id,data_object_type,url',
    max_id = 50
)

# Filter the data object results to only include the Scaffold Lineage tsv files
data_object_df <- data_object_df %>%
    rename(data_object_id = id) %>%
    filter(data_object_type == 'Scaffold Lineage tsv')
head(data_object_df)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df9' not found


Merge the data object data with the biosample, pooling, processed sample, extraction, processed sample, library preparation, processed sample, omics processing, processed sample, and metagenome annotation data

In [24]:
biosample_df10 <- biosample_df9 %>%
    rename(data_object_id = matagenome_annotation_has_output) %>%
    left_join(data_object_df, by = join_by(data_object_id))
head(biosample_df10)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df9' not found


## Clean up the combined results

In the final step of retrieving and cleaning the data, we clean up the final merged data frame by removing all of the "joining columns" that are not needed in our final analysis. Because some of the biosamples were pooled, we only retain unique url results (and drop the `biosample_id` column). The only columns we retain are the `soil_horizon`, `geo_loc_name`, and the `url` to the tsv. The `final_df` is displayed.

In [26]:
biosample_df_final <- biosample_df10 %>%
    select(biosample_id, soil_horizon, geo_loc_name, data_object_id, data_object_type, url) %>%
    distinct() %>%
    filter(!is.na(url))
head(biosample_df_final)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df10' not found


## Show how many results have M horizon vs. O horizon

The `soil_horizon` column can be counted using the `count()` functionality. There are many more M horizon samples than O horizon.

In [27]:
biosample_df_final %>%
    count(soil_horizon)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df_final' not found


## Example of what the TSV contig taxa file looks like

A snippet of the TSV file we need to iterate over to get the taxa abundance for the contigs is shown below. The third column is the initial count for the taxa, where each row is `1.0`. However, there are duplicate rows of taxa, meaning there are actually more than `1.0` for several taxa (though they appear as duplicate rows with `1.0` as the count`). We will take this into consideration when we calculate the relative abundance for each taxa.

In [28]:
url <- biosample_df_final$url[1]

# Read the TSV file
contig_taxa_df <- read_tsv(url, col_names = FALSE)

# Add column names 
colnames(contig_taxa_df) <- c('contig_id', 'taxa', 'initial_count')

# Show the first few rows
head(contig_taxa_df)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df_final' not found


## Iterate throught the TSVs to get the contig taxa information

Using the readr's `read_tsv` function, the TSV urls can be iterated over gathering the taxa information. The TSVs are converted into dataframes where they are manipulated to suit the data structure needed. The columns are given names and the taxa column is split into a proper list (instead of a string of items separated by a semicolon ;). The third element from the list of taxa is retrieved to get only the phylum level information of the taxa (or unknown to the highest available taxon). A grouping function is performed on the `taxa` column and the `count()` functionality is used to calculate the count for how many times each taxa occurs, which is then used to calculate the relative abundance of each taxa for each sample. 

Any errors in requesting the TSV urls are collected as a dictionary, so we can either try to query them again, or look into why they were not able to be collected. 

In [30]:
urls <- unique(biosample_df_final$url)
results_list <- c()
error_dict <- list()

for (i in 1:length(urls)) {
    # if i a factor of 100, print the progress
    if (i %% 10 == 0) {
        print(paste('Processing', i, 'of', length(urls)))
    }
    url <- urls[i]
    tryCatch({
        contig_taxa_df <- read_tsv(url, col_names = FALSE, show_col_types = FALSE)
        colnames(contig_taxa_df) <- c('contig_id', 'taxa', 'initial_count')
        
        # Clean up the taxa column and deal with unknown taxa
        contig_taxa_df$taxa_new <- contig_taxa_df$taxa
        contig_taxa_df$taxa_new <- sapply(strsplit(contig_taxa_df$taxa_new, ';'), function(x) x[3])
        contig_taxa_df$taxa_new <- ifelse(
            is.na(contig_taxa_df$taxa_new), 
            paste('Unknown', sapply(strsplit(contig_taxa_df$taxa, ';'), function(x) x[2])), 
            contig_taxa_df$taxa_new)
        contig_taxa_df$taxa_new <- ifelse(
            contig_taxa_df$taxa_new == "Unknown NA", 
            paste('Unknown', sapply(strsplit(contig_taxa_df$taxa, ';'), function(x) x[1])), 
            contig_taxa_df$taxa_new)
        contig_taxa_df$taxa <- contig_taxa_df$taxa_new

        contig_taxa_df <- contig_taxa_df %>%
            group_by(taxa) %>%
            summarise(count = n()) %>%
            mutate(relative_abundance = count / sum(count))
        contig_taxa_df$url <- url
        results_list[[i]] <- contig_taxa_df
        results_list[[i]] <- contig_taxa_df

    }, error = function(e) {
        error_dict[[i]] <- e
    })
}

contig_df <- bind_rows(results_list) 

head(contig_df)

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df_final' not found


## Clean up the relative abundance data to fill in NAs with 0 for unobserved taxa

In [31]:
# First merge to get the url for geo_loc_name and soil_horizon
biosample_taxa_df <- biosample_df_final %>%
    select(soil_horizon, geo_loc_name, url) %>%
    distinct() %>%
    right_join(contig_df, by = join_by(url))

# Then pivot the table to fill in the relative abundance as zero for un-observed taxa
biosample_taxa_df_wide <- biosample_taxa_df %>%
    pivot_wider(id_cols = c(url, soil_horizon, geo_loc_name),
        names_from = taxa, values_from = relative_abundance) %>%
    replace(is.na(.), 0)

# And unpivot the table to get the taxa relative abundance for each biosample
biosample_taxa_df <- biosample_taxa_df_wide %>%
    pivot_longer(cols = -c(url, soil_horizon, geo_loc_name), names_to = 'taxa', values_to = 'relative_abundance')

ERROR: Error in eval(expr, envir, enclos): object 'biosample_df_final' not found


## Plot the average taxa abundance for all M and O horizon soil samplesFirst calculate the average relative abundance for each taxa in each soil horizon.  Next, we'll pull out the top ten taxa and lump all others into an "Other" category for plotting purposes using the `forcats::fct_other` function.  Then we'll calculate the mean relative abundance of each taxa for each soil horizon. Finally, we'll choose an appropriate color palette for the plot, and plot the relative abundance of each taxa for each soil horizon at each location.

In [32]:
horizon_taxa <- biosample_taxa_df %>%
    group_by(soil_horizon, taxa) %>%
    summarise(mean_relative_abundance = mean(relative_abundance))%>%
    arrange(mean_relative_abundance) %>%
    mutate(taxa = factor(taxa, levels = rev(unique(taxa)))) %>%
    mutate(taxa_lump = forcats::fct_other(taxa, keep = levels(taxa)[1:15], other_level = 'Other')) 
           
# Make color palette that is 9 colors long, and followed with grey
color_pal <- c(RColorBrewer::brewer.pal(8, 'Set1'), RColorBrewer::brewer.pal(7, 'Set3'), 'grey')
g <- ggplot(horizon_taxa, aes(x = soil_horizon, y = mean_relative_abundance, fill = taxa_lump)) +
    geom_bar(stat = 'identity', color = NA) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title = 'Taxa abundance of M and O horizon soil samples for each location',
         x = 'Soil Horizon', y = 'Mean relative abundance', fill = NULL) +
    scale_fill_manual(values = color_pal) +
    theme_minimal() 
g

ERROR: Error in eval(expr, envir, enclos): object 'biosample_taxa_df' not found


## Plot the taxa abundance of M and O horizon soil samples for each location
First we'll pull out the top ten taxa and lump all others into an "Other" category for plotting purposes using the `forcats::fct_other` function.  Then we'll calculate the mean relative abundance of each taxa for each soil horizon for each location. Finally, we'll plot the relative abundance of each taxa for each soil horizon at each location (using the same color palette as above).

In [33]:
geo_taxa <- biosample_taxa_df %>%
    group_by(geo_loc_name, soil_horizon, taxa) %>%
    summarise(mean_relative_abundance = mean(relative_abundance)) %>%
    arrange(mean_relative_abundance) %>%
    mutate(taxa = factor(taxa, levels = rev(unique(taxa)))) %>%
    mutate(taxa_lump = forcats::fct_other(taxa, keep = levels(taxa)[1:15], other_level = 'Other')) %>%
    mutate(soil_horizon = factor(soil_horizon, levels = c('M horizon', 'O horizon'), labels = c('M', 'O')))

g <- ggplot(geo_taxa, aes(x = soil_horizon, y = mean_relative_abundance, fill = taxa_lump)) +
    geom_bar(stat = 'identity', color = NA) +
    facet_wrap(~geo_loc_name, nrow = 1,labeller =  label_wrap_gen(width = 20, multi_line = TRUE)) +
    labs(title = 'Taxa abundance of M and O horizon soil samples for each location',
         x = 'Soil Horizon', y = 'Mean relative abundance', fill = NULL) +
    scale_fill_manual(values = color_pal) +
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "bottom") 
g

ERROR: Error in eval(expr, envir, enclos): object 'biosample_taxa_df' not found
