Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Brodie Biosample names missing local names after merging DNA and RNA GOLD biosamples #340

Closed
dehays opened this issue May 7, 2021 · 11 comments
Assignees

Comments

@dehays
Copy link
Contributor

dehays commented May 7, 2021

The primary issue here is that the biosample names (which originate from GOLD) are truncated in the search portal.

Eoin Brodie pointed out in a call yesterday that there was no way to differentiate between the different samples in the UI because the part of the biosample name that is different - has been truncated.

For example, GOLD appears to build the biosample name by appending the local sample name to the end of the study name; i.e. "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_115"

But in most cases - all of the Brodie biosamples display "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -" as the sample names.

First and easier step to address this: display the entire biosample name

In future, consider adopting a different sample naming scheme than GOLD's.

@jbeezley
Copy link

jbeezley commented May 7, 2021

If your screen is wide enough the full name is shown. Maybe we need to make the name multi-line rather than truncating (with an ellipsis) when it doesn't fit?

@jbeezley
Copy link

jbeezley commented May 7, 2021

Oh wait, I guess for the Brodie biosamples the name is like that in the database. I assume this is coming from upstream in the pipeline because nothing in the ingest does any truncation.

@dehays
Copy link
Contributor Author

dehays commented May 7, 2021

You're right @jbeezley - I see it in the Mongo documents:

{"_id":
{"$oid":"602551d125261d62add15a31"},
"name":"Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ",

"description":"Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States",

"lat_lon":{"has_raw_value":"38.9206 -106.9489",
    "latitude":38.9206,
    "longitude":-106.9489},

...

I would agree making the display wrap to multi lines so long names display regardless of window width.

I'll move this to the ETL issues as I think that is where the truncation must be happening.

@dehays dehays transferred this issue from microbiomedata/nmdc-server May 7, 2021
@dehays
Copy link
Contributor Author

dehays commented May 7, 2021

@wdduncan It appears that this biosample name truncation is happening in the ETL. The example above is truncated at 103 characters. (Maybe when you load from Oracle to your local DB?)

@wdduncan
Copy link
Contributor

The json that I output has more data than what is shown above. Here is the json for one Brodie's (Gs0135149) study biosamples (note the name has ** - ER_DNA_115**).

{
      "id": "gold:Gb0191643",
      "name": "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_115",
      "description": "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States",
      "type": "nmdc:Biosample",
      "collection_date": {
        "has_raw_value": "2017-03-07"
      },
      "lat_lon": {
        "has_raw_value": "38.9206 -106.9489",
        "latitude": 38.9206,
        "longitude": -106.9489
  }
}

The name matches what is on the GOLD portal for biosample Gb0191643 (see screenshot).
image

@dehays
Copy link
Contributor Author

dehays commented May 10, 2021

This is kinda yuck. @wdduncan - this is not truncation happening in the GOLD ETL as I had originally thought.

For the Brodie study, there are 53 biosample metadata documents in Mongo and as expected 53 biosamples on the search portal. Bill, your ETL produces nearly twice that number. This is because the RNA and DNA samples were merged into single source samples. @dwinston - the naming from that merge appears to be the shared part of the name but doesn't include the different part of the name. (Makes sense, there'd need to be a special rule to do something useful with the different parts of the names.) So two samples from GOLD, Gb0191643 and Gb0205601 with names "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_115" and "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_RNA_115", get merged to one sample igsn:IEWFS0001 with name "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -".

These look truncated but are really the result of that 2:1 biosample transformation. And this happen for each sample that has both a DNA and an RNA biosample in GOLD. (There are only three GOLD DNA biosamples for this study that had no corresponding RNA biosample and those appear correctly in the portal ...ER_DNA_379, ..._ER_DNA_380 and ...ER_DNA_381.

Something similar happens for the Organic Matter samples that have not corresponding GOLD biosample record; e.g. igsn:IEWFS000K which ends up with name: "Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -"

Yet again - in this case, we know what happened, but there is no general solution beyond these specific samples. The local names for the samples were numbers 115 - 381. The DNA and RNA isolations got "ER_DNA_" and "ER_RNA_" prefixes for the local names for samples provided to JGI. EMSL got mostly the numbered samples - except for a set of samples used for metabolomics that had completely different naming.

I don't see any way for a transform to set appropriate sample names except as a a one-off that has knowledge of the local naming schemes.

@jbeezley
Copy link

jbeezley commented May 10, 2021

Some do have that extra text, some don't. These are the entities in question:

nmdc> select id, name from biosample where name like '%- '                                                                     
+----------------+---------------------------------------------------------------------------------------------------------+
| id             | name                                                                                                    |
|----------------+---------------------------------------------------------------------------------------------------------|
| igsn:IEWFS0001 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0002 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0003 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0004 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0005 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0006 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0007 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0008 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0009 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000C | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000D | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000E | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000F | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000G | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000H | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000L | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000M | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000N | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000O | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000P | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000Q | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000R | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000S | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000T | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000U | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000V | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000W | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000X | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000Y | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000Z | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0010 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0011 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0012 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0013 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0014 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0015 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0016 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0017 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0018 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS0019 | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001A | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001B | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001C | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001D | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS001E | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000I | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000K | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000B | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000A | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
| igsn:IEWFS000J | Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States -  |
+----------------+---------------------------------------------------------------------------------------------------------+

@dehays
Copy link
Contributor Author

dehays commented May 10, 2021

@jbeezley See my explanation - each merged RNA and DNA GOLD biosample and each EMSL only sample ends up looking truncated.

@wdduncan
Copy link
Contributor

wdduncan commented May 11, 2021

@dehays glad to know it is not a GOLD ETL issue. But, I'm not sure what the right approach to take is.

@dehays dehays changed the title Biosample names are truncated Brodie Biosample names missing local names after merging DNA and RNA GOLD biosamples May 20, 2021
@dehays dehays assigned dehays and unassigned wdduncan May 20, 2021
@dehays dehays assigned dwinston and unassigned dehays Jun 22, 2021
@dehays
Copy link
Contributor Author

dehays commented Jun 22, 2021

@dwinston You addressed this in the changes you made while meeting with Bill and I last Thursday. Can you close this with the PR for those changes.

@dwinston
Copy link
Contributor

dwinston commented Jul 1, 2021

@dwinston dwinston closed this as completed Jul 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants