Skip to content
This repository has been archived by the owner on Mar 27, 2024. It is now read-only.

[clinical] Support NLST clinical data #28

Closed
fedorov opened this issue Mar 23, 2022 · 9 comments
Closed

[clinical] Support NLST clinical data #28

fedorov opened this issue Mar 23, 2022 · 9 comments
Assignees

Comments

@fedorov
Copy link
Member

fedorov commented Mar 23, 2022

Currently, it is captured in the nlst_clinical_* tables, with the dictionaries in RTF format linked here: https://learn.canceridc.dev/data/organization-of-data/files-and-metadata#nlst. I don't know if they are already handled, but I did not find the dictionaries/tables in idc-dev-etl:clinical.

@G-White-ISB
Copy link

I have not started on parsing the RTF files

@G-White-ISB
Copy link

This issue is still outstanding

@fedorov
Copy link
Member Author

fedorov commented May 16, 2023

@G-White-ISB I made Excel spreadsheets for each of the dictionaries, see here:

NLST_dicts_xls.zip

The original RTF for convenience are here:

nlst780.idc.delivery.052821.zip

Can you please review and let me know if you would organize it differently or it is ok? If ok, I will send to TCIA and ask them to post on the wiki so you can ingest them from there and incorporate into your workflow. Would it be possible to add this to v15?

@G-White-ISB
Copy link

I can definitely work with these Excell files. We'll get this in for v15. I don't know if we need to bother TCIA. After the release I'll see if I can do the rtf to Excel conversion programaticaly

@fedorov
Copy link
Member Author

fedorov commented May 16, 2023

I don't know if we need to bother TCIA.

I do! If I put effort into this, and I believe it can help someone, I want it to be available, and ideally at a central place. I will take care of this.

After the release I'll see if I can do the rtf to Excel conversion programaticaly

I do not think this is worth the effort. I don't think we can expect those files to update dynamically, it is not a common representation, so we do it and forget about it until the next time (if the next time ever comes).

@fedorov
Copy link
Member Author

fedorov commented Jul 7, 2023

@G-White-ISB I was reviewing this, and I have troubles understanding the BQ content.

I selected column metadata using this query:

SELECT
  *
FROM
  `bigquery-public-data.idc_v15_clinical.column_metadata`
WHERE
  collection_id="nlst"
  AND table_name="bigquery-public-data.idc_v15_clinical.nlst_prsn"
ORDER BY
  column_label
  • I would expect each tab in the individual spreadsheet would be stored as a separate table, but I only see nlst_prsn (the corresponding Excel contains 6 sheets)
  • I do not see the variables below
    *
    image
  • I do see variables I do not see in the spreadsheet (and option descriptions are missing):
    *
    image

@G-White-ISB
Copy link

The source DATA for nlst_prsn is all in ONE CSV file with all 30 + columns. The accompanying RTF document, which was used to create the Excel spreadsheet, explains different sets of columns on different pages.

Some columns in the dictionary were missed because the column name is not literally in the dictionary. Columns scr_iso1, scr_iso2, scr_iso2 are apparently covered by scr_iso0-2 in the dictionary.

@fedorov
Copy link
Member Author

fedorov commented Jul 7, 2023

This needs to be addressed in the custom parsing script. The dictionary should contain actual values for meaning/labels.

@G-White-ISB
Copy link

The column_metadata table in the pdp_staging dataset has been updated to include the column labels and options for scr_iso0.. scr_iso2 columns and scr_days0 ..scr_days2 columns as parsed from the dictionary. The table still needs to be updated in the public dataset.

@fedorov fedorov closed this as completed Jul 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants