# IDC data sampling

In this notebook we sample the data hosted by IDC based on the matching reference data.

We then save the metadata of the sampe into a csv that is uploaded to a Google Cloud Storage Bucket for reuse during later steps.

### Environvent variables and imports

In [None]:
myProjectID = '%%PUT-YOUR-PROJECT-ID-HERE%%'
bucket_name = '%%PUT-YOUR-BUCKET-ID-HERE%%'

In [None]:
# Colab specific authentication helpers
from google.colab import auth

# Other imports
import pandas as pd

In [None]:
auth.authenticate_user()

## Query the BQ tables

Queries that fetch the matched data we're interested in can be executed from the Colab notebook using the `%%bigquery` magics function.

If this operation is done from a python script a BQ client should be instantiated. Please refer to the [client library documentation](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas) for specs and examples.

In [None]:
%%bigquery --project=$myProjectID cohort_df

WITH
  reference_data AS (
  SELECT
    PV_MRI_assay,
    SPLIT(Assay_Name, '/')[OFFSET(1)] AS study_id,
    SPLIT(Assay_Name, '/')[OFFSET(2)] AS series_id,
  FROM
    `idc-external-007.ReferenceData.ReferenceDataTable`),
  idc_data AS (
  SELECT
    StudyInstanceUID,
    SeriesInstanceUID,
    gcs_url
  FROM
    `canceridc-data.idc_v2.dicom_all`)
SELECT
  PV_MRI_assay,
  study_id,
  series_id,
  gcs_url
FROM
  reference_data
INNER JOIN
  idc_data
ON
  idc_data.SeriesInstanceUID = reference_data.series_id
ORDER BY
  rand()
LIMIT
  100000

In [None]:
cohort_df

Unnamed: 0,PV_MRI_assay,study_id,series_id,gcs_url
0,T1-weighted pre-contrast,1.3.6.1.4.1.14519.5.2.1.5826.4001.113291808789...,1.3.6.1.4.1.14519.5.2.1.5826.4001.171072817207...,gs://idc-open/71bf09c0-d238-4be6-b102-8c98c65a...
1,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.2531.4003.112684169989...,1.3.6.1.4.1.14519.5.2.1.2531.4003.314673445079...,gs://idc-open/f839168c-281d-48f3-b1bc-88025c3c...
2,T1-weighted pre-contrast,1.3.6.1.4.1.14519.5.2.1.4591.4001.278928118420...,1.3.6.1.4.1.14519.5.2.1.4591.4001.316434438257...,gs://idc-open/07862c91-5f99-45e3-a44a-daaa8ceb...
3,T1-weighted pre-contrast,1.3.6.1.4.1.14519.5.2.1.2531.4003.176367534254...,1.3.6.1.4.1.14519.5.2.1.2531.4003.925945533711...,gs://idc-open/78320abb-bc08-4e95-964f-420c7c95...
4,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.4591.4001.278928118420...,1.3.6.1.4.1.14519.5.2.1.4591.4001.170539642438...,gs://idc-open/125f6019-9888-4de4-ad41-258d9c17...
...,...,...,...,...
55920,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.2531.4003.231610681115...,1.3.6.1.4.1.14519.5.2.1.2531.4003.324452218716...,gs://idc-open/909ea8b8-4b49-4a41-bac2-e9543bfa...
55921,T2-weighted image,1.3.6.1.4.1.14519.5.2.1.4591.4003.425990241084...,1.3.6.1.4.1.14519.5.2.1.4591.4003.796723886140...,gs://idc-open/87d4ae5c-90b2-44fd-9ac8-afbec5c5...
55922,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.7695.4001.912696403696...,1.3.6.1.4.1.14519.5.2.1.7695.4001.220439982391...,gs://idc-open/6caab1d3-0ae3-458c-8bb2-5ad76d14...
55923,T1-weighted pre-contrast,1.3.6.1.4.1.14519.5.2.1.5826.4001.194554107490...,1.3.6.1.4.1.14519.5.2.1.5826.4001.555701715534...,gs://idc-open/ab59234c-909d-4e01-b271-85bb428a...


In [None]:
cohort_df['PV_MRI_assay'].value_counts(dropna=False)

T1-weighted post-contrast    20633
T2-weighted image            12952
T1-weighted pre-contrast     11807
T2-weighted image flair      10533
Name: PV_MRI_assay, dtype: int64

Now select a sample of some size.

Here we set 10000 slices for each of the target classes. For testing or illustration smaller samples should be sufficient.

In [None]:
sample_df = cohort_df.groupby('PV_MRI_assay').sample(n=10000, random_state=1)
sample_df['PV_MRI_assay'].value_counts(dropna=False)

T1-weighted post-contrast    10000
T2-weighted image            10000
T2-weighted image flair      10000
T1-weighted pre-contrast     10000
Name: PV_MRI_assay, dtype: int64

In [None]:
sample_df

Unnamed: 0,PV_MRI_assay,study_id,series_id,gcs_url
22393,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.7695.4001.569320506186...,1.3.6.1.4.1.14519.5.2.1.7695.4001.149026159959...,gs://idc-open/5f5b0bb1-f2bc-4cc2-91e0-547bc2e4...
27223,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.2531.4003.666089736017...,1.3.6.1.4.1.14519.5.2.1.2531.4003.261424643169...,gs://idc-open/1d2c9f3f-865d-49b1-8e7b-55d055a8...
34112,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.2531.4003.666089736017...,1.3.6.1.4.1.14519.5.2.1.2531.4003.261424643169...,gs://idc-open/1d4c1c72-fa74-4344-998e-159b953c...
213,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.7695.4001.155576204150...,1.3.6.1.4.1.14519.5.2.1.7695.4001.332152476495...,gs://idc-open/7566f7fe-1fa6-4f75-9cc7-50212f93...
50080,T1-weighted post-contrast,1.3.6.1.4.1.14519.5.2.1.4591.4001.145527987060...,1.3.6.1.4.1.14519.5.2.1.4591.4001.125672926847...,gs://idc-open/862e33bf-0e35-46f1-875b-7b957dae...
...,...,...,...,...
2499,T2-weighted image flair,1.3.6.1.4.1.14519.5.2.1.8862.4001.322746683938...,1.3.6.1.4.1.14519.5.2.1.8862.4001.239852680542...,gs://idc-open/ecc9dd94-5ecf-42b0-a4e0-f9a6c506...
25820,T2-weighted image flair,1.3.6.1.4.1.14519.5.2.1.3775.4001.338505699209...,1.3.6.1.4.1.14519.5.2.1.3775.4001.242527138671...,gs://idc-open/77166447-8cff-4e11-b5a0-edaa01f2...
21777,T2-weighted image flair,1.3.6.1.4.1.14519.5.2.1.4591.4001.176500029848...,1.3.6.1.4.1.14519.5.2.1.4591.4001.491441365887...,gs://idc-open/495b7186-8f3b-4a86-a97b-5cce85a4...
21763,T2-weighted image flair,1.3.6.1.4.1.14519.5.2.1.4591.4003.122287415106...,1.3.6.1.4.1.14519.5.2.1.4591.4003.238576995972...,gs://idc-open/aeed2fa1-bfdc-43a9-a8e4-9bc2d2e6...


Save the sample dataframe to GCS to be used later.

In [None]:
sample_df.to_csv('/tmp/sample.csv', sep='\t')
!gsutil -u $myProjectID cp /tmp/sample.csv gs://$bucket_name/

Copying file:///tmp/sample.csv [Content-Type=text/csv]...
/ [1 files][  8.2 MiB/  8.2 MiB]                                                
Operation completed over 1 objects/8.2 MiB.                                      


---