All steps were conducted on Linux Ubuntu 20.04.3 LTS.

## Create data folder

In [None]:
!mkdir -p data/lung_data

### Exposure Data

* [GDC Data Portal Repository for Transcriptome Profiling Gene Expression Quantification FPKM+UQ Files for TCGA Projects LUAD+LUSC (lung cancers)](https://portal.gdc.cancer.gov/repository?facetTab=cases&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%5B%22TCGA%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TCGA-LUAD%22%2C%22TCGA-LUSC%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20FPKM-UQ%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_category%22%2C%22value%22%3A%5B%22transcriptome%20profiling%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%5D%7D)
(last visited 2022/01/06)
* Add All Files to Cart.
* At the cart, press Clinical, then TSV. From that compressed directory, extract the file `exposure.tsv` and place it
in the same directory as this notebook.
* Still at the cart, press Metadata. Save that JSON file in the same directory as this notebook.

* The file `exposure.tsv` has two issues that must be solved for our purposes.
* First, it contains extraneous information; we only need to know the number of cigarettes smoked per day
as our label.
* Second, samples are labeled with a case id, rather than with a filename, as is the case with gene expression
quantification data. This information is required to match our desired label with each sample's gene expression data
levels.

* To fix these issues, we start by loading the `case_id` and `cigarettes_per_day` columns from the `exposure.tsv` file
into a pandas dataframe. Rows marked as `'--` are NaN values, so we drop them.

In [1]:
import pandas as pd

raw_cigs_per_day = pd.read_csv('exposure.tsv', index_col=None, header=0, sep='\t',
                           usecols=['case_id', 'cigarettes_per_day'], na_values="'--").dropna()
print("(Number labels, number columns) = ", raw_cigs_per_day.shape)

(Number labels, number columns) =  (776, 2)


In [2]:
raw_cigs_per_day.head()

Unnamed: 0,case_id,cigarettes_per_day
0,cbbea9f1-396a-4bf3-b67c-2cac3394dceb,0.821918
1,e499069b-a16a-49e9-941a-e3e9ea62af25,3.287671
3,aee86a89-0377-4080-b16c-408bfbe78687,4.383562
5,44218b35-219c-4ad9-a01e-fde14067c4c0,5.150685
6,34d8e84e-c3e1-417d-8b9b-8563d9fa0f8e,3.287671


* To ensure the label and gene expression data have matching sample names, we next load the metadata
to create a dictionary mapping case ids to filenames.

In [3]:
import json

def get_caseid_filename_dict(metadata):
    with open(metadata) as f:
        metadata_json = json.load(f)
    submitter_filename_dict = {entry['associated_entities'][0]['case_id']: entry['file_name']
                               for entry in metadata_json}
    return submitter_filename_dict

In [4]:
mapping_dict = get_caseid_filename_dict('metadata.cart.2022-01-06.json')  # replace with your metadata's filename

raw_cigs_per_day['case_id'] = raw_cigs_per_day['case_id'].map(mapping_dict)
raw_cigs_per_day = raw_cigs_per_day.rename(columns={'case_id': 'filenames'})
raw_cigs_per_day = raw_cigs_per_day.set_index('filenames')

In [5]:
raw_cigs_per_day.head()

Unnamed: 0_level_0,cigarettes_per_day
filenames,Unnamed: 1_level_1
1ef4c9a4-403e-4ed2-b9d6-06302a842278.FPKM-UQ.txt.gz,0.821918
44615ba8-36ce-4d2e-9d9d-b2f01179b6c8.FPKM-UQ.txt.gz,3.287671
378bba31-bcf1-49a1-b1ff-e14278f7054c.FPKM-UQ.txt.gz,4.383562
6fdbffc8-dac1-4a98-9912-9084b2ce3f28.FPKM-UQ.txt.gz,5.150685
ee6fc916-052f-4fab-974f-119ab34078d6.FPKM-UQ.txt.gz,3.287671


### Gene expression data

* [GDC Data Portal Repository for Transcriptome Profiling Gene Expression Quantification FPKM+UQ Files for TCGA Projects LUAD+LUSC (lung cancers)](https://portal.gdc.cancer.gov/repository?facetTab=cases&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%5B%22TCGA%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TCGA-LUAD%22%2C%22TCGA-LUSC%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20FPKM-UQ%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_category%22%2C%22value%22%3A%5B%22transcriptome%20profiling%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%5D%7D)
(last visited 2022/01/06)
* Add All Files to Cart.
* At the cart, either download the data directly, or press Download > Manifest.
Save the manifest in the same folder as this notebook.
* Install the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)
(last visited 2022/01/06; the download link used with wget below is subject to change).

In [6]:
!wget https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.1_Ubuntu_x64.zip
!unzip gdc-client_v1.6.*zip
!rm gdc-client_v1.6.*.zip wget-log

--2022-01-18 12:55:37--  https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.1_Ubuntu_x64.zip
Resolving gdc.cancer.gov (gdc.cancer.gov)... 151.101.133.193, 2a04:4e42:1f::449
Connecting to gdc.cancer.gov (gdc.cancer.gov)|151.101.133.193|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23940006 (23M) [application/zip]
Saving to: ‘gdc-client_v1.6.1_Ubuntu_x64.zip’

nt_v1.6.1_Ubuntu_x6  10%[=>                  ]   2.45M   586KB/s    eta 40s    ^C
Archive:  gdc-client_v1.6.1_Ubuntu_x64.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of gdc-client_v1.6.1_Ubuntu_x64.zip or
        gdc-client_v1.6.1_Ubuntu_x64.zip.zip, and cannot find gdc-client_v1.6.1_Ubuntu_x64.zip.ZIP, period.
rm: cannot remove 

* We make a directory in which to store all the files.
* Then we use the transfer tool to download the files listed in the manifest into that newly create directory.

Note: The download of the transcriptome profiling files will, as a rule, take a while.
Furthermore, if it gives an error, it may be worth re-running the following block of code until all files are downloaded.

In [7]:
!mkdir -p fpkm-tcga-lung-gene-exp
!./gdc-client download -d fpkm-tcga-lung-gene-exp --manifest gdc_manifest*.txt

UnboundLocalError: local variable 'child' referenced before assignment

* Don't forget to confer if the number of files in your directory is the same as the number of files
in your GDC card.
* Next, we open each of the files to merge them into a single dataframe.
* We will save the concatenated dataframe into a file for future use.

In [9]:
from ..functions.load_gexp_dataset import load_gexp_dataset

load_gexp_dataset(infolder='fpkm-tcga-lung-gene-exp', outfile='..data/lung_data/lung_fpkm.csv')

Because we are only interested in the gene expression levels for data which has information concerning the number of
cigarettes smoked per day available, we intersect the feature and label datasets, and sort them to ensure that they
match on a per-sample basis

In [10]:
raw_lung = pd.read_csv('..data/lung_data/lung_fpkm.csv', index_col=0)
print("Before matching:", raw_lung.shape)

lung_gexp = raw_lung.loc[raw_lung.index.isin(raw_cigs_per_day.index)].sort_index()
cigs_per_day = raw_cigs_per_day.loc[raw_cigs_per_day.index.isin(lung_gexp.index)].sort_index()
print("After matching:")
print("Gene expression Features", lung_gexp.shape)
print("Cigarettes per Day Labels", cigs_per_day.shape)

Before matching: (1145, 60483)
After matching:
Gene expression Features (776, 60483)
Cigarettes per Day Labels (776, 1)


Since coding gene selection is a static preprocessing step, we might perform it during the
one-time data preparation step in order to avoid introducing unnecessary computations when
iterating over and finetuning dynamic steps in the ML pipelines.

**Obtaining a list of coding genes**
* [Download a list of protein coding gene IDs from Ensembl](http://www.ensembl.org/biomart/martview/0c0008282d973b80155b23e263f874a8)
(last visited 2022/01/06).
* To select protein coding genes, in Dataset choose Ensemble Genes (Version) > Human Genes (Version);
then click Filters and, under GENE, tick Gene Type and select protein coding;
lastly, go to Attributes, and under GENE untick all boxes except Gene stable ID.
* To download the list of protein coding genes, go to Results,
then Export all results to > File and TSV and tick Unique results only.
Save the file as `protein_coding_genes.txt` in the same directory as this notebook.

**Coding Gene Selection**
* Load the protein coding gene list and select only coding genes from the main DF.
* By reducing the number of genes (features), we reduce dimensionality (curse of dimensionality!).

In [None]:
protein_coding_genes = pd.read_csv('protein_coding_genes.txt', sep='\n', header=0).values
unfurled_protein_coding_genes = [gene_id[0] for gene_id in protein_coding_genes.tolist()]
coding_lung_gexp = lung_gexp.loc[:, lung_gexp.columns.str.contains('|'.join(unfurled_protein_coding_genes))]

print("Gene expression matrix dimensions after coding gene selection:", coding_lung_gexp.shape)

* Write the processed gene expression dataset into files.
* CSV files work well with pandas.
* For DASK, it is recommended to write files as parquets.
* To facilitate certain downstream steps, we reset the index names to default ordinal integers.

In [11]:
coding_lung_gexp.to_csv(path_or_buf='../data/lung_data/coding_lung_fpkm.csv', index=False)
cigs_per_day.to_csv(path_or_buf='..data/lung_data/cigs_per_day.csv', index=False)

* For use with Dask, adjusting a parquet's chunk size is important.
Smaller chunks provide more parallelization, but larger chunks have a smaller computational overhead.
* To guarantee that the feature matrix and label vector's samples match, we would like them to have the same
set of sample-wise divisions.

In [14]:
import dask.dataframe as dd
import sys

# 6e7 bytes equals around 60 MB;
# dividing total dataframe byte size by this value ensures that each partition will have roughly that size.
dask_coding_lung_gexp = dd.from_pandas(coding_lung_gexp.reset_index(drop=True),
                                       npartitions=(sys.getsizeof(coding_lung_gexp)//6e7))

dask_coding_lung_gexp.to_parquet('..data/lung_data/coding_lung_fpkm.parquet',
                                 engine='pyarrow', compression='snappy')

dask_coding_cigs_per_day = dd.from_pandas(cigs_per_day.reset_index(drop=True), npartitions=1).repartition(
    divisions=dask_coding_lung_gexp.divisions)
dask_coding_cigs_per_day.reset_index(drop=True).to_parquet('..data/lung_data/coding_cigs_per_day.parquet',
                                                            engine='pyarrow', compression='snappy')
