In [1]:
# -------------------------------------------------------------------------
# Title: DNA Methylation Data Processing from TCGA Cancer Projects
# Description:
#   This component is developed using the R programming language to download,
#   process, and save DNA methylation data from various TCGA cancer projects.
#
# Modules:
# 1. Library Installation:
#   - Install the required libraries for extracting data from the GDC repository.
#
# 2. Querying Data List:
#   - Select the project, type of cancer, data category, platform, data type,
#     and data accessibility.
#   - The output is a list of data IDs.
#   - For this project, chosen datasets are: LUAD and LUSD types of lung cancer.
#
# 3. Sample Selection:
#   - Randomly select 20 samples from each type of cancer.
#   - Due to large data volumem I chose just 20 samples.
#
# 4. Querying Data:
#   - Query information related to the samples chosen in the previous step.
#
# 5. Downloading Selected Data:
#   - Download the selected samples.
#
# 6. Converting to Dataframe:
#   - Convert the downloaded data to the dataframe type.
#   - Result: 2 dataframes, each for a specific type of cancer and project.
#
# 7. Labeling Data Module:
#   - Assign a label to a column named "Type" based on the type of cancer and project's name.
#   - Label assignment:
#     • LUAD: 0
#     • LUSD: 1
#
# 8. Data Binding:
#   - Bind together the 2 dataframes for different cancer types and projects.
#
# 9. Transposing:
#   - Transpose the dataframe to facilitate further data processing.
#
# 10. Saving as CSV:
#   - Save the prepared dataset as a CSV file for subsequent use.
# -------------------------------------------------------------------------


In [2]:
## Note: This step may take a significant amount of time.
# --------------------------------------------------
# Install and Load Essential Packages
# --------------------------------------------------

# Ensure BiocManager is installed and loaded
install.packages('BiocManager')
library(BiocManager)

# Install required packages using BiocManager:
# 1. tidyverse: A collection of R packages designed for data science.
# 2. TCGAbiolinks: Provides an interface to The Cancer Genome Atlas (TCGA) data.
# 3. SummarizedExperiment: Offers summarized data representation tools.
# 4. sesameData: Annotation Support for sesame
# 5. sesame: For analysis of DNA methylation arrays

BiocManager::install(c('tidyverse', 'TCGAbiolinks', 'SummarizedExperiment', 'sesameData', 'sesame'))


# Load the installed packages
library(TCGAbiolinks)
library(tidyverse)
library(SummarizedExperiment)
library(sesameData)
library(sesame)


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.17 (BiocManager 1.30.22), R 4.3.1 (2023-06-16)

“package(s) not installed when version(s) same as or greater than current; use
  `force = TRUE` to re-install: 'tidyverse'”
Installing package(s) 'BiocVersion', 'TCGAbiolinks', 'SummarizedExperiment',
  'sesameData', 'sesame'

also installing the dependencies ‘lazyeval’, ‘png’, ‘Biostrings’, ‘httpuv’, ‘xtable’, ‘sourcetools’, ‘later’, ‘promises’, ‘htmlwidgets’, ‘crosstalk’, ‘formatR’, ‘KEGGREST’, ‘zlibbioc’, ‘bitops’, ‘plogr’, ‘shiny’, ‘DT’, ‘lambda.r’, ‘futile.options’, ‘AnnotationDbi’, ‘XVector’, ‘Rcpp’, ‘R.oo’, ‘R.methodsS3’, ‘matrixStats’, ‘RCurl’, ‘GenomeInfoDbData’, ‘abind’, ‘RSQLite’, ‘interactiveDisplayBase’, ‘futile.logger’, ‘snow’, ‘BH

In [3]:
# Query DNA Methylation data from The Cancer Genome Atlas (TCGA)

# Note: This step may take an amount of time.


# Query for Lung Adenocarcinoma (LUAD) methylation data
query_methly_LUAD <- GDCquery(
  project = 'TCGA-LUAD',                         # Specify the cancer type (Lung Adenocarcinoma)
  data.category = 'DNA Methylation',             # We are interested in DNA methylation data
  platform = "Illumina Human Methylation 450",   # Specify the platform used for the methylation analysis
  access = 'open',                               # We only want data that's publicly accessible
  data.type = 'Methylation Beta Value'           # Specify the type of methylation data we're looking for
)

# Query for Lung Squamous Cell Carcinoma (LUSC) methylation data
query_methly_LUSC <- GDCquery(
  project = 'TCGA-LUSC',                         # Specify the cancer type (Lung Squamous Cell Carcinoma)
  data.category = 'DNA Methylation',
  platform = "Illumina Human Methylation 450",
  access = 'open',
  data.type = 'Methylation Beta Value'
)

# Note: Add queries for other cancer types as needed below this line.

# For more detailed data and information required for querying:
# Visit the TCGA repository at: https://portal.gdc.cancer.gov/repository



# Extract methylation results for different cancer types
output_query_methyl_LUAD <- getResults(query_methly_LUAD)  # Extract results for LUAD (Lung Adenocarcinoma) cancer type
output_query_methyl_LUSC <- getResults(query_methly_LUSC)  # Extract results for LUSC (Lung Squamous Cell Carcinoma) cancer type


--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-LUAD

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By access

ooo By data.type

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases

ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-LUSC

--------------------

oo Filtering results

---------------

In [4]:
# Displaying the methylation results for LUAD.
output_query_methyl_LUAD

Unnamed: 0_level_0,id,data_format,cases,access,file_name,submitter_id,data_category,type,platform,file_size,⋯,analysis_id,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>
1,99204ca7-9dd0-4422-951e-19da468967e1,TXT,TCGA-93-A4JN-01A-11D-A24U-05,open,a06c8019-c063-450b-92f3-4f39568361e4.methylation_array.sesame.level3betas.txt,baf37b13-2bb6-44df-98f7-b492da5e5bc0,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13197148,⋯,7ffdd580-48a8-424c-9717-4f23854bd8b0,released,a06c8019-c063-450b-92f3-4f39568361e4,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,f8ef9dc6375573fc05fee47b70278d89f79a4cd7,Primary Tumor,,TCGA-93-A4JN,TCGA-93-A4JN-01A
2,bca42f75-ac65-4add-8450-35b846c991df,TXT,TCGA-44-7662-01A-11D-2064-05,open,f1df8f0c-382d-49d7-980a-c16a4765e407.methylation_array.sesame.level3betas.txt,878484f9-4a6e-4bff-990b-4f035dea7e1d,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13181105,⋯,8c0502b0-fc01-4702-92be-3b2d21b93f0a,released,f1df8f0c-382d-49d7-980a-c16a4765e407,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,f8ef9dc6375573fc05fee47b70278d89f79a4cd7,Primary Tumor,,TCGA-44-7662,TCGA-44-7662-01A
3,c9894ac2-3a81-482e-be5e-c951a83856dc,TXT,TCGA-73-4658-01A-01D-1756-05,open,00685208-8382-48e1-838b-fc0b7807f67f.methylation_array.sesame.level3betas.txt,e02f90cd-c959-474c-97f8-1f40daace5d5,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13124401,⋯,88fc8467-617f-4303-952f-197dc1945a4c,released,00685208-8382-48e1-838b-fc0b7807f67f,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,9e258094ac9d4febe3721ceba25c1b8dbd1a27f8,Primary Tumor,,TCGA-73-4658,TCGA-73-4658-01A
4,0cb10a2d-1f60-46a8-8f50-d62941ab09f6,TXT,TCGA-69-7765-01A-11D-2168-05,open,9d47dfb6-8ff8-4771-884f-53433f21ba68.methylation_array.sesame.level3betas.txt,2165db95-b19b-42db-9bf8-d111125791db,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13059462,⋯,d4b7722e-7e86-4b74-a3ab-3df1aaf14437,released,9d47dfb6-8ff8-4771-884f-53433f21ba68,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,9e258094ac9d4febe3721ceba25c1b8dbd1a27f8,Primary Tumor,,TCGA-69-7765,TCGA-69-7765-01A
5,ada7caf4-d45f-4bcc-8eb1-bd247d1222f4,TXT,TCGA-55-7995-01A-11D-2185-05,open,a9d5eda4-b86a-42bc-b32f-c2d5cc4f38fe.methylation_array.sesame.level3betas.txt,53b76c94-0ac4-484d-8e21-763025c0a868,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13140938,⋯,b1afb267-889c-4b1f-9cca-78237774e705,released,a9d5eda4-b86a-42bc-b32f-c2d5cc4f38fe,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,9e258094ac9d4febe3721ceba25c1b8dbd1a27f8,Primary Tumor,,TCGA-55-7995,TCGA-55-7995-01A
6,2c0787b2-39fe-4007-95fd-69359ee2f424,TXT,TCGA-49-4487-01A-21D-1856-05,open,d7e3478b-7126-4e40-af10-39f2bfd5c718.methylation_array.sesame.level3betas.txt,9a9e31ae-0729-4472-ad3c-ebf148306a57,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13185116,⋯,15c5b85b-fc49-4ffe-aa4b-813935cd3ffa,released,d7e3478b-7126-4e40-af10-39f2bfd5c718,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,9e258094ac9d4febe3721ceba25c1b8dbd1a27f8,Primary Tumor,,TCGA-49-4487,TCGA-49-4487-01A
7,22ce762e-783e-455a-bb04-fa95a0c4b1d2,TXT,TCGA-55-7727-01A-11D-2168-05,open,fb46eff2-33da-4061-bcd9-9b13508bbce8.methylation_array.sesame.level3betas.txt,02cf1f7e-b1d7-4e25-b4c2-e34e49efe2dc,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13007727,⋯,b9b68ed2-62e8-4c73-aef9-662380f66c21,released,fb46eff2-33da-4061-bcd9-9b13508bbce8,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,f8ef9dc6375573fc05fee47b70278d89f79a4cd7,Primary Tumor,,TCGA-55-7727,TCGA-55-7727-01A
8,a0f8ac3a-99e2-4da8-942a-9d5e6ca862fa,TXT,TCGA-44-2668-01A-01D-A276-05,open,d5769119-6454-48f2-9b9c-9f22c9761df9.methylation_array.sesame.level3betas.txt,613ca21e-a0e5-496d-941e-075698090e8f,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13133450,⋯,d5b64871-73da-497b-8b22-b2295c5a8f7b,released,d5769119-6454-48f2-9b9c-9f22c9761df9,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,f8ef9dc6375573fc05fee47b70278d89f79a4cd7,Primary Tumor,,TCGA-44-2668,TCGA-44-2668-01A
9,3e3c799b-a3c4-4e6f-802a-5504d3312ef2,TXT,TCGA-95-7043-01A-11D-1947-05,open,a472e1c8-4a94-4b20-a3ca-8bdfd29cdec5.methylation_array.sesame.level3betas.txt,9cbf96bf-109b-4a45-93da-97db6a489f4e,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13149558,⋯,cf4e6b9c-fb2e-4d9e-a3fe-8bab52172e76,released,a472e1c8-4a94-4b20-a3ca-8bdfd29cdec5,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,f8ef9dc6375573fc05fee47b70278d89f79a4cd7,Primary Tumor,,TCGA-95-7043,TCGA-95-7043-01A
10,5df2e0f0-9e18-42f5-bf61-643530af6c01,TXT,TCGA-75-5147-01A-01D-1626-05,open,c3113131-59d9-4d87-85da-27f6c65d1b4e.methylation_array.sesame.level3betas.txt,9c2d3875-ba39-4867-8a4d-443088960472,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13193330,⋯,1d93cf59-3e73-4ccd-859b-e04313a61355,released,c3113131-59d9-4d87-85da-27f6c65d1b4e,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,dc9467b78cf538a0430cd4027dc5092f77d47e20,Primary Tumor,,TCGA-75-5147,TCGA-75-5147-01A


In [5]:
# Set seed for reproducibility
set.seed(42)

# Randomly select 20 samples from specified datasets based on the "cases" column

# LUAD cancer type
selected_indices_LUAD <- sample(output_query_methyl_LUAD$cases, 20)

# LUSC cancer type
selected_indices_LUSC <- sample(output_query_methyl_LUSC$cases, 20)


In [6]:
# Query information for each cancer type based on selected samples.
# Only samples chosen in the previous step are queried.
query_methly_LUAD_final_version <- GDCquery(project = 'TCGA-LUAD',
         data.category = 'DNA Methylation',
         platform = "Illumina Human Methylation 450",
         access = 'open',
         data.type = 'Methylation Beta Value', barcode = c(selected_indices_LUAD))

query_methly_LUSC_final_version <- GDCquery(project = 'TCGA-LUSC',
         data.category = 'DNA Methylation',
         platform = "Illumina Human Methylation 450",
         access = 'open',
         data.type = 'Methylation Beta Value', barcode = c(selected_indices_LUSC))

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-LUAD

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By access

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases

ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-LUSC

--------------------

oo Filtering results


In [7]:
# Extract results for the final version of selected samples for LUAD
output_query_methly_LUAD_final_version <- getResults(query_methly_LUAD_final_version)

# Extract results for the final version of selected samples for LUSC
output_query_methly_LUSC_final_version <- getResults(query_methly_LUSC_final_version)


In [8]:
# Displaying the methylation results for LUAD.
output_query_methly_LUAD_final_version

Unnamed: 0_level_0,id,data_format,cases,access,file_name,submitter_id,data_category,type,platform,file_size,⋯,analysis_id,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>
49,c7264ff7-7b32-4430-ab2e-201a5fb8b691,TXT,TCGA-86-7713-01A-11D-2064-05,open,be75986f-4acd-4e56-8bd3-c6fc1829fe64.methylation_array.sesame.level3betas.txt,aa3c4486-ab66-4c13-8006-e91d0ed8278a,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13145092,⋯,2878474e-5de4-44ce-8674-3361dd29d1f1,released,be75986f-4acd-4e56-8bd3-c6fc1829fe64,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,dc9467b78cf538a0430cd4027dc5092f77d47e20,Primary Tumor,,TCGA-86-7713,TCGA-86-7713-01A
485,7433085a-d10e-4597-a999-cc6d800509bd,TXT,TCGA-86-8674-01A-21D-2398-05,open,e5419571-0fe0-4358-b803-3fd83e5d653b.methylation_array.sesame.level3betas.txt,60f9a8bd-4d75-4638-a009-cc588a79d132,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13174871,⋯,dfbd021c-47dd-4b3d-84f5-c697b97a451a,released,e5419571-0fe0-4358-b803-3fd83e5d653b,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,f8ef9dc6375573fc05fee47b70278d89f79a4cd7,Primary Tumor,,TCGA-86-8674,TCGA-86-8674-01A
321,4f8e99f9-a32a-4ab9-8f71-642d5db85fe9,TXT,TCGA-55-6987-01A-11D-1947-05,open,59ded356-91a9-45d8-b2c8-c44f797c725f.methylation_array.sesame.level3betas.txt,f3da3b64-9045-405e-9f0f-858868a38156,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13175478,⋯,4b36e50a-5d9b-48fd-a374-ee337d5df7bf,released,59ded356-91a9-45d8-b2c8-c44f797c725f,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,f8ef9dc6375573fc05fee47b70278d89f79a4cd7,Primary Tumor,,TCGA-55-6987,TCGA-55-6987-01A
153,ddadc636-be3c-4258-8ae6-c93e591a12a7,TXT,TCGA-49-4514-01A-21D-1856-05,open,41adfb52-afb5-462f-98a4-018be9613677.methylation_array.sesame.level3betas.txt,e874a57c-089a-481e-8e79-d24a13005fe3,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13200460,⋯,56572c89-2734-45d2-93c4-94e0c316aa4b,released,41adfb52-afb5-462f-98a4-018be9613677,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,9e258094ac9d4febe3721ceba25c1b8dbd1a27f8,Primary Tumor,,TCGA-49-4514,TCGA-49-4514-01A
74,04e13d09-b1b4-4bff-a06b-b30106322bb3,TXT,TCGA-55-7574-01A-11D-2037-05,open,0f8c4c8c-414d-4c23-97c0-9442d33fd1c9.methylation_array.sesame.level3betas.txt,ba24a22c-f0e6-41e0-89ca-f43ea7b11636,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13184651,⋯,72575f0f-43dc-4aaf-8221-526baf344e60,released,0f8c4c8c-414d-4c23-97c0-9442d33fd1c9,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,dc9467b78cf538a0430cd4027dc5092f77d47e20,Primary Tumor,,TCGA-55-7574,TCGA-55-7574-01A
228,54a66b64-7a0f-4a5f-93f8-a4d34c952fad,TXT,TCGA-86-8076-01A-31D-2239-05,open,f2e79be4-ff8f-47e8-969f-c29d401cd34a.methylation_array.sesame.level3betas.txt,6cc3df93-f46f-4d03-8397-c0351e5d000b,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13075037,⋯,af8dc7b6-ca11-41f9-b90c-ef8b900f562d,released,f2e79be4-ff8f-47e8-969f-c29d401cd34a,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,9e258094ac9d4febe3721ceba25c1b8dbd1a27f8,Primary Tumor,,TCGA-86-8076,TCGA-86-8076-01A
146,c01350c9-89ac-40ae-84ac-0fac77712446,TXT,TCGA-62-A46S-01A-11D-A24I-05,open,82df3c33-91dc-4cfe-b6e1-ac38dc3e94a3.methylation_array.sesame.level3betas.txt,0295387b-d1de-4bea-b0cb-4d626a5ebacc,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13168456,⋯,c2292f13-8358-4a47-94e0-6923fe4efb95,released,82df3c33-91dc-4cfe-b6e1-ac38dc3e94a3,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,dc9467b78cf538a0430cd4027dc5092f77d47e20,Primary Tumor,,TCGA-62-A46S,TCGA-62-A46S-01A
122,1198cff7-3fb9-43ba-ace1-11ede3ef5816,TXT,TCGA-86-8278-01A-11D-2285-05,open,1fc57c52-b7bd-4fec-b974-7cc75086d861.methylation_array.sesame.level3betas.txt,b7320dea-5cd4-421f-8126-37c71927ad04,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13124094,⋯,a5fbbcfb-580b-482c-ba90-49e890de4b08,released,1fc57c52-b7bd-4fec-b974-7cc75086d861,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,9e258094ac9d4febe3721ceba25c1b8dbd1a27f8,Primary Tumor,,TCGA-86-8278,TCGA-86-8278-01A
507,f3987c5d-f346-4f2f-9f4f-b774327e77b0,TXT,TCGA-44-5644-01A-21D-2037-05,open,e1828cf1-3237-48f1-9605-527231455dfc.methylation_array.sesame.level3betas.txt,3aaa5f2b-265a-4314-8479-da80b01f3ea9,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13037527,⋯,34b6332b-3463-4868-84ad-d871f8607739,released,e1828cf1-3237-48f1-9605-527231455dfc,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,dc9467b78cf538a0430cd4027dc5092f77d47e20,Primary Tumor,,TCGA-44-5644,TCGA-44-5644-01A
128,9bccd373-9b0c-430b-85a6-4ea5e5a74e76,TXT,TCGA-69-7980-01A-11D-2185-05,open,3aaabbfe-84a6-4c04-bd35-94c0c8cc3557.methylation_array.sesame.level3betas.txt,9961a9b5-8467-4bcf-834c-73423990b128,DNA Methylation,methylation_beta_value,Illumina Human Methylation 450,13091649,⋯,8ef56dbc-faeb-4b7f-b455-b604a964519a,released,3aaabbfe-84a6-4c04-bd35-94c0c8cc3557,quay.io/ncigdc,SeSAMe Methylation Beta Estimation,f8ef9dc6375573fc05fee47b70278d89f79a4cd7,Primary Tumor,,TCGA-69-7980,TCGA-69-7980-01A


In [9]:
# Depending on the size of the dataset, this step might take a significant amount of time to execute.

# Download methylation data for LUAD (Lung Adenocarcinoma)
# Note: Depending on the size of the queried data, the download may take a while.
GDCdownload(query_methly_LUAD_final_version)

# Download methylation data for LUSC (Lung Squamous Cell Carcinoma)
# Note: Similarly, this can also take time depending on the data size.
GDCdownload(query_methly_LUSC_final_version)


Downloading data for project TCGA-LUAD

GDCdownload will download 20 files. A total of 262.900687 MB

Downloading as: Mon_Oct__9_16_04_03_2023.tar.gz



Downloading: 100 MB     

Downloading data for project TCGA-LUSC

GDCdownload will download 20 files. A total of 262.147226 MB

Downloading as: Mon_Oct__9_16_05_26_2023.tar.gz



Downloading: 100 MB     

In [10]:
# Process the downloaded methylation data
# The GDCprepare function processes and organizes the data into a more accessible format.
# The 'summarizedExperiment' parameter specifies that the output should be a SummarizedExperiment object.

# Process data for LUAD
dna.meth_LUAD <- GDCprepare(query_methly_LUAD_final_version, summarizedExperiment = TRUE)

# Process data for LUSC
dna.meth_LUSD <- GDCprepare(query_methly_LUSC_final_version, summarizedExperiment = TRUE)


-------------------

oo Reading 20 files

-------------------





-------------------

oo Merging 20 files

-------------------

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Creating a SummarizedExperiment from DNA methylation input

Accessing DNAm annotation from sesame package for: hg38 - HM450

see ?sesameData and browseVignettes('sesameData') for documentation

downloading 1 resources

retrieving 1 resource

loading from cache

Starting to add information to samples

 => Add clinical information to samples

 => Adding TCGA molecular information from marker papers

 => Information will have prefix 'paper_' 

luad subtype information from:doi:10.1038/nature13385

-------------------

oo Reading 20 files

-------------------





-------------------

oo Merging 20 files

-------------------

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Creating a SummarizedExperiment from DNA methylation input

Accessing DNAm annotation from sesame package for: hg38 - HM450

see ?sesameData and browseVignettes('sesameData') for documentation

loading from cache

Starting to add information to samples

 => Add clinical information to samples

 => Adding TCGA molecular information from marker papers

 => Information will have prefix 'paper_' 

lusc subtype information from:doi:10.1038/nature11404



In [11]:
# Convert the prepared data to dataframe
df_LUAD <- as.data.frame(assay(dna.meth_LUAD))
df_LUSD <- as.data.frame(assay(dna.meth_LUSD))

In [12]:
# In this dataframe:
# - Each column represents a unique sample of data.
# - Each row corresponds to a specific feature of the dataset.

# Add a label (record) to the data to indicate the type of cancer:
# - If the type of cancer is LUAD, the label is 0.
# - If the type of cancer is LUSD, the label is 1.

# Determine the number of samples (columns) in the existing datasets for LUAD and LUSD.
num_cols_LUAD <- ncol(dna.meth_LUAD)
num_cols_LUSD <- ncol(dna.meth_LUSD)

# Create a new record for the cancer type label (0 for LUAD and 1 for LUSD).
new_record_LUAD <- rep(0, num_cols_LUAD)
new_record_LUSD <- rep(1, num_cols_LUSD)

# Add the new record (label) to the existing dataframes.
df_LUAD <- rbind(df_LUAD, new_record_LUAD)
df_LUSD <- rbind(df_LUSD, new_record_LUSD)

# Define the name for the new record (row) as "Type".
new_record_name <- "Type"

# Assign the name to the new record in the rownames of the respective dataframes.
rownames(df_LUAD)[nrow(df_LUAD)] <- new_record_name
rownames(df_LUSD)[nrow(df_LUSD)] <- new_record_name

In [13]:
# Combine both df_LUAD and df_LUSD dataframes horizontally (column-wise).
combined_df <- cbind(df_LUAD, df_LUSD)


In [14]:
# Transpose the combined dataframe so that samples become rows and features become columns for the final output.
transposed_df <- t(combined_df)


In [15]:
write.csv(transposed_df, "data.csv", row.names = FALSE)