This markdown documents code used to construct a Synapse dataset in the Biobank Synapse workspace using a subset of processed files in the NF-OSI-Processed folder.

I need all MAF files in the exomeseq somatic variant calls folder and want to store them as a versioned dataset that I can come back to again and again.


Log into Synapse

In [None]:
import synapseclient
syn = synapseclient.Synapse()
syn.login()

Import all important packages

In [44]:
from synapseclient import Activity
from synapseclient import Entity, Project, Folder, File, Link
from synapseclient import Evaluation, Submission, SubmissionStatus
from synapseclient import Wiki
import synapseutils
from re import search
import pandas as pd

Take a look at the items in the folder

In [6]:
children = syn.getChildren(parent="syn34678050")

There is a dataset in the project for the somatic variants. Take a look at the items in that dataset

In [86]:
dataset = syn.get("syn34678050")
df2 = pd.DataFrame(columns=["files", "synid"])
for f in dataset.datasetItems:
   f_id = f.get('entityId')
   filename = syn.get(f_id, downloadFile=False).get('name')
   df = pd.DataFrame(filename.split("\n"), columns = ['files'])
   df['synid'] = f_id
   df2 = df2.append(df)   

selected_files = df2[df2["files"].str.contains('_filtered') & df2["files"].str.endswith('.vcf.gz')]

Looks like the dataset only had VCF files, no MAF files.

So we take a step back and make a manifest of all the files in the original processed file folder.

In [163]:
#files = synapseutils.sync.syncFromSynapse(syn, "syn27650634", downloadFile=False)
files_df = pd.DataFrame(files)
files_df = files_df[["id", 'name', 'versionNumber']]

Then we select only the MAF files : their SynIDs and Version number.

We will make a new dataset for these MAF files. Both SynID and Version Number are needed to make the dataset.

In [164]:
selected_files_df = files_df[files_df["name"].str.endswith('.maf')]

In [173]:
items = selected_files_df[["id", "versionNumber"]]
items.columns = ["entityId", "versionNumber"]
items = items.to_dict('records')


Now we make the new dataset specifically for the MAF files. Once the items are added, we go to the WebUI to publish a static version of the dataset for future referral.

In [175]:
dataset = synapseclient.Dataset(
    name="Somatic_SNV_Maf",
    parent="syn4939902",
    dataset_items = items
)
dataset = syn.store(dataset)

#dataset.add_item({'entityId': "syn111", 'versionNumber': 1})

We can then use this dataset to download the important files that we will use for future analysis.

In [177]:

for id in selected_files_df["id"]:
    syn.get(id, downloadLocation = "/Users/jineta/git/gitrepo/biobank-release-2/data/somatic_snv_batch1_2_3")
