NetFlow3D is a computational tool aiming at mapping how somatic mutations act across scales in cancer. If you find NetFlow3D helpful, please cite https://doi.org/10.1101/2023.03.06.531441. You can also upload your data to our web server (http://netflow3d.yulab.org) and run NetFlow3D there.
The Python Standard Library and the following packages:
- scipy (version 1.9.3)
- numpy (version 1.23.5)
- networkx (version 2.8.8)
- pandas (version 1.5.2)
- statsmodels (version 0.13.5)
git clone https://github.com/hyulab/NetFlow3D.git
cd NetFlow3D
python NetFlow3D.py -h
Your command should be in the following format (the contents in []
are optional):
python NetFlow3D.py -m <input_maf> -I <job_name> [-X <expressed_genes>] [-n <binary_interactome>] [-o <output_path>] [-t <threads>]
-m <input_maf>
: replace<input_maf>
with the path to your MAF file.-I <job_name>
: replace<job_name>
with a name you preferred for the current job.
-X <expressed_genes>
: replace<expressed_genes>
with the path to your file which stores a complete list of expressed genes/proteins (see Optional input for how to generate the file). If not specified, all genes/proteins will be considered expressed.-n <binary_interactome>
: replace<binary_interactome>
with the path to your file which stores a complete list of existing protein-protein interactions (see Optional input for how to generate the file). If not specified, NetFlow3D will use the high quality binary interactome of Homo sapiens curated by HINT (http://hint.yulab.org/).-o <output_path>
: replace<output_path>
with a directory where the output files will be stored. If not specified, the output files will be stored in./output/
.-t <threads>
: replace<threads>
with a postive integer. This argument specifies the number of threads to use. If not specified, NetFlow3D will use 5 threads.
We provide example input files in ./example/input/
. Here is an example of your command (please run the following command to see if NetFlow3D is working properly, taking ~1min):
python NetFlow3D.py -m example/input/mutations.maf -I test -X example/input/expressed_genes.txt -n example/input/interactome.txt -t 10
If you run the above command, the output should be found in ./output/
, including test_signatures.txt
, test_subnetworks_intercept1.0_lowres_edgeweightTrue.txt
, and a folder test/
. To get an idea of what the output files should look like, please see example output files in ./example/output/
.
-
A Mutation Annotation Format (MAF) file (https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format)
Required columns:
- Chromosome
- Start_Position
- Variant_Classification
- ENSP
- Transcript_ID
- Gene
- Protein_position
- Tumor_Sample_Barcode
Other columns can also be present in the MAF file but they will not be used.
-
A text file containing a complete list of genes/transcripts expressed in the contexts where the mutations occur. One ID per line. Ensembl gene ID and Ensembl transcript ID are accepted. Example:
ENSG00000163166
ENSG00000110422
ENSG00000077312
ENSG00000180660
ENSG00000186635 -
A text file containing a complete list of protein-protein interactions existing in the contexts where the mutations occur. One interaction per line. Protein IDs should be separated by tab. Only UniProt ID is accepted. Example:
Q9H4A3 Q9HBL0
Q15654 Q15797
P63279 Q13643
O43236 O43236
P01112 P04049
NetFlow3D will output the following and files and a folder. {job_name}
will be replaced by the job name you specified before. If you run the example command, {job_name}
will be replaced by test
.
-
{job_name}
_signatures.txtThis a tab-separated file containing the significant 3D clusters and LOF enrichment signals identified by NetFlow3D. The first line is a header. Eight columns are present:
-
Signature_ID
-
Type
-
Uniprots
-
Canonical_isoform
-
Structure_source (
[NA]
means not applicable) -
Mutation_frequency
The content format in this column depends on the content in "Type":
- If the content in "Type" is “LoF_IntraProtein_Enriched”, the format of this column is
{UniProt ID}:{number of LoF mutations in all samples}
- Otherwise, the format of this column is
{residue1}:{number of mutated samples},{residue2}:{number of mutated samples},...
- If the content in "Type" is “LoF_IntraProtein_Enriched”, the format of this column is
-
LoF_enrichment (
[NA]
means not applicable) -
Raw_pvalue
-
Adjusted_pvalue
-
-
{job_name}
_subnetworks_intercept1.0_lowres_edgeweightTrue.txtThis is a tab-separated file containing the interconnected modules identified by NetFlow3D. Two columns are present:
- Subnetwork_UniProts
- Subnetwork_size (i.e. number of proteins in the interconnected module)
-
{job_name}/
This is a folder containing intermediate files by NetFlow3D:
initial_state_intercept1.0_lowres_edgeweightTrue.graphml.gz
: input to the network propagation model in NetFlow3D.final_state_intercept1.0_lowres_edgeweightTrue.graphml.gz
: output from the network propagation model in NetFlow3D.choose_delta_intercept1.0_lowres_edgeweightTrue.txt
: δ's from randomized inputShortVersion_mutation_data.txt
: mutation information summarized to each residuePIONEER_inter_pvalue.txt
,PDB_intra_pvalue.txt
,PDB_inter_pvalue.txt
,AlphaFold2_intra_pvalue_pLDDT0.txt
: 3D cluster information.All_intra_LoF_pvalue.txt
: LOF enrichment information.PDB_graph
,AlphaFold2_graph_pLDDT0
: residue-residue contact map.