Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funseq2 output vcf parsing errors #1

Closed
lybird300 opened this issue Apr 9, 2018 · 2 comments
Closed

Funseq2 output vcf parsing errors #1

lybird300 opened this issue Apr 9, 2018 · 2 comments

Comments

@lybird300
Copy link

lybird300 commented Apr 9, 2018

Hi there, first of all thanks for sharing the code and I'm looking forward to your paper when it comes out. This is not an issue of your tool, but I would really appreciate your insight since you have been working with Funseq2 for so long (I guess). I was able to make funseq2.1.6 (downloaded from the official website) work on my data and generate the Output.vcf file, but reading the file with VariantAnnotation::scanVcf (invoked by VariantAnnotation::readVcf) gives me Error: scanVcf: invalid split pattern ',(?=(ID|Number|Type)=[[:alnum:]])|,(?=Description=".?")'. I googled but could not find any solution expect for this thread. I'm wondering if you ever encountered the same problem or if there is any way to bypass the issue and still make your tool work. Thank you for your time!

Below is the first several lines of my Output.vcf file:

##fileformat=VCFv4.0
##INFO=<ID=OTHER,Number=.,Type=String, Description = "Other Information From Original File">
##INFO=<ID=SAMPLE,Number=.,Type=String,Description="Sample id">
##INFO=<ID=CDS,Number=.,Type=String,Description="Coding Variants or not">
##INFO=<ID=VA,Number=.,Type=String,Description="Coding Variant Annotation">
##INFO=<ID=HUB,Number=.,Type=String,Description="Network Hubs, PPI (protein protein interaction network), REG (regulatory network), PHOS (phosphorylation network)...">
##INFO=<ID=GNEG,Number=.,Type=String,Description="Gene Under Negative Selection">
##INFO=<ID=GERP,Number=.,Type=String,Description="Gerp Score">
##INFO=<ID=NCENC,Number=.,Type=String,Description="NonCoding ENCODE Annotation">
##INFO=<ID=HOT,Number=.,Type=String,Description="Highly Occupied Target Region">
##INFO=<ID=MOTIFBR,Number=.,Type=String,Description="Motif Breaking">
##INFO=<ID=MOTIFG,Number=.,Type=String,Description="Motif Gain">
##INFO=<ID=SEN,Number=.,Type=String,Description="In Sensitive Region">
##INFO=<ID=USEN,Number=.,Type=String,Description="In Ultra-Sensitive Region">
##INFO=<ID=UCONS,Number=.,Type=String,Description="In Ultra-Conserved Region">
##INFO=<ID=GENE,Number=.,Type=String,Description="Target Gene (For coding - directly affected genes ; For non-coding - promoter, enhancer regulatory module)">
##INFO=<ID=CANG,Number=.,Type=String,Description="Prior Gene Information, e.g.[cancer][TF_regulating_known_cancer_gene][up_regulated][actionable]...";
##INFO=<ID=CDSS,Number=.,Type=String,Description="Coding Score">
##INFO=<ID=NCDS,Number=.,Type=String,Description="NonCoding Score">
##INFO=<ID=RECUR,Number=.,Type=String,Description="Recurrent elements / variants">
##INFO=<ID=DBRECUR,Number=.,Type=String,Description="Recurrence database">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr13 100252870 . T A . . SAMPLE=T100;GERP=-4.14;CDS=No;HUB=CLYBL:PPI(0.361),TM9SF2:PPI(0.236)REG(0.409);NCENC=DHS(MCV-7|chr13:100252825-100252975),Enhancer(Roadmap_stringent|chr13:100252624-100252923),Enhancer(chmm/segway|chr13:100252783-100253000),TFM(CTCF_DHS|SP1_disc3|chr13:100252864-100252879),TFP(CTCF|chr13:100252207-100252907),TFP(CTCF|chr13:100252476-100252894),TFP(CTCF|chr13:100252487-100252873),TFP(CTCF|chr13:100252497-100252927),TFP(CTCF|chr13:100252503-100252891),TFP(CTCF|chr13:100252503-100252930),TFP(CTCF|chr13:100252505-100252888),TFP(CTCF|chr13:100252514-100253036),TFP(CTCF|chr13:100252515-100252912),TFP(CTCF|chr13:100252515-100252924),TFP(CTCF|chr13:100252516-100252910),TFP(CTCF|chr13:100252517-100252885),TFP(CTCF|chr13:100252519-100252971),TFP(CTCF|chr13:100252524-100252882),TFP(CTCF|chr13:100252527-100252884),TFP(CTCF|chr13:100252531-100252949),TFP(CTCF|chr13:100252532-100252884),TFP(CTCF|chr13:100252533-100252888),TFP(CTCF|chr13:100252535-100252892),TFP(CTCF|chr13:100252539-100252884),TFP(CTCF|chr13:100252541-100252876),TFP(CTCF|chr13:100252548-100252894),TFP(CTCF|chr13:100252564-100252892),TFP(CTCF|chr13:100252575-100252873),TFP(CTCF|chr13:100252587-100252877),TFP(CTCF|chr13:100252589-100252892),TFP(RAD21|chr13:100252520-100252938),TFP(RAD21|chr13:100252524-100252936),TFP(RAD21|chr13:100252529-100252933),TFP(RAD21|chr13:100252531-100252912),TFP(RAD21|chr13:100252537-100252892),TFP(RAD21|chr13:100252537-100252897);HOT=H1hesc;MOTIFBR=CTCF_DHS#SP1_disc3#100252864#100252879#-#10#0.131148#0.245902;MOTIFG=MZF1_3#100252864#100252870#-#1#6.892#5.961;SEN=Yes;GENE=CLYBL(Distal)TM9SF2(Distal);NCDS=3.854994663534:4.854994663534;RECUR=TFP(CTCF|chr13:100252207-100252907):T100&T236,TFP(CTCF|chr13:100252476-100252894):T100&T236
chr4 60720 . T C . . SAMPLE=T100;GERP=0.69;CDS=No;HUB=ZNF595:REG(0.987);NCENC=TFM(MAFF_MAFK|CEBPG_1|chr4:60714-60727),TFP(CHD2|chr4:50497-61009),TFP(EP300|chr4:57632-61054),TFP(FAM48A|chr4:58427-60880),TFP(FAM48A|chr4:59515-61055),TFP(FOS|chr4:55158-60924),TFP(IRF3|chr4:60169-61169),TFP(JUND|chr4:50500-62894),TFP(KAT2A|chr4:59928-60898),TFP(MAFF|chr4:60004-61056),TFP(MAFK|chr4:50489-63382),TFP(MAFK|chr4:59870-61092),TFP(MAFK|chr4:60116-61036),TFP(MXI1|chr4:50616-61014),TFP(SIN3A|chr4:52691-62808),TFP(STAT1|chr4:60005-61046),TFP(ZZZ3|chr4:59671-60953);HOT=K562;MOTIFBR=MAFF_MAFK#CEBPG_1#60714#60727#+#6#0.000000#0.714286;SEN=Yes;USEN=Yes;GENE=ZNF595(Intron);NCDS=3.7416745628426:4.7416745628426;RECUR=TFP(CHD2|chr4:50497-61009):T100&T104&T106&T107&T11&T115&T153&T157&T167&T252&T256&T258&T264&T267&T272&T275&T284&T294&T297&T300&T314&T318&T59&T70&T80&T92,TFP(EP300|chr4:57632-61054):T100&T104&T106&T107&T264&T272&T275&T333&T59&T80&T92,TFP(FAM48A|chr4:58427-60880):T100&T264&T272&T59&T80&T92,TFP(FAM48A|chr4:59515-61055):T100&T106&T264&T272&T333&T59&T80,TFP(FOS|chr4:55158-60924):T100&T104&T106&T107&T115&T256&T264&T272&T275&T294&T297&T318&T59&T80&T92,TFP(IRF3|chr4:60169-61169):T100&T106&T264&T272&T333&T80,TFP(JUND|chr4:50500-62894):T100&T104&T106&T107&T108&T11&T115&T153&T157&T167&T252&T256&T258&T264&T267&T272&T275&T284&T294&T297&T300&T314&T318&T333&T52&T59&T70&T80&T92,TFP(KAT2A|chr4:59928-60898):T100&T264&T272&T80,TFP(MAFF|chr4:60004-61056):T100&T106&T264&T272&T333&T80,TFP(MAFK|chr4:50489-63382):T100&T104&T106&T107&T108&T11&T115&T153&T157&T167&T252&T256&T258&T264&T267&T272&T275&T284&T294&T297&T300&T314&T318&T333&T52&T59&T70&T80&T92,TFP(MAFK|chr4:59870-61092):T100&T106&T264&T272&T333&T80,TFP(MAFK|chr4:60116-61036):T100&T106&T264&T272&T80,TFP(MXI1|chr4:50616-61014):T100&T104&T106&T107&T11&T115&T153&T157&T167&T252&T256&T258&T264&T267&T272&T275&T284&T294&T297&T300&T314&T318&T59&T70&T80&T92,TFP(SIN3A|chr4:52691-62808):T100&T104&T106&T107&T108&T11&T115&T153&T252&T256&T258&T264&T272&T275&T284&T294&T297&T314&T318&T333&T59&T80&T92,TFP(STAT1|chr4:60005-61046):T100&T106&T264&T272&T333&T80,TFP(ZZZ3|chr4:59671-60953):T100&T264&T272&T59&T80;DBRECUR=TFP(CHD2|chr4:50497-61009):Lung_Adeno(Altered in 17/24(70.83%) samples.)|Pancreas(Altered in 2/15(13.33%) samples.)|Prostate(Altered in 5/64(7.81%) samples.),TFP(EP300|chr4:57632-61054):Lung_Adeno(Altered in 10/24(41.67%) samples.),TFP(FAM48A|chr4:58427-60880):Lung_Adeno(Altered in 7/24(29.17%) samples.),TFP(FAM48A|chr4:59515-61055):Lung_Adeno(Altered in 6/24(25.00%) samples.),TFP(FOS|chr4:55158-60924):Lung_Adeno(Altered in 14/24(58.33%) samples.)|Prostate(Altered in 2/64(3.12%) samples.),TFP(JUND|chr4:50500-62894):Lung_Adeno(Altered in 17/24(70.83%) samples.)|Pancreas(Altered in 2/15(13.33%) samples.)|Prostate(Altered in 5/64(7.81%) samples.),TFP(KAT2A|chr4:59928-60898):Lung_Adeno(Altered in 3/24(12.50%) samples.),TFP(MAFF|chr4:60004-61056):Lung_Adeno(Altered in 3/24(12.50%) samples.),TFP(MAFK|chr4:50489-63382):Lung_Adeno(Altered in 17/24(70.83%) samples.)|Pancreas(Altered in 2/15(13.33%) samples.)|Prostate(Altered in 5/64(7.81%) samples.),TFP(MAFK|chr4:59870-61092):Lung_Adeno(Altered in 3/24(12.50%) samples.),TFP(MXI1|chr4:50616-61014):Lung_Adeno(Altered in 17/24(70.83%) samples.)|Pancreas(Altered in 2/15(13.33%) samples.)|Prostate(Altered in 5/64(7.81%) samples.),TFP(SIN3A|chr4:52691-62808):Lung_Adeno(Altered in 17/24(70.83%) samples.)|Prostate(Altered in 4/64(6.25%) samples.),TFP(STAT1|chr4:60005-61046):Lung_Adeno(Altered in 3/24(12.50%) samples.),TFP(ZZZ3|chr4:59671-60953):Lung_Adeno(Altered in 4/24(16.67%) samples.)

@mil2041
Copy link
Owner

mil2041 commented Apr 9, 2018

Hi,

It seems VariantAnnotation::scanVcf cannot parse vcf output file from FunSeq2 correctly.
Because I have not encountered this error message in the VariantAnnotation package before, I am not sure which line of your Output.vcf generate this error.

You will need to figure out which line in the Output.vcf that cannot be recognized by the VariantAnnotation parser. Otherwise, making a custom VCF file parser will be another possible solution.

There could be several possible places you can check. (1) Does your file end with correct "\n" symbol in the file. Mac will generate "^\M" symbol in the file that may cause parser encounter error. (2) If you randomly pick several lines in the Output.vcf, and then VariantAnnotation::scanVcf does not generate error message anymore. You can be surer it is just some lines in the Output.vcf generate this error.

@lybird300
Copy link
Author

Hi Eric, thank you for your quick reply. I reinstalled all related R packages and the readVcf function works now! Still no clue what went wrong but I can live with that. Thanks again and good luck with everything!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants