Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag annotations ignored when multiple are present #2

Closed
xrobin opened this issue Oct 30, 2015 · 1 comment
Closed

Tag annotations ignored when multiple are present #2

xrobin opened this issue Oct 30, 2015 · 1 comment

Comments

@xrobin
Copy link

xrobin commented Oct 30, 2015

The following line from the Ensembl Homo_sapiens.GRCh37.75.gtf is parsed incorrectly:

1   protein_coding  exon    860260  860328  .   +   .   gene_id "ENSG00000187634"; transcript_id "ENST00000420190"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-011"; transcript_source "havana"; exon_id "ENSE00001637883"; tag "cds_end_NF"; tag "mRNA_end_NF";

As you can see the line contains two 'tag' attributes. Only the second one is present in the DataFrame returned by read_gtf_as_dataframe:

seqname          source feature   start     end  score strand frame  \
0       1  protein_coding    exon  860260  860328    NaN      +     .

           gene_id    transcript_id exon_number gene_name     gene_source  \
0  ENSG00000187634  ENST00000420190           1    SAMD11  ensembl_havana

     gene_biotype transcript_name transcript_source          exon_id  \
0  protein_coding      SAMD11-011            havana  ENSE00001637883

           tag
0  mRNA_end_NF

The cds_end_NF tag is lost. Ideally both tags should be presented in a list, but I'm not sure if that's possible with pandas.

@iskandr
Copy link
Contributor

iskandr commented Feb 19, 2018

Sorry for the very slow response and not sure if this is still relevant to you. I think this PR is trying to address the same issue: #6

In that case, the solution is to concatenate the multiple values in a comma separated string. I think collecting a list makes more sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants