Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prepare the gtf? #45

Closed
ld9866 opened this issue Mar 4, 2023 · 7 comments
Closed

How to prepare the gtf? #45

ld9866 opened this issue Mar 4, 2023 · 7 comments

Comments

@ld9866
Copy link

ld9866 commented Mar 4, 2023

Hello developers!
I notice that in your sample file (vgrna-project-paper/originals/data/transcripts/gencode29) that we download is gencode.v29.primary_assembly.annotation.gff3.gz and gencode.v29.primary_assembly.annotation.gtf.gz, there are exons.sh gene_transcripts.sh preprocess.sh subsample_transcripts.sh in the folder, and l do not know how to bash the subsample_transcripts.sh, and when we bash the code, there is nothing works in our folder. By the way, in what order should the four scripts be executed in?

@ld9866
Copy link
Author

ld9866 commented Mar 5, 2023

When we print the code, it showed that "(standard_in) 1: syntax error" and we do not how to solve the problem.

grep -P "\ttranscript\t" gencode.v29.primary_assembly.annotation_renamed_full.gtf | cut -f9 | cut -d ';' -f2 | cut -d '"' -f2 | uniq | shuf | head -n $(echo "172449 * ${1} / 100" | bc | cut -d '.' -f1) > transcripts_subset${1}.txt

@jonassibbesen
Copy link
Owner

Hi,

Are you interested in replicating the benchmark from the manuscript or do you have your own data that you want to run the pipeline on?

The scripts you are referring to are not really part of the standard pipeline. They were used specifically for the data and benchmarking presented in the manuscript:

  • preprocess.sh renames the contigs in the annotation (column 1) to match the genome used (removes chr prefix) and filters transcripts that are not full length.
  • exons.sh creates a BED file of exon coordinates from the annotation.
  • gene_transcripts.sh creates a table of gene and transcript names.
  • subsample_transcripts.sh creates a subset of the annotation by removing transcripts. Takes as input the percentage that should be kept.

Running these scripts are only needed if you want to exactly replicate the benchmark det was done in the manuscript. If you have your own data and just want to run the pipeline to get expression estimates these scripts are not really needed.

@ld9866
Copy link
Author

ld9866 commented Mar 6, 2023

Hello developers!
Thank you for your patient and timely reply!
I was building a 15 genome pan-genome using minigraph-cactus and got the pan-genome well. However, I also want to add short sequenced data from 500 individuals to our pan-genome to build a more complete pan-genome in order to obtain more complete transcriptome information.
Is it necessary to add these individuals to our pan-genome?
Best yours,

@jonassibbesen
Copy link
Owner

That depends on what you are interested in. If you want to create haplotype-specific transcripts using the haplotypes from those individuals then you would need to add them to the pangenome graph. However, the genotypes need to be phased for that.

@ld9866
Copy link
Author

ld9866 commented Mar 7, 2023

Thank you for your reply!
Now we have one more little question.
Our genome is about the same size as the human genome. It has been several days since we used vg auto index, but it is still not finished. Would you please tell me how to solve this problem? If we split up the chromosomes and analyzed them, we couldn't match them in the transcriptome, and that confused us.
Best yours,

@jonassibbesen
Copy link
Owner

I am unfortunately not able to answer that. You should ask the question on the vg GitHub.

@ld9866
Copy link
Author

ld9866 commented Mar 7, 2023

Thank you for your reply
Best wishes

@ld9866 ld9866 closed this as completed Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants