Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

odgi extract - please remove nodes that are not in specified BED range(s) #526

Open
subwaystation opened this issue Aug 24, 2023 · 9 comments

Comments

@subwaystation
Copy link
Member

When extracting a subgraph using a BED file via -b the resulting subgraph contains all BED range(s) specified plus all nodes between these ranges which are not touched by any path!
It would be nice to have an additional flag which allows for the removal of nodes and edges that are not touched by any path.

@subwaystation
Copy link
Member Author

subwaystation commented Aug 25, 2023

This is becoming a serious problem now @AndreaGuarracino. I am extraction some full range paths and want to sort and visualize the graph. However, the nodes and edges which are not visited by any path are disturbing the 1D visualization.

wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chr6.hprc-v1.0-pggb.gfa.gz
gunzip chr6.hprc-v1.0-pggb.gfa.gz
odgi build -g chr6.hprc-v1.0-pggb.gfa -o chr6.hprc-v1.0-pggb.og -P -t 28
odgi paths -i chr6.hprc-v1.0-pggb.og -f > chr6.hprc-v1.0-pggb.fasta
samtools faidx chr6.hprc-v1.0-pggb.fasta
grep 'chm13#chr6\|grch38#chr6\|HG01928#2\|HG01071#2\|HG01106#2\|HG02622#2' chr6.hprc-v1.0-pggb.fasta.fai | cut -f 1 > 18_paths_to_extract.txt
grep 'chm13#chr6\|grch38#chr6\|HG01928#2\|HG01071#2\|HG01106#2\|HG02622#2' chr6.hprc-v1.0-pggb.fasta.fai | cut -f 1,2 | sed 's/\t/\t0\t/g' > 18_paths_to_extract.bed
odgi extract -i chr6.hprc-v1.0-pggb.og -o chr6.hprc-v1.0-pggb.18paths.og -E -b 18_paths_to_extract.bed -P -p 18_paths_to_extract.txt -t 28
odgi viz -i chr6.hprc-v1.0-pggb.18paths.og -o chr6.hprc-v1.0-pggb.18paths.og.png
bedtools merge -i /mnt/vol1/software/odgi/git/master/test/chr6.HLA_genes.bed -d 10000000 > MHC.bed
echo -e '\tMHC' >> MHC.bed
cat MHC.bed | tr '\n' ' ' | sed 's/ //g' > MHC.final.bed
odgi procbed -i chr6.hprc-v1.0-pggb.18paths.og -b MHC.final.bed > MHC.adj.bed
odgi inject -i chr6.hprc-v1.0-pggb.18paths.og -b MHC.adj.bed -o chr6.hprc-v1.0-pggb.18paths.MHC.og -P -t 28
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.png
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og -r -P -t 28 -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.png -u -d
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og -p Y -t 28 -P
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og.png -u -d
odgi paths -i chr6.hprc-v1.0-pggb.18paths.MHC.og -L | grep "grch" > chr6.grch38.MHC.name
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og -P -t 28 -H chr6.grch38.MHC.name -Y
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og.png -d -u

default viz
chr6 hprc-v1 0-pggb 18paths MHC og
random sort
chr6 hprc-v1 0-pggb 18paths MHC og r og
PG-SGD
chr6 hprc-v1 0-pggb 18paths MHC og r og Y og
RPG-SGD
chr6 hprc-v1 0-pggb 18paths MHC og r og HY og

Could you please fix this or tell me where to change what in the code @AndreaGuarracino. Thanks!

@subwaystation
Copy link
Member Author

subwaystation commented Aug 25, 2023

panacus table -c nodes -a chr6.hprc-v1.0-pggb.18paths.MHC.gfa > chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt
wc -l chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt
4699147
grep -P "\t0" chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt | wc -l
1248659

Around 1/4 of nodes are not visited by any path.

@subwaystation
Copy link
Member Author

panacus table -c edges -a chr6.hprc-v1.0-pggb.18paths.MHC.gfa > chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt
wc -l chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt
6552109
grep -P "\t0" chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt | wc -l
2554251

Around 1/3 of edges are not followed by any path.

@subwaystation
Copy link
Member Author

Maybe this is related to how you fixed this problem in smoothxg @AndreaGuarracino ?

@AndreaGuarracino
Copy link
Member

This is a problem of odgi extract interface that is not super intuitive. You want

        --paths-to-extract=[FILE]         List of paths to keep in the extracted
                                          graph. The FILE must contain one path
                                          name per line and a subset of all
                                          paths can be specified. Paths
                                          specified in the input path ranges
                                          (with -r/--path-range and/or
                                          -b/--bed-file) will be kept in any
                                          case.

so try

odgi extract ... -p <(cut -f 1 18_paths_to_extract.bed | sort | uniq)

@subwaystation
Copy link
Member Author

I already am using -p, please take an exact look at my command above.

odgi extract -i chr6.hprc-v1.0-pggb.og -o chr6.hprc-v1.0-pggb.tomato.og -P -p 18_paths_to_extract.txt -t 28
ls -lisah chr6.hprc-v1.0-pggb.tomato.og
192399413 4.0K -rw-rw-r-- 1 ubuntu ubuntu 60 Aug 28 14:35 chr6.hprc-v1.0-pggb.tomato.og
cat 18_paths_to_extract.txt
chm13#chr6
grch38#chr6
HG01071#2#JAHBCE010000006.1
HG01071#2#JAHBCE010000042.1
HG01071#2#JAHBCE010000076.1
HG01071#2#JAHBCE010000081.1
HG01106#2#JAHAMB010000019.1
HG01106#2#JAHAMB010000032.1
HG01106#2#JAHAMB010000054.1
HG01106#2#JAHAMB010000082.1
HG01928#2#JAGYVP010000017.1
HG01928#2#JAGYVP010000027.1
HG01928#2#JAGYVP010000201.1
HG02622#2#JAHAON010000002.1
HG02622#2#JAHAON010000041.1
HG02622#2#JAHAON010000058.1
HG02622#2#JAHAON010000190.1
HG02622#2#JAHAON010000229.1

As expected, your suggestion only gives me an empty graph.

@AndreaGuarracino
Copy link
Member

Can you also try without -E?

@AndreaGuarracino
Copy link
Member

About the edges not supported by any path, yes, it is related to a recent bug I've fixed in smoothxg (pangenome/smoothxg#194).

@subwaystation
Copy link
Member Author

Leaving out -E does the trick. Thanks. Leaving this open so you can fix the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants