Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-aligned BAM files, and reversion to single end for HiC-Pro use #105

Open
jamesdalg opened this issue Oct 25, 2017 · 26 comments
Open

Pre-aligned BAM files, and reversion to single end for HiC-Pro use #105

jamesdalg opened this issue Oct 25, 2017 · 26 comments

Comments

@jamesdalg
Copy link

Hello!
I'm having difficulty running the program in a stepwise fashion:
After copying two bam files to lscratch, I receive the following error:

"Exit: Error: Directory Hierarchy of rawdata '/lscratch/52514766/' is not correct. Paired '.bam' files with _R1/_R2 are required for 'proc_hic quality_checks merge_persample build_contact_maps ice_norm' step(s)"
I was aware that R1 and R2 fastq files were needed, but I was unaware that R1 and R2 bam files would be needed. What is the best tool to convert a single, aligned bam file to a pair of bamfiles that HiC-Pro will accept?
My syntax is as follows (I am attempting to do all of the steps except for the first, which requires fastq files).
HiC-Pro -i /lscratch/${SLURM_JOBID}/ -o /path/to/ouptut -c config_file.txt -s proc_hic -s quality_checks -s merge_persample -s build_contact_maps -s ice_norm

Thanks,
James D

@nservant
Copy link
Owner

Hi James,
Indeed, I know that I should improve the management of BAM files. It's in the todo list !
Currently, if you do not want to map the data with HiC-pro, it expects to have 2 independent BAM files (R1 and R2 aligned files). Then HiC-pro will merge them and processed the merged BAM.
So in theory, you can easily split your single BAM file, into R1/R2 bam file using samtools.

samtools view -F 0x40 -h SRR400264_00_hg19.bwt2pairs.bam > R1.bam
samtools view -F 0x80 -h SRR400264_00_hg19.bwt2pairs.bam > R2.bam
Let me know if it works
N

@jamesdalg
Copy link
Author

jamesdalg commented Oct 26, 2017

I just tried this method, but it doesn't recognize the format of the resulting sam file. I have removed some of the condition names for privacy purposes of those with whom I work. See the results below:
Run HiC-Pro 2.9.0

Wed Oct 25 18:10:26 EDT 2017
Pairing of R1 and R2 tags ...
/usr/local/Anaconda/envs_app/hicpro/2.9.0/HiC-Pro_2.9.0/scripts/bowtie_pairing.sh -c /config/file/path/config_from_bam.txt >> hicpro.log
Traceback (most recent call last):
File "/usr/local/Anaconda/envs_app/hicpro/2.9.0/HiC-Pro_2.9.0/scripts/mergeSAM.py", line 222, in
with pysam.Samfile(R1file, "rb") as hr1, pysam.Samfile(R2file, "rb") as hr2:
File "pysam/libcalignmentfile.pyx", line 444, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 664, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False
make: *** [bowtie_pairing] Error 1
Here is the config file:

@nservant
Copy link
Owner

could you try with BAM files instead of SAM ?

@jamesdalg
Copy link
Author

jamesdalg commented Oct 26, 2017

I did, but I could try making SAM files instead of bam if it is helpful. Here is the syntax I used to create them.
#!/bin/bash
module load hicpro
module load samtools
cd /data/path/
cp -r /data/path/*.bam /lscratch/${SLURM_JOBID}/
mkdir -p /lscratch/${SLURM_JOBID}/Treatment/
mkdir -p /lscratch/${SLURM_JOBID}/Control/

samtools view -f 0x40 /lscratch/${SLURM_JOBID}/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted.bam > /lscratch/${SLURM_JOBID}/Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R1.bam
samtools view -f 0x80 /lscratch/${SLURM_JOBID}/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted.bam > /lscratch/${SLURM_JOBID}/Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R2.bam

samtools view -f 0x40 /lscratch/${SLURM_JOBID}/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam > /lscratch/${SLURM_JOBID}/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R1.bam
samtools view -f 0x80 /lscratch/${SLURM_JOBID}/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam > /lscratch/${SLURM_JOBID}/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R2.bam
ls -lrt /lscratch/${SLURM_JOBID}/
HiC-Pro -i /lscratch/${SLURM_JOBID}/ -o /data/path/output_sbatch_multithreaded_lscratch_from_bam -c /data/path/config_from_bam.txt -s proc_hic -s quality_checks -s merge_persample -s build_contact_maps -s ice_norm

@jamesdalg
Copy link
Author

jamesdalg commented Oct 26, 2017

Here is the config file:

Please change the variable settings below if necessary

#########################################################################

Paths and Settings - Do not edit !

#########################################################################

TMP_DIR = tmp
LOGS_DIR = logs
BOWTIE2_OUTPUT_DIR = bowtie_results
MAPC_OUTPUT = hic_results
RAW_DIR = bam_input

#######################################################################

SYSTEM - PBS - Start Editing Here !!

#######################################################################
N_CPU = 32
LOGFILE = hicpro_HiChip-Treatment.log

JOB_NAME = hicpro-HiChip-Control
JOB_MEM = 32G
JOB_WALLTIME = 24:00:00
JOB_QUEUE = quick,ccr,norm
JOB_MAIL = my_email@server.com

#########################################################################

Data

#########################################################################

PAIR1_EXT = _R1
PAIR2_EXT = _R2

#######################################################################

Alignment options

#######################################################################

FORMAT = phred33
MIN_MAPQ = 30

BOWTIE2_IDX_PATH = /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/
BOWTIE2_GLOBAL_OPTIONS = --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 16
BOWTIE2_LOCAL_OPTIONS = --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 16

#######################################################################

Annotation files

#######################################################################

REFERENCE_GENOME = genome
GENOME_SIZE = chrom_hg19.sizes

#######################################################################

Allele specific

#######################################################################

ALLELE_SPECIFIC_SNP =

#######################################################################

Digestion Hi-C

#######################################################################

GENOME_FRAGMENT = /data/path/HindIII_resfrag_hg19.bed
LIGATION_SITE = AAGCTAGCTT
MIN_FRAG_SIZE =
MAX_FRAG_SIZE =
MIN_INSERT_SIZE =
MAX_INSERT_SIZE =

#######################################################################

Hi-C processing

#######################################################################

MIN_CIS_DIST =
GET_ALL_INTERACTION_CLASSES = 1
GET_PROCESS_SAM = 1
RM_SINGLETON = 1
RM_MULTI = 1
RM_DUP = 1

#######################################################################

Contact Maps

#######################################################################

BIN_SIZE = 1000 2500 5000 10000 25000 500000 1000000
MATRIX_FORMAT = upper

#######################################################################

ICE Normalization

#######################################################################
MAX_ITER = 100
FILTER_LOW_COUNT_PERC = 0.02
FILTER_HIGH_COUNT_PERC = 0
EPS = 0.1

@nservant
Copy link
Owner

Could you please show me the content of :
-i /lscratch/${SLURM_JOBID}/
Thanks

@nservant
Copy link
Owner

You have to use the data organisation with one folder per sample.
In addition, R1 and R2 files must contain the PAIR1_EXT keys.
/scratch/${SLURM_JOBID}/
/scratch/${SLURM_JOBID}/sample1
/scratch/${SLURM_JOBID}/sample1/file_R1.bam
/scratch/${SLURM_JOBID}/sample1/file_R2.bam

best

@jamesdalg
Copy link
Author

Here is the directory listing, after performing the asked operations (on the deleted post, I noticed I had done ls prior to the samtools lines, hence the lack of split reads):
bash-4.1$ cd /lscratch/${SLURM_JOBID}
bash-4.1$ find
.
./X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted.bam
./X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam
./Pg
./Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R2.bam
./Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R1.bam
./Neg
./Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R2.bam
./Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R1.bam

@nservant
Copy link
Owner

ok. Sounds good, but please remove the
./X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted.bam
./X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam
which are not in a folder directory

@jamesdalg
Copy link
Author

done.
SCRIPT FILE:
#!/bin/bash
module load hicpro
module load samtools
cd /data/dalgleishjl/hicpro/fanHiC10112017/
cp -r /data/dalgleishjl/fanHiC/bam/*sorted.bam /lscratch/${SLURM_JOBID}/
ls -l /lscratch/${SLURM_JOBID}
mkdir -p /lscratch/${SLURM_JOBID}/Pg/
mkdir -p /lscratch/${SLURM_JOBID}/Neg/

samtools view -f 0x40 /lscratch/${SLURM_JOBID}/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted.bam > /lscratch/${SLURM_JOBID}/Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R1.bam
samtools view -f 0x80 /lscratch/${SLURM_JOBID}/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted.bam > /lscratch/${SLURM_JOBID}/Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R2.bam

samtools view -f 0x40 /lscratch/${SLURM_JOBID}/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam > /lscratch/${SLURM_JOBID}/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R1.bam
samtools view -f 0x80 /lscratch/${SLURM_JOBID}/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam > /lscratch/${SLURM_JOBID}/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R2.bam
cd /lscratch/${SLURM_JOBID}
find
rm /lscratch/${SLURM_JOBID}/*sorted.bam
find
ls -lrt /lscratch/${SLURM_JOBID}/
HiC-Pro -i /lscratch/${SLURM_JOBID}/ -o /data/CCRBioinfo/dalgleishjl/hicpro/output_sbatch_multithreaded_lscratch_from_bamv2 -c /data/CCRBioinfo/dalgleishjl/hicpro/config_fanHiC_from_bam.txt -s proc_hic -s quality_checks -s merge_persample -s build_contact_maps -s ice_norm

OUTPUT:
-bash-4.1$ cat slurm-53091057.out
[+] Loading gcc 4.9.1 ...
[+] Loading GSL 2.2.1 ...
[+] Loading Graphviz v2.38.0 ...
[+] Loading gdal 2.0 ...
[+] Loading proj 4.9.2 ...
[-] Unloading gcc 4.9.1 ...
[+] Loading gcc 4.9.1 ...
[+] Loading openmpi 1.10.0 for GCC 4.9.1
[+] Loading tcl_tk 8.6.3
[+] Loading pandoc 1.15.0.6 ...
[+] Loading Zlib 1.2.8 ...
[+] Loading Bzip2 1.0.6 ...
[+] Loading pcre 8.38 ...
[+] Loading liblzma 5.2.2 ...
[-] Unloading Zlib 1.2.8 ...
[+] Loading Zlib 1.2.8 ...
[-] Unloading liblzma 5.2.2 ...
[+] Loading liblzma 5.2.2 ...
[+] Loading libjpeg-turbo 1.5.1 ...
[+] Loading tiff 4.0.7 ...
[+] Loading curl 7.46.0 ...
[+] Loading boost libraries v1.65 ...
[+] Loading R 3.4.0 on cn0630
[+] Loading samtools 1.5 ...
[+] Loading hicpro 2.9.0
[-] Unloading samtools 1.5 ...
[+] Loading samtools 1.5 ...
total 50021444
-rw-r--r-- 1 dalgleishjl dalgleishjl 21093061259 Nov 2 13:15 X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.dupsmarked.matefixed.namesorted.bam
-rw-r----- 1 dalgleishjl dalgleishjl 14704999205 Nov 2 13:16 X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted.bam
-rw-r----- 1 dalgleishjl dalgleishjl 15423880819 Nov 2 13:17 X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam
.
./Neg
./Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R1.bam
./Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R2.bam
./X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.dupsmarked.matefixed.namesorted.bam
./X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted.bam
./X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam
./Pg
./Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R1.bam
./Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R2.bam
.
./Neg
./Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R1.bam
./Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R2.bam
./Pg
./Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R1.bam
./Pg/X_AHJNHNAFXX.1009_NEXTSEQ-2017-04-25.fq.hg19.bwa.sorted_R2.bam
total 8
drwxr-xr-x 2 dalgleishjl dalgleishjl 4096 Nov 2 13:26 Pg
drwxr-xr-x 2 dalgleishjl dalgleishjl 4096 Nov 2 13:44 Neg

Run HiC-Pro 2.9.0

Thu Nov 2 13:54:10 EDT 2017
Pairing of R1 and R2 tags ...
/usr/local/Anaconda/envs_app/hicpro/2.9.0/HiC-Pro_2.9.0/scripts/bowtie_pairing.sh -c /data/CCRBioinfo/dalgleishjl/hicpro/config_fanHiC_from_bam.txt >> hicpro_T47D-HiChip-Pg.log
Traceback (most recent call last):
File "/usr/local/Anaconda/envs_app/hicpro/2.9.0/HiC-Pro_2.9.0/scripts/mergeSAM.py", line 222, in
with pysam.Samfile(R1file, "rb") as hr1, pysam.Samfile(R2file, "rb") as hr2:
File "pysam/libcalignmentfile.pyx", line 444, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 664, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False
make: *** [bowtie_pairing] Error 1

@jamesdalg
Copy link
Author

from pysam-developers/pysam#51, it seems that there may be a way to make it work by setting check_sq to false on line 222 of mergeSAM.py.

@nservant
Copy link
Owner

nservant commented Nov 2, 2017

Could you show the mergeSAM.py command line used. It should be written in the hicpro_T47D-HiChip-Pg.log file.
I just want to be sure that the input files are correct.
Thanks

@jamesdalg
Copy link
Author

/usr/local/Anaconda/envs_app/hicpro/2.9.0/bin/python /usr/local/Anaconda/envs_app/hicpro/2.9.0/HiC-Pro_2.9.0/scripts/mergeSAM.py -q 0 -t -v -f bam_input/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R1.bam -r bam_input/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R2.bam -o bowtie_results/bwt2/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam.bwt2pairs.bam > logs/Neg/mergeSAM.log
/usr/local/Anaconda/envs_app/hicpro/2.9.0/bin/python /usr/local/Anaconda/envs_app/hicpro/2.9.0/HiC-Pro_2.9.0/scripts/mergeSAM.py -q 0 -t -v -f bam_input/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R1.bam -r bam_input/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted_R2.bam -o bowtie_results/bwt2/Neg/X_AHJY2LAFXX.1005_NEXTSEQ-2017-04-18.fq.hg19.bwa.sorted.bam.bwt2pairs.bam > logs/Neg/mergeSAM.log

@nservant
Copy link
Owner

nservant commented Nov 3, 2017

why do you have bam_input/... instead of rawdata/... ??
If you go into the HiC-Pro output folder, do these files are accessible ?
Otherwise, would you mind sharing with me a pair of BAM file, so that I can try here ?
thanks

@jamesdalg
Copy link
Author

jamesdalg commented Nov 6, 2017

I figured it would be best to store the bam files in a separate directory so that the precise files I wanted (the bam files) could be read. I've included the output. It's very small and only consists of logs. I've also sent you the data, with permission from the experimenter. Thanks so much for being willing to look into this. I'd love to use this for BAM files and do the alignment myself, if possible. Having bam files input be more seamless should make the software better for many others outside of my use case as well. If you can, please keep the input data confidential.
output_sbatch_multithreaded_lscratch_from_bamv2.zip

@nservant
Copy link
Owner

nservant commented Nov 8, 2017

Hi James,
Thanks for the data. Don't worry, I'll keep them confidential.
The only point is that I'm going to travel this week. So I will have a look next week.
We keep in touch
Best

@jamesdalg
Copy link
Author

Sounds good. Thanks so much for looking into this. If you find a way to make it work, let me know. I'd be interested to make this work as it will speed processing by quite a bit if one can bypass the aligment step and also allow workarounds when alignment issues arise. If there is anything further I can do, let me know. I imagine you are quite busy.

@nsauerwald
Copy link

Was there ever a resolution to this? I'm also trying to run HiCPro on some pre-aligned data and having a lot of trouble. It would save me a lot of time to bypass the bowtie alignment step, so I'd love to find out if it's possible.

@nservant
Copy link
Owner

nservant commented Jul 3, 2018

not yet, but it is in my top priority !

@jamesdalg
Copy link
Author

jamesdalg commented Jul 3, 2018

We found that we couldn't do it on non-bowtie aligned files previously as HiC-pro functions are dependent on bowtie format BAM files. Just my thinking, but perhaps you could try if the files were bowtie aligned.

@nsauerwald
Copy link

Thanks! My files are bowtie aligned (they're actually just subsampled from a previous alignment run through HiC-Pro), but I'm not sure exactly which files HiC-Pro needs for the next steps. The docs just say ".bam" files, but does this mean the bwt2pairs.bam, or bwt2merged.bam files? And does it also require any of the .mapstat or .mmapstat or .mpairstat files?

@nservant
Copy link
Owner

nservant commented Jul 3, 2018

Hi,
It requires the bwt2merged.bam files.
Organize your data as usually, with on folder per sample, and thus, your bam files instead of fastq files, and run HiC-Pro with options -s proc_hic -s merge_persample -s build_contact_maps -s ice_norm

@nsauerwald
Copy link

Perfect, thanks! I was doing exactly what you said, but used the bwt2pairs.bam files, so that must be why it wasn't working.

@nservant
Copy link
Owner

nservant commented Jul 3, 2018

Yes I know. This is exactly what I have to update.
I just finished to fix other points. I'm not sure I will include this in the next release, but in the next-next one ;)

@ToruOkadaOi
Copy link

Hello,
Is there any workaround for this issue yet?

@nservant
Copy link
Owner

yes, move to the nf-core-hic pipeline https://nf-co.re/hic/latest/docs/usage/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants