Skip to content

larsgr/dna-sequence-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DNA sequence models

work in progress

This repo currently contain scripts for preparing training data for ML models that predict ATAC-seq peak data from genomic sequences.

"download_from_salmobase.sh" - downloads peak files, blacklist file from salmobase.org and fasta file from Ensembl.

"binify_bed.py" - reads multiple peak files (bed files with non-overlapping regions) and divides the genome into bins of certain size (default: 200 basepairs) and outputs the bins where atleast one of the files had a peak that overlaps the bin with more than a given number of bases (default 50% of bin). The sequence for that bin is also extracted from a given fasta file. I also supports blacklists, i.e. a bed file of regions to exclude.

Example:

python3 binify_bed.py \
  --bedlist bed_list_test.txt \
  --fasta downloads/genome/Salmo_salar.Ssal_v3.1.dna_sm.toplevel.fa \
  --outfile test.tsv \
  --exclude data/blacklist/AtlanticSalmon_blacklist_sorted.bed \
  --seq_length 10

where "bed_list_test.txt" contains paths to peak files (one per line):

downloads/AtlanticSalmon-ATAC-peaks/AtlanticSalmon_ATAC_LateBlastulation_R1.mLb.clN_peaks.narrowPeak
downloads/AtlanticSalmon-ATAC-peaks/AtlanticSalmon_ATAC_LateSomitogenesis_R1.mLb.clN_peaks.narrowPeak
downloads/AtlanticSalmon-ATAC-peaks/AtlanticSalmon_ATAC_Brain_Immature_Female_R1.mLb.clN_peaks.narrowPeak
downloads/AtlanticSalmon-ATAC-peaks/AtlanticSalmon_ATAC_Liver_Immature_Female_R1.mLb.clN_peaks.narrowPeak
downloads/AtlanticSalmon-ATAC-peaks/AtlanticSalmon_ATAC_Muscle_Immature_Female_R1.mLb.clN_peaks.narrowPeak
downloads/AtlanticSalmon-ATAC-peaks/AtlanticSalmon_ATAC_Gonad_Mature_Female_R1.mLb.clN_peaks.narrowPeak
downloads/AtlanticSalmon-ATAC-peaks/AtlanticSalmon_ATAC_Gonad_Mature_Male_R1.mLb.clN_peaks.narrowPeak

The output file is a .tsv file with columns:

  • binID: currently shows the location (chr:start-end) of the extracted sequence
  • sequence: extracted sequence (Note: may contain N's)
  • values: one for each file. 1 if peak overlaps 0 if not
1:444096-444105 ACACAAACAA      1       1       1       1       0       0       0

About

Scripts for ML models on genomic data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published