# Load SELEX Matricies and Run MOODS from Python

We're going to load the SELEX matricies from Regulation into Python, and output them as nicely formatted JSON. Then we'll select the first 50 matricies, and use this to scan some enhancer sequences (also loaded into Python, not using the file system).

In [None]:
import itertools
import random
import json
from multiprocessing import Pool
from pathlib import Path

from tfbs.pfm_reader import pfm_reader, PFM
from tfbs.moods import Scanner
from tfbs.utils import iter_fasta
from tfbs.pwm import PWM

In [None]:
SELEX = Path("/home/malcolm/Data/Regulation/PWMs/SELEX")

# build an iterator that makes everything look like one big file
def load_pfms(d: Path):
    for pfm_file in d.iterdir():
        with open(pfm_file) as f:
            yield [pfm_file.stem, *f.readlines()]


pfm_iter = pfm_reader(itertools.chain.from_iterable(load_pfms(SELEX / "matrices")))

# read in SELEX PFMs and index by file name
SELEX_pfms = [PFM(id=info[0], PFM=pfm) for info, pfm in pfm_iter]

SELEX_pfms[50].dict()


From here we can serialize everything as JSON:

In [None]:
with open(SELEX / "selex.json", 'w') as jsonfile:
    json.dump([pfm.dict() for pfm in SELEX_pfms], jsonfile, indent=4)

And then read it back in:

In [None]:
with open(SELEX / "selex.json") as jsonfile:
    pfms = [PFM(**item) for item in json.load(jsonfile)]

pfms[50].dict()

Now we can select the PWMs we want to scan, in this case 50 random PWMs from the SELEX data:

In [None]:
chosen_pfms = random.choices(SELEX_pfms, k=50)


In [None]:
PWMs = [PWM(p.PFM, p.id, pvalue=1e-3) for p in chosen_pfms]

Build the Scanner object:

In [None]:
s = Scanner(PWMs)

Read in the fasta sequences:

In [None]:
seqs = iter_fasta(Path.home() / "Data" / "Other Resources" / "vista" / "vista_20_04_21.fa")

def scan(fa: tuple[str, str]): return s.scan(fa)

with Pool(16) as p:
    results = list(p.map(scan, seqs))


Now we have a list of hits for each sequence:

In [None]:
results[:10]

We can output or store this however we want, in this case we can convert to chromosomal co-ordinates in a BED-like file:

NB I have no idea if the MOODS output is 0 or 1 based, we will have to test this.

In [None]:
header = ["chr", "start", "end", "name", "score", "strand", "PWM"]

print("\t".join(header))

for r in results[:50]:
    chrom, start, end, name = r['header'].split(':')
    for h in r['hits']:
        h_start = int(start) + h.start
        h_end = int(start) + h.end
        print("\t".join(map(str, [chrom, h_start, h_end, name, h.score, h.strand, h.TF])))