# Installation

To install these scripts I suggest doing the following:

1. Create folder in home directory called pybin. Open terminal then type 

    ```mkdir pybin```

2. Clone this github repo into pybin. 
    
    ```cd pybin``` 
    
    ```git clone https://github.com/jsolvason/js```
    
3. Go back to your home directory 

    ```cd ~```

4. Find your bash profile (The profile will be called ```.bash_profile```,```.bashrc```, or ```.zshrc```).

    ```ls -a ~``` 

5. Open your bash profile with the terminal word processor (this assumes its named ```.zshrc```) 

    ```nano ~./zshrc```

6. Add the following line to your bash profile 

    ```export PYTHONPATH=~/pybin/js:~/pybin/:$PYTHONPATH```


7. Test this works by creating a new terminal window and typing the follwoing. If this does not return an error message, then it works!

    ```python```

    ```import js``` 


# Download reference files

Reference files are too large to load to github. They can be found on our labs google drive here.

https://drive.google.com/drive/u/1/folders/1EP85teWmxO69mGgdENufN6YEux7i3X33

# Load all packages

In [18]:
import js
import jsAff as jsa
import jsDna as jsd
import jsGenome as jsg

help(js)

Help on module js:

NAME
    js

FUNCTIONS
    basename(fn)
    
    dprint(d, n)
    
    get_basename(fn)
    
    listModules()
    
    million(number)
    
    percent(number)
    
    read_tsv(fn, pc=False, header=True, breakBool=False, sep='\t', pc_list=False)
    
    strjoin(delim, l)
    
    write_row(rowList, delim='\t')

FILE
    /Users/joe/pybin/js/js.py




In [19]:
# List all modules in the package
js.listModules()

js.py
jsAff.py
jsDna.py
jsGataEts.py
jsGenome.py
jsNearGene.py
jsPandas.py



# Affinity scripts

This module allows you to load affinity reference files in the form of a dictionary with ```key=dna_sequence``` and ```value=affinity``` where max affinity is 1.0

In [20]:
help(jsa)

Help on module jsAff:

NAME
    jsAff

FUNCTIONS
    loadAr(ref='/Users/joe/code/ref/binding_affinity/ar/js-normalized-ar-8mer-affinities.tsv')
        Ar. Returns dictionary with key=8mer dna sequence, value=affinity (0-1.0)
    
    loadEts(ref='/Users/joe/code/ref/binding_affinity/ets/parsed_Ets1_8mers.txt')
        Ets1 from mouse. Returns dictionary with key=8mer dna sequence, value=affinity (0-1.0)
    
    loadGata4(ref='/Users/joe/code/ref/binding_affinity/gata4/js-normalized-gata4.tsv')
        Gata4 Mariani 2017. Returns dictionary with key=8mer dna sequence, value=affinity (0-1.0)
    
    loadGata6(ref='/Users/joe/code/ref/binding_affinity/gata6/parsed_Gata6_3769_contig8mers.txt')
        Gata6 Badis 2009
    
    loadNkx(ref='/Users/joe/code/ref/binding_affinity/nkx2-5/js-normalized-nkx25.tsv')
        Nkx2 Barerra 2016. Returns dictionary with key=8mer dna sequence, value=affinity (0-1.0)
    
    loadTbx(ref='/Users/joe/code/ref/binding_affinity/tbx5/mariani-et-al-2017/j

## Downloading affinity files

Download here: https://drive.google.com/drive/u/1/folders/1DhCFV5_naMZRFjfwdsYMnGatXaTX824y

## Loading dictionary

In [21]:
# Load Ets1 affinity data
seq2aff=jsa.loadEts(ref='/Users/joe/code/ref/binding_affinity/ets/parsed_Ets1_8mers.txt')
js.dprint(seq2aff,0)

AAAAAAAA 0.14700431859740803


In [22]:
# Max and min affinities
round(min(seq2aff.values()),3),max(seq2aff.values())

(0.071, 1.0)

In [23]:
# Max sequence. 
seq2aff['CCGGAAGT']

1.0

In [24]:
# Note that you can search for fwd or rev and get same answer.
seq2aff[jsd.revcomp('CCGGAAGT')]

1.0

# Dna scripts

This module allows you to do various operations on DNA sequence

In [25]:
import jsDna as jsd
help(jsd)

Help on module jsDna:

NAME
    jsDna

FUNCTIONS
    GenerateAllPossibleSequences(template)
        Takes a string with letters A,T,C,G,N and returns all possible DNA strings converting N=>A,T,G,C
    
    GenerateRandomDNA(length)
        Takes length as input and returns a random DNA string of that length.
    
    GenerateSingleRandomSequence(template)
        Takes a string with letters A,T,C,G,N and returns a DNA string with rnadom AGTC nucleotides at position(s) with N.
    
    IupacToAllPossibleSequences(dna)
        Takes DNA with IUPAC letters and returns all possible DNA strings with only A/G/T/C.
    
    IupacToRegexPattern(dna)
        Takes DNA with IUPAC letters and returns a regex object that can search DNA with only A/T/G/C for the corresponding IUPAC DNA string.
    
    gc_content(seq)
        Takes sequence as input and returns GC content from 0-1.0.
    
    get_kmers(string, k)
        Takes DNA sequence as input and a kmer length and YIELDS all kmers of length K

## Hamming Distance

In [None]:
str1='AATTGGCC'
str2='TTTTGGCC'
jsd.hamming(str1, str2)    

## Reverse Complement

In [None]:
dna='ATGC'
jsd.revcomp(dna)    

## GC Content

In [None]:
seq='ATGGCCAT'
jsd.gc_content(seq)    

## Iterating over kmers

In [None]:
string='AATTGGCC'
k=3
jsd.get_kmers(string, k)    

In [None]:
for kmer in jsd.get_kmers(string, k):
    print(kmer) 

In [None]:
list(jsd.get_kmers(string, k))

## Generating random DNA

In [None]:
length=5
jsd.GenerateRandomDNA(length)    

In [None]:
template='AANAA'
jsd.GenerateSingleRandomSequence(template)    

In [None]:
template='AANAA'
jsd.GenerateAllPossibleSequences(template)

In [None]:
jsd.Iupac2AllNt

In [None]:
dna='AYN'
jsd.IupacToAllPossibleSequences(dna)    

In [None]:
dna='AYN'
pattern=jsd.IupacToRegexPattern(dna)    
pattern

In [None]:
jsd.revcomp_regex(pattern)

# Genome scripts

This module is used to load a genome as a dictionary with ```key=chromosome_name``` and ```value=chromosome_sequence```.

## Download genome

Found here https://drive.google.com/drive/u/1/folders/1zRZUe7bcVZqB1qsYbjWSAHIAlGw-UyAb

## Load genome

In [2]:
import jsGenome as jsg
help(jsg)

Help on module jsGenome:

NAME
    jsGenome

FUNCTIONS
    loadCi08(file_genome='/Users/joe/code/ref/genomes/ciona/2008/JoinedScaffold.fasta')
        Load 2008 ciona genome
    
    loadCi19(file_genome='/Users/joe/code/ref/genomes/ciona/2019/HT.Ref.fasta')
        Load 2019 ciona genome
    
    loadCi19_beta(file_genome='/Users/joe/code/ref/genomes/ciona/2019/HT.Ref.forBetaTesting.fasta')
        Load 2019 ciona genome (for beta testing, first 1kb of each chrom)
    
    loadDr11(file_genome='/Users/joe/code/ref/genomes/zebrafish/danRer11/danRer11.fa')
        Load Zebrafish danRer11 genome
    
    loadGenome(file_genome)
        Loads arbitrary genome
    
    loadHg19(file_genome='/Users/joe/code/ref/genomes/human/hg19/hg19.fa')
        Load hg19  genome
    
    loadHg38(file_genome='/Users/joe/code/ref/genomes/human/hg38/hg38.fa')
        Load hg38  genome
    
    loadMm10(file_genome='/Users/joe/code/ref/genomes/mouse/mm10/mm10.fa')
        Load mm10 mouse genome

FILE
    /U

In [3]:
chr2seq=jsg.loadHg38()

In [6]:
# Inspect keys of dictionary
list(chr2seq.keys())[0]

'chr1'

In [7]:
# print random location in genome
chr2seq['chr1'][500000:500100]

'AGGTATCCTCTCATCTCAGCTTCCCTAGTAGTTGGAACTCTAGGTGCACAACACCACACCAGTTATTATTATTATTTTTTAATTTTTTATAGAGACAGGT'

# Reading other files

In [9]:
import js
help(js.read_tsv)

Help on function read_tsv in module js:

read_tsv(fn, pc=False, header=True, breakBool=False, sep='\t', pc_list=False)



## Tab separated value file 

In [11]:
# Define file name
fn='./test-data/test.tsv'

# Usually you will just use the parameters 'fn','pc' (printcols), 'header'

# Usually you want to start by viewing what column is at what pythonic index in the row
for row in js.read_tsv(fn, pc=True, header=True):
    pass

0 name
1 sequence
2 value



In [12]:
# Now that we know what row is at what index, we can load each row
for row in js.read_tsv(fn, pc=False, header=True):
    print("row =",row)
    
    name,seq,val=row
    print(name,seq,val)
    
    print()

row = ['a', 'ATG', '1.0']
a ATG 1.0

row = ['b', 'AAA', '3.1']
b AAA 3.1



## Bed file


In [13]:
# Define file name
fn='./test-data/test.bed'

# This program assumes the bed file has no headers. Bed file headers start with a '#', so if you really want 
# to read in a bed file with headers you can just skip all rows with a '#' at the begining of each column.

for row in js.read_tsv(fn, pc=True, header=False):
    pass

0 chr1
1 500000
2 501000



In [27]:
# Now you can do things like iterate over all kmers for each entry in the bed file
for row in js.read_tsv(fn, pc=False, header=False):
    chrom,start,end=row
    start,end=int(start),int(end)
    
    seq=chr2seq[chrom][start:end]
    for kmer in jsd.get_kmers(seq,8):
        
        # Find one example of an ets binding site and leave
        if kmer[2:6] in ['GGAA']:
            affinity = round(seq2aff[kmer],2)
            print(kmer,affinity)
            break
    break

TTGGAACT 0.12
