# Extracting fasta files

This notebook explains how to extract FASTA files  for a desired region in the  genome. The code we provide extracts the 500bp region upstream of the closest TSS to each gene. However, these limits can be changed to extract FASTA files of different length.

### Imports

In [1]:
from Bio import SeqIO

### Write FASTA file

In [None]:
# Write FASTA file for regions upstream of TSS
file_handle_TSS=open("500bp_upstream_TSS.FASTA", "w")

### Extract a list of TSS coordinates in BED format for each chromosome

The TSS map that we used (YeasTTS) contains all TSS coordinates in BED format. We downloaded the BED file for each chromosome from YeastTSS and saved the file as "TSSs_chrA.bed", "TSSs_chrB.bed", "TSSs_chrC.bed" etc. From this BED file we then extracted a list of TSS coordinates for each chromosome using the code below.

In [None]:
def get_TSSs(file):
    TSSs=[]
    
    for line in open(file,"r"):
        splits=line.split("\t")
        try:
            if splits[5]=='+':
                TSS_location=int(splits[1])
                TSSs.append([TSS_location, 1])
        except:
            pass
    for line in open(file,"r"):
        splits=line.split("\t")
        try:
            if splits[5]=='-':
                TSS_location=int(splits[2])
                TSSs.append([TSS_location, -1])
        except:
            pass
    return TSSs
        
for chr in ["chrA", "chrB", "chrC", "chrD", "chrE", "chrF"]:
    TSSs = get_TSSs("TSSs_"+chr+".bed")

### Define find current closest TSS

We next define a function that detects the closest TSS to the start codon of each gene. This function only searches for TSSs that are located within 2500bp of the ATG. Crucially, this function only assigns TSSs to start codons that have the same strandedness.   

In [None]:
def current_closest_TSS(strand, loc_startcodon):
    minimum=100000000
    
    current_closest_TSS = None
    
    for TSS, TSS_strand in TSSs:
        # Changing the value in 'loc_startcodon-TSS <= 2500' will change the maximum search distance 
        if strand==1 and loc_startcodon-TSS <= minimum and loc_startcodon-TSS > 0 and TSS_strand==strand and loc_startcodon-TSS <= 2500:
            current_closest_TSS=TSS
            minimum=loc_startcodon-TSS

        if strand==-1 and TSS-loc_startcodon <= minimum and TSS-loc_startcodon > 0 and TSS_strand==strand and TSS-loc_startcodon <= 2500:
            current_closest_TSS=TSS
            minimum=TSS-loc_startcodon
            
    return current_closest_TSS, minimum

### Write nucleotide sequence into FASTA file

To write the sequence into a FASTA file, we must first extract the ATG coordinates from an online database. Although we are extracting the sequence upstream of the TSS, we need the ATG coordinates for each gene so we can identify the closest TSS to each gene. We obtained ATG coordinates from Genbank accession (GCA_001761485.1). Crucially, for the code below to work, each genbank file must be saved with the following format: 'Yali_chrA.gbk', 'Yali_chrB.gbk' etc.

In [None]:
# Loop over each chromosome
for chr in ["chrA", "chrB", "chrC", "chrD", "chrE", "chrF"]:
    seq_features=next(SeqIO.parse("Yali_"+chr+".gbk", "genbank")).features
    chromosome=str(next(SeqIO.parse("Yali_"+chr+".gbk", "genbank")).seq) 
    for feature in seq_features:
        if feature.type=="CDS":
            if feature.location.strand==1:
                start=feature.location.start
                TSS, dist = current_closest_TSS(feature.location.strand, start)
                if TSS is not None:
                    # To extract different length of DNA, change these limits. e.g to extract 200bp upstream and 100bp downstream of TSS use [TSS-200:TSS+100]  
                    region=chromosome[TSS-500:TSS]
                    file_handle_TSS.write(">" + gene_name + "\n")
                    file_handle_TSS.write(region+"\n")

            elif feature.location.strand==-1:
                end=feature.location.end
                TSS, dist = current_closest_TSS(feature.location.strand, end)
                if TSS is not None: 
                    # To extract different length of DNA, change these limits. e.g to extract 200bp upstream and 100bp downstream of TSS use [TSS-100:TSS+200]
                    region=chromosome[TSS:TSS+500]
                    file_handle_TSS.write(">" + gene_name + "\n")
                    file_handle_TSS.write(region+"\n")
file_handle_TSS.close()