## Relationship between number of TATA-boxes and gene expression level

This notebook explains how to extract the number of predicted TATA boxes per gene from a CSV file. This code will only work if the CSV file has a specific format. The CSV file must contain the ElemeNT results in the following order; column 1: Elements found, column 2: Start position, column 3: Sequence, column 4: PWM score, column 5: Consensus match.

We also show how to extract the gene expression values directly from .txt files generated in Galaxy. Both the number of TATA-boxes and gene expression values are stored in separate dictionaries. We illustrate how to plot a correlation graph between these two dictionaries.

### Imports

In [None]:
import matplotlib.pyplot as plt
from Bio import SeqIO
import pandas as pd
from pandas import DataFrame

### Extract no. TATA box hits per gene

In [None]:
file_name="Predicted_TATA_hits.csv"
file_handle=open(file_name, "r")
dict_hits={}

#N.B replace YALI_A00014g with the first gene ID/name found in your csv file
for line in file_handle:
    if "YALI" in line:
        if not "YALI1_A00014g" in line:
            dict_hits[current_gene_name] = current_count

        current_gene_name=line[1:-2].replace(",","")
        current_count=0
    if line.startswith("TATA box"):
            current_count+=1

dict_hits[current_gene_name] = current_count

### Extract gene expression values

In [None]:
file_name_expression="Expression.txt"

file_handle_expression=open(file_name_expression, "r")
dict_expression={}

for line in file_handle_expression:
    splits=line.split("\t")
    # In our .txt file, the gene name is stored in column 5. Change number in [] depending accordingly.
    gene_name=splits[4]
    # In our .txt file, the expression value is stored in column 10. Change number in [] depending accordingly.
    expression_level=float(splits[9])
    if "," in gene_name:
        for sub_gene in gene_name.split(","):
            dict_expression[sub_gene]=expression_level
    
    else:
        dict_expression[gene_name]=expression_level

### Plot correlation between gene expression and TATA box abundance

In [None]:
for gene_name in dict_hits.keys():
    try: 
        plt.scatter(dict_hits[gene_name], dict_expression[gene_name])
    except:
        pass

plt.xlim(-0.5, 8.5)
plt.ylim(0,22000)
plt.xlabel("No. TATA boxes",color="black", fontsize=14 )
plt.ylabel("Normalised FPKM",color="black", fontsize=14)
plt.xticks(fontsize=13, color='black')
plt.yticks(fontsize=13, color='black')
plt.suptitle('Relationship between no. of TATA-boxes and expression level', fontsize=16, fontweight='bold', color='black')
plt.show()