## Functional annotation

We will quickly get a "full" set of ChIP-Seq x RNA-Seq target genes

In [None]:
import pandas as pd

#### Gene annotations

Read gene annotation table and extract gene names

In [None]:
genes = pd.read_csv("~/shared/MCB280A_data/S288C_R64-3-1/saccharomyces_cerevisiae_R64-3-1_20210421.gff",
                   delimiter="\t",
                   header=None,
                   names=['seqid', 'source', 'type', 
                          'start', 'end', 'score', 'strand',
                          'phase', 'attributes'])
genes = genes[genes['type'] == "gene"]
genes['name'] = genes['attributes'].str.split(';').str[1]
genes['name'] = genes['name'].str.replace("Name=", "")

Build promoter regions for each gene

In [None]:
import numpy as np
genes['prmstart'] = np.where(genes['strand'] == '+', 
                             genes['start'] - 1000, 
                             genes['end'] + 1)
genes['prmend'] = genes['prmstart'] + 1000

#### ChIP-Seq Peaks

Now, we'll read in the table of ChIP-Seq peaks.

In [None]:
peaks = pd.read_csv("~/Hsf1/ChIP-Seq/macs2/Hsf1_ChIP_heatshk_peaks.xls",
                    comment='#', delimiter='\t')

Compute intersection between promoter regions and ChIP-Seq peaks

In [None]:
gene_peaks = {}

top_peaks = peaks[peaks['-log10(qvalue)'] > 20]

for peak in top_peaks.itertuples():
    for gene in genes.itertuples():
        if (peak.chr == gene.seqid) and (peak.abs_summit >= gene.prmstart) and (peak.abs_summit <= gene.prmend):
            gene_peaks[gene.name] = peak.name
            
gene_peaks = pd.Series(gene_peaks, name='peak')


#### ChIP-Seq Genes

Add peaks to the gene table

In [None]:
genes2 = pd.merge(genes, gene_peaks,
                  left_on='name', right_index=True, how='left')

Now, we will merge in the peaks table by matching up the `peak` column with the `name` column in the peaks table.

In [None]:
genes3 = pd.merge(genes2, peaks, 
                  left_on='peak', right_on='name', how='left')

#### RNA-Seq data

Finally, we're ready to read in the table of RNA-Seq results.

In [None]:
results = pd.read_csv("full.results.csv",
                     index_col=0)

Merge RNA-Seq into the gene table by name

In [None]:
genes4 = pd.merge(genes3, results,
                  left_on='name_x', right_index=True)

### Hsf1 Target Genes

Here we get the set of genes that have a ChIP-Seq peak and a significant expression change into a set called `targets`

We want a list of target genes for functional analysis. The gene names can be found in the `name_x` column.

We want a simple listing of gene names, which we can produce using the `to_string()` method on the column and setting the `index` parameter to turn off the "index", i.e., the row number.

This generates a string, which we need to `print(...)`.

### RNA-Seq enrichment analysis

We can also run an enrichment analysis based just on the RNA-Seq data.

To do this, we write a table of genes and expression changes.

We want to exclude genes that are not expressed at all under any condition. Create a table of `present` genes that are above a cutoff `baseMean` value.

See how many significantly changed genes show up in this analysis.

Extract the column of expression changes

Write a file of expression changes, using the `sep` parameter to make a tab-delimited text file rather than the default CSV.