## ChIP-Seq and RNA-Seq

We now want to combine our ChIP-Seq and RNA-Seq data, using pandas.

### Gene annotations

We can read our GFF file of gene annotations. It's just a text table.

There are no headers, though. The columns of the file are:
```
['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
```

There are many different "types" of entries in the file. We can use the `value_counts()` method on the `type` column to see all of them.

We only want the `"gene"` type of entry and so we'll pick out just these rows out of the data frame.

The systematic name of the gene is buried inside the `attributes` column. This column has a bunch of labeled data in the form _key_=_value_.

We can extract it using the `.str.split()` method and then use `.str.replace()` to get rid of the wanted `ID=` part of it to create a `name` column.

The data table lists the coordinates of genes, but we want to look for ChIP-Seq peaks in the promoter regions. For a `+` strand gene, the promoter is to the left (smaller coordinate numbers) and for a `-` strand gene it is to the right.

We'll use a 1 kb window here.

For a `+` strand gene, the promoter is _start_-1000 to _start_-1.

For a `-` strand gene, the promoter is _end_+1 to _end_+1000.

We can use the `np.where(...)` function from `numpy` to handle this situation. We'll compute the starting position for each promoter in the `prmstart` column.

We'll be sure to check our coordinate math for genes on each strand.

Next, we'll make a `prmend` column that is 1000 bases after the start. We don't need to do anything different based on the strand here.

### ChIP-Seq Peaks

Now, we'll read in the table of ChIP-Seq peaks.

At this point we need to loop over each peak and find out which promoter(s) it affects.

We'll do this in a multi-step process.
1. Generate a dictionary where keys are gene names and values are associated ChIP-Seq peak names. There won't be an entry for every gene.
1. Convert this dictionary into a pandas `Series` and merge it in to a new column of the genes data frame, holding peak names. It will have many `NaN` entries for "missing" data.
1. Merge the whole table of peaks (with enrichment and p-values) into the gene table based on the names.

First, we will loop over each peak, using the `.itertuples()` method. In class, we'll just use the top 10 peaks by significance.

Now we need to convert this into a `Series` and give it a name.

We can merge our named `Series` into the data frame of genes. We want to match up the `name` column in the genes table with the "index" of the `Series.

To make sure this worked, check on the row for the gene `YAL005C`, which does have a peak.

Now, we will merge in the peaks table by matching up the `peak` column with the `name` column in the peaks table.

Again, we'll need to check on a row that has a peak to be sure it worked.

### RNA-Seq data

Finally, we're ready to read in the table of RNA-Seq results.

Again we can pull out some statistically significant genes to be sure it worked. Notice that almost every significant gene is down-regulated.

Finally, we're ready to merge the results with the genes.

We'll match up the `name_x` column of the genes (which was renamed because the peak table also had a name column) with the "index" of the results table.

Let's check out the ChIP-Seq genes in the RNA-Seq data.

Of course, most genes have a `NaN` missing value for fold enrichment. We can use the `pd.isna()` function to test whether a value is `NaN` or not.

Rows where `fold_enrichment` is _not_ `NaN` are genes with a potential ChIP-Seq peak.

Many of these genes have significant adjusted p-values and negative log fold-changes. Let's look at this trend more rigorously by plotting the histogram of fold-changes for these groups.

Start by importing matplotlib.pyplot.

We'll make a histogram of `fold_enrichment` values for all genes.

Then, we'll make a similar histogram, for genes with a ChIP-Seq peak.

We'll use the `range` parameter so the histograms are easier to compare.

This looks different—but is it significant?

We'll use the `mannwhitneyu` function from `scipy.stats` to run a statistical test. To do this, we'll need to remove the `NaN` values using the `.dropna()` method.

In [None]:
import scipy.stats as stats
nochip = genes4[pd.isna(genes4['fold_enrichment'])]['log2FoldChange'].dropna()
yeschip = genes4[~pd.isna(genes4['fold_enrichment'])]['log2FoldChange'].dropna()
stats.mannwhitneyu(nochip, yeschip)

Finally, we'll make a table of the high-confidence targets that have a ChIP-Seq peak and an expression change.