## RNA-Seq Counts

We have two tables of RNA-Seq read counts now, and we'll have four after we process all of the samples.

We'll read in these count tables using pandas and set up the input for DESeq2.

The count tables are tab-delimited text tables.

They have no headers, and we'll want to give meaningful names to the columns. In particular, we'll be constructing a multi-column data frame with one column per sample. In order to make this easier, the name of the counts column should match the name of the sample.

Now, we can use the `merge` function to combine the data frames.

We match up rows from different count tables according to the `gene` column.

In addition to all the "real" genes, we also have entries in our table that count up the reads that don't match any gene (`__no_feature`), the reads that could match two or more genes (`__ambiguous`, not many of these), and so forth.

We can search for those rows using the `.str.contains()` method on the `.gene` column.

We can *remove* all of those unwanted rows from our counts matrix by picking all rows whose gene name does *not* contain a `_` using the `~` operator.

Next we'll look at the highest-expression genes using the `.sort_values()` method.

### Getting ready for DESeq2

At this point, we'll write a CSV file with the counts matrix in order to look at it in DESeq2.

We'll also need a "conditions" matrix that describes the samples, for use with DESeq2.

We will write this to a CSV file too.

### Manual analysis

Before running DESeq2, let's simply *look* at our data, using matplotlib.

We start by making a scatter plot of the read counts in these two samples.

Because of the very wide spread in gene expression levels, we usually make these plots with log-scaled axes. The `loglog()` function turns on log-scaled axes.

In order to hunt down genes that change a lot between the two conditions, we'll compute the *ratio* between the counts, as a new column.

We can sort the ratios to look for really extreme values.

Most of the extreme values show up when the absolute number of reads is very small. These are probably statistical variation and not real biology.

Instead, let's return to sorting genes according to their expression in the unperturbed experiment, taking the 1000 highest-expression genes.

The ratio values are all pretty close to 1 here. Maybe there are some high-expression genes with ratio values very different from 1. We really want to see, among all highly-expressed genes, what are the most extreme ratios?

To do this, we can first sort by `rapa_0` expression, take the top 1000, and then sort by value

Now we see YAL005C near the top: high expression and a very low `rapa_60` over `rapa_0` ratio.

We can also plot the histogram of these ratios in all highly-expressed genes, using `plt.hist()`

We have been sorting according to `rapa_0` expression, but this is asymmetric. We'd miss a gene that was low expression in `rapa_0` and went up a lot in `rapa_60`.

It would be more fair to look at the *average* read count between the two samples.

From here, we can plot the average on the x-axis and the ratio on the y-axis. This is called an "MA" plot.

It has a characteristic arrowhead shape. The big spread on the left-hand side means that the ratios are large when the average is small.

We will pick an arbitrary cutoff at 100 genes and pick out a "good" data set of genes with average expression over 100.

Now, we can look at the extreme changes in this data set.

The systematic gene names aren't very useful to us.

I like to "annotate" these data sets by merging them with a data frame containing the names and brief descriptions of yeast genes.

We'll load this data set from a tab-delimited file.

Now, we'll merge the expression data with the _Saccharomyces_ genome database table. The `sgd` doesn't have column names, so we'll use the number, 3, instead.

Now we can repeat our analysis: pick genes with average over 100 and sort the extreme ratio values.