# Gene expression data

In [1]:
import pandas as pd

First, let's import the gene expression data.

In [14]:
gasch_data = pd.read_csv("complete_dataset_gasch.txt", sep="\t")
gasch_data = gasch_data.iloc[:,:11] # Filter only heat-shock data from first experiment (hs-1).
gasch_data

Unnamed: 0,UID,NAME,GWEIGHT,Heat Shock 05 minutes hs-1,Heat Shock 10 minutes hs-1,Heat Shock 15 minutes hs-1,Heat Shock 20 minutes hs-1,Heat Shock 30 minutes hs-1,Heat Shock 40 minutes hs-1,Heat Shock 60 minutes hs-1,Heat Shock 80 minutes hs-1
0,YAL001C,YAL001C TFC3 TRANSCRIPTION ...,1,1.53,-0.06,0.58,0.52,0.42,0.16,0.79,
1,YAL002W,YAL002W VPS8 VACUOLAR PROTEIN TARGETIN...,1,-0.01,-0.30,0.23,0.01,-0.15,0.45,-0.04,0.14
2,YAL003W,YAL003W EFB1 PROTEIN SYNTHESIS ...,1,0.15,-0.07,-0.25,-0.30,-1.12,-0.67,-0.15,-0.43
3,YAL004W,YAL004W UNKNOWN ...,1,0.24,0.76,0.20,0.34,0.11,0.07,0.01,0.36
4,YAL005C,YAL005C SSA1 ER AND MITOCHONDRIAL TRAN...,1,2.85,3.34,,,,,,
5,YAL007C,YAL007C ERP2 MEMBRANE TRAFFICKING; SEC...,1,-0.22,-0.12,-0.29,-0.51,-0.81,-0.47,0.28,-0.10
6,YAL008W,YAL008W FUN14 UNKNOWN ...,1,0.19,0.25,0.69,0.34,0.65,0.48,,
7,YAL009W,YAL009W SPO7 MEIOSIS ...,1,0.23,0.05,0.18,-0.15,-0.06,-0.19,-0.20,
8,YAL010C,YAL010C MDM10 MITOCHONDRIAL BIOGENESIS ...,1,0.03,-0.23,0.33,,0.23,-0.20,,0.29
9,YAL011W,YAL011W UNKNOWN ...,1,0.01,-0.12,0.00,-0.22,,-0.34,0.45,-0.10


From the Gasch website we have this description:

*"The data contained in all files represents the normalized, background-corrected log2 values of the Red/Green ratios measured on the DNA microarrays. Data from the figures have been mathematically transformed as described in Materials and Methods, such that each sample described is compared to the unstressed cells. The untransformed data can be accessed in the Complete Dataset file, with descriptions of each microarray reference in the online version of Materials and Methods. "*

I need to read the Materials and Methods of the Gasch paper again, but I think what they report is log2(expression at time t) - log2(expression at time 0). 

We have expression data for 6152 genes, although there seem to be quite a few missing datapoints. Note that all the genes are referred to as ORFs (e.g. YAL001C), not the common gene names. This does not match the format of the Yeastract data for the gene regulatory network, so we'll have to map the ORFs to the gene names.

I looked around a little bit and it seems that the Yeastract website has a way to do this, but we need to input a list of ORFs to get the gene names and viceversa. Because of this, I'll output the full list of names to a text file that we can hopefully just copy-paste to get the gene names.

In [17]:
outfile = open("ORFnames.txt", "w")
for orfname in list(gasch_data["UID"]):
    outfile.write(orfname + "\n")
outfile.close()    

I then copy-pasted the contents of this file here (http://www.yeastract.com/formorftogene.php) to get the list of common gene names. This mostly worked (except for a few gene names (~25) which are unidentified). That doesn't seem too bad since we have like 6100 genes total and we may even be able to hunt them down individually. However, there's  a problem: the output is an html table that is annoying to use. I couldn't figure out how to copy-paste it into LibreOffice Calc (a free version of Excel) in a way that the last two columns don't disappear. Basically, we need to figure out how to get the data out of there and into a csv or some sane format. It may also be possible to look at the source code for the html and try to get the data from there.

Alternatively, I also found this (https://www.uniprot.org/docs/yeast.txt), which is a list that maps all yeast ORFs to gene names (and lists when there's multiple gene names). We could also try writing a small script to get the name for each ORF. I'd rather use the Yeastract list if we can, since it's likely to match the names they have in the network data. 

After doing this we should be able to map the gene expression data to the connectivity data.