## Reading the data

In [1]:
import pandas as pd
import os.path

We will read the data describing the gene expression intesities - the data were acquired with DNA microarrays:

In [2]:
df = pd.read_csv(os.path.join('data','data.txt'), comment="#", sep="\t")

Which columns are available?

In [3]:
df.columns

Index(['ID', '1 Avg (log2)', '2 Avg (log2)', '1 Standard Deviation',
       '2 Standard Deviation', '1 Expressed', '2 Expressed', 'Fold Change',
       'P-val', 'FDR P-val', 'Expressed in Both Conditions', 'Public Gene IDs',
       'Gene Symbol', 'Description', 'Chromosome', 'Strand', 'Group', 'Start',
       'Stop', 'SNU-449 control R1.sst-rma-gene-full.chp Signal',
       'SNU-449 control R2.sst-rma-gene-full.chp Signal',
       'SNU-449 control R3.sst-rma-gene-full.chp Signal',
       'SNU-449 KD R1.sst-rma-gene-full.chp Signal',
       'SNU-449 KD R2.sst-rma-gene-full.chp Signal',
       'SNU-449 KD R3.sst-rma-gene-full.chp Signal'],
      dtype='object')

The columns `'1 Avg (log2)'` and `'2 Avg (log2)'` contain averaged intensitites (obtained from three control replicates and three replicates in which ESRP2 was silenced). The last 6 columns contain intensities measured in all 6 samples (3 x control, 3 x knockdown). The experiment is describe in the [paper](https://www.mdpi.com/1422-0067/22/14/7477).

We will only focus on the columns with average intesities. We will also need gene names (symbols) also used in GEM:

In [4]:
df1 = df[['Gene Symbol', '1 Avg (log2)']]
df2 = df[['Gene Symbol', '2 Avg (log2)']] 

We will store these data in two separate files:

In [5]:
df1.columns=['gene', 'value']
df2.columns=['gene', 'value']

df1.to_csv(os.path.join('data','data_control.txt'), index=False)
df2.to_csv(os.path.join('data','data_kd.txt'), index=False)
