# Data pre-processing

There are two types of files required to POTFUL to work.


1. `Auxiliary files` : Organism specific files
2. `Sample(s) files` : Experiment specific files


In [36]:
import pandas as pd

def make_pretty(styler):
    styler.background_gradient(axis=None, cmap="YlGnBu")
    return styler

## 1. Auxiliary files

These files includes:
### [WGCNA_COLOR_MAP.csv](https://raw.githubusercontent.com/nilesh-iiita/POTFUL/main/Auxiliary_File/WGCNA_COLOR_MAP.csv)

WGCNA color map file, fixed for all experiments.

`Comma-separated values`


In [15]:
pd.read_csv("Auxiliary_File/WGCNA_COLOR_MAP.csv", sep=",").head()

Unnamed: 0,Color_name,Color_hex
0,grey,#BEBEBE
1,turquoise,#40E0D0
2,blue,#0000FF
3,brown,#A52A2A
4,yellow,#FFFF00


### [masterTF-target.txt](https://github.com/nilesh-iiita/POTFUL/blob/main/Auxiliary_File/masterTF-target.txt)

Curated TF-target network, specific for *Arabidopsis thaliana*. For data of different species, organism specific TF-target data is required.

`Tab-separated values`


In [40]:
pd.read_csv("Auxiliary_File/masterTF-target.txt", sep="\t").head().style.pipe(make_pretty)

Unnamed: 0,TF,Gene,Input
0,AT1G08290,AT5G66250,1
1,AT1G08290,AT5G65920,1
2,AT1G08290,AT5G65005,1
3,AT1G08290,AT5G37660,1
4,AT1G08290,AT5G26990,1


### [Arabidopsis_TF and family.csv](https://github.com/nilesh-iiita/POTFUL/blob/main/Auxiliary_File/Arabidopsis_TF%20and%20family.csv)

List TF and their family, specific for *Arabidopsis thaliana*. For data of different species, organism specific TF-family data is required.

`Comma-separated values`


In [16]:
pd.read_csv("Auxiliary_File/Arabidopsis_TF and family.csv", sep=",").head()

Unnamed: 0,Protein ID,TF-Family
0,AT1G04880,ARID
1,AT1G20910,ARID
2,AT1G55650,ARID
3,AT1G76110,ARID
4,AT1G76510,ARID


## Sample(s) files

These files includes file which are specific to experiment.


### Co-expression 

### WGCNA node file

`Tab-separated values`
`Must have header, column name can be anything.`

1. 1<sup>st</sup> column gene ID/Probe ID.
2. 2<sup>nd</sup> column gene name or symbol. Often the 1<sup>st</sup> and 2<sup>nd</sup> column are same.
3. 3<sup>rd</sup> column is WGCNA module color.

In [39]:
pd.read_csv("Data/2_WGCNA_data/WGCNA_GSE30166_pH/CytoscapeInput-nodes-pH_Anno.txt", sep="\t").head()

Unnamed: 0,nodeName,altName,"nodeAttr[nodesPresent, ]"
0,244901_at,ATMG00640,turquoise
1,244929_at,ATMG00580,brown
2,244979_at,ATCG00750,blue
3,245031_at,AT2G26360,blue
4,245035_at,AT2G26400,brown


### WGCNA edge file

`Tab-separated values`

1. 1<sup>st</sup> column gene A.
2. 2<sup>nd</sup> column gene B.
3. 3<sup>rd</sup> column is WGCNA edge weight.

In [37]:
pd.read_csv("Data/2_WGCNA_data/WGCNA_GSE10576_Fe/CytoscapeInput-edges-Fe_Anno.txt", sep="\t").head().style.pipe(make_pretty)

Unnamed: 0,fromNode,toNode,weight
0,ATMG00680,AT5G66530,0.838411
1,ATMG00710,AT4G22650,0.825572
2,ATMG00710,AT3G31460,0.855061
3,ATMG00710,AT2G07698,0.869021
4,ATMG00840,AT5G24810,0.890208


### Gene regulatory network file
This is the output of GRE infrence experiment

`Tab-separated values`

1. 1<sup>st</sup> column TF.
2. 2<sup>nd</sup> column Target.
3. 3<sup>rd</sup> column is weight/importance.

In [35]:
pd.read_csv("Data/3_GRN_data/GSE30166_Sulfur_regnet_small.tsv", sep="\t").head().style.pipe(make_pretty)

Unnamed: 0,TF,target,importance
0,AT1G74840,AT2G18700,0.007215
1,AT3G10113,AT1G53320,0.007215
2,AT5G47390,AT2G38400,0.007215
3,AT5G51190,AT4G24570,0.007215
4,AT5G47390,AT4G30690,0.007215


In [27]:
import session_info
session_info.show(html=0)

-----
pandas              1.4.3
session_info        1.0.0
-----
IPython             8.4.0
jupyter_client      7.3.4
jupyter_core        4.10.0
jupyterlab          3.4.3
notebook            6.4.12
-----
Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
Linux-5.15.57.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
-----
Session information updated at 2022-09-22 05:16
