# Example Notebook: Fibroblast Gene Regulatory Network Inference

In [1]:
from src.GRNinference import inferGRN, crossvalidateGRN

Specify data and library arguments:
- `path_to_data`: address of CSV file containing gene expression data (formatted as genes x samples)
- `lib_dir`: address of directory containing sub-directories for all TF-target databases
- `lib_name`: string specifying the desired library to use for inference and refinement
  - Here, the [CHEA](https://pubmed.ncbi.nlm.nih.gov/20709693/) database of transcription factor targets is used

Specify data and library arguments for inference:
- `path_to_data`: address of file containing gene expression data
- `lib_dir`: directory containing sub-folders with TF-target databases (included in repository as `data`)
- `lib_name`: string pointing function to the sub-folder containing database files
  - In this case, we'll use the [CHEA](https://pubmed.ncbi.nlm.nih.gov/20709693/) database of 

### Run a single network inference:  
Here we'll point the function to the required arguments above and keep all other arguments as defaults. Access to a Dask dashboard will be provided during computation as a link printed to the workspace.

In [6]:
path_to_data = "data\\expression\\GSE133529_ProcessedDataFile.csv.gz"
lib_dir = "data\\"
lib_name = "CHEA"

In [10]:
# inspect the final network
grn

Unnamed: 0,index,TF,target,importance
893,10,HIF1A,MMP1,53.416939
845,113,RUNX2,SERPINE1,26.032237
525,112,SRF,ETV4,21.864790
801,192,GLI2,ACTA2,21.501221
846,136,AHR,SERPINE1,20.987329
...,...,...,...,...
630,75,TEAD4,HMGA1,0.469218
496,8,NFATC1,ETS1,0.465846
516,87,NFATC3,RUNX2,0.443590
147,118,FOS,RUNX1,0.365915


### Run k-fold cross validation for the inference pipeline:
Using the `crossvalidateGRN` function, we can specify the number of folds `k` and train multiple networks from training sets. After inference and refinement, each network will be concatenated into a single pandas dataframe with the fold specified as a separate column. We can additionally specify a save directory to output each network as a CSV file, along with the samples used for each training/testing set.  

In this case, we have 16 samples of bulk RNA-seq data, and so we'll use 8-fold cross validation.

In [None]:
k = 8
path_to_save = "data\\networks\\"

grn_all = crossvalidateGRN(path_to_data, lib_dir, lib_name, k, savedir=path_to_save)

In [7]:
grn = inferGRN(path_to_data, lib_dir, lib_name)

http://127.0.0.1:8787/status
(211801, 3)
(1028, 3)
(98, 4)


In [8]:
grn

Unnamed: 0,index,TF,target,importance
893,10,HIF1A,MMP1,53.416939
845,113,RUNX2,SERPINE1,26.032237
525,112,SRF,ETV4,21.864790
801,192,GLI2,ACTA2,21.501221
846,136,AHR,SERPINE1,20.987329
...,...,...,...,...
630,75,TEAD4,HMGA1,0.469218
496,8,NFATC1,ETS1,0.465846
516,87,NFATC3,RUNX2,0.443590
147,118,FOS,RUNX1,0.365915
