# 3.1. Prepare YEASTRACT protein-DNA interaction network

Generate network file of protein-dna interactions with the locus names of genes.

Finally, the duplicated interaction entries and the self-loops were removed.

## Input

* `data-create_networks/yeastract_TF/RegulationTwoColumnTable_Documented_DNA_binding_and_expression_evidence_2019-08-14.tsv`: protein-dna interactions of transcription factors from YEASTRACT database.

## Output

* `data-create_networks/yeastract_TF/protein_DNA_interactions.tsv`: cleaned protein-dna interactions of transcription factors from YEASTRACT database with locus names.

In [1]:
import pandas as pd

In [2]:
yeastract_file = '../../data-create_networks/yeastract_TF/yeastract2019-flat-file.tsv'
pdiTable_file = '../../data-create_networks/yeastract_TF/protein_DNA_interactions.tsv'

## Prepare PDI table from YEASTRACT bulk flat file

In [3]:
# get all interactions
yeastract = pd.read_csv(yeastract_file, sep='\t', header=None).rename({0:'source', 2:'target'}, axis=1)

# filter out the interactions which are not "direct" and without gene expression effect
x = yeastract[ yeastract[8] == 'Direct' ][ ['source', 'target'] ]
y = yeastract[ ~yeastract[7].isna() ][ ['source', 'target'] ]

# build the PDI table
confirmed_interactions = pd.merge( x, y, on=['source', 'target'], how='inner' )

# get proteins involved in the regulatory network
tfs = confirmed_interactions['source'].unique()
targets = confirmed_interactions['target'].unique()
yeastract_proteins = pd.Series( list( set(tfs) | set(targets) ) )

print("transcription factors:", len(tfs))
print("target genes:", len(targets))
print("combined:", len(yeastract_proteins))
print("raw interaction entries:", len(confirmed_interactions))

transcription factors: 152
target genes: 3915
combined: 3940
raw interaction entries: 42492


## Clean PDI table

In [4]:
# remove duplicated interactions
pdi = confirmed_interactions.drop_duplicates()
print( 'duplicated interaction entries:', len(confirmed_interactions[confirmed_interactions.duplicated()]) )

# remove self-loops
print( 'self loops:', len(pdi[pdi.iloc[:,0] == pdi.iloc[:,1]]) )
pdi = pdi[pdi.iloc[:,0] != pdi.iloc[:,1]]

print('number of protein-DNA interactions:', len(pdi))

duplicated interaction entries: 30990
self loops: 43
number of protein-DNA interactions: 11459


## Export

In [5]:
pdi.to_csv( pdiTable_file, sep='\t', index=None )