Takes in the original Chasman network as input. Creates an edge and weight column. Obtains all of the directed edges and sets each edge weight at 0.5. Produces a file called ChasmanNetwork-Dir.txt that has GeneA, GeneB, edge weight, and direction. Removes these interactions: metapath, cxorf, and cxcx. Also removes inferred sources. 
Then, this notebook generates statistics related to each source protein and how much the source proteins appear in the dataset. Then, it finds all of the sources in the interactome that have undirected edges and adds these undirected sources to the dataframe that is purely directed. 

Input: Original input network developed by Chasman
Output: Partial directed model of all the directed edges used in interactome, Interactome with both directed and undirected source interactions

# Import Pandas Library and Data Files

In [23]:
import os.path
import matplotlib.pyplot as plt
import pandas as pd

#r-escapes the whole string
USERPATH = r'/home/dylan/Documents/HDD/Wisconsin/'
NOTEBOOK = r'osmotic-stress/Notebooks/'
FILEPATH = USERPATH + NOTEBOOK + 'ChasmanNetwork-DirUndir/'
#FILEPATH = r'/home/dylan/Documents/HDD/Wisconsin/osmotic-stress/Notebooks/ChasmanNetwork-DirUndir/'

Location = FILEPATH + 'yeastdata.txt'

#check file exists
assert os.path.isfile(Location) == True, "File does not exist."

In [24]:
## Change Headers

In [25]:
#Changes the names of the headers
df = pd.read_csv(Location, sep = '\t', names = ["Interaction", "GeneA", "GeneB", "Direction", "Sign", "Source", "PMID"], skiprows = 0)

In [26]:
## Reomve metapath, cxorf, cxcx, inferred interactions

In [27]:
#Drop all values in the dataframe that have 'metapath' or 'cxorf' interaction types
df = df[((df.Interaction != 'metapath') & (df.Interaction != 'cxorf') & (df.Source != 'inferred') & (df.Interaction != 'cxcx'))]

check = True
for index,row in df.iterrows():
    if(row['Source'] == 'inferred' or row['Interaction'] == 'metapath' or row['Interaction'] == 'cxorf' or row['Interaction'] == 'cxcx'):
        check = False
        #print False

if check == True:
    print "All inferred, metapath, cxorf, and cxcx have been removed!"

All inferred, metapath, cxorf, and cxcx have been removed!


# Drop parts of dataset and convert direction to 'U' or 'D'

In [28]:
#Drop the Source, Sign, and PubMed ID from the data
df = df.drop(['Interaction','Source', 'Sign', 'PMID'], 1)

df.loc[df['Direction'] == '1', 'Direction'] = 'D'
df.loc[df['Direction'] == '0', 'Direction'] = 'U'
df = df.drop([0]) #Drop the first row that contains the headings

# Define Protein Class

In [29]:
class Proteins:
    def __init__(self, name):
        self.name = name
        self.directed = 0
        self.undirected = 0
        self.outgoingEdge = 0
        self.incomingEdge = 0
    #Functions below increment the number of directed,undirected,outgoing,or incoming edges
    def incrementDirected(self):
        self.directed = self.directed + 1
    def incrementUndirected(self):
        self.undirected = self.undirected + 1
    def incrementOutgoing(self):
        self.outgoingEdge = self.outgoingEdge + 1
    def incrementIncoming(self):
        self.incomingEdge = self.incomingEdge + 1
        
    #display function to display proteins
    def display(self):
        print "Name of protein: " + self.name
        print "Number of Directed Edges: ",self.directed
        print "Number of Undirected Edges: ",self.undirected
        print "Number of Outgoing Edges: ",self.outgoingEdge
        print "Number of Incoming Edges: ",self.incomingEdge
        print ""
        print ""

# Declare protein objects

In [30]:
YDR420W = Proteins('YDR420W')
YGR014W = Proteins('YGR014W')
YER118C = Proteins('YER118C')
YPR075C = Proteins('YPR075C')
YIL147C = Proteins('YIL147C')    

# Iterate through all proteins and generate statistics related to those proteins

In [31]:
proteinObjects = [YDR420W, YGR014W, YER118C, YPR075C, YIL147C]
count = 0
for index,row in df.iterrows(): #for each row in dataframe
    for protein in proteinObjects: #for each protein in proteinObjects array
        if(row['GeneA'] == getattr(protein, 'name')): #check if protein is under GeneA column
            protein.incrementOutgoing()
            if(row['Direction'] == 'D'):
                protein.incrementDirected()
            else:
                protein.incrementUndirected()

        elif(row['GeneB'] == getattr(protein, 'name')): #check if protein is under GeneB column
            protein.incrementIncoming()
            if(row['Direction'] == 'D'):
                protein.incrementDirected()
            else:
                protein.incrementUndirected()

    

for protein in proteinObjects:
    protein.display()

Name of protein: YDR420W
Number of Directed Edges:  0
Number of Undirected Edges:  3
Number of Outgoing Edges:  3
Number of Incoming Edges:  0


Name of protein: YGR014W
Number of Directed Edges:  1
Number of Undirected Edges:  4
Number of Outgoing Edges:  1
Number of Incoming Edges:  4


Name of protein: YER118C
Number of Directed Edges:  1
Number of Undirected Edges:  18
Number of Outgoing Edges:  12
Number of Incoming Edges:  7


Name of protein: YPR075C
Number of Directed Edges:  0
Number of Undirected Edges:  4
Number of Outgoing Edges:  0
Number of Incoming Edges:  4


Name of protein: YIL147C
Number of Directed Edges:  2
Number of Undirected Edges:  5
Number of Outgoing Edges:  4
Number of Incoming Edges:  3




In [32]:
## Insert edge weight of .5

In [33]:
df.insert(2, 'Weight', .5) #insert a weight column into index 2 with values of .5
#Check all weights are .5
for index,row in df.iterrows():
    assert (row['Weight'] == .5),"Weight is incorrect."
df

Unnamed: 0,GeneA,GeneB,Weight,Direction
1,YCL032W,YLR006C,0.5,D
2,YCL032W,YNR031C,0.5,D
3,YCL032W,YJL128C,0.5,D
4,YNR031C,YJL128C,0.5,D
5,YJL128C,YLR113W,0.5,D
6,YAL040C,YMR037C,0.5,D
7,YJL164C,YMR037C,0.5,D
8,YOR360C,YJL164C,0.5,D
9,YOR360C,YPL203W,0.5,D
10,YOR360C,YKL166C,0.5,D


In [34]:
## Find all sourceProteins in the network and append them to df2

In [35]:
sourceProteins = ['YDR420W', 'YGR014W', 'YER118C','YIL147C', 'YPR075C']

columns = ['GeneA','GeneB', 'Weight', 'Direction']
df2 = pd.DataFrame(columns=columns)

#Find all sourceProteins in the network and append them to df2
for protein in sourceProteins:
    dfTemp = df[(df.GeneA == protein) | (df.GeneB == protein)]
    df2 = df2.append(dfTemp)

df2.count()

GeneA        38
GeneB        38
Weight       38
Direction    38
dtype: int64

In [36]:
## Remove duplicates found associated with source proteins

In [37]:
#df3 = df2.drop_duplicates(subset=['GeneA', 'GeneB'], keep=False)
df2 = df2.drop_duplicates()
df2.count()

GeneA        36
GeneB        36
Weight       36
Direction    36
dtype: int64

In [38]:
df2

Unnamed: 0,GeneA,GeneB,Weight,Direction
1319,YDR420W,YER118C,0.5,U
11778,YDR420W,YOR153W,0.5,U
13715,YDR420W,YHR154W,0.5,U
5701,YBR160W,YGR014W,0.5,D
10646,YGL209W,YGR014W,0.5,U
17029,YGL035C,YGR014W,0.5,U
21994,YGR014W,YLR229C,0.5,U
22379,YER118C,YGR014W,0.5,U
6083,YER118C,YJL128C,0.5,U
6480,YER118C,YNL152W,0.5,U


In [39]:
## Find all connections to source proteins that are undirected

In [40]:
df2 = df2[df2['Direction'] == 'U']
print df2.count()
df = df[df['Direction'] == 'D'] #include only directed edges from original network

#output undirected edges to file
df3 = df2.drop(['Weight', 'Direction'],1)
path = FILEPATH + 'ChasmanNetwork-UndirEdges.txt'
df3.to_csv(path,  index = False, header = False, sep = '\t')

print df2

GeneA        32
GeneB        32
Weight       32
Direction    32
dtype: int64
         GeneA    GeneB  Weight Direction
1319   YDR420W  YER118C     0.5         U
11778  YDR420W  YOR153W     0.5         U
13715  YDR420W  YHR154W     0.5         U
10646  YGL209W  YGR014W     0.5         U
17029  YGL035C  YGR014W     0.5         U
21994  YGR014W  YLR229C     0.5         U
22379  YER118C  YGR014W     0.5         U
6083   YER118C  YJL128C     0.5         U
6480   YER118C  YNL152W     0.5         U
7540   YDL117W  YER118C     0.5         U
8382   YER118C  YOR208W     0.5         U
9781   YCL027W  YER118C     0.5         U
13184  YBR023C  YER118C     0.5         U
14785  YER118C  YER118C     0.5         U
17529  YER118C  YLR452C     0.5         U
18202  YER118C  YPR032W     0.5         U
19915  YER118C  YLR353W     0.5         U
20250  YER118C  YMR032W     0.5         U
23059  YER118C  YOR188W     0.5         U
23640  YCL032W  YER118C     0.5         U
24783  YAL041W  YER118C     0.5         U

In [41]:
## Combine undirected edges associated with source proteins to all the directed edges of network

In [42]:
print "df count before appending: ", df.count()
df = df.append(df2)
print "df count after appending: ", df.count()
df

df count before appending:  GeneA        7584
GeneB        7584
Weight       7584
Direction    7584
dtype: int64
df count after appending:  GeneA        7616
GeneB        7616
Weight       7616
Direction    7616
dtype: int64


Unnamed: 0,GeneA,GeneB,Weight,Direction
1,YCL032W,YLR006C,0.5,D
2,YCL032W,YNR031C,0.5,D
3,YCL032W,YJL128C,0.5,D
4,YNR031C,YJL128C,0.5,D
5,YJL128C,YLR113W,0.5,D
6,YAL040C,YMR037C,0.5,D
7,YJL164C,YMR037C,0.5,D
8,YOR360C,YJL164C,0.5,D
9,YOR360C,YPL203W,0.5,D
10,YOR360C,YKL166C,0.5,D


In [43]:
## Output directed + some undirected edges to file

In [44]:
path = FILEPATH + 'ChasmanNetwork-DirUndir.txt'

df.to_csv(path,  index = False, header = False, sep = '\t')