# Social Network Analysis of a Company Email Network

This assignment involves working with a company's email network where each node represents an employee of the company, and each edge denotes that at least one email has been sent between two employees. The email network also consists of node attributes `Department` and `ManagementSalary`. `Department` denotes the department in the company which the employee belongs to, and `ManagementSalary` indicates whether that employee is receiving a management position salary.

In [1]:
import networkx as nx # importing NetworkX which is a Python language software package for complex networks.
import pandas as pd # importing pandas which is high-performance, easy-to-use data structures and data analysis framework
import numpy as np # importing numpy which is fundamental package for scientific computing with Python
import pickle # importing pickle module which implements binary protocols for serializing and de-serializing a Python object structure

In [2]:
G = nx.read_gpickle('email_prediction.txt') # Reading email network of a small company in Python pickle format

print(nx.info(G)) # Print short summary of information for the email network of the company

Name: 
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458


In [3]:
G.nodes()[:10] # displaying a subset of 10 list of the employee nodes of the email network of the company

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [4]:
G.edges()[:10] # displaying a subset of 10 list of the email communication edges of the email network of the company

[(0, 1),
 (0, 17),
 (0, 316),
 (0, 146),
 (0, 581),
 (0, 268),
 (0, 221),
 (0, 218),
 (0, 18),
 (0, 734)]

### Part A - Salary Prediction

Network `G`, is used to identify employees in the network with missing values for the node attribute `ManagementSalary` and predict whether or not they are receiving a management position salary.

To achieve this, a matrix of node features using networkx is created and a sklearn classifier is trained on nodes that have `ManagementSalary` data for predicting probabilities of employees receiving management salary for nodes where `ManagementSalary` is missing.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC) which is graded in the assignment server.

In [5]:
from sklearn.svm import SVC # importing Support Vector Classifier from scikit-learn
from sklearn.neural_network import MLPClassifier # importing Multi-layer Perceptron classifier from scikit-learn
from sklearn.preprocessing import MinMaxScaler # importing min max scaler to transform features by scaling each feature to a given range

In [6]:
def is_management_salary(node): # function to check if each employee node from the company network receives a management salary
    management_salary = node[1]['ManagementSalary']
    if management_salary == 0:
        return 0
    elif management_salary == 1:
        return 1
    else:
        return None

In [7]:
df = pd.DataFrame(index=G.nodes()) # storing the employee nodes of the company network 

In [8]:
df['clustering coefficient'] = pd.Series(nx.clustering(G)) # computing the clustering coefficient for employee nodes and storing it in a dataframe
df['degree_view'] = pd.Series(G.degree()) # computing degrees of employee nodes for the company email network 
df['degree_centrality'] = pd.Series(nx.degree_centrality(G)) # computing the degree centrality for employee nodes and storing it in a dataframe
df['closeness_centrality'] = pd.Series(nx.closeness_centrality(G, normalized=True)) # computing the degree centrality for employee nodes and storing it in a dataframe
df['betweenness_centrality'] = pd.Series(nx.betweenness_centrality(G, normalized=True)) # computing the betweenness centrality for employee nodes and storing it in a dataframe
df['PageRank'] = pd.Series(nx.pagerank(G)) # computing PageRank of the employee nodes and storing it in a dataframe
df['Management Salary'] = pd.Series([is_management_salary(node) for node in G.nodes(data=True)])

In [9]:
df.head() # displaying the first five rows of the employee nodes from the company network 

Unnamed: 0,clustering coefficient,degree_view,degree_centrality,closeness_centrality,betweenness_centrality,PageRank,Management Salary
0,0.276423,44,0.043825,0.421991,0.001124,0.001224,0.0
1,0.265306,52,0.051793,0.42236,0.001195,0.001426,
2,0.297803,95,0.094622,0.46149,0.00657,0.002605,
3,0.38491,71,0.070717,0.441663,0.001654,0.001833,1.0
4,0.318691,96,0.095618,0.462152,0.005547,0.002526,1.0


In [10]:
df_train = df[~pd.isnull(df['Management Salary'])] # extracting only the rows with valid Management Salary values for train set
df_test = df[pd.isnull(df['Management Salary'])] # extracting only the rows with null values in Management Salary for test set

In [11]:
df_train.head() # displaying the first five rows of train set

Unnamed: 0,clustering coefficient,degree_view,degree_centrality,closeness_centrality,betweenness_centrality,PageRank,Management Salary
0,0.276423,44,0.043825,0.421991,0.001124,0.001224,0.0
3,0.38491,71,0.070717,0.441663,0.001654,0.001833,1.0
4,0.318691,96,0.095618,0.462152,0.005547,0.002526,1.0
6,0.155183,115,0.114542,0.475805,0.012387,0.003146,1.0
7,0.287785,72,0.071713,0.420156,0.002818,0.002002,0.0


In [12]:
df_test.head() # displaying the first five rows of test set

Unnamed: 0,clustering coefficient,degree_view,degree_centrality,closeness_centrality,betweenness_centrality,PageRank,Management Salary
1,0.265306,52,0.051793,0.42236,0.001195,0.001426,
2,0.297803,95,0.094622,0.46149,0.00657,0.002605,
5,0.107002,171,0.170319,0.501484,0.030995,0.004914,
8,0.447059,37,0.036853,0.413151,0.000557,0.001059,
14,0.215784,80,0.079681,0.442068,0.003726,0.002166,


In [13]:
features = ['clustering coefficient', 'degree_view', 'degree_centrality', 'closeness_centrality', 'betweenness_centrality', 'PageRank'] # features of interest to keep
X_train = df_train[features] # keeping only features of interest for train set
Y_train = df_train['Management Salary'] # train set labels
X_test = df_test[features] # keeping only features of interest for test set

In [14]:
scaler = MinMaxScaler() # initializing min max scaler
X_train_scaled = scaler.fit_transform(X_train) # fitting the min max scaler to train set data
X_test_scaled = scaler.transform(X_test) # fitting the same scaled min max scaler to test set data
clf = MLPClassifier(hidden_layer_sizes = [10, 5], alpha = 5,
                       random_state = 0, solver='lbfgs', verbose=0) # initializing Multi-layer Perceptron classifier with two hidden layer sizes of 10 and 5, L2 penalty (regularization term) parameter of 5, and lbfgs as optimising algorithm 
clf.fit(X_train_scaled, Y_train) # fitting the classifier on the scaled train set 
test_proba = clf.predict_proba(X_test_scaled)[:, 1] # computing probabilities of the employee having management salaries

In [15]:
test_proba[:10] # displaying the probabilities of the first 10 employee in the company network

array([ 0.12843931,  0.58626097,  0.97985876,  0.13370215,  0.30462535,
        0.208766  ,  0.2695859 ,  0.33614353,  0.1691909 ,  0.14351299])

In [16]:
probability_employee_MG_salary = pd.Series(test_proba,X_test.index) # storing the above probabilities with thier respective node as indexes in a pandas dataframe
probability_employee_MG_salary.head() # displaying the first five rows of the above dataframe

1     0.128439
2     0.586261
5     0.979859
8     0.133702
14    0.304625
dtype: float64

### Part B - New Connections Prediction

The second part of this assignment involves predicting future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [17]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head(10)

Unnamed: 0,Future Connection
"(6, 840)",0.0
"(4, 197)",0.0
"(620, 979)",0.0
"(519, 872)",0.0
"(382, 423)",0.0
"(97, 226)",1.0
"(349, 905)",0.0
"(429, 860)",0.0
"(309, 989)",0.0
"(468, 880)",0.0


The edges in `future_connections` with missing values can be extracted using network `G` and `future_connections`. Then predictions on whether or not these edges will have a future connection can be made.

To achieve this, a matrix of features for the edges in `future_connections` is created using networkx, a sklearn classifier is trained on those edges in `future_connections` (that have `Future Connection` data) for predicting probabilities of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC) which is graded in the assignment server.

In [18]:
for node in G.nodes():
        G.node[node]['community'] = G.node[node]['Department']

In [19]:
G.nodes()[:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [20]:
G.edges()[:10]

[(0, 1),
 (0, 17),
 (0, 316),
 (0, 146),
 (0, 581),
 (0, 268),
 (0, 221),
 (0, 218),
 (0, 18),
 (0, 734)]

In [21]:
preferential_attachment = list(nx.preferential_attachment(G)) # Computing preferential attachment scores of all employee node pairs and storing it in a list

In [22]:
preferential_attachment[:10] # displaying the preferential attachment score of the first 10 rows of employee node pairs

[(0, 2, 4180),
 (0, 3, 3124),
 (0, 4, 4224),
 (0, 7, 3168),
 (0, 8, 1628),
 (0, 9, 1760),
 (0, 10, 2068),
 (0, 11, 3344),
 (0, 12, 2552),
 (0, 13, 7920)]

In [23]:
df = pd.DataFrame(index=[(x[0], x[1]) for x in preferential_attachment]) # storing the employee node pairs as indexes in an empty pandas dataframe

In [24]:
df.head() # displaying the first 5 rows of the employee node pairs

"(0, 2)"
"(0, 3)"
"(0, 4)"
"(0, 7)"
"(0, 8)"


In [25]:
df['preferential_attachment_score'] = [x[2] for x in preferential_attachment] # storing preferential attachment scores of all employee node pairs in the dataframe as a feature 
cn_soundarajan_hopcroft = list(nx.cn_soundarajan_hopcroft(G)) # computing the number of common neighbors of employee node pairs in the company email network and storing it in a dataframe as a feature
df_cn_soundarajan_hopcroft = pd.DataFrame(index=[(x[0], x[1]) for x in cn_soundarajan_hopcroft]) # storing the employee node pairs as indexes in an empty pandas dataframe
df_cn_soundarajan_hopcroft['cn_soundarajan_hopcroft'] = [x[2] for x in cn_soundarajan_hopcroft] # storing the number of common neighbors of employee node pairs in the dataframe as a feature 
df = df.join(df_cn_soundarajan_hopcroft,how='outer') # joining the above two dataframes
df['cn_soundarajan_hopcroft'] = df['cn_soundarajan_hopcroft'].fillna(value=0) # filling missing values in Common Neighbor Soundarajan-Hopcroft feature column with zero
df['resource_allocation_index'] = [x[2] for x in list(nx.resource_allocation_index(G))] # computing resource allocation index of employee node pairs in the company email network and storing it in a dataframe as a feature
df['jaccard_coefficient'] = [x[2] for x in list(nx.jaccard_coefficient(G))] # computing Jaccard coefficient of employee node pairs in the company email network and storing it in a dataframe as a feature
df = future_connections.join(df,how='outer') # merging the above dataframe with the future conncection dataframe  

In [26]:
df.head() # displaying the first five rows of the dataframe

Unnamed: 0,Future Connection,preferential_attachment_score,cn_soundarajan_hopcroft,resource_allocation_index,jaccard_coefficient
"(0, 2)",0.0,4180,6,0.05534,0.045802
"(0, 3)",0.0,3124,3,0.021388,0.027273
"(0, 4)",0.0,4224,3,0.021388,0.022222
"(0, 7)",0.0,3168,4,0.061668,0.036364
"(0, 8)",0.0,1628,1,0.011628,0.012821


In [27]:
df_train = df[~pd.isnull(df['Future Connection'])] # extracting only the rows with valid future conncection values for train set
df_test = df[pd.isnull(df['Future Connection'])] # extracting only the rows with null future conncection values for test set

In [28]:
df_train.head()  # displaying the first five rows of the train set

Unnamed: 0,Future Connection,preferential_attachment_score,cn_soundarajan_hopcroft,resource_allocation_index,jaccard_coefficient
"(0, 2)",0.0,4180,6,0.05534,0.045802
"(0, 3)",0.0,3124,3,0.021388,0.027273
"(0, 4)",0.0,4224,3,0.021388,0.022222
"(0, 7)",0.0,3168,4,0.061668,0.036364
"(0, 8)",0.0,1628,1,0.011628,0.012821


In [29]:
df_test.head()  # displaying the first five rows of the train set

Unnamed: 0,Future Connection,preferential_attachment_score,cn_soundarajan_hopcroft,resource_allocation_index,jaccard_coefficient
"(0, 9)",,1760,2,0.041931,0.025
"(0, 19)",,3168,4,0.064557,0.036364
"(0, 20)",,3256,7,0.090283,0.06422
"(0, 35)",,2596,1,0.005848,0.01
"(0, 38)",,2068,0,0.0,0.0


In [30]:
features = ['cn_soundarajan_hopcroft', 'preferential_attachment_score', 'resource_allocation_index', 'jaccard_coefficient'] # features of interest to keep
X_train = df_train[features] # keeping only features of interest for train set
Y_train = df_train['Future Connection'] # train set labels
X_test = df_test[features] # keeping only features of interest for test set
scaler = MinMaxScaler() # initializing min max scaler
X_train_scaled = scaler.fit_transform(X_train) # fitting the min max scaler to train set data
X_test_scaled = scaler.transform(X_test) # fitting the same scaled min max scaler to test set data
clf = MLPClassifier(hidden_layer_sizes = [10, 5], alpha = 5, # initializing Multi-layer Perceptron classifier with two hidden layer sizes of 10 and 5, L2 penalty (regularization term) parameter of 5, and lbfgs as optimising algorithm 
                    random_state = 0, solver='lbfgs', verbose=0) 
clf.fit(X_train_scaled, Y_train) # fitting the classifier on the scaled train set 
test_proba = clf.predict_proba(X_test_scaled)[:, 1] # computing probabilities of new connections among employee pairs in the test set
    
predictions = pd.Series(test_proba,X_test.index) # storing the predicted probabilities of employee pairs in test set with their respective employee pairs as indexes in a pandas Series
target = future_connections[pd.isnull(future_connections['Future Connection'])] # extracting the train set target comprising of null values
target['prob'] = [predictions[x] for x in target.index] # storing the predicted probabilities of employee pairs in test set with their respective employee pairs as indexes in a pandas Dataframe

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [31]:
target.head()

Unnamed: 0,Future Connection,prob
"(107, 348)",,0.029863
"(542, 751)",,0.012068
"(20, 426)",,0.56685
"(50, 989)",,0.012165
"(942, 986)",,0.01227
