# Analyzing a company's email network

In [1]:
# Importing libraries
import networkx as nx
import pandas as pd
import numpy as np
import pickle
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc

## Company Emails

In this project we will be workking with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagmentSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagmentSalary` indicates whether that person is receiving a managment position salary.

In [21]:
# Loading the graph
G = nx.read_gpickle('email_prediction.txt')

# Printing graph info
print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458


### Salary Prediction

Using network `G`, we will identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a managment position salary.

To accomplish this, we will need to create a matrix of node features using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a managment salary for nodes where `ManagementSalary` is missing.


Our predictions will need to be given as the probability that the corresponding employee is receiving a managment position salary.

The evaluation metric for this project is the Area Under the ROC Curve (AUC).

In [175]:
def salary_predictions():
    
    '''
        Description: This function calculates the probability of an employee being receiving a management salary.
        
        Args:
            - N/A
        
        Returns: 
            - predictions (Series): the list of probability of the employee being receiving a management salary
    '''
    
    # A list to store graph properties
    mngnt_salary = []

    # Iterating over the graph nodes
    for item in G.nodes_iter(data=True):
        # Storing the values for each node
        mngnt_salary.append({'Department':item[1]['Department'], 'ManagementSalary': item[1]['ManagementSalary']})
    # Creating a dataframe from list
    df_salary = pd.DataFrame(mngnt_salary)

    # Creating a column with 'clustering' graph property
    df_salary['clustering'] = pd.Series(nx.clustering(G))
    
    # Creating a column with 'degree' graph property
    df_salary['degree'] = pd.Series(G.degree())
    
    # Creating a column with 'closeness_centrality' graph property
    df_salary['closeness'] = pd.Series(nx.closeness_centrality(G))
    
    # Creating a column with 'betweenness_centrality' graph property
    df_salary['betweness'] = pd.Series(nx.betweenness_centrality(G))

    # Cloning the original dataframe
    df_missing = df_salary.copy()
    
    # Filtering the entries where 'ManagementSalary' is missing
    df_missing = df_missing[np.isnan(df_missing['ManagementSalary'])]
    
    # Drop NA values
    df_salary.dropna(inplace=True)
    
    # Selecting X features
    X = df_salary[['clustering', 'degree', 'closeness',
       'betweness']]
    
    # Selecting target feature
    y = df_salary['ManagementSalary']

    # Selecting employees which we want to predict the salary
    X_pred = df_missing[['clustering', 'degree', 'closeness',
           'betweness']]

    # Splitting data into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # Applying Standard Scaler to normalize the data
    scaler = StandardScaler().fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test) 
    X_pred = scaler.transform(X_pred) 

    # Training a SVC model
    svc_model = SVC(C=10, probability=True, random_state=42).fit(X_train, y_train)
    
    # Traninng a KNN model
    knn_classifier = KNeighborsClassifier().fit(X_train, y_train)
    
    # Training a Logistic Regression Model
    logistic_reg = LogisticRegression().fit(X_train, y_train)

    # Predicting the probability of the person being receiving a management salary
    model_pred = logistic_reg.predict_proba(X_pred)
    
    # Storing the predictions
    predictions = pd.Series(model_pred[:, 1], index=df_missing.index)
    
    # Returning the list of probabilities
    return predictions

### New Connections Prediction

In this part of this project, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [22]:
# Loading data
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})

# Showing top 10 entries
future_connections.head(10)

Unnamed: 0,Future Connection
"(6, 840)",0.0
"(4, 197)",0.0
"(620, 979)",0.0
"(519, 872)",0.0
"(382, 423)",0.0
"(97, 226)",1.0
"(349, 905)",0.0
"(429, 860)",0.0
"(309, 989)",0.0
"(468, 880)",0.0


Using network `G` and `future_connections`, we will identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, we will need to create a matrix of features for the edges found in `future_connections` using networkx, train a sklearn classifier on those edges in `future_connections` that have `Future Connection` data, and predict a probability of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.

The evaluation metric for this project is the Area Under the ROC Curve (AUC).

In [27]:
def new_connections_predictions():
    
    '''
        Description: This function calculates the probability of employees being a future connection in a company's email network.
        
        Args:
            - N/A
        
        Returns: 
            - predictions (Series): the edges probabilities of employees being a future connection
    
    '''
    
    # Adding 'preferencial attachement' graph attribute to the dataframe
    future_connections['Preferential Attachment'] = [i[2] for i in nx.preferential_attachment(G, future_connections.index)]
    
    # Adding 'common neighbors' graph attribute to the dataframe
    future_connections['Common Neighbors'] = future_connections.index.map(lambda empl: len(list(nx.common_neighbors(G, empl[0], empl[1]))))

    # Cloning the original dataframe
    ft_mising = future_connections.copy()
    
    # Filtering the entries where 'ManagementSalary' is missing
    ft_mising = ft_mising[np.isnan(ft_mising['Future Connection'])]
    
    # Drop NA values
    future_connections.dropna(inplace=True)
    
    # Selecting predictor features
    X = future_connections[['Preferential Attachment', 'Common Neighbors']]
    
    # Selecting target feature
    y = future_connections['Future Connection']
    
    # Selecting entries which we want to predict edgde probability of being a future connection
    X_pred = ft_mising[['Preferential Attachment', 'Common Neighbors']]
    
    # Splitting the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # Applying Standard Scaler to normalize the data
    scaler = StandardScaler().fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test) 
    X_pred = scaler.transform(X_pred) 

    # Training a Logistic Regression Model
    logistic_reg = LogisticRegression().fit(X_train, y_train)
    
    # Predicting the probability of employees being a future connection
    ft_predictions = logistic_reg.predict_proba(X_pred)
    
     # Storing the predictions
    predictions = pd.Series(ft_predictions[:, 1], index=ft_mising.index)

    # Returning the predictions probabilities
    return predictions