---

_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-social-network-analysis/resources/yPcBs) course resource._

---

# Assignment 4

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

# !pip install --upgrade git+https://github.com/bmurauer/pipelinehelper

---

## Part 1 - Random Graph Identification

For the first part of this assignment you will analyze randomly generated graphs and determine which algorithm created them.

In [2]:
P1_Graphs = pickle.load(open('A4_graphs','rb'))
P1_Graphs

[<networkx.classes.graph.Graph at 0x112f4d400>,
 <networkx.classes.graph.Graph at 0x112f4d630>,
 <networkx.classes.graph.Graph at 0x112f4d668>,
 <networkx.classes.graph.Graph at 0x112f4d6a0>,
 <networkx.classes.graph.Graph at 0x112f4d6d8>]

<br>
`P1_Graphs` is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms:
* Preferential Attachment (`'PA'`)
* Small World with low probability of rewiring (`'SW_L'`)
* Small World with high probability of rewiring (`'SW_H'`)

Anaylze each of the 5 graphs and determine which of the three algorithms generated the graph.

*The `graph_identification` function should return a list of length 5 where each element in the list is either `'PA'`, `'SW_L'`, or `'SW_H'`.*

In [3]:
def graph_identification():
    return ['PA' if nx.average_clustering(G) < 0.1 and nx.average_clustering(G) > 0.01 else 'SW_H' if nx.average_clustering(G) < 0.01 else 'SW_L' for G in P1_Graphs]
    
graph_identification()

['PA', 'SW_L', 'SW_L', 'PA', 'SW_H']

---

## Part 2 - Company Emails

For the second part of this assignment you will be workking with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagementSalary` indicates whether that person is receiving a management position salary.

In [4]:
G = nx.read_gpickle('email_prediction.txt')

print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458


In [5]:
# print(G.nodes(data=True))

In [6]:
# print(G.edges(data=True))

### Part 2A - Salary Prediction

Using network `G`, identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a management position salary.

To accomplish this, you will need to create a matrix of node features using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a management salary for nodes where `ManagementSalary` is missing.



Your predictions will need to be given as the probability that the corresponding employee is receiving a management position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 252 with the data being the probability of receiving management salary, and the index being the node id.

    Example:
    
        1       1.0
        2       0.0
        5       0.8
        8       1.0
            ...
        996     0.7
        1000    0.5
        1001    0.0
        Length: 252, dtype: float64

In [7]:
# from sklearn.pipeline import Pipeline
# from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScaler
# from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.neighbors import KNeighborsClassifier
# from pipelinehelper import PipelineHelper

# pipe = Pipeline([
#     ('scaler', PipelineHelper([
#         ('std', StandardScaler()),
#         ('max', MaxAbsScaler()),
#     ], include_bypass=True)), # this will produce one setting without scaler
#     ('classifier', PipelineHelper([
# #         ('svm', SVC()),
# #         ('rf', RandomForestClassifier()),
# #         ('ada', AdaBoostClassifier()),
# #         ('gb', GradientBoostingClassifier()),
# #         ('knn', KNeighborsClassifier()),
        
# #         ('nb_pipe', Pipeline([
#             # Naivie Bayes needs positive numbers
# #             ('scaler', MinMaxScaler()),
# #             ('nb', MultinomialNB())
# #         ])),
        
#         ('mlp_pipe', Pipeline([
#             # Neural network needs positive numbers
#             ('scaler', MinMaxScaler()),
#             ('mlp', MLPClassifier())
#         ])),
#     ])),
# ])

# params = {
#     'scaler__selected_model': pipe.named_steps['scaler'].generate({
#         'std__with_mean': [True, False],
#         'std__with_std': [True, False],
#         # no params for 'max' leads to using standard params
#     }),
#     'classifier__selected_model': pipe.named_steps['classifier'].generate({

# #         'svm__C': [0.1, 1.0],
# #         'svm__kernel': ['linear', 'rbf'],

# #         'rf__n_estimators': [10, 20, 50, 100, 150],
# #         'rf__max_features' : ['auto', 'sqrt', 'log2'],
# #         'rf__min_samples_split' : [2, 5, 10],
# #         'rf__min_samples_leaf' : [1, 2, 4],
# #         'rf__bootstrap': [True, False],

# #         'ada__n_estimators': [10, 20, 40, 100],
# #         'ada__algorithm': ['SAMME', 'SAMME.R'],
        
# #         'gb__n_estimators': [10, 20, 50, 100],
# #         'gb__criterion': ['friedman_mse', 'mse', 'mae'],
# #         'gb__max_features': ['auto', 'sqrt', None],

# #         'knn__n_neighbors'  : [2, 3, 5, 7, 10],
# #         'knn__leaf_size':[1,2,3,5],
# #         'knn__weights': ['uniform', 'distance'],
# #         'knn__algorithm': ['auto', 'ball_tree','kd_tree','brute'],

# #         'nb_pipe__nb__fit_prior': [True, False],
# #         'nb_pipe__nb__alpha': [0.1, 0.2],
        
#         'mlp_pipe__mlp__hidden_layer_sizes': [(100,100,100),(30,30,30),(10,10,10),(5,2),(10,2),(30,5)],
#         'mlp_pipe__mlp__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 5, 10],
#         'mlp_pipe__mlp__random_state': [0, 1, 10, 20],
#         'mlp_pipe__mlp__solver': ['lbfgs','adam']
#     })
# }


# salary_predictions data set

# BEST PARAMETERS
# {'classifier__selected_model': ('rf', 
# {'n_estimators': 100, 'max_features': 'log2', 'min_samples_split': 10, 'min_samples_leaf': 4, 'bootstrap': True})
# , 'scaler__selected_model': ('std', {'with_mean': True, 'with_std': True})}

# ROC SCORE: 0.9722596980504754

# {'classifier__selected_model': ('rf', {'n_estimators': 150, 'max_features': 'log2', 'min_samples_split': 2, 'min_samples_leaf': 4, 'bootstrap': True}), 'scaler__selected_model': ('std', {'with_mean': True, 'with_std': False})}
# 0.9682045101556715

# {'classifier__selected_model': ('mlp_pipe', {'mlp__hidden_layer_sizes': (10, 10, 10), 'mlp__alpha': 1, 'mlp__random_state': 0, 'mlp__solver': 'lbfgs'}), 'scaler__selected_model': ('std', {'with_mean': True, 'with_std': True})}
# 0.9593457652411499

In [8]:
def salary_predictions():
    # Initialize the dataframe, using the nodes as the index
    df = pd.DataFrame(index=G.nodes())
    # Extract the node attributes 
    df['Department'] = pd.Series(nx.get_node_attributes(G,'Department'))
    df['Management Salary'] = pd.Series(nx.get_node_attributes(G,'ManagementSalary'))
    # Creating node based features
    df['Clustering'] = pd.Series(nx.clustering(G))
    df['Degree'] = pd.Series(G.degree())
    df['Degree Centrality'] = pd.Series(nx.degree_centrality(G))
    df['Closeness'] = pd.Series(nx.closeness_centrality(G,normalized=True))
    df['Betweeness'] = pd.Series(nx.betweenness_centrality(G,normalized=True))
    df['PageRank'] = pd.Series(nx.pagerank(G))
#     df['Hubs'] = pd.Series(nx.hits(G,normalized=True)[0])
    
    # predict a probability of the node receiving a management salary for nodes where ManagementSalary is missing.
    # target = Management Salary
    df_train = df[~pd.isnull(df['Management Salary'])]
    df_test = df[pd.isnull(df['Management Salary'])]
    
#     return df_train, df_test
    
    X_train = df_train[['Department','Clustering','Degree','Degree Centrality','Closeness','Betweeness','PageRank']]
    X_test = df_test[['Department','Clustering','Degree','Degree Centrality','Closeness','Betweeness','PageRank']]
    y_train = df_train['Management Salary']
    
    scaler = StandardScaler(with_mean=True,with_std=False)
#     scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    rf = RandomForestClassifier(**{'n_estimators': 150, 'max_features': 'log2', 'min_samples_split': 2, 'min_samples_leaf': 4, 'bootstrap': True})
    rf.fit(X_train_scaled,y_train)
    yhat_prob = rf.predict_proba(X_test_scaled)

#     mlp = MLPClassifier(**{'hidden_layer_sizes': (10, 10, 10), 'alpha': 1, 'random_state': 0, 'solver': 'lbfgs'})
#     mlp.fit(X_train_scaled,y_train)
#     yhat_prob = mlp.predict_proba(X_test_scaled)
    
    return pd.Series(data=yhat_prob[:,1],index=X_test.index)

salary_predictions()

1       0.006647
2       0.890217
5       0.999999
8       0.148112
14      0.017568
18      0.168842
27      0.371367
30      0.423367
31      0.078429
34      0.060233
37      0.008235
40      0.029168
45      0.133255
54      0.534884
55      0.945227
60      0.032986
62      0.999972
65      0.988911
77      0.022210
79      0.026931
97      0.023756
101     0.003087
103     0.240765
108     0.025784
113     0.274330
122     0.001957
141     0.334581
142     0.695667
144     0.016096
145     0.882458
          ...   
913     0.000959
914     0.005529
915     0.001390
918     0.099507
923     0.000180
926     0.013048
931     0.001678
934     0.000077
939     0.000241
944     0.000556
945     0.000395
947     0.092325
950     0.002049
951     0.000324
953     0.000531
959     0.014104
962     0.002202
963     0.109878
968     0.008077
969     0.009075
974     0.024397
984     0.000057
987     0.033237
989     0.010657
991     0.016900
992     0.021744
994     0.001172
996     0.0014

In [9]:
# df_train, df_test = salary_predictions()

# features = ['Department','Clustering','Degree','Degree Centrality','Closeness','Betweeness','PageRank']
# X_train = df_train[features]
# X_test = df_test[features]
# y_train = df_train['Management Salary']

In [10]:
# grid = GridSearchCV(pipe,params,scoring='roc_auc',cv=5,verbose=1,n_jobs=-1)
# grid.fit(X_train,y_train)
# print(grid.best_params_)
# print(grid.best_score_)

### Part 2B - New Connections Prediction

For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [11]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
# future_connections.head(10)

In [12]:
# print(G.nodes(data=True))

Using network `G` and `future_connections`, identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, you will need to create a matrix of features for the edges found in `future_connections` using networkx, train a sklearn classifier on those edges in `future_connections` that have `Future Connection` data, and predict a probability of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.



Your predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

    Example:
    
        (107, 348)    0.35
        (542, 751)    0.40
        (20, 426)     0.55
        (50, 989)     0.35
                  ...
        (939, 940)    0.15
        (555, 905)    0.35
        (75, 101)     0.65
        Length: 122112, dtype: float64

In [16]:
pd.options.mode.chained_assignment = None

def new_connections_predictions():
    # create index on tuple of u,v (pair of nodes)
    df = pd.DataFrame(index=[(x[0], x[1]) for x in list(nx.preferential_attachment(G))])
    # take p score
    df['Preferential Attachment'] = [x[2] for x in list(nx.preferential_attachment(G))]
    # take p score
    df['CN Soundarajan Hopcroft'] = [x[2] for x in list(nx.cn_soundarajan_hopcroft(G,community='Department'))]
    # take p score
    df['Resource Allocation Index'] = [x[2] for x in list(nx.resource_allocation_index(G))]
    # take p score
    df['Jaccard Coefficient'] = [x[2] for x in list(nx.jaccard_coefficient(G))]
    # take p score
    df['Adamic Adar Index'] = [x[2] for x in list(nx.adamic_adar_index(G))]
    # take p score
    # df['RA Index Soundarajan Hopcroft'] = [x[2] for x in list(nx.ra_index_soundarajan_hopcroft(G,community='Department'))]    
    # take p score
    # df['Within Inter Cluster'] = [x[2] for x in list(nx.within_inter_cluster(G,community='Department'))]    
       
    # join future connections
    df = future_connections.join(df,how='outer')
    
    df_train = df[~pd.isnull(df['Future Connection'])]
    df_test = df[pd.isnull(df['Future Connection'])]
    
#     return df_train, df_test

    X_train = df_train[['Preferential Attachment','CN Soundarajan Hopcroft','Resource Allocation Index','Jaccard Coefficient','Adamic Adar Index']]
    X_test = df_test[['Preferential Attachment','CN Soundarajan Hopcroft','Resource Allocation Index','Jaccard Coefficient','Adamic Adar Index']]
    y_train = df_train['Future Connection']
    
#     scaler = StandardScaler(with_mean=True,with_std=False)
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
#     rf = RandomForestClassifier(**{'n_estimators': 150, 'max_features': 'log2', 'min_samples_split': 2, 'min_samples_leaf': 4, 'bootstrap': True})
#     rf.fit(X_train_scaled,y_train)
#     yhat_prob = rf.predict_proba(X_test_scaled)

    mlp = MLPClassifier(**{'hidden_layer_sizes': (10, 10, 10), 'alpha': 1, 'random_state': 0, 'solver': 'lbfgs'})
    mlp.fit(X_train_scaled,y_train)
    yhat_prob = mlp.predict_proba(X_test_scaled)

    preds = pd.Series(data=yhat_prob[:,1],index=X_test.index)
    
    # return all NaNs
    nans = lambda df: df[df.isnull().any(axis=1)]
    target = nans(future_connections)
    
    target['Probability'] = [preds[x] for x in target.index]
    
    return target['Probability']

new_connections_predictions()    

(107, 348)    0.025915
(542, 751)    0.010576
(20, 426)     0.615793
(50, 989)     0.010638
(942, 986)    0.010703
(324, 857)    0.010629
(13, 710)     0.143566
(19, 271)     0.133401
(319, 878)    0.010659
(659, 707)    0.010583
(49, 843)     0.010678
(208, 893)    0.010597
(377, 469)    0.008234
(405, 999)    0.017541
(129, 740)    0.017122
(292, 618)    0.022193
(239, 689)    0.010663
(359, 373)    0.009167
(53, 523)     0.241084
(276, 984)    0.010675
(202, 997)    0.010687
(604, 619)    0.053479
(270, 911)    0.010662
(261, 481)    0.072299
(200, 450)    0.999926
(213, 634)    0.010557
(644, 735)    0.043586
(346, 553)    0.010405
(521, 738)    0.010114
(422, 953)    0.017220
                ...   
(672, 848)    0.010662
(28, 127)     0.939200
(202, 661)    0.010380
(54, 195)     0.999805
(295, 864)    0.010620
(814, 936)    0.010567
(839, 874)    0.010703
(139, 843)    0.010625
(461, 544)    0.009910
(68, 487)     0.009875
(622, 932)    0.010605
(504, 936)    0.016878
(479, 528) 

In [14]:
# df_train, df_test = new_connections_predictions()

# features = ['Preferential Attachment','CN Soundarajan Hopcroft','Resource Allocation Index','Jaccard Coefficient','Adamic Adar Index']
# X_train = df_train[features]
# X_test = df_test[features]
# y_train = df_train['Future Connection']

In [15]:
# grid = GridSearchCV(pipe,params,scoring='roc_auc',cv=5,verbose=1,n_jobs=-1)
# grid.fit(X_train,y_train)
# print(grid.best_params_)
# print(grid.best_score_)