---

_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-social-network-analysis/resources/yPcBs) course resource._

---

# Assignment 4

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

---

## Part 1 - Random Graph Identification

For the first part of this assignment you will analyze randomly generated graphs and determine which algorithm created them.

In [2]:
#P1_Graphs = pickle.load(open('A4_graphs','rb'))

<br>
`P1_Graphs` is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms:
* Preferential Attachment (`'PA'`)
* Small World with low probability of rewiring (`'SW_L'`)
* Small World with high probability of rewiring (`'SW_H'`)

Anaylze each of the 5 graphs and determine which of the three algorithms generated the graph.

*The `graph_identification` function should return a list of length 5 where each element in the list is either `'PA'`, `'SW_L'`, or `'SW_H'`.*

---

### QUESTION 1 - Desarrollo


So, if we recall, the preferential attachment model has the feature that nodes that have very high degree are more likely to get more neighbors. And so, the intuition behind this measure is that, well, if I'm looking at a pair of nodes and they both have a very high degree, then they're more likely to be connected to each other in the future.

1. **HOW TO IDENTIFY 'PA'**: Look for very high degree nodes in the network (look for networks over 10-15 nodes)
2. **HOW TO IDENTIFY 'SW_H'**: Look for the High value of average_clustering (look for values >= 0.2 or higher)
3. **HOW TO IDENTIFY 'SW_L'**: Discard Process, if not 'PA' and not 'SW_H', then it is 'SW_L'



**SELECTION CRITERIA: PROCESS OF SELECTION OF EACH OF THE UNKOWN NETWORKS AND SEGMENT THEM TO THEIR RESPECTING ALGORITHM**

- **1st step :** Identify if the Network has very high degrees $\rightarrow$ **Criteria of Selection Nº1** If len(hist_degrees) > 15 $\Rightarrow$ `'PA'`

- **2nd step :** Identify if the Network has high average clustering $\rightarrow$ **Criteria of Selection Nº2** If avg_cluster value < 0.2 $\Rightarrow$ `'SW_H'`

- **3rd step :** By discard process, if it's not either one, it must be $\Rightarrow$ `'SW_L'`



In [3]:
def graph_identification():
    
    # Your Code Here
    master_list = []

    for graph in P1_Graphs:
        # Estimate the degrees of the network barabasi_albert_graph()
        degrees = graph.degree()

        # Set of the unique values othe degrees of the network
        degrees_values = sorted(set(degrees.values()))

        # Histogram
        histogram = [list(degrees.values()).count(i) / float(nx.number_of_nodes(graph)) for i in degrees_values]
    
        
        # Algorithm identification process (criteria set above)
        # Check if len(degree_histogram) higher than 15
        if len(histogram) >= 15:
            master_list.append('PA')
        
        # Check average clustering is lower than 0.2
        elif nx.average_clustering(graph) < 0.2:
            master_list.append('SW_H')
        
        # By discard process
        else:
            master_list.append('SW_L')
    
    
    return master_list

---

## Part 2 - Company Emails

For the second part of this assignment you will be workking with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagementSalary` indicates whether that person is receiving a management position salary.

In [4]:
G = nx.read_gpickle('email_prediction.txt')

### Part 2A - Salary Prediction

Using network `G`, identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a management position salary.

To accomplish this, you will need to create a matrix of node features using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a management salary for nodes where `ManagementSalary` is missing.



Your predictions will need to be given as the probability that the corresponding employee is receiving a management position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 252 with the data being the probability of receiving management salary, and the index being the node id.

    Example:
    
        1       1.0
        2       0.0
        5       0.8
        8       1.0
            ...
        996     0.7
        1000    0.5
        1001    0.0
        Length: 252, dtype: float64

In [5]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score

In [6]:
# Store nodes info in a dataframe
df = pd.DataFrame(G.nodes(data=True))

# Set the node_id column as index
df = df.rename(columns={0: 'node_id'}).set_index('node_id')

# Extract Department & Management_Salary from dictionary and store it in separate cols
df['Department'] = pd.Series(nx.get_node_attributes(G, 'Department'))
df['Management_Salary'] = pd.Series(nx.get_node_attributes(G, 'ManagementSalary'))
# Drop the dictionary column
df.drop(1, axis=1, inplace=True)

# Add Various Centrality Measurements of Network
df['Degree_Centrality'] = pd.Series(nx.degree_centrality(G))
df['Closeness_Centrality'] = pd.Series(nx.closeness_centrality(G))
df['Betweeness_Centrality_Normalized'] = pd.Series(nx.betweenness_centrality(G))
df['Scaled_PageRank'] = pd.Series(nx.pagerank(G, alpha=0.8))

# Split the data in train and test (Management_Salary == nan are used for predictions)
df_train = df[df['Management_Salary'].notnull()]
df_predict = df[df['Management_Salary'].isnull()]

# Set the Deparment value from integers to categorical using LabelEncoder
label_scaler = LabelEncoder()
# Fit_transform the Departments of train data
df_train['Department'] = label_scaler.fit_transform(df_train['Department'])
# Transform the Departments of train data
df_predict['Department'] = label_scaler.transform(df_predict['Department'])

# Train-Test split
cols = ['Degree_Centrality', 'Closeness_Centrality', 'Betweeness_Centrality_Normalized', 'Scaled_PageRank']
X = df_train[cols]
y = df_train['Management_Salary'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Explore different classifiers with criteria: higher roc-auc (> 0.88 for 100%)

# List of classifiers to use
classifiers = [LogisticRegression(),
               DecisionTreeClassifier(),
               RandomForestClassifier(max_depth=3, random_state=2),
               KNeighborsClassifier(),
               MLPClassifier(hidden_layer_sizes=(10,5), random_state=0),
               SVC(kernel='rbf', probability=True)]

# Create DataFrame with classifier and roc-auc score
master_list = []

# Loop through the classifiers list
for clf in classifiers:
    classifier = clf.fit(X_train, y_train)
    master_list.append({'classifier': str(clf).split('(')[0],
                        'roc_auc_score': roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])})
    
#master_list

AttributeError: 'Graph' object has no attribute '_node'

In [None]:
# Store nodes info in a dataframe
df = pd.DataFrame(G.nodes(data=True))

# Set the node_id column as index
df = df.rename(columns={0: 'node_id'}).set_index('node_id')

# Extract Department & Management_Salary from dictionary and store it in separate cols
df['Department'] = pd.Series(nx.get_node_attributes(G, 'Department'))
df['Management_Salary'] = pd.Series(nx.get_node_attributes(G, 'ManagementSalary'))
# Drop the dictionary column
df.drop(1, axis=1, inplace=True)

# Add Various Centrality Measurements of Network
df['Degree_Centrality'] = pd.Series(nx.degree_centrality(G))
df['Closeness_Centrality'] = pd.Series(nx.closeness_centrality(G))
df['Betweeness_Centrality_Normalized'] = pd.Series(nx.betweenness_centrality(G))
df['Scaled_PageRank'] = pd.Series(nx.pagerank(G, alpha=0.8))

# Split the data in train and test (Management_Salary == nan are used for predictions)
df_train = df[df['Management_Salary'].notnull()]
df_predict = df[df['Management_Salary'].isnull()]

# Set the Deparment value from integers to categorical using LabelEncoder
label_scaler = LabelEncoder()

# Fit_transform the Departments of train data
df_train['Department'] = label_scaler.fit_transform(df_train['Department'])
# Transform the Departments of train data
df_predict['Department'] = label_scaler.transform(df_predict['Department'])

# Split the data in X_train, X_test, y_train, y_test
cols = ['Degree_Centrality', 'Closeness_Centrality', 'Betweeness_Centrality_Normalized', 'Scaled_PageRank']

# Train
X_train = df_train[cols]
y_train = df_train['Management_Salary'].astype(int)

# Test
X_test = df_predict[cols]
y_test = df_predict['Management_Salary']

# Set RandomForest Classifier
random_clf = RandomForestClassifier(max_depth=3, random_state=2).fit(X_train, y_train)

# Estimate predict_proba() and save it in a pd.Series()
answer = pd.Series(data=random_clf.predict_proba(X_test)[:, 1], index=X_test.index)

In [None]:
def salary_predictions():
    
    # Your Code Here
    # Store nodes info in a dataframe
    df = pd.DataFrame(G.nodes(data=True))

    # Set the node_id column as index
    df = df.rename(columns={0: 'node_id'}).set_index('node_id')

    # Extract Department & Management_Salary from dictionary and store it in separate cols
    df['Department'] = pd.Series(nx.get_node_attributes(G, 'Department'))
    df['Management_Salary'] = pd.Series(nx.get_node_attributes(G, 'ManagementSalary'))
    # Drop the dictionary column
    df.drop(1, axis=1, inplace=True)

    # Add Various Centrality Measurements of Network
    df['Degree_Centrality'] = pd.Series(nx.degree_centrality(G))
    df['Closeness_Centrality'] = pd.Series(nx.closeness_centrality(G))
    df['Betweeness_Centrality_Normalized'] = pd.Series(nx.betweenness_centrality(G))
    df['Scaled_PageRank'] = pd.Series(nx.pagerank(G, alpha=0.8))

    # Split the data in train and test (Management_Salary == nan are used for predictions)
    df_train = df[df['Management_Salary'].notnull()]
    df_predict = df[df['Management_Salary'].isnull()]

    # Set the Department value from integers to categorical using LabelEncoder
    label_scaler = LabelEncoder()
    # Fit_transform the Departments of train data
    df_train['Department'] = label_scaler.fit_transform(df_train['Department'])
    # Transform the Departments of train data
    df_predict['Department'] = label_scaler.transform(df_predict['Department'])

    # Split the data in X_train, X_test, y_train, y_test
    cols = ['Degree_Centrality', 'Closeness_Centrality', 'Betweeness_Centrality_Normalized', 'Scaled_PageRank']

    # Train
    X_train = df_train[cols]
    y_train = df_train['Management_Salary'].astype(int)

    # Test
    X_test = df_predict[select_cols]
    y_test = df_predict['Management_Salary']

    # Set RandomForest Classifier
    random_clf = RandomForestClassifier(max_depth=3, random_state=2).fit(X_train, y_train)
    
    
    return pd.Series(data=random_clf.predict_proba(X_test)[:, 1], index=X_test.index)

### Part 2B - New Connections Prediction

For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [None]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
#future_connections.head(10)

Using network `G` and `future_connections`, identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, you will need to create a matrix of features for the edges found in `future_connections` using networkx, train a sklearn classifier on those edges in `future_connections` that have `Future Connection` data, and predict a probability of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.



Your predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

    Example:
    
        (107, 348)    0.35
        (542, 751)    0.40
        (20, 426)     0.55
        (50, 989)     0.35
                  ...
        (939, 940)    0.15
        (555, 905)    0.35
        (75, 101)     0.65
        Length: 122112, dtype: float64

In [None]:
# This task is necesarry to apply the Link Predictions Measurements

# Estimate the Common Neighbors Measurement
future_connections['Common Neighbors'] = [len(list(nx.common_neighbors(G, n[0], n[1])))
                                          for n in future_connections.index]

# Estimate the Jaccard Coefficient
future_connections['Jaccard Coefficient'] = [list(nx.jaccard_coefficient(G, [node_tuple]))[0][2]
                                             for node_tuple in future_connections.index]

# Estimate the Preferential Attachment Score
future_connections['Preferential Attachment Score'] = [list(nx.preferential_attachment(G, [node_tuple]))[0][2]
                                                       for node_tuple in future_connections.index]

# Split the data in train and test (Management_Salary == nan are used for predictions)
df_train = future_connections[future_connections['Future Connection'].notnull()].sample(frac=0.2)
df_test = future_connections[future_connections['Future Connection'].isnull()]

# Train-Test Split
cols = ['Common Neighbors', 'Jaccard Coefficient', 'Preferential Attachment Score']
X = df_train[cols]
y = df_train['Future Connection'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)


# Explore different classifiers with criteria: higher roc-auc (> 0.88 for 100%)

# List of classifiers to use
classifiers = [LogisticRegression(),
               DecisionTreeClassifier(),
               RandomForestClassifier(max_depth=3, random_state=2),
               KNeighborsClassifier()]
               #SVC(kernel='rbf', probability=True)]

# Create DataFrame with classifier and roc-auc score
master_list = []

# Loop through the classifiers list
for clf in classifiers:
    #print(f"Fitting Classifier: {str(clf).split('(')[0]}")
    classifier = clf.fit(X_train, y_train)
    master_list.append({'classifier': str(clf).split('(')[0],
                        'roc_auc_score': roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])})
    #print(f"End fitting and appending ROC-AUC results of model {str(clf).split('(')[0]}\n")


In [None]:
# This task is necesarry to apply the Link Predictions Measurements

# Estimate the Common Neighbors Measurement
future_connections['Common Neighbors'] = [len(list(nx.common_neighbors(G, n[0], n[1])))
                                          for n in future_connections.index]

# Estimate the Jaccard Coefficient
future_connections['Jaccard Coefficient'] = [list(nx.jaccard_coefficient(G, [node_tuple]))[0][2]
                                             for node_tuple in future_connections.index]

# Estimate the Preferential Attachment Score
future_connections['Preferential Attachment Score'] = [list(nx.preferential_attachment(G, [node_tuple]))[0][2]
                                                       for node_tuple in future_connections.index]

# Split the data in train and test (Future Connection == nan are used for predictions)
df_train = future_connections[future_connections['Future Connection'].notnull()]
df_test = future_connections[future_connections['Future Connection'].isnull()]

# Train-Test Split
cols = ['Common Neighbors', 'Jaccard Coefficient', 'Preferential Attachment Score']
X_train = df_train[cols]
y_train = df_train['Future Connection'].astype(int)

# Test
X_test = df_test[cols]
y_test = df_test['Future Connection']

# Set RandomForest Classifier
random_clf = RandomForestClassifier(max_depth=3, random_state=2).fit(X_train, y_train)

In [None]:
def new_connections_predictions():
    
    # Your Code Here
    # This task is necesarry to apply the Link Predictions Measurements

    # Estimate the Common Neighbors Measurement
    future_connections['Common Neighbors'] = [len(list(nx.common_neighbors(G, n[0], n[1])))
                                              for n in future_connections.index]

    # Estimate the Jaccard Coefficient
    future_connections['Jaccard Coefficient'] = [list(nx.jaccard_coefficient(G, [node_tuple]))[0][2]
                                                 for node_tuple in future_connections.index]

    # Estimate the Preferential Attachment Score
    future_connections['Preferential Attachment Score'] = [list(nx.preferential_attachment(G, [node_tuple]))[0][2]
                                                           for node_tuple in future_connections.index]

    # Split the data in train and test (Future Connection == nan are used for predictions)
    df_train = future_connections[future_connections['Future Connection'].notnull()]
    df_test = future_connections[future_connections['Future Connection'].isnull()]

    # Train-Test Split
    cols = ['Common Neighbors', 'Jaccard Coefficient', 'Preferential Attachment Score']
    X_train = df_train[cols]
    y_train = df_train['Future Connection'].astype(int)

    # Test
    X_test = df_test[cols]
    y_test = df_test['Future Connection']

    # Set RandomForest Classifier
    random_clf = RandomForestClassifier(max_depth=3, random_state=2).fit(X_train, y_train)


    return pd.Series(data=random_clf.predict_proba(X_test)[:, 1], index=X_test.index)