---

_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-social-network-analysis/resources/yPcBs) course resource._

---

# Assignment 4

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import pickle

---

## Part 1 - Random Graph Identification

For the first part of this assignment you will analyze randomly generated graphs and determine which algorithm created them.

In [41]:
P1_Graphs = pickle.load(open('A4_graphs','rb'))
P1_Graphs

[<networkx.classes.graph.Graph at 0x1454fb059c8>,
 <networkx.classes.graph.Graph at 0x1454fb003c8>,
 <networkx.classes.graph.Graph at 0x1453fa50a48>,
 <networkx.classes.graph.Graph at 0x1454ccb2208>,
 <networkx.classes.graph.Graph at 0x1454ccb2688>]

<br>
`P1_Graphs` is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms:
* Preferential Attachment (`'PA'`)
* Small World with low probability of rewiring (`'SW_L'`)
* Small World with high probability of rewiring (`'SW_H'`)

Anaylze each of the 5 graphs and determine which of the three algorithms generated the graph.

*The `graph_identification` function should return a list of length 5 where each element in the list is either `'PA'`, `'SW_L'`, or `'SW_H'`.*

In [42]:
def graph_identification():
    
    # Your Code Here
    answer=[]
    for x in P1_Graphs:
        lcc = nx.average_clustering(x)
        path = nx.average_shortest_path_length(x)
        if (lcc<0.1 and path<5):
            answer.append("PA")
        elif (lcc>0.1 and path>5):
            answer.append("SW_L")
        else:
            answer.append("SW_H")

    
    return answer  # Your Answer Here

---

## Part 2 - Company Emails

For the second part of this assignment you will be workking with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagementSalary` indicates whether that person is receiving a management position salary.

In [4]:
G = nx.read_gpickle('email_prediction.txt')

print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree:  33.2458


### Part 2A - Salary Prediction

Using network `G`, identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a management position salary.

To accomplish this, you will need to create a matrix of node features using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a management salary for nodes where `ManagementSalary` is missing.



Your predictions will need to be given as the probability that the corresponding employee is receiving a management position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 252 with the data being the probability of receiving management salary, and the index being the node id.

    Example:
    
        1       1.0
        2       0.0
        5       0.8
        8       1.0
            ...
        996     0.7
        1000    0.5
        1001    0.0
        Length: 252, dtype: float64

In [59]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score

data=pd.DataFrame(G.nodes(data=True),columns=["node","ManagementSalary"]).set_index("node")
data["department"]=data["ManagementSalary"].map(lambda x:x["Department"])
data["salary"]=data["ManagementSalary"].map(lambda x:x["ManagementSalary"])
data.drop(columns=["ManagementSalary"],inplace=True)


data["closeness"]=pd.Series(nx.closeness_centrality(G,normalized=True))
data["betweenness"]=pd.Series(nx.betweenness_centrality(G,normalized=True))
data["degree_centrality"]=pd.Series(nx.degree_centrality(G))
data['clustering'] = pd.Series(nx.clustering(G))
data['degree'] = pd.Series(G.degree())
data["pr"]=pd.Series(nx.pagerank(G))

na_data = data[data["salary"].isna()]
data=data[data["salary"].notna()]

X=data[["closeness","betweenness","degree","clustering","degree_centrality","pr"]]
y=data["salary"]
X_train,X_test,y_train,y_test=train_test_split(X,y)
scaler = MinMaxScaler()
X_train_scaled =scaler.fit_transform(X_train)
X_test_scaled =scaler.transform(X_test)
################################### All The Methods Tried  ###############################################

# clf = RandomForestClassifier()
# grid_values = {"n_estimators":[9,10,11],"max_depth":[3,4,5]}
# grid = GridSearchCV(clf,scoring="roc_auc",param_grid=grid_values,cv=5)
# grid.fit(X_train_scaled,y_train)

# clf = SVC(kernel="rbf",probability=True,random_state=0)
# grid_values={"gamma":[0.001,0.01,0.1,1]}
# grid = GridSearchCV(clf,param_grid=grid_values,scoring="roc_auc")
# grid.fit(X_train_scaled,y_train)

# clf = LogisticRegression(random_state=0)
# grid_values = {"C":[0.01,0.1,1]}
# grid = GridSearchCV(clf,scoring="roc_auc",param_grid=grid_values)
# grid.fit(X_train_scaled,y_train)

####################################              ###################################################################

clf = MLPClassifier(hidden_layer_sizes = [10, 5],
                   random_state = 0, solver='lbfgs', verbose=0)
grid_values={"alpha":[0.1]}       # grid_values={"alpha":[0.001,0.01,0.1,1,5]}  grid.best_params_ gives alpha=0.1
grid = GridSearchCV(clf,param_grid=grid_values,scoring="roc_auc")
grid.fit(X_train_scaled,y_train)


############################# Answer To Return For Assignment ###################################################

# answer_X=na_data[["closeness","betweenness","degree","clustering","degree_centrality","pr"]]
# na_index=na_data.index.tolist()
# answer_X=scaler.transform(answer_X)
# prob=clf.predict_proba(answer_X)[:,1]
# answer=pd.Series(prob,index=na_index)

###################################################            ###############################################


y_pred=grid.predict(X_test_scaled)

roc_auc_score(y_test,y_pred)  # Testing roc_auc_score  


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

0.8561821613968854

### Part 2B - New Connections Prediction

For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `future_connections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

In [2]:
future_connections = pd.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head(10)

Unnamed: 0,Future Connection
"(6, 840)",0.0
"(4, 197)",0.0
"(620, 979)",0.0
"(519, 872)",0.0
"(382, 423)",0.0
"(97, 226)",1.0
"(349, 905)",0.0
"(429, 860)",0.0
"(309, 989)",0.0
"(468, 880)",0.0


Using network `G` and `future_connections`, identify the edges in `future_connections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, you will need to create a matrix of features for the edges found in `future_connections` using networkx, train a sklearn classifier on those edges in `future_connections` that have `Future Connection` data, and predict a probability of the edge being a future connection for those edges in `future_connections` where `Future Connection` is missing.



Your predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.88 or higher will receive full points, and with an AUC of 0.82 or higher will pass (get 80% of the full points).

Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

    Example:
    
        (107, 348)    0.35
        (542, 751)    0.40
        (20, 426)     0.55
        (50, 989)     0.35
                  ...
        (939, 940)    0.15
        (555, 905)    0.35
        (75, 101)     0.65
        Length: 122112, dtype: float64

In [5]:

from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score

for x in G.nodes():
    G.node[x]["community"]=G.node[x]["Department"]
preferential_attachment=list(nx.preferential_attachment(G))
data = pd.DataFrame(index=[(x[0],x[1]) for x in preferential_attachment])
data["preferential_attachment"]=[x[2] for x in preferential_attachment]

cn_sound = list(nx.cn_soundarajan_hopcroft(G))
data_cn = pd.DataFrame(index=[(x[0],x[1]) for x in cn_sound])
data_cn["cn_sound"]=[x[2] for x in cn_sound]


data=data.join(data_cn,how="outer")

data["cn_sound"] = data["cn_sound"].fillna(value=0)


data["resource_allocation"]=[x[2] for x in list(nx.resource_allocation_index(G))]
data["jaccard"]=[x[2] for x in list(nx.jaccard_coefficient(G))]

df = future_connections.join(data,how="outer")
df_na = df[df["Future Connection"].isna()]
df = df[df["Future Connection"].notna()]

X =df[["preferential_attachment","cn_sound","resource_allocation","jaccard"]]
y =df["Future Connection"]

X_train,X_test,y_train,y_test = train_test_split(X,y)


scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

        
        
###########################  All The Methods Tried ######################################################        
        
#     clf = RandomForestClassifier()
#     grid_values = {"n_estimators":[9,10,11],"max_depth":[3,4,5]}
#     grid = GridSearchCV(clf,scoring="roc_auc",param_grid=grid_values,cv=5)
#     grid.fit(X_train,y_train)

#     clf = SVC(kernel="rbf",probability=True,random_state=0)
#     grid_values={"gamma":[0.01,0.1,1]}
#     grid = GridSearchCV(clf,param_grid=grid_values,scoring="roc_auc")
#     grid.fit(X_train_scaled,y_train)


#     clf = LogisticRegression(random_state=0)
#     grid_values = {"C":[0.01,0.1,1]}
#     grid = GridSearchCV(clf,scoring="roc_auc",param_grid=grid_values)
#     grid.fit(X_train_scaled,y_train)

######################################                  ##########################################


clf = MLPClassifier(hidden_layer_sizes = [10, 5],
                   random_state = 0, solver='lbfgs', verbose=0)
grid_values = {"alpha":[0.1,1]}                                   # grid.best_params_  gives alpha= 1
    
grid = GridSearchCV(clf,param_grid=grid_values,scoring="roc_auc")
grid.fit(X_train_scaled,y_train)

y_pred = grid.predict(X_test_scaled)

####################################  Assignment Answer         #######################################
    
    
#     answer_X =df_na[["preferential_attachment","cn_sound","resource_allocation","jaccard"]] 
#     index = df_na.index.tolist()
#     answer_X = scaler.transform(answer_X)
    
#     prob=clf.predict_proba(answer_X)[:,1]
#     answer = pd.Series(prob,index=index)


roc_auc_score(y_test,y_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

0.8039859109667596