**Authors**: 1. Tridev Parashar (Tridev-Parashar.Parashar@bayes.city.ac.uk)<br>
&emsp;&emsp;&emsp;&emsp;&emsp;&nbsp;2. Preeti Pothireddy (Preeti.Pothireddy@bayes.city.ac.uk)<br>&emsp;&emsp;&emsp;&emsp;&emsp;&nbsp;3. Mihir Salunke (Mihir.Salunke@bayes.city.ac.uk)<br>&emsp;&emsp;&emsp;&emsp;&emsp;&nbsp;4. Rajat Sawant (Rajat.Sawant@bayes.city.ac.uk)<br> 
&emsp;&emsp;&emsp;&emsp;&emsp;&nbsp;5. Elisavet Demetriou (Elisavet.Demetriou@bayes.city.ac.uk)<br>
 
**Synopsis:** This notebook creates Network Analytic Models, which eventually provide insights on how
to improve performance of R&D projects of Silico Inc.

**Date:** 2022-11-19 22:33:30 


## Importing Libraries

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import math
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
import statsmodels.formula.api as smf
from collections import Counter
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)

## Loading Datasets

In [2]:
UA = nx.read_graphml('ua.graphml')
IE = nx.read_graphml('ie.graphml')
PA = nx.read_graphml('pa.graphml')
PO = pd.read_csv('po.csv')

## Descriptive Statistics

In [3]:
#Calculating the Count of Males and Females, Average Ti_Exp and Average Tenure in the IE Network:

Gender=[i[1]['gender'] for i in list(IE.nodes(data=True))]
Ti_Exp=[i[1]['ti_exp'] for i in list(IE.nodes(data=True))]
Tenure=[i[1]['tenure'] for i in list(IE.nodes(data=True))]
print('\n1. Number of Male Employees :', Counter(Gender)[1])
print('2. Number of Female Employees :', Counter(Gender)[0])
print('3. Average Ti_Exp of Employees :', math.ceil(np.mean(Ti_Exp)))
print('4. Average Tenure of Employees :', math.ceil(np.mean(Tenure)))


1. Number of Male Employees : 491
2. Number of Female Employees : 667
3. Average Ti_Exp of Employees : 6
4. Average Tenure of Employees : 12


In [4]:
#Calculating Blau Index to determine network diversity in terms of Gender, Ti_Exp & Tenure:

print('\n1. Blau Index (Gender):', round(1-sum([(value/len(Gender))**2 for key,value in Counter(Gender).items()]),3))
print('2. Blau Index (Ti_Exp):', round(1-sum([(value/len(Ti_Exp))**2 for key,value in Counter(Ti_Exp).items()]),3))
print('3. Blau Index (Tenure):', round(1-sum([(value/len(Tenure))**2 for key,value in Counter(Tenure).items()]),3))


1. Blau Index (Gender): 0.488
2. Blau Index (Ti_Exp): 0.874
3. Blau Index (Tenure): 0.999


In [5]:
#Calculating the number of Strong and Weak Ties:

Ties=[i[2] for i in list(IE.edges(data=True))]
print('\n1. Number of Weak Ties:', Ties.count({'strength': 0}))
print('2. Number of Strong Ties:', Ties.count({'strength': 1}))     


1. Number of Weak Ties: 5072
2. Number of Strong Ties: 3321


## DataFrame Creation

In [6]:
#Step1: Converting IE_Nodes_Data into a DataFrame:

Emp_ID=[list(IE.nodes(data=True))[i][0] for i in range(0,1158)]
Gender=[(i)[1]['gender'] for i in list(IE.nodes(data=True))]
Ti_Exp=[(i)[1]['ti_exp'] for i in list(IE.nodes(data=True))]
Tenure=[(i)[1]['tenure'] for i in list(IE.nodes(data=True))]
IE_Data_Frame=pd.DataFrame({'Emp_ID':Emp_ID,'Gender':Gender,'Ti_Exp':Ti_Exp,'Tenure':Tenure})

#Step2: Calculating Network_Measures at node level:

#Degree_centrality
Degree_Centrality=pd.DataFrame.from_dict(nx.degree_centrality(IE),orient='index')
Degree_Centrality.index.rename('Emp_ID',inplace=True)
Degree_Centrality.rename(columns={0:'Degree_Centrality'},inplace=True)

#Betweeness_Centrality
Betweeness_Centrality=pd.DataFrame.from_dict(nx.betweenness_centrality(IE),orient='index')
Betweeness_Centrality.index.rename('Emp_ID',inplace=True)
Betweeness_Centrality.rename(columns={0:'Betweeness_Centrality'},inplace=True)


#Closeness_Centrality
Closeness_Centrality=pd.DataFrame.from_dict(nx.closeness_centrality(IE),orient='index')
Closeness_Centrality.index.rename('Emp_ID',inplace=True)
Closeness_Centrality.rename(columns={0:'Closeness_Centrality'},inplace=True)

IE_Data_Frame=IE_Data_Frame.merge(Degree_Centrality,on='Emp_ID').merge(Betweeness_Centrality,on='Emp_ID').merge(Closeness_Centrality,on='Emp_ID')
#IE_Data_Frame

In [7]:
#Step3: Merging IE_Data_Frame with Project level details(PA network and project performance details):

PA_Edges=pd.DataFrame.from_dict(list(PA.edges()))
PA_Edges.rename(columns={0:'Emp_ID',1:'Project_ID'},inplace=True)
PO.rename(columns={'project':'Project_ID'},inplace=True)
Final_Df=PA_Edges.merge(IE_Data_Frame,on='Emp_ID').merge(PO,on='Project_ID')
Final_Df

Unnamed: 0,Emp_ID,Project_ID,Gender,Ti_Exp,Tenure,Degree_Centrality,Betweeness_Centrality,Closeness_Centrality,project_score,patent_application
0,11-1,11-p5,1,8.0,19.043628,0.012965,0.000773,0.243579,54,0
1,11-64,11-p5,0,11.0,13.490373,0.013829,0.001981,0.241243,54,0
2,11-22,11-p5,0,4.0,13.763704,0.013829,0.001733,0.222800,54,0
3,11-32,11-p5,0,2.0,4.758325,0.012965,0.005137,0.237333,54,0
4,11-35,11-p5,0,0.0,18.410328,0.011236,0.000594,0.219461,54,0
...,...,...,...,...,...,...,...,...,...,...
660,3-24,3-p6,0,8.0,13.948475,0.012100,0.000618,0.218426,71,0
661,3-9,3-p6,0,8.0,2.881731,0.012965,0.002450,0.232003,71,0
662,3-31,3-p6,1,25.0,12.060524,0.013829,0.010988,0.241142,71,0
663,3-22,3-p6,0,0.0,19.560133,0.011236,0.000482,0.214220,71,0


In [8]:
#Step4: determine correlation between project score & patent application
Final_Df.corr()

Unnamed: 0,Gender,Ti_Exp,Tenure,Degree_Centrality,Betweeness_Centrality,Closeness_Centrality,project_score,patent_application
Gender,1.0,-0.014943,0.059164,0.01683,0.007449,-0.030592,0.022526,-0.014938
Ti_Exp,-0.014943,1.0,0.084177,-0.010484,0.005486,0.051221,0.011883,-0.010205
Tenure,0.059164,0.084177,1.0,0.07102,0.049108,0.033594,0.002119,-0.005695
Degree_Centrality,0.01683,-0.010484,0.07102,1.0,0.612226,0.536351,0.0854,-0.015158
Betweeness_Centrality,0.007449,0.005486,0.049108,0.612226,1.0,0.690157,0.183567,0.045625
Closeness_Centrality,-0.030592,0.051221,0.033594,0.536351,0.690157,1.0,0.119788,0.009145
project_score,0.022526,0.011883,0.002119,0.0854,0.183567,0.119788,1.0,0.052711
patent_application,-0.014938,-0.010205,-0.005695,-0.015158,0.045625,0.009145,0.052711,1.0


In [9]:
#Step5: Cleaning Final_Df by dropping project score due to a very low correlation with the KPI in consideration (patent_application):

Final_DF = Final_Df.drop('project_score',axis = 1)


#Step6: Aggregating node level data to project level using median values (except Gender which has been aggregated using mode):

Final_DF= Final_Df.groupby(['Project_ID'])[['Ti_Exp','Tenure','Degree_Centrality','Betweeness_Centrality','Closeness_Centrality','patent_application']].median()
Final_DF=Final_DF.merge(Final_Df.groupby(['Project_ID'])[['Gender']].agg(pd.Series.mode),on='Project_ID')
Final_DF

Unnamed: 0_level_0,Ti_Exp,Tenure,Degree_Centrality,Betweeness_Centrality,Closeness_Centrality,patent_application,Gender
Project_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0-p0,10.0,15.976054,0.012965,0.002009,0.228160,0.0,1
0-p1,3.0,12.171597,0.012965,0.001022,0.228837,0.0,0
0-p2,0.0,12.547671,0.013829,0.004453,0.237528,0.0,1
0-p3,14.0,11.178196,0.012965,0.001609,0.239594,0.0,1
0-p4,5.0,11.519989,0.011236,0.000799,0.227353,0.0,0
...,...,...,...,...,...,...,...
9-p2,6.0,10.677689,0.012965,0.003601,0.251031,0.0,0
9-p3,0.0,8.657564,0.012965,0.003206,0.237479,0.0,0
9-p4,6.0,11.856158,0.012100,0.004440,0.240791,0.0,0
9-p5,4.0,10.513502,0.012100,0.001770,0.240891,0.0,0


## Multinomial Logit Regressions

In [10]:
#step7:Checking Correlation of the final dataframe before running regressions :
Final_DF.corr()

Unnamed: 0,Ti_Exp,Tenure,Degree_Centrality,Betweeness_Centrality,Closeness_Centrality,patent_application,Gender
Ti_Exp,1.0,0.017727,0.13208,-0.049682,0.04799,-0.157921,-0.050511
Tenure,0.017727,1.0,0.061753,0.143232,0.003751,0.048562,0.157372
Degree_Centrality,0.13208,0.061753,1.0,0.440696,0.33982,-0.130568,0.013966
Betweeness_Centrality,-0.049682,0.143232,0.440696,1.0,0.561545,0.118085,0.017268
Closeness_Centrality,0.04799,0.003751,0.33982,0.561545,1.0,-0.022945,-0.135711
patent_application,-0.157921,0.048562,-0.130568,0.118085,-0.022945,1.0,0.009549
Gender,-0.050511,0.157372,0.013966,0.017268,-0.135711,0.009549,1.0


In [11]:
#step8: checking for multi collinearity Using variance inflation factor(vif)

#Calculated VIF for Final_DF[['Betweeness_Centrality', 'Ti_Exp','Gender','Tenure']] and removed Tenure & Gender due to high collinearity

X1 = Final_DF[['Betweeness_Centrality', 'Ti_Exp']] 
vif_data1 = pd.DataFrame()
vif_data1["Feature"] = X1.columns
vif_data1["VIF"] = [vif(X1.values,i)for i in range(len(X1.columns))]
print(vif_data1);print('\n')

#Calculated VIF for Final_DF[['Degree_Centrality', 'Ti_Exp','Gender','Tenure']] and removed Tenure & Gender due to high collinearity

X2 = Final_DF[['Degree_Centrality', 'Ti_Exp']] 
vif_data2 = pd.DataFrame()
vif_data2["Feature"] = X2.columns
vif_data2["VIF"] = [vif(X2.values,i)for i in range(len(X2.columns))]
print(vif_data2);print('\n')

#Calculated VIF for Final_DF[['Closeness_Centrality', 'Ti_Exp','Gender','Tenure']] and removed Tenure & Gender due to high collinearity

X3 = Final_DF[['Closeness_Centrality', 'Ti_Exp']] 
vif_data3 = pd.DataFrame()
vif_data3["Feature"] = X3.columns
vif_data3["VIF"] = [vif(X3.values,i)for i in range(len(X3.columns))]
print(vif_data3)


                 Feature       VIF
0  Betweeness_Centrality  1.602618
1                 Ti_Exp  1.602618


             Feature       VIF
0  Degree_Centrality  2.665205
1             Ti_Exp  2.665205


                Feature       VIF
0  Closeness_Centrality  2.635904
1                Ti_Exp  2.635904


### Model 1 - Regression summary and odds ratio

In [14]:
#Model: patent_application ~ Degree_Centrality + Ti_Exp 

log_reg1 = smf.logit("patent_application ~ Degree_Centrality+Ti_Exp", data = Final_DF).fit() 
print(log_reg1.summary())

Optimization terminated successfully.
         Current function value: 0.428710
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:     patent_application   No. Observations:                  133
Model:                          Logit   Df Residuals:                      130
Method:                           MLE   Df Model:                            2
Date:                Mon, 21 Nov 2022   Pseudo R-squ.:                 0.04421
Time:                        14:14:54   Log-Likelihood:                -57.018
converged:                       True   LL-Null:                       -59.656
Covariance Type:            nonrobust   LLR p-value:                   0.07156
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             4.5855      4.594      0.998      0.318      -4.418      13.589
Degree_C

In [15]:
#Determining the odds of generating patent application using degree centrality as a independent variable:

odds_ratios1 = pd.DataFrame(
    {
        "OR": log_reg1.params,
        "Lower CI": log_reg1.conf_int()[0],
        "Upper CI": log_reg1.conf_int()[1],
    }
)
odds_ratios1 = np.exp(odds_ratios1)
print(odds_ratios1)

                              OR  Lower CI       Upper CI
Intercept           9.804851e+01  0.012062   7.970259e+05
Degree_Centrality  3.697417e-199  0.000000  1.250413e+119
Ti_Exp              8.755514e-01  0.744955   1.029042e+00


### Model 2 - Regression summary and odds ratio

In [16]:
#Model: patent_application ~ Closeness_Centrality + Ti_Exp 

log_reg2 = smf.logit("patent_application ~ Closeness_Centrality+Ti_Exp", data = Final_DF).fit() 
print(log_reg2.summary())

Optimization terminated successfully.
         Current function value: 0.434436
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:     patent_application   No. Observations:                  133
Model:                          Logit   Df Residuals:                      130
Method:                           MLE   Df Model:                            2
Date:                Mon, 21 Nov 2022   Pseudo R-squ.:                 0.03144
Time:                        14:14:59   Log-Likelihood:                -57.780
converged:                       True   LL-Null:                       -59.656
Covariance Type:            nonrobust   LLR p-value:                    0.1533
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                0.5659      7.289      0.078      0.938     -13.719      14.851

In [17]:
#Determining the odds of generating patent application using Closeness centrality as a independent variable:

odds_ratios2 = pd.DataFrame(
    {
        "OR": log_reg2.params,
        "Lower CI": log_reg2.conf_int()[0],
        "Upper CI": log_reg2.conf_int()[1],
    }
)
odds_ratios2 = np.exp(odds_ratios2)
print(odds_ratios2)

                            OR      Lower CI      Upper CI
Intercept             1.760947  1.100776e-06  2.817044e+06
Closeness_Centrality  0.000948  1.973168e-30  4.550637e+23
Ti_Exp                0.863447  7.353966e-01  1.013793e+00


### Model 3 - Regression summary and odds ratio

In [18]:
#Model: patent_application ~ Betweeness_Centrality + Ti_Exp  

log_reg3 = smf.logit("patent_application ~ Betweeness_Centrality+Ti_Exp", data = Final_DF).fit() 
print(log_reg3.summary())

Optimization terminated successfully.
         Current function value: 0.429111
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:     patent_application   No. Observations:                  133
Model:                          Logit   Df Residuals:                      130
Method:                           MLE   Df Model:                            2
Date:                Mon, 21 Nov 2022   Pseudo R-squ.:                 0.04331
Time:                        14:15:09   Log-Likelihood:                -57.072
converged:                       True   LL-Null:                       -59.656
Covariance Type:            nonrobust   LLR p-value:                   0.07548
                            coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                -1.4521      0.500     -2.902      0.004      -2.433      -0.

In [19]:
#Determining the odds of generating patent application using Betweeness centrality as a independent variable:

odds_ratios3 = pd.DataFrame(
    {
        "OR": log_reg3.params,
        "Lower CI": log_reg3.conf_int()[0],
        "Upper CI": log_reg3.conf_int()[1],
    }
)
odds_ratios3 = np.exp(odds_ratios3)
print(odds_ratios3)

                                 OR      Lower CI       Upper CI
Intercept              2.340756e-01  8.778828e-02   6.241312e-01
Betweeness_Centrality  1.448715e+75  3.665683e-45  5.725471e+194
Ti_Exp                 8.677203e-01  7.393777e-01   1.018341e+00
