### Data 620 Project 1
#### Centrality Measures Across Categorical Groups

#### Jit and Sheryl
June 17, 2020

**Dataset** 
The data used in this project is from SG&R law firm, a U.S. corporate law firm in New England. It contains network measurements for 71 attorneys from 1988-1991. There are three types of relationships measured, i.e. co-worker, advice, and social. Attorneys are classified as partners or associates and the data contains gender classification.

**Co-worker Network** 
Attorneys were asked to go through a list of names and mark off those that they had worked with in the last year, either working on the same case or reading/using some of each other's work product.

**Advice Network** 
Attorneys were asked to go through a list of names and mark off those that they had gone to for professional advice in the last year.

**Social Network**
Attorneys were asked to go through a list of names and mark off those that they had socialized with outside of work.

Source: http://moreno.ss.uci.edu/data.html#lazega

**Analysis** 
The network data file contains 3 71x71 matrices, one for each network type. A separate file of a 71 x 7 matrix, contains the gender for the attorney (among other attributes). These datasets will be loaded for analysis.

Degree centrality and eigenvector centrality will be calculated for the nodes in each of the network types. These centrality measures will then be compared by gender. Hypothesis testing will be performed to determine if there are statistically significant differences in centrality measures by gender.


In [47]:
import json
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
from pandas import DataFrame
from scipy import stats

### Read in data and map attributes to individuals

In [48]:
df_rel = pd.read_table("http://moreno.ss.uci.edu/lazega.dat",
                   skiprows=7) # Get relationship data
df_map = pd.read_table("http://moreno.ss.uci.edu/lazatt.dat")
df_cols = df_map[3:10] # dataframe with column labels
df_vals = df_map[11:] # dataframe with column mapping values
df_rel['DATA:'] = df_rel['DATA:'].apply(lambda x: x.split())
df_rel = df_rel['DATA:'].apply(pd.Series)
df_vals['DL'] = df_vals['DL'].apply(lambda x: x.split())
df_vals = df_vals['DL'].apply(pd.Series)
df_vals.columns = ['INDEX'] + df_cols['DL'].tolist() 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [49]:
df_vals.head()

Unnamed: 0,INDEX,STATUS,GENDER,OFFICE,SENIORITY,AGE,PRACTICE,LAW_SCHOOL
11,1,1,1,1,31,64,1,1
12,2,1,1,1,32,62,2,1
13,3,1,1,2,13,67,1,1
14,4,1,1,1,31,59,2,3
15,5,1,1,2,31,59,1,2


### Build mapping dictionaries for gender and status

In [50]:
gender_dict = {int(k)-1:int(v) for k,v in zip(df_vals['INDEX'],df_vals['GENDER'])}
status_dict = {int(k)-1:int(v) for k,v in zip(df_vals['INDEX'],df_vals['STATUS'])}

### Look at Social Network

#### Build social network graph

In [51]:
df_social = df_rel[71:142].astype(int).reset_index(drop=True)

G_social=nx.from_pandas_adjacency(df_social)
for node in list(G_social.nodes):
    G_social.nodes[node]['gender'] = gender_dict[node]
    G_social.nodes[node]['status'] = status_dict[node]

#### Visualize Social Network
The visualization of the social network by gender is below.  We see there are two individuals that are not connected to the rest of the network, one male and one female.  Also, there are more males than females and by looking at the size of the node, we see that males have higher degrees than females.

### Social Network - by Gender

Orange is male, purple is female. Size indicates degree.

![title](graphics/social_gender.png)

The visualization of the social network by status is below.  We see the two individuals that are not connected to the rest of the network are associates.  Looking at the size of the node, it appears that partners have higher degrees than associates.

### Social Network - Status

Brown is partner, green is associate. Size indicates degree

![title](graphics/social_status.png)

#### Get centrality
Degree centrality and eigenvector centrality is calculated for each node

In [52]:
social_eig_centrality = nx.eigenvector_centrality(G_social, max_iter=1000)
social_deg_centrality = nx.degree_centrality(G_social)

print('Social Network: Eigenvector Centrality')
print('---------------------------------------------------------------')
print('')

for attribute in [1,2]:
    
    gender_centrality = {k:v for k,v in social_eig_centrality.items() if gender_dict[k]==attribute}
    status_centrality = {k:v for k,v in social_eig_centrality.items() if status_dict[k]==attribute}
    
    print('Gender : {}'.format(attribute))
    print(sorted([(round(x,3),round(y,3)) 
                  for x,y in gender_centrality.items()], 
                  key=lambda x: x[1], reverse=True)[:5])
    
    print('Status : {}'.format(attribute))
    print(sorted([(round(x,3),round(y,3)) 
                  for x,y in status_centrality.items()], 
                  key=lambda x: x[1], reverse=True)[:5])
    print('---------------------------------------------------------------')
    print('')
    
print('Social Network: Degree Centrality: All')
print(sorted([(round(x,3),round(y,3)) 
               for x,y in social_deg_centrality.items()], 
               key=lambda x: x[1], reverse=True)[:5])

Social Network: Eigenvector Centrality
---------------------------------------------------------------

Gender : 1
[(23, 0.265), (25, 0.25), (12, 0.245), (16, 0.236), (3, 0.216)]
Status : 1
[(23, 0.265), (25, 0.25), (12, 0.245), (16, 0.236), (26, 0.221)]
---------------------------------------------------------------

Gender : 2
[(26, 0.221), (42, 0.174), (37, 0.157), (28, 0.138), (38, 0.135)]
Status : 2
[(64, 0.196), (40, 0.177), (42, 0.174), (37, 0.157), (38, 0.135)]
---------------------------------------------------------------

Social Network: Degree Centrality: All
[(30, 0.4), (16, 0.357), (23, 0.357), (12, 0.343), (25, 0.343)]


In the social network, the top 5 males with the highest eigenvector centrality are 23, 25, 12, 16, and 3.  Four of these top 5 have status as a partner in the law firm.  The top 5 females with the highest eigenvector centrality are 26, 42, 37, 28, and 38.  Note that the values for the top females are lower than those of the top 5 males.  

Looking at degree centrality for all nodes, we see some familiar individuals.  16, 23, 12, and 25 were also in the top 5 males of eigenvector centrality.  Node 30 has the highest degree centrality, but is not in the top for eigenvector centrality.  Looking back at the visualization we see that node 30 is also male.

In [53]:
social_df = pd.DataFrame(np.array([list(social_eig_centrality.values()),
                                   list(social_deg_centrality.values()),
                                   list(social_deg_centrality.keys())]).T,
                          columns = ['eig_centrality', 'deg_centrality','id'])

social_df['gender'] = social_df['id'].map(gender_dict)
social_df['status'] = social_df['id'].map(status_dict)

#### Scatter Plot of Social Network Eigenvector Centrality by Gender 

In [114]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=social_df[social_df['gender']==1]['deg_centrality'],
    y=social_df[social_df['gender']==1]['eig_centrality'],
    text=social_df[social_df['gender']==1]['id'],
    hoverinfo='text',
    mode='markers+text',
    name='Male',
    textfont_size=8,
    textfont_color='black',
    marker=dict(size=14, 
                color='lightsalmon',
                symbol='circle')))

fig.add_trace(go.Scatter(
    x = social_df[social_df['gender']==2]['deg_centrality'],
    y = social_df[social_df['gender']==2]['eig_centrality'],
    text=social_df[social_df['gender']==2]['id'],
    mode='markers+text',
    name = 'Female',
    textfont_size=8,
    textfont_color='white',
    marker=dict(size=14, 
                color='purple', 
                symbol='circle')))
    

fig.update_layout(go.Layout(
    title='Social Network - Centrality by Gender',
    xaxis=dict(
        title='Degree Centrality'
    ),
    yaxis=dict(
        title='Eigenvector Centrality'
    ),
    hovermode='closest',
    plot_bgcolor='white',
    paper_bgcolor='white',
))
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.show()

#### Get Mean and Variance for Social Network Eigenvector Centrality by Gender

In [55]:
print('Social Network Eigenvector Centrality Statistics')
print('Male mean: {}'.format(round(np.mean(social_df[social_df['gender']==1]['eig_centrality']),3)))
print('Female mean: {}'.format(round(np.mean(social_df[social_df['gender']==2]['eig_centrality']),3)))
print('Male variance: {}'.format(round(np.var(social_df[social_df['gender']==1]['eig_centrality']),4)))
print('Female variance: {}'.format(round(np.var(social_df[social_df['gender']==2]['eig_centrality']),4)))

Social Network Eigenvector Centrality Statistics
Male mean: 0.101
Female mean: 0.079
Male variance: 0.0052
Female variance: 0.0042


### T-Test for Differences in Social Network Eigenvector Centrality - Gender  

Test if there is a statistically significant difference between mean social network eigenvector centralities for males and females using an unpaired t-test. 

In [57]:
social_ttest = stats.ttest_ind(social_df[social_df['gender']==1]['eig_centrality'],
                               social_df[social_df['gender']==2]['eig_centrality'], 
                               equal_var = False)
print('p-value for t-test comparing mean social eigenvector centralities for males and females: {}'.format(round(social_ttest[1],3)))

p-value for t-test comparing mean social eigenvector centralities for males and females: 0.251


There is a not statistically significant difference between mean social eigenvector centralities for males and females (p-value = 0.251). 

### Look at Advice Network

#### Build advice network graph

In [58]:
df_advice = df_rel[0:71].astype(int).reset_index(drop=True)

G_advice=nx.from_pandas_adjacency(df_advice)
for node in list(G_advice.nodes):
    G_advice.nodes[node]['gender'] = gender_dict[node]
    G_advice.nodes[node]['status'] = status_dict[node]

#### Visualize Advice Network
The visualization of the advice network by gender is below.  Again, it appears the male nodes have higher degrees than the females.  Node 25, a male, has the highest degrees in the advice network.  Interestingly, node 30 that had the most degrees in the social network, is not very prominent in the advice network.

### Advice Network - by Gender

Orange is male, purple is female. Size indicates degree

![title](graphics/advice_gender.png)

The visualization of the advice network by status is below.  As we would expect, in general, partners appear to have higher degrees in the advice network than associates.

### Advice Network - by Status

Brown is partner, green is associate. Size indicates degree

![title](graphics/advice_status.png)

#### Get centrality
Degree centrality and eigenvector centrality is calculated for each node

In [59]:
advice_eig_centrality = nx.eigenvector_centrality(G_advice, max_iter=1000)
advice_deg_centrality = nx.degree_centrality(G_advice)

print('Advice Network: Eigenvector Centrality')
print('---------------------------------------------------------------')
print('')

for attribute in [1,2]:
    
    gender_centrality = {k:v for k,v in advice_eig_centrality.items() if gender_dict[k]==attribute}
    status_centrality = {k:v for k,v in advice_eig_centrality.items() if status_dict[k]==attribute}
    
    print('Gender : {}'.format(attribute))
    print(sorted([(round(x,3),round(y,3)) 
                  for x,y in gender_centrality.items()], 
                  key=lambda x: x[1], reverse=True)[:5])
    
    print('Status : {}'.format(attribute))
    print(sorted([(round(x,3),round(y,3)) 
                  for x,y in status_centrality.items()], 
                  key=lambda x: x[1], reverse=True)[:5])
    print('---------------------------------------------------------------')
    print('')
    
print('Advice Network: Degree Centrality: All')
print(sorted([(round(x,3),round(y,3)) 
               for x,y in advice_deg_centrality.items()], 
               key=lambda x: x[1], reverse=True)[:5])

Advice Network: Eigenvector Centrality
---------------------------------------------------------------

Gender : 1
[(25, 0.251), (12, 0.221), (23, 0.206), (40, 0.179), (15, 0.169)]
Status : 1
[(25, 0.251), (12, 0.221), (23, 0.206), (15, 0.169), (16, 0.166)]
---------------------------------------------------------------

Gender : 2
[(26, 0.159), (33, 0.132), (38, 0.127), (28, 0.121), (37, 0.117)]
Status : 2
[(40, 0.179), (41, 0.154), (54, 0.152), (64, 0.151), (39, 0.15)]
---------------------------------------------------------------

Advice Network: Degree Centrality: All
[(25, 0.657), (12, 0.571), (23, 0.514), (15, 0.486), (40, 0.457)]


In the advice network, the top 5 males with the highest eigenvector centrality are 25, 12, 23, 40, and 14.  A few of these males were also in the top of the social network (25, 12, 23).  Again, four of these top 5 have status as a partner in the law firm.  The top 5 females with the highest eigenvector centrality are 26, 33, 38, 28, and 37.  Four of these were also in the top of the social network.  Again, the values for the top females are lower than those of the top 5 males.  

Looking at degree centrality for all nodes, we see some familiar individuals.  The top 5 nodes are the exact same as the top 5 eigenvector centralities for males.

In [60]:
advice_df = pd.DataFrame(np.array([list(advice_eig_centrality.values()),
                                   list(advice_deg_centrality.values()),
                                   list(advice_deg_centrality.keys())]).T,
                          columns = ['eig_centrality', 'deg_centrality','id'])

advice_df['gender'] = advice_df['id'].map(gender_dict)
advice_df['status'] = advice_df['id'].map(status_dict)

#### Scatter Plot of Advice Network Eigenvector Centrality by Gender 

In [113]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=advice_df[advice_df['gender']==1]['deg_centrality'],
    y=advice_df[advice_df['gender']==1]['eig_centrality'],
    text=advice_df[advice_df['gender']==1]['id'],
    hoverinfo='text',
    mode='markers+text',
    name='Male',
    textfont_size=8,
    textfont_color='black',
    marker=dict(size=14, 
                color='lightsalmon',
                symbol='circle')))

fig.add_trace(go.Scatter(
    x = advice_df[advice_df['gender']==2]['deg_centrality'],
    y = advice_df[advice_df['gender']==2]['eig_centrality'],
    text=advice_df[advice_df['gender']==2]['id'],
    mode='markers+text',
    name = 'Female',
    textfont_size=8,
    textfont_color='white',
    marker=dict(size=14, 
                color='purple', 
                symbol='circle')))
    

fig.update_layout(go.Layout(
    title='Advice Network - Centrality by Gender',
    xaxis=dict(
        title='Degree Centrality'
    ),
    yaxis=dict(
        title='Eigenvector Centrality'
    ),
    hovermode='closest',
    plot_bgcolor='white',
    paper_bgcolor='white',
))
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.show()


#### Get Mean and Variance for Advice Network Eigenvector Centrality by Gender

In [62]:
print('Advice Network Eigenvector Centrality Statistics')
print('Male mean: {}'.format(round(np.mean(advice_df[advice_df['gender']==1]['eig_centrality']),3)))
print('Female mean: {}'.format(round(np.mean(advice_df[advice_df['gender']==2]['eig_centrality']),3)))
print('Male variance: {}'.format(round(np.var(advice_df[advice_df['gender']==1]['eig_centrality']),4)))
print('Female variance: {}'.format(round(np.var(advice_df[advice_df['gender']==2]['eig_centrality']),4)))

Advice Network Eigenvector Centrality Statistics
Male mean: 0.116
Female mean: 0.089
Male variance: 0.0023
Female variance: 0.0014


### T-Test for Differences in Advice Network Eigenvector Centrality - Gender  

Test if there is a statistically significant difference between mean advice network eigenvector centralities for males and females using an unpaired t-test. 

In [64]:
advice_ttest = stats.ttest_ind(advice_df[advice_df['gender']==1]['eig_centrality'],
                               advice_df[advice_df['gender']==2]['eig_centrality'], 
                               equal_var = False)
print('p-value for t-test comparing mean advice eigenvector centralities for males and females: {}'.format(round(advice_ttest[1],3)))

p-value for t-test comparing mean advice eigenvector centralities for males and females: 0.024


There is a statistically significant difference between mean advice eigenvector centralities for males and females (p-value = 0.024). 

### Look at Co-worker Network

#### Build co-worker network graph

In [65]:
df_work = df_rel[142:].astype(int).reset_index(drop=True)

G_work=nx.from_pandas_adjacency(df_work)
for node in list(G_work.nodes):
    G_work.nodes[node]['gender'] = gender_dict[node]
    G_work.nodes[node]['status'] = status_dict[node]

In [18]:
nx.write_gml(G_work, 'G_work.gml')
nx.write_gml(G_social, 'G_social.gml')
nx.write_gml(G_advice, 'G_advice.gml')

#### Visualize Co-worker Network
The visualization of the co-worker network by gender is below.  Again, it appears the male nodes have higher degrees than the females.  Nodes 23 and 25, both males, have high degrees in the co-worker network.  

### Work Network - by Gender

Orange is male, purple is female. Size indicates degree

![title](graphics/work_gender.png)

The visualization of the co-worker network by status is below.  As we would expect, in general, partners appear to have higher degrees in the co-worker network than associates.

### Work Network - by Status

Brown is partner, green is associate. Size indicates degree

![title](graphics/work_status.png)

In [66]:
work_eig_centrality = nx.eigenvector_centrality(G_work, max_iter=1000)
work_deg_centrality = nx.degree_centrality(G_work)

print('Co-worker Network: Eigenvector Centrality')
print('---------------------------------------------------------------')
print('')

for attribute in [1,2]:
    
    gender_centrality = {k:v for k,v in work_eig_centrality.items() if gender_dict[k]==attribute}
    status_centrality = {k:v for k,v in work_eig_centrality.items() if status_dict[k]==attribute}
    
    print('Gender : {}'.format(attribute))
    print(sorted([(round(x,3),round(y,3)) 
                  for x,y in gender_centrality.items()], 
                  key=lambda x: x[1], reverse=True)[:5])
    
    print('Status : {}'.format(attribute))
    print(sorted([(round(x,3),round(y,3)) 
                  for x,y in status_centrality.items()], 
                  key=lambda x: x[1], reverse=True)[:5])
    print('---------------------------------------------------------------')
    print('')
    
print('Co-worker Network: Degree Centrality: All')
print(sorted([(round(x,3),round(y,3)) 
               for x,y in work_deg_centrality.items()], 
               key=lambda x: x[1], reverse=True)[:5])

Co-worker Network: Eigenvector Centrality
---------------------------------------------------------------

Gender : 1
[(23, 0.236), (25, 0.235), (21, 0.197), (14, 0.195), (18, 0.195)]
Status : 1
[(23, 0.236), (25, 0.235), (21, 0.197), (14, 0.195), (18, 0.195)]
---------------------------------------------------------------

Gender : 2
[(28, 0.146), (42, 0.145), (33, 0.14), (37, 0.134), (38, 0.116)]
Status : 2
[(65, 0.154), (64, 0.148), (42, 0.145), (67, 0.135), (37, 0.134)]
---------------------------------------------------------------

Co-worker Network: Degree Centrality: All
[(23, 0.643), (25, 0.614), (14, 0.529), (18, 0.529), (21, 0.514)]


In the co-worker network, the top 5 males with the highest eigenvector centrality are 23, 25, 21, 14, and 18.  A few of these males were also in the top of the social network and advice network (25, 23). All 5 of these have status as a partner in the law firm.  The top 5 females with the highest eigenvector centrality are 28, 42, 33, 37, and 38.  Three of these were also in the top of the social and advice networks (28, 37, 38).  Again, the values for the top females are lower than those of the top 5 males.  

Looking at degree centrality for all nodes, we see some familiar individuals.  The top 5 nodes are the exact same as the top 5 eigenvector centralities for males.

#### Create dataframes 

In [68]:
work_df = pd.DataFrame(np.array([list(work_eig_centrality.values()),
                                 list(work_deg_centrality.values()),
                                 list(work_deg_centrality.keys())]).T,
                          columns = ['eig_centrality', 'deg_centrality','id'])

work_df['gender'] = work_df['id'].map(gender_dict)
work_df['status'] = work_df['id'].map(status_dict)

#### Scatter Plot of Co-Worker Network Eigenvector Centrality by Gender 

In [112]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=work_df[work_df['gender']==1]['deg_centrality'],
    y=work_df[work_df['gender']==1]['eig_centrality'],
    text=work_df[work_df['gender']==1]['id'],
    hoverinfo='text',
    mode='markers+text',
    name='Male',
    textfont_size=8,
    textfont_color='black',
    marker=dict(size=14, 
                color='lightsalmon',
                symbol='circle')))

fig.add_trace(go.Scatter(
    x = work_df[work_df['gender']==2]['deg_centrality'],
    y = work_df[work_df['gender']==2]['eig_centrality'],
    text=work_df[work_df['gender']==2]['id'],
    mode='markers+text',
    name = 'Female',
    textfont_size=8,
    textfont_color='white',
    marker=dict(size=14, 
                color='purple', 
                symbol='circle')))
    

fig.update_layout(go.Layout(
    title='Work Network - Centrality by Gender',
    xaxis=dict(
        title='Degree Centrality'
    ),
    yaxis=dict(
        title='Eigenvector Centrality'
    ),
    hovermode='closest',
    plot_bgcolor='white',
    paper_bgcolor='white',
))
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.show()


#### Get Mean and Variance for Co-Worker Network Eigenvector Centrality by Gender

In [71]:
print('Work Network Eigenvector Centrality Statistics')
print('Male mean: {}'.format(round(np.mean(work_df[work_df['gender']==1]['eig_centrality']),3)))
print('Female mean: {}'.format(round(np.mean(work_df[work_df['gender']==2]['eig_centrality']),3)))
print('Male variance: {}'.format(round(np.var(work_df[work_df['gender']==1]['eig_centrality']),4)))
print('Female variance: {}'.format(round(np.var(work_df[work_df['gender']==2]['eig_centrality']),4)))

Work Network Eigenvector Centrality Statistics
Male mean: 0.116
Female mean: 0.092
Male variance: 0.0021
Female variance: 0.0011


### T-Test for Differences in Co-Worker Eigenvector Centrality - Gender  

Test if there is a statistically significant difference between mean co-worker eigenvector centralities for males and females using an unpaired t-test. 

In [72]:
work_ttest = stats.ttest_ind(work_df[work_df['gender']==1]['eig_centrality'],
                               work_df[work_df['gender']==2]['eig_centrality'], 
                               equal_var = False)
print('p-value for t-test comparing mean work eigenvector centralities for males and females: {}'.format(round(work_ttest[1],3)))

p-value for t-test comparing mean work eigenvector centralities for males and females: 0.024


There is a statistically significant difference between mean co-worker eigenvector centralities for males and females (p-value = 0.024). 