## Data 620 Project 1
### Centrality Measures Across Categorical Groups

Jit Seneviratne and Sheryl Piechocki
June 17, 2020

**Dataset** 
The data used in this project is from SG&R law firm, a U.S. corporate law firm in New England. It contains network measurements for 71 attorneys from 1988-1991. There are three types of relationships measured, i.e. co-worker, advice, and social. Attorneys are classified as partners or associates and the data contains gender classification.

**Co-worker Network** 
Attorneys were asked to go through a list of names and mark off those that they had worked with in the last year, either working on the same case or reading/using some of each other's work product.

**Advice Network** 
Attorneys were asked to go through a list of names and mark off those that they had gone to for professional advice in the last year.

**Social Network**
Attorneys were asked to go through a list of names and mark off those that they had socialized with outside of work.

Source: http://moreno.ss.uci.edu/data.html#lazega

**Analysis** 
The network data file contains 3 71x71 matrices, one for each network type. A separate file of a 71 x 7 matrix, contains the gender role for the attorney (among other attributes). These datasets will be loaded for analysis.

Degree centrality and eigenvector centrality will be calculated for the nodes in each of the network types. These centrality measures will then be compared by gender. Hypothesis testing will be performed to determine if there are statistically significant differences in centrality measures by gender.


### Degree Centrality vs Eigenvector Centrality

Degree centrality is simply the normalized number of degrees for each node. Nodes with a higher number of degrees are of more importance to the network. Eigenvector centrality assigns scores to each connection, where connections to other well connected nodes are valued higher than to low-connected nodes.

#### Why is eigenvector centrality important?

It creates a level playing field among those who have lots of connections and those who have meaningful connections. For example, if someone has five connections, each of whom have ten connections, that's could be more valuable than having ten connections, each of whom in turn have just three connections.

#### Our approach

For the purpose of clarity, we will be looking at eigenvector centrality across gender and role (partner or associate). For simplicity, this network uses gender as opposed to sex, even though in reality sex would be the more accurate an would incorporate more identities than the gender binary.



In [1]:
import json
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
from scipy import stats

ModuleNotFoundError: No module named 'seaborn'

### Read in data and map attributes to individuals

In [None]:
df_rel = pd.read_table("http://moreno.ss.uci.edu/lazega.dat",
                   skiprows=7) # Get relationship data
df_map = pd.read_table("http://moreno.ss.uci.edu/lazatt.dat")
df_cols = df_map[3:10] # dataframe with column labels
df_vals = df_map[11:] # dataframe with column mapping values
df_rel['DATA:'] = df_rel['DATA:'].apply(lambda x: x.split())
df_rel = df_rel['DATA:'].apply(pd.Series)
df_vals['DL'] = df_vals['DL'].apply(lambda x: x.split())
df_vals = df_vals['DL'].apply(pd.Series)
df_vals.columns = ['INDEX'] + df_cols['DL'].tolist() 

In [None]:
df_vals.head()

### EDA on Columns of Interest

In [None]:
for col in ['STATUS','GENDER']:
    print(df_vals[col].value_counts(1))

We can see that males outnumber females 3 to 1. This could have an effect on networking

### Build mapping dictionaries for gender and status

In [None]:
gender_dict = {int(k)-1:int(v) for k,v in zip(df_vals['INDEX'],df_vals['GENDER'])}
status_dict = {int(k)-1:int(v) for k,v in zip(df_vals['INDEX'],df_vals['STATUS'])}

### Look at Social Network

#### Build social network graph

In [None]:
df_social = df_rel[71:142].astype(int).reset_index(drop=True)

G_social=nx.from_pandas_adjacency(df_social)
for node in list(G_social.nodes):
    G_social.nodes[node]['gender'] = gender_dict[node]
    G_social.nodes[node]['status'] = status_dict[node]

#### Get centrality
Degree centrality and eigenvector centrality is calculated for each node

In [None]:
social_eig_centrality = nx.eigenvector_centrality(G_social, max_iter=1000)
social_deg_centrality = nx.degree_centrality(G_social)

In [None]:
social_df = pd.DataFrame(np.array([list(social_eig_centrality.values()),
                                   list(social_deg_centrality.values()),
                                   list(social_deg_centrality.keys())]).T,
                          columns = ['eig_centrality', 'deg_centrality','id'])

social_df['gender'] = social_df['id'].map(gender_dict)
social_df['status'] = social_df['id'].map(status_dict)

print('Eigenvector Centrality')
print('----------------------')
for filter_ in ['gender','status']:
    for value in [1,2]:
        print(filter_.title(),value)
        print(social_df[social_df[filter_]==value].sort_values(by=['eig_centrality',
                                                             'deg_centrality'],
                                                         ascending=False).head(5))
        print('\n')
print('Degree Centrality - All')
print('----------------------')
print(social_df.sort_values(by=['deg_centrality',
                               'eig_centrality'],
                            ascending=False).head(5))
print('\n')

In the social network, the top 5 males with the highest eigenvector centrality are 23, 25, 12, 16, and 3.  Four of these top 5 have status as a partner in the law firm.  The top 5 females with the highest eigenvector centrality are 26, 42, 37, 28, and 38.  Note that the values for the top females are lower than those of the top 5 males, and could be informed by the disparity in number of males vs females in the firm.

Looking at degree centrality for all nodes, we see some familiar individuals.  16, 23, 12, and 25 were also in the top 5 males of eigenvector centrality.  Node 30 has the highest degree centrality, but is not in the top for eigenvector centrality.  Looking back at the visualization we see that node 30 is also male.

#### Visualize Social Network
The visualization of the social network by gender is below.  We see there are two individuals that are not connected to the rest of the network, one male and one female.  Also, there are more males than females and by looking at the size of the node, we see that males have higher degrees than females.

##### Social Network - by Gender

Orange is male, purple is female. Size indicates degree.

![title](graphics/social_gender_a.png)

The visualization of the social network by status is below.  We see the two individuals that are not connected to the rest of the network are associates.  Looking at the size of the node, it appears that partners have higher degrees than associates. We have one unconnected male and one unconnected female.

##### Social Network - Status

Brown is partner, green is associate. Size indicates degree

![title](graphics/social_status_a.png)

#### Scatter Plot of Social Network Eigenvector Centrality by Gender 

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=social_df[social_df['gender']==1]['deg_centrality'],
    y=social_df[social_df['gender']==1]['eig_centrality'],
    text=social_df[social_df['gender']==1]['id'],
    hoverinfo='text',
    mode='markers+text',
    name='Male',
    textfont_size=8,
    textfont_color='black',
    marker=dict(size=14, 
                color='lightsalmon',
                symbol='circle')))

fig.add_trace(go.Scatter(
    x = social_df[social_df['gender']==2]['deg_centrality'],
    y = social_df[social_df['gender']==2]['eig_centrality'],
    text=social_df[social_df['gender']==2]['id'],
    mode='markers+text',
    name = 'Female',
    textfont_size=8,
    textfont_color='white',
    marker=dict(size=14, 
                color='purple', 
                symbol='circle')))
    

fig.update_layout(go.Layout(
    title='Social Network - Centrality by Gender',
    xaxis=dict(
        title='Degree Centrality'
    ),
    yaxis=dict(
        title='Eigenvector Centrality'
    ),
    hovermode='closest',
    plot_bgcolor='white',
    paper_bgcolor='white',
))
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='grey')

Pearson's Correlation Coefficient between centrality types

In [None]:
print("Correlation:", np.corrcoef(social_df[social_df['gender']==1]['deg_centrality'],
                                  social_df[social_df['gender']==1]['eig_centrality'])[0][1])

We see strong correlation between degree centrality and eigenvector centrality, meaning than influential nodes are generally connected to other influential nodes. The upper right quadrant feature mostly influential male partners.

#### Get Mean and Variance for Social Network Eigenvector Centrality by Gender

In [None]:
print('Social Network Eigenvector Centrality Statistics')
print('Male mean: {}'.format(round(np.mean(social_df[social_df['gender']==1]['eig_centrality']),3)))
print('Female mean: {}'.format(round(np.mean(social_df[social_df['gender']==2]['eig_centrality']),3)))
print('Male variance: {}'.format(round(np.var(social_df[social_df['gender']==1]['eig_centrality']),4)))
print('Female variance: {}'.format(round(np.var(social_df[social_df['gender']==2]['eig_centrality']),4)))

#### T-Test for Differences in Social Network Eigenvector Centrality - Gender  

Test if there is a statistically significant difference between mean social network eigenvector centralities for males and females using an unpaired t-test. 

In [None]:
social_ttest = stats.ttest_ind(social_df[social_df['gender']==1]['eig_centrality'],
                               social_df[social_df['gender']==2]['eig_centrality'], 
                               equal_var = False)
print('p-value for t-test comparing mean social eigenvector centralities for males and females: {}'.format(round(social_ttest[1],3)))

There is a not statistically significant difference between mean social eigenvector centralities for males and females (p-value = 0.251). 

In [None]:
print("Number of Cliques:",len(list(nx.find_cliques(G_social))))

### Look at Advice Network

#### Build advice network graph

In [None]:
df_advice = df_rel[0:71].astype(int).reset_index(drop=True)

G_advice=nx.from_pandas_adjacency(df_advice)
for node in list(G_advice.nodes):
    G_advice.nodes[node]['gender'] = gender_dict[node]
    G_advice.nodes[node]['status'] = status_dict[node]

#### Get centrality
Degree centrality and eigenvector centrality is calculated for each node

In [None]:
advice_eig_centrality = nx.eigenvector_centrality(G_advice, max_iter=1000)
advice_deg_centrality = nx.degree_centrality(G_advice)

In [None]:
advice_df = pd.DataFrame(np.array([list(advice_eig_centrality.values()),
                                   list(advice_deg_centrality.values()),
                                   list(advice_deg_centrality.keys())]).T,
                          columns = ['eig_centrality', 'deg_centrality','id'])

advice_df['gender'] = advice_df['id'].map(gender_dict)
advice_df['status'] = advice_df['id'].map(status_dict)

print('Eigenvector Centrality')
print('----------------------')
for filter_ in ['gender','status']:
    for value in [1,2]:
        print(filter_.title(),value)
        print(advice_df[advice_df[filter_]==value].sort_values(by=['eig_centrality',
                                                             'deg_centrality'],
                                                         ascending=False).head(5))
        print('\n')
print('Degree Centrality - All')
print('----------------------')
print(advice_df.sort_values(by=['deg_centrality',
                               'eig_centrality'],
                            ascending=False).head(5))
print('\n')

In the advice network, the top 5 males with the highest eigenvector centrality are 25, 12, 23, 40, and 15.  A few of these males were also in the top of the social network (25, 12, 23).  Again, four of these top 5 have status as a partner in the law firm.  The top 5 females with the highest eigenvector centrality are 26, 33, 38, 28, and 37.  Four of these were also in the top of the social network.  Again, the values for the top females are lower than those of the top 5 males.  

Looking at degree centrality for all nodes, we see some familiar individuals.  The top 5 nodes are the exact same as the top 5 eigenvector centralities for males. It is also interesting to note that the male with id 40 is not a partner.

#### Visualize Advice Network
The visualization of the advice network by gender is below.  Again, it appears the male nodes have higher degrees than the females.  Node 25, a male, has the highest degrees in the advice network.  Interestingly, node 30 that had the most degrees in the social network, is not very prominent in the advice network.

##### Advice Network - by Gender

Orange is male, purple is female. Size indicates degree

![title](graphics/advice_gender_a.png)

The visualization of the advice network by status is below.  As we would expect, in general, partners appear to have higher degrees in the advice network than associates.

##### Advice Network - by Status

Brown is partner, green is associate. Size indicates degree

![title](graphics/advice_status_a.png)

#### Scatter Plot of Advice Network Eigenvector Centrality by Gender 

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=advice_df[advice_df['gender']==1]['deg_centrality'],
    y=advice_df[advice_df['gender']==1]['eig_centrality'],
    text=advice_df[advice_df['gender']==1]['id'],
    hoverinfo='text',
    mode='markers+text',
    name='Male',
    textfont_size=8,
    textfont_color='black',
    marker=dict(size=14, 
                color='lightsalmon',
                symbol='circle')))

fig.add_trace(go.Scatter(
    x = advice_df[advice_df['gender']==2]['deg_centrality'],
    y = advice_df[advice_df['gender']==2]['eig_centrality'],
    text=advice_df[advice_df['gender']==2]['id'],
    mode='markers+text',
    name = 'Female',
    textfont_size=8,
    textfont_color='white',
    marker=dict(size=14, 
                color='purple', 
                symbol='circle')))
    

fig.update_layout(go.Layout(
    title='Advice Network - Centrality by Gender',
    xaxis=dict(
        title='Degree Centrality'
    ),
    yaxis=dict(
        title='Eigenvector Centrality'
    ),
    hovermode='closest',
    plot_bgcolor='white',
    paper_bgcolor='white',
))
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='grey')

Pearson's Correlation Coefficient between centrality types:

In [None]:
print("Correlation:", np.corrcoef(advice_df[advice_df['gender']==1]['deg_centrality'],
                                  advice_df[advice_df['gender']==1]['eig_centrality'])[0][1])

Eigenvector centrality and degree centrality across genders show stronger correlation than in the case of the social network. We can see stronger separation between the top three nodes (23, 12 and 25) and the rest.

#### Get Mean and Variance for Advice Network Eigenvector Centrality by Gender

In [None]:
print('Advice Network Eigenvector Centrality Statistics')
print('Male mean: {}'.format(round(np.mean(advice_df[advice_df['gender']==1]['eig_centrality']),3)))
print('Female mean: {}'.format(round(np.mean(advice_df[advice_df['gender']==2]['eig_centrality']),3)))
print('Male variance: {}'.format(round(np.var(advice_df[advice_df['gender']==1]['eig_centrality']),4)))
print('Female variance: {}'.format(round(np.var(advice_df[advice_df['gender']==2]['eig_centrality']),4)))

#### T-Test for Differences in Advice Network Eigenvector Centrality - Gender  

Test if there is a statistically significant difference between mean advice network eigenvector centralities for males and females using an unpaired t-test. 

In [None]:
advice_ttest = stats.ttest_ind(advice_df[advice_df['gender']==1]['eig_centrality'],
                               advice_df[advice_df['gender']==2]['eig_centrality'], 
                               equal_var = False)
print('p-value for t-test comparing mean advice eigenvector centralities for males and females: {}'.format(round(advice_ttest[1],3)))

There is a statistically significant difference between mean advice eigenvector centralities for males and females (p-value = 0.024). 

### Look at Co-worker Network

#### Build co-worker network graph

In [None]:
df_work = df_rel[142:].astype(int).reset_index(drop=True)

G_work=nx.from_pandas_adjacency(df_work)
for node in list(G_work.nodes):
    G_work.nodes[node]['gender'] = gender_dict[node]
    G_work.nodes[node]['status'] = status_dict[node]

#### Get centrality

In [None]:
work_eig_centrality = nx.eigenvector_centrality(G_work, max_iter=1000)
work_deg_centrality = nx.degree_centrality(G_work)

In [None]:
work_df = pd.DataFrame(np.array([list(work_eig_centrality.values()),
                                 list(work_deg_centrality.values()),
                                 list(work_deg_centrality.keys())]).T,
                          columns = ['eig_centrality', 'deg_centrality','id'])

work_df['gender'] = work_df['id'].map(gender_dict)
work_df['status'] = work_df['id'].map(status_dict)

print('Eigenvector Centrality')
print('----------------------')
for filter_ in ['gender','status']:
    for value in [1,2]:
        print(filter_.title(),value)
        print(work_df[work_df[filter_]==value].sort_values(by=['eig_centrality',
                                                             'deg_centrality'],
                                                         ascending=False).head(5))
        print('\n')
print('Degree Centrality - All')
print('----------------------')
print(work_df.sort_values(by=['deg_centrality',
                               'eig_centrality'],
                            ascending=False).head(5))
print('\n')

In the co-worker network, the top 5 males with the highest eigenvector centrality are 23, 25, 21, 14, and 18.  A few of these males were also in the top of the social network and advice network (25, 23). All 5 of these have status as a partner in the law firm.  The top 5 females with the highest eigenvector centrality are 28, 42, 33, 37, and 38.  Three of these were also in the top of the social and advice networks (28, 37, 38).  Again, the values for the top females are lower than those of the top 5 males.  

Looking at degree centrality for all nodes, we see some familiar individuals.  The top 5 nodes are the exact same as the top 5 eigenvector centralities for males.

#### Visualize Co-worker Network
The visualization of the co-worker network by gender is below.  Again, it appears the male nodes have higher degrees than the females.  Nodes 23 and 25, both males, have high degrees in the co-worker network.  

##### Work Network - by Gender

Orange is male, purple is female. Size indicates degree

![title](graphics/work_gender_a.png)

The visualization of the co-worker network by status is below.  As we would expect, in general, partners appear to have higher degrees in the co-worker network than associates.

##### Work Network - by Status

Brown is partner, green is associate. Size indicates degree

![title](graphics/work_status_a.png)

#### Scatter Plot of Co-Worker Network Eigenvector Centrality by Gender 

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=work_df[work_df['gender']==1]['deg_centrality'],
    y=work_df[work_df['gender']==1]['eig_centrality'],
    text=work_df[work_df['gender']==1]['id'],
    hoverinfo='text',
    mode='markers+text',
    name='Male',
    textfont_size=8,
    textfont_color='black',
    marker=dict(size=14, 
                color='lightsalmon',
                symbol='circle')))

fig.add_trace(go.Scatter(
    x = work_df[work_df['gender']==2]['deg_centrality'],
    y = work_df[work_df['gender']==2]['eig_centrality'],
    text=work_df[work_df['gender']==2]['id'],
    mode='markers+text',
    name = 'Female',
    textfont_size=8,
    textfont_color='white',
    marker=dict(size=14, 
                color='purple', 
                symbol='circle')))
    

fig.update_layout(go.Layout(
    title='Work Network - Centrality by Gender',
    xaxis=dict(
        title='Degree Centrality'
    ),
    yaxis=dict(
        title='Eigenvector Centrality'
    ),
    hovermode='closest',
    plot_bgcolor='white',
    paper_bgcolor='white',
))
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='grey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='grey')


Pearson's Correlation Coefficient between centrality types

In [None]:
print("Correlation:", np.corrcoef(work_df[work_df['gender']==1]['deg_centrality'],
                                  work_df[work_df['gender']==1]['eig_centrality'])[0][1])

Once again we can see stronger correlation than in the social network and strong separation between the top nodes (25, 23) and the rest. It is interesting to note that node 12 features higher in the social and advice network than in the work network. 

#### Get Mean and Variance for Co-Worker Network Eigenvector Centrality by Gender

In [None]:
print('Work Network Eigenvector Centrality Statistics')
print('Male mean: {}'.format(round(np.mean(work_df[work_df['gender']==1]['eig_centrality']),3)))
print('Female mean: {}'.format(round(np.mean(work_df[work_df['gender']==2]['eig_centrality']),3)))
print('Male variance: {}'.format(round(np.var(work_df[work_df['gender']==1]['eig_centrality']),4)))
print('Female variance: {}'.format(round(np.var(work_df[work_df['gender']==2]['eig_centrality']),4)))

#### T-Test for Differences in Co-Worker Eigenvector Centrality - Gender  

Test if there is a statistically significant difference between mean co-worker eigenvector centralities for males and females using an unpaired t-test. 

In [None]:
work_ttest = stats.ttest_ind(work_df[work_df['gender']==1]['eig_centrality'],
                               work_df[work_df['gender']==2]['eig_centrality'], 
                               equal_var = False)
print('p-value for t-test comparing mean work eigenvector centralities for males and females: {}'.format(round(work_ttest[1],3)))

There is a statistically significant difference between mean co-worker eigenvector centralities for males and females (p-value = 0.024). 

### Conclusion:
    
#### Eigenvector Centrality vs. Degree Centrality

Generally, we see strong positive correlation between eigenvector centrality and degree centrality, indicating that irrespective of the network, lawyers are connected to other lawyers who have similar reach. However, the social network showed up as less correlated than the other two, since social interactions are a little more organic than advice networks and co-work networks

#### Males vs Females

On average males have a stronger social, advice and co-work network. This could also be informed by the fact that males outnumber women 3 to 1. Among males ids 23 and 25 feature strongly in all three networks. Id 12 seems to be less significant in the co-work network than in other networks perhaps due to shorter tenure. Among women, ids 28, 37 and 38 were influential in all three networks.

#### Partners vs Associates

Partners showed up as more central than associates in all networks, with the exception of id 40, who is a male. He showed up as prominent in the social and advice networks. We also have id 64 who showed up as central in the social network.



In [None]:
# nx.write_gml(G_work, 'G_work.gml')
# nx.write_gml(G_social, 'G_social.gml')
# nx.write_gml(G_advice, 'G_advice.gml')