## Network Analysis of Shakespeare and Company and Goodreads

This notebooks performs a network analysis of readership in two networks: Shakespeare and Company borrowers and another of Goodreads reviewers. In both networks, books are nodes and edges are created between nodes when the same person reads both of them (more readers reading the same two books = higher edge weight). The Goodreads dataset has been winnowed to include only books that were in the Shakespeare and Co library. 

### Quick data preparation:

In [1]:
# import functions from graph.py
from graph import get_goodreads_graph, get_sc_graph
from core_periphery_sbm import core_periphery as cp

import networkx as nx

from collections import Counter
from operator import itemgetter
import pandas as pd

In [2]:
# get vertex lists, edge weights, vertex to neighbors, and number of nodes
sc_books_in_vertex_order, sc_book_to_vertex_index, sc_edge_to_weight, sc_vertex_to_neighbors, sc_n = get_sc_graph()
gr_books_in_vertex_order, gr_book_to_vertex_index, gr_edge_to_weight, gr_vertex_to_neighbors, gr_n = get_goodreads_graph()

In [3]:
# core-periphery code is simplest when in networkx graph format
# following code converts from tuple structure into a list, which will then be added by nodes/edges to graph

def tuple_to_list(edge_to_weight):
    fill = []
    for i in edge_to_weight.items():
        l = list(i)
        fill.append(l)
    edge_to_weight_list = []
    for i in fill:
        edges = i[0]
        listed_edge = list(edges)
        listed_edge.append(i[1])
        edge_to_weight_list.append(listed_edge)
    return edge_to_weight_list

In [4]:
sc_weights_list = tuple_to_list(sc_edge_to_weight)
gr_weights_list = tuple_to_list(gr_edge_to_weight)

In [5]:
# from edge [0] to edge [1], the weight is [2]
sc_weights_list[0]

[860, 3, 4]

### Shakespeare and Co Analysis

In [6]:
# Create SHAKESPEARE AND CO graph
sc_G = nx.Graph()
sc_G.add_weighted_edges_from(sc_weights_list)

Next chunk is optional: I do this only to make the results more interpretable. 

In [7]:
# optional: change vertex ids to book names
sc_dict = {value:key for key, value in sc_book_to_vertex_index.items()}
mapping = sc_dict # Dictionary from id to title
sc_G = nx.relabel_nodes(sc_G, mapping)

### Core-Periphery Structure

This section draws from Gallagher et al.'s ["A clarified typology of core-periphery structure in networks"](https://advances.sciencemag.org/content/7/12/eabc9800), with code available [here](https://github.com/ryanjgallagher/core_periphery_sbm). Core-periphery structure studies how networks can be divided into a core of densely interconnected nodes, which are highly connected to other core nodes, and periphery nodes that are connected only to core nodes and not each other. This paper develops/assesses two core-periphery model types: the hub-and-spoke model that divides the network into two clean blocks (core vs. periphery) and a layered model that allows for layers of periphery-ness. 

In [9]:
# Initialize hub-and-spoke model and infer structure
hubspoke = cp.HubSpokeCorePeriphery(n_gibbs=100, n_mcmc=10*len(sc_G))
hubspoke.infer(sc_G)

In [10]:
layered = cp.LayeredCorePeriphery(n_layers=3, n_gibbs=100, n_mcmc=10*len(sc_G))
layered.infer(sc_G)

In [11]:
# Get core and periphery assignments from hub-and-spoke model
node2label_hs = hubspoke.get_labels(last_n_samples=50)

# Get layer assignments from the layered model
node2label_l = layered.get_labels(last_n_samples=50)

**Create dataframes**: the goal is to get this into a nice csv with columns for book title, author, h&s cp label, layered cp label, coreness, and probability.

In [12]:
sc_hub_spoke_df = pd.DataFrame.from_dict(node2label_hs, orient='index')
sc_layered_df = pd.DataFrame.from_dict(node2label_l, orient='index')

**Core-Periphery Label:** For both models (hub-and-spoke and layered) the core is 0; for the layered model, the further away from 0 the more peripheral the node. 

In [14]:
# Number of nodes in periphery vs. core in hub-and-spoke
Counter(node2label_hs.values())

Counter({1: 767, 0: 476})

In [15]:
# Number of nodes in each layer between periphery to core
Counter(node2label_l.values())

Counter({2: 700, 0: 309, 1: 234})

**All core books:**

In [42]:
for book, label in node2label_l.items():  # for name, age in dictionary.iteritems():  
    if label == 0:
        print(book)

1919 by Dos Passos, John (1932)
A Farewell to Arms by Hemingway, Ernest (1929)
A Handful of Dust by Waugh, Evelyn (1934)
A High Wind in Jamaica by Hughes, Richard (1929)
A Note in Music by Lehmann, Rosamond (1930)
A Passage to India by Forster, E. M. (1924)
A Room of One's Own by Woolf, Virginia (1929)
A Room with a View by Forster, E. M.
A Tale of a Tub by Swift, Jonathan (1704)
After Many a Summer by Huxley, Aldous (1939)
Agnes Grey by Brontë, Anne (1847)
Alice Adams by Tarkington, Booth (1921)
All Passion Spent by Sackville-West, Vita (1931)
All This and Heaven Too by Field, Rachel (1938)
Antic Hay by Huxley, Aldous (1923)
Appointment in Samarra by O'Hara, John (1934)
Arrowsmith by Lewis, Sinclair (1925)
Autobiographies by Yeats, William Butler (1926)
Autobiography by Powys, John Cowper (1934)
Axel's Castle: A Study in the Imaginative Literature of 1870 – 1930 by Wilson, Edmund (1931)
Babbitt by Lewis, Sinclair (1922)
Back Street by Hurst, Fannie (1931)
Banbury Bog by Taylor, Phoebe

**Probability** that a node is in the core vs. periphery.

In [59]:
#  Dictionary of node -> ordered array of probabilities
node2probs_l = hubspoke.get_labels(last_n_samples=50, prob=True, return_dict=True)

In [60]:
sc_layered_probs_df = pd.DataFrame.from_dict(node2probs_l, orient='index')

**Coreness**, where the closer to 1 the more core; closer to 0 the more peripheral.

In [19]:
# Dictionary of node -> coreness
node2coreness_hs = hubspoke.get_coreness(last_n_samples=50, return_dict=True)
node2coreness_l = layered.get_coreness(last_n_samples=50, return_dict=True)

In [21]:
sc_layered_coreness_df = pd.DataFrame.from_dict(node2coreness_l, orient='index')

In [22]:
sc_layered_coreness_df

Unnamed: 0,0
"1914 and Other Poems by Brooke, Rupert (1915)",0.02
"1919 by Dos Passos, John (1932)",0.91
365 Days (1936),0.28
"A Backward Glance by Wharton, Edith (1934)",0.44
"A Book by Barnes, Djuna (1923)",0.00
...,...
Zola,0.00
"Zola and His Time by Josephson, Matthew (1928)",0.01
"Zuleika Dobson by Beerbohm, Max (1911)",0.51
[unknown],0.02


In [23]:
# all books that are especially "corey" in the layered model
most_core = []
for book, coreness in node2coreness_l.items():  # for name, age in dictionary.iteritems():  
    if coreness > .93:
        most_core.append(book)

In [24]:
most_core

['A Farewell to Arms by Hemingway, Ernest (1929)',
 'Alice Adams by Tarkington, Booth (1921)',
 'All This and Heaven Too by Field, Rachel (1938)',
 'Antic Hay by Huxley, Aldous (1923)',
 'Arrowsmith by Lewis, Sinclair (1925)',
 'Babbitt by Lewis, Sinclair (1922)',
 'Back Street by Hurst, Fannie (1931)',
 'Buddenbrooks by Mann, Thomas (1924)',
 'Celibate Lives by Moore, George (1927)',
 'Crewe Train by Macaulay, Rose (1926)',
 'Dusty Answer by Lehmann, Rosamond (1927)',
 'Eyeless in Gaza by Huxley, Aldous (1936)',
 "Jacob's Room by Woolf, Virginia (1922)",
 'Jew Süss by Feuchtwanger, Lion (1925)',
 'Manhattan Transfer by Dos Passos, John (1925)',
 'Mr. Norris Changes Trains by Isherwood, Christopher (1935)',
 'Mrs. Dalloway by Woolf, Virginia (1925)',
 'Oscar Wilde',
 'Pity Is Not Enough by Herbst, Josephine (1933)',
 'So Red the Rose by Young, Stark (1934)',
 'Some Do Not... by Ford, Ford Madox (1924)',
 'Sparkenbroke by Morgan, Charles (1936)',
 'The Apes of God by Lewis, Wyndham (193

## Save to csv

In [62]:
sc_layered_df['book_info'] = sc_layered_df.index
sc_layered_df = sc_layered_df.rename(columns={0: "layer"})

In [63]:
sc_hub_spoke_df['book_info'] = sc_hub_spoke_df.index
sc_hub_spoke_df = sc_hub_spoke_df.rename(columns={0: "hub_and_spoke"})

In [64]:
sc_layered_coreness_df['book_info'] = sc_layered_coreness_df.index
sc_layered_coreness_df = sc_layered_coreness_df.rename(columns={0: "coreness"})

In [67]:
full_sc_df = sc_layered_df.merge(sc_hub_spoke_df, how='inner', on='book_info').merge(sc_layered_coreness_df, how='inner', on='book_info')

In [68]:
full_sc_df = full_sc_df[["book_info", "hub_and_spoke", "layer", "coreness"]]

In [69]:
full_sc_df

Unnamed: 0,book_info,hub_and_spoke,layer,coreness
0,"1914 and Other Poems by Brooke, Rupert (1915)",1,2,0.02
1,"1919 by Dos Passos, John (1932)",0,0,0.91
2,365 Days (1936),1,1,0.28
3,"A Backward Glance by Wharton, Edith (1934)",1,1,0.44
4,"A Book by Barnes, Djuna (1923)",1,2,0.00
...,...,...,...,...
1238,Zola,1,2,0.00
1239,"Zola and His Time by Josephson, Matthew (1928)",1,2,0.01
1240,"Zuleika Dobson by Beerbohm, Max (1911)",0,1,0.51
1241,[unknown],1,2,0.02


In [70]:
# vertex
vertex_title = pd.DataFrame.from_dict(sc_dict, orient = "index")

In [72]:
vertex_title["vertex"] = vertex_title.index

In [73]:
vertex_title = vertex_title.rename(columns={0: "book_info"})

In [74]:
full_sc_df = full_sc_df.merge(vertex_title, how='inner', on='book_info')

In [75]:
full_sc_df = full_sc_df[["vertex","book_info", "layer", "hub_and_spoke", "coreness"]]

In [155]:
full_sc_df.head(10)

Unnamed: 0,vertex,book_info,layer,hub_and_spoke,coreness,degree_centrality,between_centrality,eigenvector_centrality
0,394,"1914 and Other Poems by Brooke, Rupert (1915)",2,1,0.02,8,0.0,0.000368
1,685,"1919 by Dos Passos, John (1932)",0,0,0.91,493,0.001969,0.048415
2,607,365 Days (1936),1,1,0.28,154,5.3e-05,0.016411
3,1215,"A Backward Glance by Wharton, Edith (1934)",1,1,0.44,208,0.000465,0.018054
4,1076,"A Book by Barnes, Djuna (1923)",2,1,0.0,39,4.8e-05,0.00281
5,704,"A Book of Nonsense by Lear, Edward (1846)",2,1,0.0,8,0.0,0.000521
6,207,"A Child's Garden of Verses by Stevenson, Rober...",2,1,0.0,97,0.0,0.009044
7,1152,"A Christmas Garland by Beerbohm, Max (1912)",2,1,0.01,75,5.3e-05,0.00671
8,393,"A City of Bells by Goudge, Elizabeth (1936)",1,0,0.53,275,0.000132,0.030492
9,61,A Connecticut Yankee in King Arthur's Court by...,2,1,0.01,10,0.0,0.000939


In [77]:
#full_sc_df.to_csv("shakespeare-co-core-periphery.csv")

# Other metrics

### Density
How connected is this graph? This finds the number of exissting edges divided by the number of total possible edges. 

In [50]:
sc_density = nx.density(sc_G)
print("Shakespeare and Co Network Density:", sc_density)

Shakespeare and Co Network Density: 0.16937620400490735


### Transitivity
How likely is it that if book A and book B are read together, and book B and book C are also read together, that books A and C are also connected by an edge? 

In [51]:
triadic_closure = nx.transitivity(sc_G)
print("Triadic closure for S&C:", triadic_closure)

Triadic closure for S&C: 0.6236327807463062


### Diameter length
Because this is not a connected graph, diameter length measures are slightly more complex. The below code finds the largest connected component of the graph, makes that a "subgraph" and then calculates the diameter of the largest connected component. 

In [52]:
# Get the largest connected component of the graph
components = nx.connected_components(sc_G)
largest_component = max(components, key=len)

# Create a "subgraph" of the largest component and find diameter
subgraph = sc_G.subgraph(largest_component)
diameter = nx.diameter(subgraph)
print("Network diameter of Shakespeare and Co's largest component:", diameter)

Network diameter of Shakespeare and Co's largest component: 5


### Centrality Measures
There are multiple ways to assess centrality in a network. Centrality measures usually try to capture something similar to significance or importance in a network--but there are different ways to understand importance. This code looks at the following:
- **degree centrality:** the sum of all of a node's edges. When considering S&C, a book with the highest number of degrees demonstrates that it was the book most often read with other books in the network. This is a measure of a type of popularity (*but remember -> this isn't the book checked out the most times, it's the book checked out the most times with any other book*).  
- **betweeness centrality:** betweenness centrality disregards node degree, and instead focuses on path length for determining the most important nodes. This looks at shortest paths to figure out which nodes connect otherwise disparate parts of the network. 
- **eigenvector centrality:** eigenvector centrality accounts for whether or not a node is connected to many other high-degree nodes--this would make it a hub, and also accounts for a central node that may not have the highest # of degrees, but is highly important, regardless

In [78]:
# degree centrality
sc_degree_dict = dict(sc_G.degree(sc_G.nodes()))
nx.set_node_attributes(sc_G, sc_degree_dict, 'degree')

sc_sorted_degree = sorted(sc_degree_dict.items(), key=itemgetter(1), reverse=True)
print("Top 100 nodes by degree in S&C:")
for d in sc_sorted_degree[:20]:
    print(d)

Top 100 nodes by degree in S&C:
('The Sun Also Rises by Hemingway, Ernest (1926)', 760)
('A Portrait of the Artist as a Young Man by Joyce, James (1916)', 734)
('A Passage to India by Forster, E. M. (1924)', 727)
('Dubliners by Joyce, James (1914)', 711)
('A Farewell to Arms by Hemingway, Ernest (1929)', 709)
('Sanctuary by Faulkner, William (1931)', 705)
('Eyeless in Gaza by Huxley, Aldous (1936)', 696)
('Pointed Roofs (Pilgrimage 1) by Richardson, Dorothy M. (1915)', 694)
('To the Lighthouse by Woolf, Virginia (1927)', 690)
("Jacob's Room by Woolf, Virginia (1922)", 686)
('Manhattan Transfer by Dos Passos, John (1925)', 685)
('Mr. Norris Changes Trains by Isherwood, Christopher (1935)', 684)
('New Writing', 684)
('The Garden Party and Other Stories by Mansfield, Katherine (1922)', 670)
('The Waves by Woolf, Virginia (1931)', 670)
("Axel's Castle: A Study in the Imaginative Literature of 1870 – 1930 by Wilson, Edmund (1931)", 670)
('Sparkenbroke by Morgan, Charles (1936)', 670)
('Mrs.

In [85]:
sc_degree_df = pd.DataFrame.from_dict(sc_degree_dict, orient='index')
sc_degree_df["book_info"] = sc_degree_df.index

In [87]:
sc_degree_df = sc_degree_df.rename(columns={0: "degree_centrality"})

In [79]:
# betweenness centrality
betweenness_dict = nx.betweenness_centrality(sc_G) # Run betweenness centrality
# Assign each to an attribute in your network
nx.set_node_attributes(sc_G, betweenness_dict, 'betweenness')

sorted_betweenness = sorted(betweenness_dict.items(), key=itemgetter(1), reverse=True)

print("Top 20 S&C nodes by betweenness centrality:")
for b in sorted_betweenness[:20]:
    print(b)

Top 20 S&C nodes by betweenness centrality:
('A Portrait of the Artist as a Young Man by Joyce, James (1916)', 0.0211823163565959)
('Dubliners by Joyce, James (1914)', 0.014421762191648936)
('Pointed Roofs (Pilgrimage 1) by Richardson, Dorothy M. (1915)', 0.01299321287359992)
('The Sun Also Rises by Hemingway, Ernest (1926)', 0.011548535324779827)
('Mr. Norris Changes Trains by Isherwood, Christopher (1935)', 0.009603896350449798)
('Exiles by Joyce, James (1918)', 0.009564421220576092)
('Moby-Dick; Or, the Whale by Melville, Herman (1851)', 0.009473956082605615)
('The Garden Party and Other Stories by Mansfield, Katherine (1922)', 0.008946568470141466)
('A Passage to India by Forster, E. M. (1924)', 0.008914960276224064)
('Manhattan Transfer by Dos Passos, John (1925)', 0.008622284488015441)
('Bliss and Other Stories by Mansfield, Katherine (1920)', 0.008391996154496885)
('New Writing', 0.008006549886583031)
('A Farewell to Arms by Hemingway, Ernest (1929)', 0.007916833409597613)
('Abs

In [None]:
sc_between_df = pd.DataFrame.from_dict(betweenness_dict, orient='index')
sc_between_df["book_info"] = sc_between_df.index

sc_between_df = sc_between_df.rename(columns={0: "between_centrality"})

In [80]:
# eigenvector centrality
eigenvector_dict = nx.eigenvector_centrality(sc_G) # Run eigenvector centrality
nx.set_node_attributes(sc_G, eigenvector_dict, 'eigenvector')

sorted_eigenvector = sorted(eigenvector_dict.items(), key=itemgetter(1), reverse=True)

print("Top 20 S&C nodes by eigenvector centrality:")
for b in sorted_eigenvector[:20]:
    print(b)

Top 20 S&C nodes by eigenvector centrality:
('A Passage to India by Forster, E. M. (1924)', 0.06129393107749066)
('The Sun Also Rises by Hemingway, Ernest (1926)', 0.061157170929938345)
('Eyeless in Gaza by Huxley, Aldous (1936)', 0.061058426522785245)
('A Farewell to Arms by Hemingway, Ernest (1929)', 0.06034363737272585)
('Sanctuary by Faulkner, William (1931)', 0.0602831778003923)
('Sparkenbroke by Morgan, Charles (1936)', 0.06026089984278902)
('To the Lighthouse by Woolf, Virginia (1927)', 0.060231445215858985)
("Axel's Castle: A Study in the Imaginative Literature of 1870 – 1930 by Wilson, Edmund (1931)", 0.060230281618190516)
('Mrs. Dalloway by Woolf, Virginia (1925)', 0.05998365750681798)
('The Waves by Woolf, Virginia (1931)', 0.05996296419039787)
("Jacob's Room by Woolf, Virginia (1922)", 0.0598260413364359)
('The Death of the Heart by Bowen, Elizabeth (1938)', 0.05968057791376069)
('The Rains Came by Bromfield, Louis (1937)', 0.05954315894900877)
('South Riding: An English La

In [90]:
sc_eigenvector_df = pd.DataFrame.from_dict(eigenvector_dict, orient='index')
sc_eigenvector_df["book_info"] = sc_eigenvector_df.index

In [91]:
sc_eigenvector_df = sc_eigenvector_df.rename(columns={0: "eigenvector_centrality"})

### Combine all

In [95]:
full_sc_df = full_sc_df.merge(sc_degree_df, how='inner', on='book_info').merge(sc_between_df, how='inner', on='book_info').merge(sc_eigenvector_df, how='inner', on='book_info')

In [157]:
full_sc_df.head(10)

Unnamed: 0,vertex,book_info,layer,hub_and_spoke,coreness,degree_centrality,between_centrality,eigenvector_centrality
0,394,"1914 and Other Poems by Brooke, Rupert (1915)",2,1,0.02,8,0.0,0.000368
1,685,"1919 by Dos Passos, John (1932)",0,0,0.91,493,0.001969,0.048415
2,607,365 Days (1936),1,1,0.28,154,5.3e-05,0.016411
3,1215,"A Backward Glance by Wharton, Edith (1934)",1,1,0.44,208,0.000465,0.018054
4,1076,"A Book by Barnes, Djuna (1923)",2,1,0.0,39,4.8e-05,0.00281
5,704,"A Book of Nonsense by Lear, Edward (1846)",2,1,0.0,8,0.0,0.000521
6,207,"A Child's Garden of Verses by Stevenson, Rober...",2,1,0.0,97,0.0,0.009044
7,1152,"A Christmas Garland by Beerbohm, Max (1912)",2,1,0.01,75,5.3e-05,0.00671
8,393,"A City of Bells by Goudge, Elizabeth (1936)",1,0,0.53,275,0.000132,0.030492
9,61,A Connecticut Yankee in King Arthur's Court by...,2,1,0.01,10,0.0,0.000939


In [97]:
full_sc_df.to_csv("shakespeare-co-core-periphery.csv")

## Model Selection
This is the section that needs continued work: how meaningful are either of these models? 


In [31]:
from core_periphery_sbm import model_fit as mf

# Get description length of hub-and-spoke model
inf_labels_hs = hubspoke.get_labels(last_n_samples=50, prob=False, return_dict=False)
mdl_hubspoke = mf.mdl_hubspoke(sc_G, inf_labels_hs, n_samples=100000)

# Get the description length of layered model
inf_labels_l = layered.get_labels(last_n_samples=50, prob=False, return_dict=False)
mdl_layered = mf.mdl_layered(sc_G, inf_labels_l, n_layers=3, n_samples=100000)

In [32]:
print("Description length of hub-and-spoke model: " + str(mdl_hubspoke))
print("Description length of layered model: " + str(mdl_layered))

Description length of hub-and-spoke model: 236981.57824596137
Description length of layered model: 217498.64027873726


So, the layered model is a better fit since it has a shorter description length. **BUT** this still lacks an assessment of the meaningfulness of the goodneses of fit. 

## Goodreads Core-Periphery Analysis
---> remember! this might not be the best way to compare the two networks since they represent such different types of readership w/ missing books in the Goodreads data

In [98]:
# Create SHAKESPEARE AND CO graph
gr_G = nx.Graph()
gr_G.add_weighted_edges_from(gr_weights_list)

In [99]:
# optional: change vertexes ids to book names
gr_dict = {value:key for key, value in sc_book_to_vertex_index.items()}
gr_mapping = gr_dict # Dictionary from id to title
gr_G = nx.relabel_nodes(gr_G, mapping)

### Core-Periphery Structure

In [100]:
# Initialize hub-and-spoke model and infer structure
gr_hubspoke = cp.HubSpokeCorePeriphery(n_gibbs=100, n_mcmc=10*len(gr_G))
gr_hubspoke.infer(gr_G)

In [101]:
gr_layered = cp.LayeredCorePeriphery(n_layers=3, n_gibbs=100, n_mcmc=10*len(gr_G))
gr_layered.infer(gr_G)

In [102]:
# Get core and periphery assignments from hub-and-spoke model
gr_node2label_hs = gr_hubspoke.get_labels(last_n_samples=50)

# Get layer assignments from the layered model
gr_node2label_l = gr_layered.get_labels(last_n_samples=50)

In [103]:
gr_hub_spoke_df = pd.DataFrame.from_dict(gr_node2label_hs, orient='index')

In [104]:
gr_layered_df = pd.DataFrame.from_dict(gr_node2label_l, orient='index')

**Core-Periphery Label:** For both models (hub-and-spoke and layered) the core is 0; for the layered model, the further away from 0 the more peripheral the node. 

In [105]:
# Number of nodes in periphery vs. core in hub-and-spoke
Counter(gr_node2label_hs.values())

Counter({1: 891, 0: 277})

In [106]:
# Number of nodes in each layer between periphery to core
Counter(gr_node2label_l.values())

Counter({2: 667, 1: 266, 0: 235})

In [107]:
for book, label in gr_node2label_hs.items():  # for name, age in dictionary.iteritems():  
    if label == 0:
        print(book)

A Fleet in Being by Kipling, Rudyard (1898)
A General History of the Robberies and Murders of the Most Notorious Pirates by Johnson, Captain Charles (1926)
A Hazard of New Fortunes by Howells, William Dean (1890)
A Journal of the Plague Year by Defoe, Daniel (1722)
A Little Tour in France by James, Henry (1900)
A Shropshire Lad by Housman, A. E. (1896)
A Story Teller's Story by Anderson, Sherwood (1924)
African Game Trails: An Account of the African Wanderings of an American Hunter-Naturalist by Roosevelt, Theodore (1910)
Aleck Maury, Sportsman by Gordon, Caroline (1934)
All Men Are Enemies by Aldington, Richard (1933)
All Passion Spent by Sackville-West, Vita (1931)
All Quiet on the Western Front by Remarque, Erich Maria (1929)
All Souls' Night by Walpole, Hugh (1933)
Almayer's Folly by Conrad, Joseph (1895)
Along the Road: Notes and Essays of a Tourist by Huxley, Aldous (1925)
An Introduction to Modern Philosophy by Joad, C. E. M. (Cyril Edwin Mitchinson) (1925)
An Unsocial Socialist

**Probability** that a node is in the core vs. periphery.

In [108]:
#  Dictionary of node -> ordered array of probabilities
gr_node2probs_hs = gr_hubspoke.get_labels(last_n_samples=50, prob=True, return_dict=True)

# n_nodes x n_layers array of probabilities
gr_inf_probs_l = gr_layered.get_labels(last_n_samples=50, prob=True, return_dict=False)

In [109]:
gr_hs_probs_df = pd.DataFrame.from_dict(gr_node2probs_hs, orient='index')

**Coreness**, where the closer to 1 the more core; closer to 0 the more peripheral.

In [111]:
# Dictionary of node -> coreness
gr_node2coreness_hs = gr_hubspoke.get_coreness(last_n_samples=50, return_dict=True)
gr_node2coreness_l = gr_layered.get_coreness(last_n_samples=50, return_dict=True)

In [122]:
gr_layered_coreness_df = pd.DataFrame.from_dict(gr_node2coreness_l, orient='index')

In [114]:
# all books that are especially "corey" in the layered model
gr_most_core = []
for book, coreness in gr_node2coreness_hs.items():  # for name, age in dictionary.iteritems():  
    if coreness == 1:
        gr_most_core.append(book)

## Save as CSV

In [116]:
gr_layered_df['book_info'] = gr_layered_df.index
gr_layered_df = gr_layered_df.rename(columns={0: "layer"})

In [117]:
gr_hub_spoke_df['book_info'] = gr_hub_spoke_df.index
gr_hub_spoke_df = gr_hub_spoke_df.rename(columns={0: "hub_and_spoke"})

In [126]:
gr_layered_coreness_df['book_info'] = gr_layered_coreness_df.index
gr_layered_coreness_df = gr_layered_coreness_df.rename(columns={0: "coreness"})

In [128]:
full_gr_df = gr_layered_df.merge(gr_hub_spoke_df, how='inner', on='book_info').merge(gr_layered_coreness_df, how='inner', on='book_info')

In [130]:
full_gr_df = full_gr_df[["book_info", "layer", "hub_and_spoke", "coreness"]]

In [131]:
# vertex
gr_vertex_title = pd.DataFrame.from_dict(gr_dict, orient = "index")

In [132]:
gr_vertex_title["vertex"] = gr_vertex_title.index

In [133]:
gr_vertex_title = gr_vertex_title.rename(columns={0: "book_info"})

In [135]:
full_gr_df = full_gr_df.merge(gr_vertex_title, how='inner', on='book_info')

In [137]:
full_gr_df = full_gr_df[["vertex", "book_info", "layer", "hub_and_spoke", "coreness"]]

In [120]:
#full_gr_df.to_csv("goodreads-core-periphery.csv")

## Other metrics

### Density

In [139]:
gr_density = nx.density(gr_G)
print("Goodreads Network Density:", gr_density)

Goodreads Network Density: 0.0626489300512965


### Transitivity


In [140]:
gr_triadic_closure = nx.transitivity(gr_G)
print("Triadic closure for Goodreads:", gr_triadic_closure)

Triadic closure for Goodreads: 0.4466288038513365


### Diameter length

In [141]:
# Get the largest connected component of the graph
components = nx.connected_components(gr_G)
largest_component = max(components, key=len)

# Create a "subgraph" of the largest component and find diameter
subgraph = gr_G.subgraph(largest_component)
diameter = nx.diameter(subgraph)
print("Network diameter of Goodread's largest component:", diameter)

Network diameter of Goodread's largest component: 5


### Centrality Measures

In [142]:
# degree centrality
gr_degree_dict = dict(gr_G.degree(gr_G.nodes()))
nx.set_node_attributes(gr_G, gr_degree_dict, 'degree')

gr_sorted_degree = sorted(gr_degree_dict.items(), key=itemgetter(1), reverse=True)
print("Top 100 nodes by degree in Goodreads:")
for d in gr_sorted_degree[:20]:
    print(d)

Top 100 nodes by degree in Goodreads:
('Montaigne', 533)
('Lolly Willowes by Warner, Sylvia Townsend (1926)', 507)
('The Unvanquished by Faulkner, William (1938)', 504)
("Ten North Frederick by O'Hara, John (1955)", 501)
('Goodbye to All That: An Autobiography by Graves, Robert (1929)', 484)
('Strait Is the Gate by Gide, André (1924)', 482)
("Strange Interlude by O'Neill, Eugene (1928)", 463)
('The Fortunes of Richard Mahony by Richardson, Henry Handel (1931)', 463)
('Mr. Bennett and Mrs. Brown by Woolf, Virginia (1924)', 446)
('Huntingtower by Buchan, John (1922)', 445)
('Grand Hotel by Baum, Vicki (1931)', 444)
('Saint Joan: A Chronicle Play in Six Scenes and an Epilogue by Shaw, George Bernard (1924)', 442)
('All Passion Spent by Sackville-West, Vita (1931)', 435)
('An Unsocial Socialist by Shaw, George Bernard (1917)', 429)
('Poems by Meredith, George', 422)
('Streets of Night by Dos Passos, John (1923)', 421)
('The Republic by Plato', 417)
('The Soldier and the Gentlewoman by Vaug

In [143]:
gr_degree_df = pd.DataFrame.from_dict(gr_degree_dict, orient='index')
gr_degree_df["book_info"] = gr_degree_df.index

In [144]:
gr_degree_df = gr_degree_df.rename(columns={0: "degree_centrality"})

In [146]:
# betweenness centrality
gr_betweenness_dict = nx.betweenness_centrality(gr_G) # Run betweenness centrality
# Assign each to an attribute in your network
nx.set_node_attributes(gr_G, betweenness_dict, 'betweenness')

gr_sorted_betweenness = sorted(gr_betweenness_dict.items(), key=itemgetter(1), reverse=True)

print("Top 20 GR nodes by betweenness centrality:")
for b in gr_sorted_betweenness[:20]:
    print(b)

Top 20 GR nodes by betweenness centrality:
('Montaigne', 0.044360882313723426)
("Ten North Frederick by O'Hara, John (1955)", 0.02653761956312943)
('Lolly Willowes by Warner, Sylvia Townsend (1926)', 0.025261467362823285)
('The Unvanquished by Faulkner, William (1938)', 0.024973792887256995)
('Goodbye to All That: An Autobiography by Graves, Robert (1929)', 0.024777621609717883)
('The Fortunes of Richard Mahony by Richardson, Henry Handel (1931)', 0.018942674185599872)
('Huntingtower by Buchan, John (1922)', 0.017308742069390162)
('Valmouth by Firbank, Ronald (1919)', 0.017098804375863388)
('Mr. Bennett and Mrs. Brown by Woolf, Virginia (1924)', 0.01683473072223191)
("Strange Interlude by O'Neill, Eugene (1928)", 0.016034617435096814)
('Strait Is the Gate by Gide, André (1924)', 0.015872169735874544)
('All Quiet on the Western Front by Remarque, Erich Maria (1929)', 0.015682468487507244)
('Saint Joan: A Chronicle Play in Six Scenes and an Epilogue by Shaw, George Bernard (1924)', 0.015

In [None]:
gr_between_df = pd.DataFrame.from_dict(gr_betweenness_dict, orient='index')
gr_between_df["book_info"] = gr_between_df.index

gr_between_df = gr_between_df.rename(columns={0: "between_centrality"})


In [148]:
# eigenvector centrality
gr_eigenvector_dict = nx.eigenvector_centrality(gr_G) # Run eigenvector centrality
nx.set_node_attributes(gr_G, gr_eigenvector_dict, 'eigenvector')

gr_sorted_eigenvector = sorted(gr_eigenvector_dict.items(), key=itemgetter(1), reverse=True)

print("Top 20 GR nodes by eigenvector centrality:")
for b in gr_sorted_eigenvector[:20]:
    print(b)

Top 20 GR nodes by eigenvector centrality:
('Lolly Willowes by Warner, Sylvia Townsend (1926)', 0.09035293999436013)
('Montaigne', 0.09008099630796959)
('The Unvanquished by Faulkner, William (1938)', 0.08992104054894763)
("Ten North Frederick by O'Hara, John (1955)", 0.08907464064615696)
('Strait Is the Gate by Gide, André (1924)', 0.08862605096337263)
('Goodbye to All That: An Autobiography by Graves, Robert (1929)', 0.0882979082692426)
("Strange Interlude by O'Neill, Eugene (1928)", 0.08694687771709478)
('The Fortunes of Richard Mahony by Richardson, Henry Handel (1931)', 0.086455403477043)
('Saint Joan: A Chronicle Play in Six Scenes and an Epilogue by Shaw, George Bernard (1924)', 0.08539422610125849)
('Mr. Bennett and Mrs. Brown by Woolf, Virginia (1924)', 0.08483449426528818)
('Huntingtower by Buchan, John (1922)', 0.08481290271746149)
('All Passion Spent by Sackville-West, Vita (1931)', 0.08445883363664931)
('Poems by Meredith, George', 0.08424966063233859)
('Grand Hotel by Bau

In [149]:
gr_eigenvector_df = pd.DataFrame.from_dict(gr_eigenvector_dict, orient='index')
gr_eigenvector_df["book_info"] = gr_eigenvector_df.index

In [150]:
gr_eigenvector_df = gr_eigenvector_df.rename(columns={0: "eigenvector_centrality"})

### Combine all

In [152]:
full_gr_df = full_gr_df.merge(gr_degree_df, how='inner', on='book_info').merge(gr_between_df, how='inner', on='book_info').merge(gr_eigenvector_df, how='inner', on='book_info')

In [160]:
full_gr_df.head(10)

Unnamed: 0,vertex,book_info,layer,hub_and_spoke,coreness,degree_centrality,between_centrality,eigenvector_centrality
0,394,"1914 and Other Poems by Brooke, Rupert (1915)",2,1,0.03,2,0.001696158,0.000348
1,685,"1919 by Dos Passos, John (1932)",1,1,0.46,69,0.0007121886,0.015438
2,607,365 Days (1936),2,1,0.02,30,1.719823e-06,0.010243
3,1076,"A Book by Barnes, Djuna (1923)",1,1,0.5,73,8.313327e-07,0.023388
4,704,"A Book of Nonsense by Lear, Edward (1846)",1,1,0.5,89,0.0003616448,0.025975
5,207,"A Child's Garden of Verses by Stevenson, Rober...",2,1,0.01,3,0.0,0.000814
6,1152,"A Christmas Garland by Beerbohm, Max (1912)",2,1,0.01,12,0.0,0.003526
7,393,"A City of Bells by Goudge, Elizabeth (1936)",2,1,0.01,7,0.0,0.000378
8,61,A Connecticut Yankee in King Arthur's Court by...,1,1,0.51,82,0.0001289415,0.02331
9,330,"A Doll's House by Ibsen, Henrik (1889)",2,1,0.06,49,0.0006103457,0.008319


In [154]:
full_gr_df.to_csv("goodreads-core-periphery.csv")

### Future work
- might look into allowing multiple cores ([Xiao Zhang, Travis Martin, and M. E. J. Newman. 2015. “Identification of core-periphery structure in networks” *Physical Review* E 91](https://journals.aps.org/pre/abstract/10.1103/PhysRevE.91.032803))
- Stats about the core:
    - density of the core
    - relative size of core
- four quadrants of core -> core, core -> periphery, periphery -> periphery, periphery -> core

## Other basic network measures
Many of these measures are drawn from Ladd et al.'s ["Exploring and Analyzing Network Data with Python"](https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python). Each measure is only used with the Shakespeare and Co dataset, but could just as easily be applied to Goodreads.  