[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeljov/NAP2025/blob/main/Pride_and_Prejudice_network_analysis.ipynb)

## Text analytics and SNA applied to Jane Austen's "Pride and Prejudice" novel

Pride and Prejudice is one of the most beloved English classic novels from the first half of the 19th century. As is the case with many other classics, the novel is made publicly available as a part of the [Project Gutenberg](https://www.gutenberg.org/).

Similar to how movie scripts were used at [Movie Galaxies](https://moviegalaxies.com/) to build networks of movie characters, we will use the (publicly available) text of Pride and Prejudice to create a social network of its characters. In the case of networks of movie character, two characters are connected if they occur in the same scene. Analogous to that, we will create a network, where two characters are connected if they occur in the same paragraph, and the frequncy of their co-occurrence will be reflected in the weight of the edge that connects them.

To build such a network, we will adopt the following method:
* **Download and preprocess the book content**. Download the book as a .txt file either directly from [Project Gutenberg](https://www.gutenberg.org/ebooks/1342) of from one of many public repos where it is available. Then, split the book into volumes and chapters, and do some text cleaning along the way.
* **Collect data about the book characters**. Look for a place on the net where the list of book characters can be found - this will be needed to properly identify characters in the text. For example, one such list is available [here](https://austenprose.com/pride-and-prejudice-character-list/), as a part of website that is generally about Jane Austen's work and can be considered as a credible source. Extract the list of characters and make it available for the entity extraction task.
* **Do paragraph-level enitity (character) extraction**. Extract entities from each book paragraph using spaCy for tokenisation and entity detection. Since spaCy's NER (Named Entity Recognition) module was trained on news arcticle and web content, it is not able to handle the text of a 19th century novel properly. Thus, we need to define a set of custom rules for the entity detection task.
* **Create an edge list**. Recall that a typical source for network creation is an edge list. So, having identified entities in each paragraph, for each paragraph with at least two entities detected, we add all pairs of the identified entities to an edge list. For example, if Elizabeth, Jane, and Darcy are mentioned in one paragraph, we would add three edges to the edge list: Elizabeth - Jane, Elizabeth - Darcy, and Jane - Darcy. After all edges (i.e., detected entity pairs) have been added to the edge list, reduce the list by introducing edge weight, so that instead of having multiple occurrences of the same edge, we have one edge occurrence with the weight reflecting its frequency.
* **Create and visualise the (paragraph-level) network**. The edge list can be directly used to create a network using the networkX library. Then we visualise and explore the network.
* **Explore the network**. Compute the basic netowrk statistics and identify the most central characters. Examine how connected the networkk is and if some communities can be detected.



### Instal and load the required libraries

In [None]:
# pyvis is a network visualisation library capable of much better network visualisation compared to that provided by networkX
# https://pypi.org/project/pyvis/

!pip install -q pyvis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

import requests
from bs4 import BeautifulSoup

from google.colab import files
import pickle

from collections import defaultdict

import re
import spacy

import networkx as nx

from pyvis.network import Network
import IPython

import warnings

### Download and preprocess the book content

The book content was downloaded from [this GitHub repo](https://github.com/laumann/ds/blob/master/hashing/books/) and stored localy.

In [None]:
book_file = files.upload()

In [None]:
with open('jane-austen-pride-prejudice.txt', 'r') as fobj:
  book_raw_txt = fobj.read()

print(len(book_raw_txt))

Split the book content into volumes, chapters, and eventually paragrpahs

In [None]:
# Regarding the regex patterns below:
# \s+ matches one or more whitespace characters
# [IVXLCDM]+ matches one or more Roman numeral characters
# \. captures the dot at the end of the chapter number
volume_pattern = r'(VOL\.\s+[IVXLCDM]+\.)'
chapter_pattern = r'(CHAPTER\s+[IVXLCDM]+\.)'

# split the text into volumes
# note that we included parentheses in the pattern to keep the volume titles in the text
book_volumes = re.split(volume_pattern, book_raw_txt)

# the text of VOL. I is preceded by the 'preface'
preamble = book_volumes[0]

chapters = list()

for i in range(1, len(book_volumes)-1, 2):
    volume_lbl = book_volumes[i].strip()
    volume_content = book_volumes[i+1].strip()

    # split a volume into chapters
    volume_chapters = re.split(chapter_pattern, volume_content)
    # skip the content before the first chapter, it's a preface
    for i in range(1, len(volume_chapters)-1, 2):
      chapter_lbl = volume_chapters[i].strip()
      chapter_content = volume_chapters[i+1].strip().strip("\n")
      chapters.append({
          'volume':volume_lbl,
          'chapter':chapter_lbl,
          'content':chapter_content
      })



In [None]:
len(chapters)

In [None]:
print(chapters[0])

### Collect data about the book characters

The character data will be collected by scraping the relevant content of the [Pride and Prejudice: List of Characters](https://austenprose.com/pride-and-prejudice-character-list/) page.

In [None]:
char_groups = ['Longbourn', 'Netherfield Park', 'Lucas Lodge', 'Meryton', 'Rosings Park', 'Pemberley', 'Town', 'Regiment']

In [None]:
characters = defaultdict(list)
url = "https://austenprose.com/pride-and-prejudice-character-list/"

try:
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    main_chars_elem = soup.find(lambda elem: (elem.name=='div') and elem.has_attr('class') and (elem['class'][0] == 'post-content'))

    current_char_group = ""
    for elem in main_chars_elem.find_all(lambda elem: elem.name in ['h3', 'p']):

      if (elem.name == 'h3'):
        if elem.text.strip() == "Minor Characters":
          break
        group_lbl = elem.text.split(maxsplit=1)[1].strip()
        if (group_lbl in char_groups) and (current_char_group != group_lbl):
          current_char_group = group_lbl

      if (current_char_group != "") and (elem.name == 'p') and (elem.find_next(name='strong')):
          char_in_bold = elem.find_next('strong')
          characters[current_char_group].append({'name': char_in_bold.text, 'description':elem.text})

except Exception as e:
    print(e)
    print(e.with_traceback)

In [None]:
characters_list = list()
for char_group, char_list in characters.items():
  for char in char_list:
    char.update({'group':char_group})
    characters_list.append(char)

In [None]:
len(characters_list)

In [None]:
for char in characters_list[:5]:
  print(char)

In [None]:
characters_df = pd.DataFrame(characters_list)

print(characters_df.shape)
characters_df.head()

In [None]:
characters_df.tail()

In [None]:
characters_df.name.tolist()

Since each character can be referenced in the text in many different ways, we will first split the full name into segments (title, first name, last name) in order to be able to create various versions of character mentions in the text.

In [None]:
def extract_title_firstname_surname(full_name):
  name_parts = full_name.split()
  if len(name_parts) == 2:
    title, surname = name_parts
    return title, None, surname
  elif len(name_parts) == 3:
    title, fname, surname = name_parts
    return title, fname, surname
  else:
    surname = " ".join(name_parts[-2:])
    fname = name_parts[-3]
    n = len(name_parts)
    title = " ".join(name_parts[0:(n-3)])
    return title, fname, surname

In [None]:
name_parts_df = characters_df.name.apply(extract_title_firstname_surname).apply(pd.Series)
name_parts_df.columns = ['title', 'first_name', 'last_name']

pride_prej_chars_df = pd.concat([name_parts_df, characters_df], axis=1)
pride_prej_chars_df

We will now add a column with all possible name variants to look for in the text:
* full name,
* title + first name (if first name exists)
* title + surname (if first name exists),
* first name only (if it exists),
* special cases:
  * include nicknames "Lizzy" and "Eliza" for Elizabeth, as well as "Mrs. Darcy" (what she becomes at the very end),
  * include nickname "Kitty" for Catherine Bennet
  * include surnames (without the title) for Bingly and Darcy, as they are often referenced like that in the book
  * Darcy is never referred to by his first name and it might be better not included not to mix him with Colonel Fitzwilliam, his cousin.
  * exclude the title + surname combination for the Bennet sisters as for them this combination is the same and thus introduces ambiguity
  * include "Lady Catherine" for Rt. Hon. Lady Catherine de Bourgh
  * include labels 'the late Mr. Darcy', 'the elder Mr. Darcy', and "old Mr. Darcy", for Mr. Darcy's father; this is to handle the ambiguity that tends to arise from both the father and the son being referred to "Mr. Darcy".

In [None]:
chars_name_variants = []

for _, row in pride_prej_chars_df.iterrows():
  fn = row['first_name']
  ln = row['last_name']
  title = row['title']

  name_variants = [row['name']]

  if fn:
    name_variants.extend([f"{title} {fn}", f"{fn} {ln}"])

  # the following one is to add the title + surname combination for all with the first name
  # except for the Bennet sisters - see the special rules above
  if fn and (ln != "Bennet"):
    name_variants.append(f"{title} {ln}")

  # the following one is to add the first name for all characters except Darcy - see the special rules above
  if fn and (ln != "Darcy"):
    name_variants.append(fn)

  if row['name'] == 'Miss Elizabeth Bennet':
    name_variants.extend(['Lizzy', 'Eliza', 'Miss Eliza', 'Mrs. Darcy'])

  if row['name'] == 'Miss Catherine Bennet':
    name_variants.extend(['Kitty', 'Miss Kitty'])

  if row['name'] in ['Mr. Charles Bingley', 'Mr. Fitzwilliam Darcy']:
    name_variants.append(ln)

  if row['name'] == 'Mr. Darcy (the elder)':
    name_variants = ['the late Mr. Darcy', 'the elder Mr. Darcy']

  chars_name_variants.append({
      'name': row['name'],
      'name_variants': name_variants
  })

chars_name_variants_df = pd.DataFrame(chars_name_variants)
#chars_name_variants_df

In [None]:
pride_prej_chars_df = pd.merge(pride_prej_chars_df, chars_name_variants_df, on='name', how='inner')
pride_prej_chars_df.info()

In [None]:
pride_prej_chars_df.head()

In [None]:
pride_prej_chars_df.to_csv("pride_and_prejudice_characters.csv", index=False)
files.download("pride_and_prejudice_characters.csv")

### Do paragraph-level enitity (character) extraction

Install a spaCy model for text processing.

In [None]:
!python3 -m spacy download en_core_web_md

Prepare a list of dictionaries, where key is the full character name (as identifier) and the value is a list of different ways this character may appear in text. This will be needed for setting rules for character detection

In [None]:
characters_in_text_dict = dict()

for _, row in pride_prej_chars_df.iterrows():
  characters_in_text_dict[row['name']] = row['name_variants']

# for k, v in characters_in_text_dict.items():
#   print(f"{k}: {v}")

Load the NLP model and define a set of custom rules for character detection

In [None]:
spacy_pipeline = spacy.load("en_core_web_md")

# add the ruler BEFORE the ner
ruler = spacy_pipeline.add_pipe("entity_ruler", before="ner")
ruler.overwrite_ents = True

# patterns to be used for identifying characters in the text
patterns = []

# # first, add general templates of the kind: Title + Capitalized Word
# titles = ["Mr.", "Mrs.", "Miss", "Lady", "Sir", "Rev.", "Colonel"]
# for title in titles:
#     patterns.append({
#         "label": "PERSON",
#         "pattern": [{"LOWER": title.lower()}, {"IS_TITLE": True}]
#     })

# next, create patterns for specific name combinations
for full_name, name_variants in characters_in_text_dict.items():
  char_id = "_".join(full_name.split())
  for name in name_variants:
      patterns.append({"label": "PERSON", "pattern": name, "id":char_id})

# load all the patterns into the ruler
ruler.add_patterns(patterns)

In [None]:
# test the pipeline
doc = spacy_pipeline("Mr. Bennet and George Wickham both knew Elizabeth Bennet.")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_}) ({ent.id_})")

Texts downloaded from Project Gutenbeg require some preprocessing to be ready for further analysis. Typical problems include almost randomly put newlines / tabs / multiple spaces as well as '--' connecting two words that should not be connected (e.g., "Mrs. Bennet.--They")

In [None]:
def clean_gutenberg_paragraphs(raw_paragraphs):

    cleaned_paragraphs = []

    for par in raw_paragraphs:
        # replace any internal newlines or tabs or multiple spaces with a single space
        clean_par = re.sub(r'\s+', ' ', par)

        # replace the Gutenberg double-dash with a space-padded version
        # this is to avoid problems with detection of entities (e.g., 'Mrs. Bennet.--They')
        clean_par = clean_par.replace('--', ' -- ')

        # strip leading/trailing whitespace
        clean_par = clean_par.strip()

        if clean_par: # if the paragraph isn't empty
            cleaned_paragraphs.append(clean_par)

    return cleaned_paragraphs

The function below extract characters from individual paragraphs of a book chapter by:
* splitting the chapter into paragraphs
* cleaning the text of those paragraphs
* passing the paragraphs to the spaCy's NLP pipeline for further processing, specifically tokenisation and entity extraction based on custom rules (set above); not that the `parser` and `lemmatizer` are excluded from the pipeline since they are not needed for entity extraction; in addition, the `ner` module is also excluded since it proved not to work well with 19th century language (it eas trained on modern days web and news content)
* filtering out the extracted entities to keep only those of the type PERSON.

In [None]:
def extract_characters_from_chapter(book_chapter: dict, nlp_pipeline) -> list:

  chapter_entity_labels = []
  chapter_entity_ids = []

  chapter_pars = re.split(r'\n\n+', book_chapter['content'])

  cleaned_chapter_pars = clean_gutenberg_paragraphs(chapter_pars)

  processed_paragraphs = nlp_pipeline.pipe(cleaned_chapter_pars, disable=["ner", "parser", "lemmatizer"])

  for p in processed_paragraphs:
      ent_labels = []
      ent_ids = []
      for ent in p.ents:
        if ent.label_ == 'PERSON':
          ent_labels.append(ent.text.strip())
          ent_ids.append(ent.id_)

      chapter_entity_labels.append(ent_labels)
      chapter_entity_ids.append(ent_ids)

  chapter_pars_ents = []
  for par_text, par_ents, par_ent_ids in zip(chapter_pars, chapter_entity_labels, chapter_entity_ids):
    chapter_pars_ents.append({
        'volume': book_chapter['volume'],
        'chapter': book_chapter['chapter'],
        'paragraph': par_text,
        'entity_lbls': set(par_ents),
        'entity_ids': set(par_ent_ids)})

  return chapter_pars_ents

In [None]:
chapters_with_entities = []

for i, chapter in enumerate(chapters):
  if i % 10 == 0: print(f"processing {i+1}. chapter")
  chapters_with_entities.extend(extract_characters_from_chapter(chapter, spacy_pipeline))

print(len(chapters_with_entities))

In [None]:
chapters_with_entities_df = pd.DataFrame(chapters_with_entities)
chapters_with_entities_df.head(10)

The code below is just for checking what PERSON type entities have been detected.

In [None]:
# the following line shows how to turn a list of lists into a flat list
# flattened_list = [item for sublist in list_of_lists for item in sublist]

# all_extracted_entities = set([ent for paragraph_ents in chapters_with_entities_df.entity_lbls.to_list() for ent in paragraph_ents])

# all_extracted_entity_ids = set([ent for paragraph_ents in chapters_with_entities_df.entity_ids.to_list() for ent in paragraph_ents])

# print("\n".join(all_extracted_entities))
# print("\n".join(all_extracted_entity_ids))
# print(len(all_extracted_entity_ids))

#### Filter out the extracted entities

We may need to filter out the extracted entities to keep only those who are really book characters. To that end, we will use the data about the book characters that were collected from the web.

Note: this step is required only if we use some more general rules for entity detection - for example, rules of the type: Title + Capitalized Word, as given (and commented out) in the code above.

In [None]:
def check_entity(ent_label):
  for full_name, name_variants in characters_in_text_dict.items():
    if ent_label in name_variants:
      return full_name
  # if the entity label is not associated with any known entity
  return None

true_ents_in_pars = []

for _, row in chapters_with_entities_df.iterrows():
  if not row['entity_lbls'] or len(row['entity_lbls']) == 0:
    continue

  par_ents = []
  for ent_lbl in row['entity_lbls']:
    ent_id = check_entity(ent_lbl)
    if ent_id:
      par_ents.append(ent_id)

  if len(par_ents) > 0:

    vol_num = row['volume'].split()[1]
    chapt_num = row['chapter'].split()[1]
    vol_chapt = f"{vol_num.strip('.')}_{chapt_num.strip('.')}"

    true_ents_in_pars.append({
        'chapter': vol_chapt,
        'paragraph': row['paragraph'],
        'chars' : set(par_ents)
    })

In [None]:
true_ents_in_pars_df = pd.DataFrame(true_ents_in_pars)
true_ents_in_pars_df.head()

In [None]:
true_ents_in_pars_df.info()

Note that in the cell below, the data frame is serialised using pickle since it could not be properly stored in a .csv file.

In [None]:
# true_ents_in_pars_df.to_csv("chapters_with_paragraph_level_entities.csv", index = False)

with open("chapters_with_paragraph_level_entities.pkl", "wb") as fobj:
  pickle.dump(true_ents_in_pars_df, fobj)

files.download("chapters_with_paragraph_level_entities.pkl")

### Create a paragraph-level edge-list

To create this edge list (and later network), we will establish connections between characters who appear in the same paragraph. If there is just one person mentioned in a paragraph, we simply skip the paragraph.

In [None]:
# data_file = files.upload()

# with open("chapters_with_paragraph_level_entities.pkl", "rb") as fobj:
#   true_ents_in_pars_df = pickle.load(fobj)


In [None]:
edge_list = []

for _, row in true_ents_in_pars_df.iterrows():
  n_chars = len(row['chars'])
  if n_chars < 2: continue

  chars_list = list(row['chars'])

  for i in range(n_chars-1):
    for j in range(i+1, n_chars):
      edge_list.append({
          'chapter': row['chapter'],
          'paragraph': row['paragraph'],
          'source' : chars_list[i],
          'target': chars_list[j]
      })

print(f"Total number of edges: {len(edge_list)}")


In [None]:
edge_list_df = pd.DataFrame(edge_list)
edge_list_df.head(10)

In [None]:
weighted_edge_list = edge_list_df.groupby(['source','target']).paragraph.count()
weighted_edge_list = weighted_edge_list.reset_index(drop=False)
weighted_edge_list.info()

In [None]:
weighted_edge_list.rename(columns={'paragraph':'weight'}, inplace=True)
weighted_edge_list.head()

In [None]:
weighted_edge_list.sort_values(by='weight', ascending=False).head(10)

### Create and visualise the (paragraph-level) network

In [None]:
G_pars = nx.from_pandas_edgelist(df=weighted_edge_list,
                                    source='source',
                                    target='target',
                                    edge_attr='weight',
                                    create_using=nx.Graph)
print(G_pars)

In [None]:
# for edge in G_pars.edges(data=True):
#   print(edge)

Add the character group (location-based one) as the node attribute. Add also a numrical representation of the group, as that will be useful later on for visualization

In [None]:
unique_char_groups = pride_prej_chars_df['group'].unique().tolist()
groups_ids_mapping = {group:i for i, group in enumerate(unique_char_groups)}

for node in G_pars.nodes():
  loc_group = pride_prej_chars_df.loc[pride_prej_chars_df.name == node, 'group'].iloc[0]
  loc_group_id = groups_ids_mapping[loc_group]

  G_pars.nodes[node]['group_lbl'] = loc_group
  G_pars.nodes[node]['group'] = loc_group_id

In [None]:
def plot_graph(G,
               graph_name,
               graph_layout = None,
               node_color_modifiers=None,
               node_size_modifiers=None,
               edge_weight_multiplier=1):
    plt.figure(figsize=(12,11))

    if graph_layout:
      pos = graph_layout
    else:
      # pos = nx.kamada_kawai_layout(G)
      pos = nx.spring_layout(G, seed=9, k=0.95, weight='weight')


    if node_color_modifiers:
        node_color = [node_color_modifiers[node] for node in G.nodes()]
    else:
        node_color = 'purple'

    if node_size_modifiers:
      node_size = [200 + 1500*node_size_modifiers[node] for node in G.nodes()]
    else:
      node_size = 350

    edge_width = [attr['weight']*edge_weight_multiplier for (u, v, attr) in G.edges(data=True)]

    nx.draw_networkx_nodes(G, pos, node_size=node_size, node_color=node_color, cmap='viridis')
    nx.draw_networkx_edges(G, pos, width=edge_width, edge_color='silver')
    nx.draw_networkx_labels(G, pos, font_color='indigo', font_size=9, font_weight='bold', horizontalalignment='left', verticalalignment='bottom')

    plt.title(graph_name)

    plt.axis('off')
    plt.show()

In [None]:
plot_graph(G_pars,
           "Network of Pride and Prejudice actors, based on co-occurrence in book paragraphs",
          #  graph_layout=nx.kamada_kawai_layout(G_pars),
           node_color_modifiers= nx.get_node_attributes(G_pars, 'group'),
           edge_weight_multiplier=0.15)

To examine the tendency of the network actors to group / cluster, we will compute key measures of actor connectedness and tendency to form groups in a network:

In [None]:
from statistics import mean

# network density
den = nx.density(G_pars)
# global clustering coefficient (transitivity)
trans = nx.transitivity(G_pars)
# local clustering coefficient
loc_clust_coef = nx.clustering(G_pars)
avg_loc_clust = mean(loc_clust_coef.values())

print(f"Density: {den:.4f}, transitivity (global clust. coeff.): {trans:.4f}, avg. local clust. coeff: {avg_loc_clust:.4f}")

Compute also node centralities. In this context, only degree and closeness centrality seem to be meaningful.

In [None]:
degree_centr = nx.degree_centrality(G_pars)
closeness_centr = nx.closeness_centrality(G_pars)

In [None]:
plot_graph(G_pars,
           "Network of Pride and Prejudice actors, based on co-occurrence in book paragraphs",
          #  graph_layout=nx.kamada_kawai_layout(G_pars),
           node_color_modifiers= nx.get_node_attributes(G_pars, 'group'),
           node_size_modifiers=degree_centr,
           edge_weight_multiplier=0.15)

Create a copy of the network for visualisation purposes, as pyvis changes some network attributes.

In [None]:
G_pars_vis = G_pars.copy()

nx.set_node_attributes(G_pars_vis, degree_centr, 'value')

for u, v, attr in G_pars_vis.edges(data=True):
    G_pars_vis[u][v]['width'] = attr['weight'] * 0.5

In [None]:
net = Network(notebook = True, cdn_resources='remote', width="1000px", height="700px", bgcolor='#222222', font_color='white')

net.from_nx(G_pars_vis)

# net.show_buttons(filter_=['physics'])

options = """
var options = {
  "physics": {
    "barnesHut": {
      "gravitationalConstant": -15000,
      "centralGravity": 0.3,
      "springLength": 95
    },
    "minVelocity": 0.75,
    "stabilization": {
      "enabled": true,
      "iterations": 1000
    }
  }
}
"""
net.set_options(options)

net.save_graph("pride_and_prejudice.html")

# Use IPython to display the file in the Colab cell
IPython.display.HTML(filename="pride_and_prejudice.html")

In [None]:
# for edge in G_pars_vis.edges(data=True):
#   print(edge)

Examine if clusters can be identified in the network

In [None]:
louvaine_communities = nx.community.louvain_communities(G_pars, weight='weight')

print(f"Number of detected communities: {len(louvaine_communities)}")

In [None]:
nx.community.modularity(G_pars, louvaine_communities)

In [None]:
louvain_partitions = {}
for i, community in enumerate(louvaine_communities):
    for node in community:
        louvain_partitions[node] = i

In [None]:
plot_graph(G_pars,
           "Communities detected in the Pride and Prejudice network, with Louvaine method",
          #  graph_layout=nx.kamada_kawai_layout(G_pars),
           node_size_modifiers=degree_centr,
           node_color_modifiers=louvain_partitions,
           edge_weight_multiplier=0.15)

In [None]:
eb_communities = list(nx.community.girvan_newman(G_pars))
partition_modularity_dict = {i : nx.community.modularity(G_pars, partition) for i, partition in enumerate(eb_communities)}
best_partition_item = max(partition_modularity_dict.items(), key=lambda item: item[1])

best_partition = eb_communities[best_partition_item[0]]
eb_modularity = best_partition_item[1]

print(f"Modularity for Edge-betweenness method: {eb_modularity:.4f}")
print(f"Number of clusters: {len(best_partition)}")
print(f"Partitioning: {best_partition}")

## Create a chapter-level network

#### Identify set of entities for each chapter and eastablish a connection between any two of them

In [None]:
# data_file = files.upload()
# true_ents_in_pars_df = pd.read_csv("chapters_with_paragraph_level_entities.csv")

In [None]:
chars_edge_list = []

for chapter, chapter_group in true_ents_in_pars_df.groupby('chapter'):
  all_chars = [char for par_chars in chapter_group['chars'] for char in par_chars]
  all_unique_chars = list(set(all_chars))

  n_chars = len(all_unique_chars)
  if n_chars < 2:
    continue

  for i in range(n_chars-1):
    for j in range(i+1, n_chars):
      chars_edge_list.append({
          'chapter': chapter,
          'source' : all_unique_chars[i],
          'target': all_unique_chars[j]
      })

print(len(chars_edge_list))

In [None]:
chars_edge_list_df = pd.DataFrame(chars_edge_list)
chars_edge_list_df.head(10)

In [None]:
weighted_edge_list = chars_edge_list_df.groupby(['source','target']).chapter.count()
weighted_edge_list = weighted_edge_list.reset_index(drop=False)
weighted_edge_list.info()

In [None]:
weighted_edge_list.rename(columns={'chapter':'weight'}, inplace=True)
weighted_edge_list.head()

In [None]:
weighted_edge_list.sort_values(by='weight', ascending=False).head(20)

#### Create and explore the network

In [None]:
G_chapters = nx.from_pandas_edgelist(df=weighted_edge_list,
                                    source='source',
                                    target='target',
                                    edge_attr='weight',
                                    create_using=nx.Graph)
print(G_chapters)

In [None]:
warnings.simplefilter('ignore', UserWarning)

plot_graph(G_chapters,
           "Network of Pride and Prejudice actors, based on co-occurrence in a chapter",
           edge_weight_multiplier=0.15)

In [None]:
from statistics import mean

den = nx.density(G_chapters)
trans = nx.transitivity(G_chapters)
loc_clust_coef = nx.clustering(G_chapters)
avg_loc_clust = mean(loc_clust_coef.values())

print(f"Density: {den:.4f}, transitivity (global clust. coeff.): {trans:.4f}, avg. local clust. coeff: {avg_loc_clust:.4f}")

### Directions for future exploration

Create a separate network for each book volume (6 in total), so that the evolution of the interaction of the book characters can be traced over time. Compare the networks in terms of the main descriptive statistics and observe how in distinct chapters distinct character groups (location-based groups) interact.
