# BookNLP Demo â€”Â Get Characters, People, Places, Quotations, Etc. for Any Book

By [Melanie Walsh](https://melaniewalsh.org/), based on a notebook by [David Bamman](https://people.ischool.berkeley.edu/~dbamman/)

[BookNLP](https://github.com/booknlp/booknlp) is a natural language processing tool developed by David Bamman. This tool can computationally identify characters, people, places, quotations, events, and  a lot more for book-length documents in English (support for other languages coming soon!). Most NLP tools do not work well on book-length documents, which makes BookNLP extremely useful.

This Colab notebook, which is [based on a notebook](https://github.com/booknlp/booknlp/blob/main/examples/Read%20character%20file.ipynb) created by David Bamman, demonstrates how BookNLP works with a single book. The default book for this notebook is Virginia Woolf's *Mrs. Dalloway* (1925). However, you can substitue another URL in the "Pick Your Book" section to try BookNLP out on another book. BookNLP will take a few minutes to process a text, depending on how long it is.

# ðŸš¨ Before You Begin ðŸš¨

First, you need to sign into a Google account to use this notebook.

Second, BookNLP will work best if you switch to using a GPU, or Graphical Processing Unit, for this notebook. To use a GPU in Google Colab, go to the menu at the top of the screen and select:

`Runtime > Change runtime type > Hardware accelerator > GPU (Then slick "Save")`

To run all the code in this notebook, you can select:

`Runtime > Run all`

If you want to save your own changes to this notebook, you'll need to save a copy.

# Pick Your Book

The default book for this notebook is Virginia Woolf's *Mrs. Dalloway*: https://gutenberg.net.au/ebooks02/0200991.txt. But you can find a .txt URL from [Project Gutenberg](https://www.gutenberg.org/ebooks/search/?sort_order=downloads) or GitHub or anywhere else and plug it in below:

In [None]:
!wget "https://gutenberg.net.au/ebooks02/0200991.txt" -O my_book.txt

--2021-11-22 21:48:01--  https://gutenberg.net.au/ebooks02/0200991.txt
Resolving gutenberg.net.au (gutenberg.net.au)... 43.229.63.241
Connecting to gutenberg.net.au (gutenberg.net.au)|43.229.63.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 370280 (362K) [text/plain]
Saving to: â€˜my_book.txtâ€™


2021-11-22 21:48:03 (449 KB/s) - â€˜my_book.txtâ€™ saved [370280/370280]



Then click `Runtime > Run All`

# Install and Import Packages

In [None]:
!pip install booknlp
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 13.9 MB 8.2 MB/s 
[38;5;2mâœ” Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
from booknlp.booknlp import BookNLP
import json
from collections import Counter
from pathlib import Path
import pandas as pd
pd.options.display.max_rows = 200
pd.options.display.max_colwidth = 100

using device cuda


# Set Up BookNLP

In [None]:
model_params = {
		"pipeline":"entity,quote,supersense,event,coref", 
		"model":"big", 
	}

booknlp= BookNLP("en", model_params)

{'pipeline': 'entity,quote,supersense,event,coref', 'model': 'big'}
--- startup: 17.561 seconds ---


# Run BookNLP

Before we apply BookNLP, let's check to make sure our text file has the right character encoding.

If your text file has a character encoding that is not UTF-8 or ISO-8859-1, then you will need to uncomment the last two lines in the code cell below and manually enter the character encoding of the file before transforming it into a UTF-8 file.

In [None]:
# Check to see if text file opens with UTF-8 encoding
try:
    open("my_book.txt", encoding='utf-8').read()
except UnicodeDecodeError:
    try:
      # Check to see if file opens with ISO-8859-1 encoding and, if so, rewrite the file as UTF-8
      text = open("my_book.txt", encoding='ISO-8859-1').read()
      open('my_book.txt', mode='w', encoding='utf-8').write(text)
    except:
      print("Character encoding error: You need to uncomment the lines above and specify the character encoding for this text file")
 
# Open the file with the current encoding
#text = open("my_book.txt", encoding='Your Character Encoding Here').read()
# Rewrite the file as a UTF-8 file
#open('my_book.txt', mode='w', encoding='utf-8'.write(text)

Apply BookNLP

In [None]:
inputFile = "my_book.txt"
outputDir = "my_book_dir/"
idd="my_book"

booknlp.process(inputFile, outputDir, idd)

--- spacy: 26.765 seconds ---
--- entities: 80.104 seconds ---
--- quotes: 0.149 seconds ---
--- attribution: 5.977 seconds ---
--- name coref: 0.655 seconds ---
--- coref: 26.719 seconds ---
--- TOTAL (excl. startup): 140.780 seconds ---, 78358 words


# Get Character Data

Load character data

In [None]:
character_data = json.load(open("my_book_dir/my_book.book"))

Make a counter

In [None]:
def get_counter_from_dependency_list(dependency_list):
    
    counter = Counter()

    for token in dependency_list:
        term = token["w"]
        tokenGlobalIndex=token["i"]
        counter[term] += 1
    return counter

Loop through character data and pull out information, then transform it into a DataFrame

In [None]:
df_list = []
for character in character_data["characters"]:
    
    agentList = character["agent"]
    patientList = character["patient"]
    possList = character["poss"]
    modList = character["mod"]
    character_id = character["id"]
    count = character["count"]
    referential_gender_distribution = referential_gender_prediction="unknown"

    if character["g"] is not None and character["g"] != "unknown":
        referential_gender_distribution=character["g"]["inference"]
        referential_gender=character["g"]["argmax"]

    mentions=character["mentions"]
    proper_mentions=mentions["proper"]
    max_proper_mention=""

    # just print out information about named characters
    if len(mentions["proper"]) > 0:
        max_proper_mention=mentions["proper"][0]["n"]
        
        df_list.append( {'Name':max_proper_mention , 'Character ID': character_id,
                         'Mentions': count,
                       'Gender': referential_gender,
                       'Possessives': get_counter_from_dependency_list(possList).most_common(10),
                       'Agent': get_counter_from_dependency_list(agentList).most_common(10),
                       'Patient': get_counter_from_dependency_list(patientList).most_common(10),
                       'Modifiers': get_counter_from_dependency_list(modList).most_common(10)}
        )
df = pd.DataFrame(df_list)
df['Character ID'] = df['Character ID'].astype(str)
df

Unnamed: 0,Name,Character ID,Mentions,Gender,Possessives,Agent,Patient,Modifiers
0,Clarissa,144,1263,she/her,"[(party, 12), (dress, 10), (hand, 8), (voice, 7), (parties, 7), (life, 5), (eyes, 5), (hat, 4), ...","[(said, 61), (had, 34), (thought, 25), (felt, 23), (asked, 13), (knew, 13), (say, 12), (come, 11...","[(thought, 12), (loved, 7), (tell, 6), (asked, 5), (see, 5), (told, 4), (marry, 4), (left, 4), (...","[(positive, 3), (girl, 2), (cold, 2), (happy, 2), (pause, 1), (suspense, 1), (right, 1), (part, ..."
1,Peter,146,937,he/him/his,"[(knife, 12), (life, 7), (hand, 7), (age, 6), (eyes, 5), (name, 5), (boots, 3), (undoing, 3), (s...","[(thought, 75), (said, 61), (had, 17), (come, 9), (felt, 8), (made, 7), (remembered, 7), (knew, ...","[(asked, 6), (marry, 4), (see, 4), (married, 3), (called, 3), (left, 3), (meet, 3), (remember, 2...","[(happy, 2), (failure, 2), (man, 2), (sticks, 1), (old, 1), (older, 1), (adventurer, 1), (buccan..."
2,Miss Kilman,174,333,she/her,"[(mother, 4), (eyes, 3), (way, 3), (head, 2), (knowledge, 2), (body, 2), (fingers, 2), (hands, 2...","[(said, 14), (thought, 5), (had, 5), (sat, 5), (looked, 4), (go, 4), (stood, 3), (let, 3), (did,...","[(loved, 2), (ask, 2), (hated, 2), (starved, 1), (rasped, 1), (liking, 1), (trusting, 1), (lifte...","[(poor, 3), (able, 1), (dismissal, 1), (creature, 1), (serious, 1), (good, 1), (hungry, 1), (fon..."
3,Sally,169,328,she/her,"[(way, 4), (name, 4), (hair, 2), (lips, 2), (self, 2), (hand, 2), (grandfather, 2), (parents, 1)...","[(said, 27), (felt, 11), (had, 9), (asked, 6), (knew, 5), (saw, 4), (thought, 4), (supposed, 4),...","[(kissed, 2), (given, 2), (asking, 2), (see, 2), (told, 2), (meet, 1), (lending, 1), (mauled, 1)...","[(excited, 1), (reckless, 1), (absurd, 1), (alone, 1), (old, 1), (best, 1), (spiteful, 1), (fran..."
4,Sir William,173,314,he/him/his,"[(patients, 6), (wife, 3), (profession, 3), (eyes, 2), (head, 2), (pencil, 2), (advice, 2), (sen...","[(said, 22), (had, 5), (saw, 4), (see, 4), (lay, 4), (thought, 4), (looked, 3), (cried, 3), (mut...","[(tell, 1), (killed, 1), (fascinated, 1), (caught, 1), (prevent, 1), (calling, 1), (told, 1), (l...","[(free, 2), (certain, 2), (happy, 1), (young, 1), (ill, 1), (fit, 1), (man, 1), (master, 1), (ri..."
5,Lady Bruton,210,292,she/her,"[(friend, 3), (plate, 2), (room, 2), (sex, 2), (head, 2), (attention, 2), (soul, 2), (dress, 2),...","[(said, 11), (felt, 9), (had, 8), (asked, 6), (thought, 4), (let, 3), (used, 3), (come, 3), (saw...","[(see, 2), (telling, 2), (bringing, 1), (help, 1), (bothered, 1), (bethinking, 1), (sunk, 1), (w...","[(general, 1), (sure, 1), (asleep, 1), (right, 1), (woman, 1), (good, 1), (admirable, 1), (happy..."
6,Elizabeth,158,287,she/her,"[(mother, 8), (father, 5), (gloves, 3), (party, 2), (dinner, 2), (abstraction, 2), (eyes, 2), (h...","[(said, 10), (thought, 8), (had, 8), (go, 6), (went, 5), (felt, 5), (going, 4), (stood, 4), (loo...","[(told, 2), (ask, 2), (guided, 2), (compare, 2), (unwind, 1), (remembering, 1), (suited, 1), (ha...","[(dark, 1), (child, 1), (charming, 1), (interested, 1), (bored, 1), (delighted, 1), (right, 1), ..."
7,Hugh,159,253,he/him/his,"[(hat, 2), (legs, 2), (credit, 2), (name, 2), (hand, 2), (carnations, 2), (pen, 2), (waistcoat, ...","[(said, 7), (had, 7), (thought, 5), (going, 4), (do, 3), (hear, 2), (met, 2), (married, 2), (did...","[(told, 2), (requiring, 1), (coming, 1), (known, 1), (take, 1), (saw, 1), (read, 1), (produced, ...","[(unselfish, 2), (slow, 2), (intolerable, 1), (impossible, 1), (certain, 1), (specimen, 1), (muc..."
8,Rezia,196,221,she/her,"[(hands, 4), (life, 2), (face, 2), (fingers, 2), (head, 2), (mind, 2), (arms, 1), (cheek, 1), (s...","[(said, 28), (sat, 7), (thought, 6), (say, 5), (have, 4), (had, 4), (cried, 4), (asked, 3), (tol...","[(told, 2), (thought, 1), (asked, 1), (clutching, 1), (telling, 1), (rejoiced, 1), (struck, 1), ...","[(afraid, 1), (sewing, 1), (careful, 1), (tree, 1)]"
9,Richard,163,209,she/her,"[(hands, 3), (mind, 2), (arms, 1), (face, 1), (hand, 1), (eyes, 1), (parasol, 1), (own, 1), (dre...","[(said, 18), (had, 9), (came, 5), (thought, 4), (see, 3), (felt, 2), (went, 2), (asked, 2), (mad...","[(see, 2), (grown, 2), (driven, 1), (asking, 1), (kissed, 1), (quoting, 1), (preferred, 1), (ask...","[(torpid, 1), (happy, 1), (able, 1), (best, 1), (strict, 1), (prig, 1), (help, 1), (glad, 1)]"


# View Character Data

Let's view the most frequently mentioned characters as well as their referential gender, actions for the which they are the agent and patient, objects they possess, and modifiers.

In [None]:
df

Unnamed: 0,Name,Character ID,Mentions,Gender,Possessives,Agent,Patient,Modifiers
0,Clarissa,144,1263,she/her,"[(party, 12), (dress, 10), (hand, 8), (voice, 7), (parties, 7), (life, 5), (eyes, 5), (hat, 4), ...","[(said, 61), (had, 34), (thought, 25), (felt, 23), (asked, 13), (knew, 13), (say, 12), (come, 11...","[(thought, 12), (loved, 7), (tell, 6), (asked, 5), (see, 5), (told, 4), (marry, 4), (left, 4), (...","[(positive, 3), (girl, 2), (cold, 2), (happy, 2), (pause, 1), (suspense, 1), (right, 1), (part, ..."
1,Peter,146,937,he/him/his,"[(knife, 12), (life, 7), (hand, 7), (age, 6), (eyes, 5), (name, 5), (boots, 3), (undoing, 3), (s...","[(thought, 75), (said, 61), (had, 17), (come, 9), (felt, 8), (made, 7), (remembered, 7), (knew, ...","[(asked, 6), (marry, 4), (see, 4), (married, 3), (called, 3), (left, 3), (meet, 3), (remember, 2...","[(happy, 2), (failure, 2), (man, 2), (sticks, 1), (old, 1), (older, 1), (adventurer, 1), (buccan..."
2,Miss Kilman,174,333,she/her,"[(mother, 4), (eyes, 3), (way, 3), (head, 2), (knowledge, 2), (body, 2), (fingers, 2), (hands, 2...","[(said, 14), (thought, 5), (had, 5), (sat, 5), (looked, 4), (go, 4), (stood, 3), (let, 3), (did,...","[(loved, 2), (ask, 2), (hated, 2), (starved, 1), (rasped, 1), (liking, 1), (trusting, 1), (lifte...","[(poor, 3), (able, 1), (dismissal, 1), (creature, 1), (serious, 1), (good, 1), (hungry, 1), (fon..."
3,Sally,169,328,she/her,"[(way, 4), (name, 4), (hair, 2), (lips, 2), (self, 2), (hand, 2), (grandfather, 2), (parents, 1)...","[(said, 27), (felt, 11), (had, 9), (asked, 6), (knew, 5), (saw, 4), (thought, 4), (supposed, 4),...","[(kissed, 2), (given, 2), (asking, 2), (see, 2), (told, 2), (meet, 1), (lending, 1), (mauled, 1)...","[(excited, 1), (reckless, 1), (absurd, 1), (alone, 1), (old, 1), (best, 1), (spiteful, 1), (fran..."
4,Sir William,173,314,he/him/his,"[(patients, 6), (wife, 3), (profession, 3), (eyes, 2), (head, 2), (pencil, 2), (advice, 2), (sen...","[(said, 22), (had, 5), (saw, 4), (see, 4), (lay, 4), (thought, 4), (looked, 3), (cried, 3), (mut...","[(tell, 1), (killed, 1), (fascinated, 1), (caught, 1), (prevent, 1), (calling, 1), (told, 1), (l...","[(free, 2), (certain, 2), (happy, 1), (young, 1), (ill, 1), (fit, 1), (man, 1), (master, 1), (ri..."
5,Lady Bruton,210,292,she/her,"[(friend, 3), (plate, 2), (room, 2), (sex, 2), (head, 2), (attention, 2), (soul, 2), (dress, 2),...","[(said, 11), (felt, 9), (had, 8), (asked, 6), (thought, 4), (let, 3), (used, 3), (come, 3), (saw...","[(see, 2), (telling, 2), (bringing, 1), (help, 1), (bothered, 1), (bethinking, 1), (sunk, 1), (w...","[(general, 1), (sure, 1), (asleep, 1), (right, 1), (woman, 1), (good, 1), (admirable, 1), (happy..."
6,Elizabeth,158,287,she/her,"[(mother, 8), (father, 5), (gloves, 3), (party, 2), (dinner, 2), (abstraction, 2), (eyes, 2), (h...","[(said, 10), (thought, 8), (had, 8), (go, 6), (went, 5), (felt, 5), (going, 4), (stood, 4), (loo...","[(told, 2), (ask, 2), (guided, 2), (compare, 2), (unwind, 1), (remembering, 1), (suited, 1), (ha...","[(dark, 1), (child, 1), (charming, 1), (interested, 1), (bored, 1), (delighted, 1), (right, 1), ..."
7,Hugh,159,253,he/him/his,"[(hat, 2), (legs, 2), (credit, 2), (name, 2), (hand, 2), (carnations, 2), (pen, 2), (waistcoat, ...","[(said, 7), (had, 7), (thought, 5), (going, 4), (do, 3), (hear, 2), (met, 2), (married, 2), (did...","[(told, 2), (requiring, 1), (coming, 1), (known, 1), (take, 1), (saw, 1), (read, 1), (produced, ...","[(unselfish, 2), (slow, 2), (intolerable, 1), (impossible, 1), (certain, 1), (specimen, 1), (muc..."
8,Rezia,196,221,she/her,"[(hands, 4), (life, 2), (face, 2), (fingers, 2), (head, 2), (mind, 2), (arms, 1), (cheek, 1), (s...","[(said, 28), (sat, 7), (thought, 6), (say, 5), (have, 4), (had, 4), (cried, 4), (asked, 3), (tol...","[(told, 2), (thought, 1), (asked, 1), (clutching, 1), (telling, 1), (rejoiced, 1), (struck, 1), ...","[(afraid, 1), (sewing, 1), (careful, 1), (tree, 1)]"
9,Richard,163,209,she/her,"[(hands, 3), (mind, 2), (arms, 1), (face, 1), (hand, 1), (eyes, 1), (parasol, 1), (own, 1), (dre...","[(said, 18), (had, 9), (came, 5), (thought, 4), (see, 3), (felt, 2), (went, 2), (asked, 2), (mad...","[(see, 2), (grown, 2), (driven, 1), (asking, 1), (kissed, 1), (quoting, 1), (preferred, 1), (ask...","[(torpid, 1), (happy, 1), (able, 1), (best, 1), (strict, 1), (prig, 1), (help, 1), (glad, 1)]"


# Get Quotation Data

In [None]:
# Load quotation data
quote_df = pd.read_csv("my_book_dir/my_book.quotes", delimiter='\t')
quote_df['char_id'] = quote_df['char_id'].astype(str)
quote_df = pd.merge(df[['Character ID', 'Name']], quote_df, left_on = 'Character ID', right_on= 'char_id')

# View Quotation Data

Let's view the first 100 identified quotations 

In [None]:
quote_df.sort_values(by='quote_start')[:100]

Unnamed: 0,Character ID,Name,quote_start,quote_end,mention_start,mention_end,mention_phrase,char_id,quote
81,146,Peter,403,407,399,400,Peter Walsh,146,Musing among the vegetables?--was
82,146,Peter,409,413,417,417,He,146,"it?--""I prefer men to cauliflowers""--was"
239,159,Hugh,1255,1264,1266,1266,Hugh,159,"Good - morning to you , Clarissa !"
240,159,Hugh,1280,1287,1266,1266,Hugh,159,Where are you off to ?
337,141,Mrs. Dalloway,1288,1295,1297,1298,Mrs. Dalloway,141,"I love walking in London ,"
338,141,Mrs. Dalloway,1300,1311,1297,1298,Mrs. Dalloway,141,Really it 's better than walking in the country .
368,160,the Whitbreads,1344,1349,1341,1342,the Whitbreads,160,to see doctors .
0,144,Clarissa,3438,3443,3444,3444,she,144,"That is all ,"
1,144,Clarissa,3453,3458,3459,3459,she,144,"That is all ,"
185,173,Sir William,3524,3530,3520,3520,He,173,I have had enough .


# Get Named Entities

Read in named entities and view all named entities

In [None]:
entity_df = pd.read_csv("my_book_dir/my_book.entities", delimiter='\t')
entity_df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,141,5,6,PROP,PER,Mrs. Dalloway
1,141,12,13,PROP,PER,Virginia Woolf
2,142,17,19,PROP,PER,Gutenberg of Australia
3,142,60,66,PROP,PER,Don Lainson dlainson@sympatico.ca Project Gutenberg of Australia
4,397,89,89,PRON,PER,We
...,...,...,...,...,...,...
11302,0,78336,78336,PRON,PER,me
11303,144,78343,78343,PROP,PER,Clarissa
11304,146,78345,78345,PRON,PER,he
11305,144,78350,78350,PRON,PER,she


Get a breakdown of all the named entitiy categories

In [None]:
entity_df['cat'].value_counts()

PER    10019
FAC      705
GPE      295
LOC      182
VEH      100
ORG        6
Name: cat, dtype: int64

# Get Locations

Let's view the commonly mentioned locations.

In [None]:
entity_filter = entity_df['cat'] == 'LOC'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})

Unnamed: 0,entity,text
0,the world,47
1,the sea,9
2,the earth,8
3,the whole world,6
4,the country,5
5,the lake,4
6,the desert,4
7,the river,4
8,the Strand,3
9,a river,2


# Get People

Let's view the commonly mentioned people.

In [None]:
entity_filter = entity_df['cat'] == 'PER'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})[:50]

Unnamed: 0,entity,text
0,her,1221
1,she,1152
2,he,908
3,his,489
4,him,419
5,She,372
6,they,283
7,He,269
8,Clarissa,242
9,them,161


# Get Geopolitical Entities

Let's view the commonly mentioned geopolitical entities.

In [None]:
entity_filter = entity_df['cat'] == 'GPE'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})[:50]

Unnamed: 0,entity,text
0,London,43
1,Bourton,25
2,India,25
3,England,16
4,Westminster,10
5,the country,8
6,Manchester,7
7,Surrey,5
8,Whitehall,5
9,Kensington,5


# Get Facilities

Let's view the commonly mentioned facilities.

In [None]:
entity_filter = entity_df['cat'] == 'FAC'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})[:50]

Unnamed: 0,entity,text
0,the room,40
1,the street,24
2,home,18
3,upstairs,15
4,Regent 's Park,14
5,there,10
6,the terrace,9
7,Bond Street,9
8,the house,9
9,the hall,7


## Get Vehicles

Let's view the commonly mentioned vehicles.

In [None]:
entity_filter = entity_df['cat'] == 'VEH'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})[:50]

Unnamed: 0,entity,text
0,the car,8
1,the motor car,7
2,motor cars,5
3,vans,5
4,the ambulance,4
5,an omnibus,4
6,the boat,4
7,the aeroplane,4
8,a train,3
9,The car,2
