# Banned Books BookNLP Analysis

**Summary:**
This notebook applies BookNLP to analyze banned books, extracting features and comparing literary elements across texts. It is focused on character, event, and stylistic analysis for banned literature.

# Banned Books BookNLP Analysis
This notebook analyzes banned books using BookNLP, focusing on extracting literary features, character and event data, and comparing results across the banned books corpus.
- **Dataset:** Banned books corpus (see Data/ for sources)
- **Methods:** BookNLP pipeline, text preprocessing, feature extraction, comparative analysis
- **Goal:** Explore literary patterns and metadata in banned books, and compare with other literary corpora.

# BookNLP of Banned Books

BookNLP sample data generated from our corpus of Banned Books are available on this Github repository: https://github.com/representationlab/booknlp_sample

A limited set of completely open data is available here: https://github.com/representationlab/booknlp_open

The following code pulls the BookNLP data for a specific book and allows you to explore it. You can change which book you want to view by visiting that books output files and linking the 'raw' version from Github in the 'wget' commands in the "Get Data" section.

# Installations

To use a GPU in Google Colab, change the following:

`Runtime > Change runtime type > Hardware accelerator > GPU`

To execute this notebook, sign in to Google and then run all cells

`Runtime > Run all`

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

In [None]:
%%capture
import pandas as pd
!pip install booknlp
!python -m spacy download en_core_web_sm
from booknlp.booknlp import BookNLP


In [None]:
model_params={
		"pipeline":"entity,quote,supersense,event,coref",
		"model":"big",
	}

booknlp=BookNLP("en", model_params)

{'pipeline': 'entity,quote,supersense,event,coref', 'model': 'big'}
--- startup: 17.758 seconds ---


# Sync Github

In [None]:
GITHUB_PRIVATE_KEY = """-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
QyNTUxOQAAACB620hfG9cx3EDooSgaSqv+z5cgXK/1xLKFLNgbs6OyywAAAJioYSQoqGEk
KAAAAAtzc2gtZWQyNTUxOQAAACB620hfG9cx3EDooSgaSqv+z5cgXK/1xLKFLNgbs6Oyyw
AAAEA+xNxU0IHlE/j6KUUWQz9AAdWkHqOq31nhRBfdvRRJ63rbSF8b1zHcQOihKBpKq/7P
lyBcr/XEsoUs2Buzo7LLAAAAEmhhd2NAQWxleHMtTUJQLmxhbgECAw==
-----END OPENSSH PRIVATE KEY-----
"""

In [None]:
# set up Secrets in Colab - not working yet
#from google.colab import userdata
#GITHUB_PRIVATE_KEY = userdata.get('GITHUB_PRIVATE_KEY')
#GITHUB_PRIVATE_KEY

In [None]:
# Create the directory if it doesn't exist.
! mkdir -p /root/.ssh
# Write the key
with open("/root/.ssh/id_ed25519", "w") as f:
  f.write(GITHUB_PRIVATE_KEY)
# Add github.com to our known hosts
! ssh-keyscan -t ed25519 github.com >> ~/.ssh/known_hosts
# Restrict the key permissions, or else SSH will complain.
! chmod go-rwx /root/.ssh/id_ed25519

# github.com:22 SSH-2.0-babeld-f8b1fc6c


In [None]:
# Note the `git@github.com` syntax, which will fetch over SSH instead of HTTP.
!git clone git@github.com:representationlab/booknlp_sample.git

fatal: destination path 'booknlp_sample' already exists and is not an empty directory.


# Import BookNLP Output Files

In [None]:
import json
from collections import Counter

In [None]:
def proc(filename):
    with open(filename) as file:
        data=json.load(file)
    return data

In [None]:
%%capture
!wget https://raw.githubusercontent.com/representationlab/booknlp_sample/main/10%20Things%20I%20Can%20See%20From%20Here%20-%20Carrie%20Mac.txt.entities?token=GHSAT0AAAAAACJX2MB3HZ3BRL4RAQSWXAEOZKELAOQ
!wget https://raw.githubusercontent.com/representationlab/booknlp_sample/main/10%20Things%20I%20Can%20See%20From%20Here%20-%20Carrie%20Mac.txt.quotes?token=GHSAT0AAAAAACJX2MB2CYVTXNKJDBJBC4O6ZKEK5RA
!wget https://raw.githubusercontent.com/representationlab/booknlp_sample/main/10%20Things%20I%20Can%20See%20From%20Here%20-%20Carrie%20Mac.txt.book?token=GHSAT0AAAAAACJX2MB2DJ2X56ROMCLFURJEZKEK7QQ

In [None]:
data=proc("/content/booknlp_sample/10 Things I Can See From Here - Carrie Mac.txt.book")

# Explore Character Data

In [None]:
import json

with open ("/content/booknlp_sample/10 Things I Can See From Here - Carrie Mac.txt.book", "r") as f:
    book_data = json.load(f)
book_data.keys()

dict_keys(['characters'])

In [None]:
len(book_data["characters"])


549

In [None]:
book_data["characters"][0].keys()


dict_keys(['agent', 'patient', 'mod', 'poss', 'id', 'g', 'count', 'mentions'])

In [None]:
book_data["characters"][0]["agent"][:1]
book_data["characters"][0]["patient"][:1]
book_data["characters"][0]["mod"][:1]
book_data["characters"][0]["poss"][:1]
book_data["characters"][0]["id"]
book_data["characters"][0]["g"]
book_data["characters"][0]["count"]
book_data["characters"][0]["mentions"].keys()

dict_keys(['proper', 'common', 'pronoun'])

In [None]:
book_data["characters"][0]["mod"][:50]


[{'w': 'sick', 'i': 1466},
 {'w': 'person', 'i': 2540},
 {'w': 'late', 'i': 5766},
 {'w': 'surprised', 'i': 10045},
 {'w': 'sure', 'i': 10463},
 {'w': 'close', 'i': 11405},
 {'w': 'upset', 'i': 11451},
 {'w': 'little', 'i': 14951},
 {'w': 'person', 'i': 15660},
 {'w': 'person', 'i': 15681},
 {'w': 'younger', 'i': 15794},
 {'w': 'worried', 'i': 15873},
 {'w': 'sure', 'i': 15897},
 {'w': 'about', 'i': 17434},
 {'w': 'stupid', 'i': 17464},
 {'w': 'sick', 'i': 17697},
 {'w': 'adult', 'i': 17821},
 {'w': 'person', 'i': 17994},
 {'w': 'woman', 'i': 18821},
 {'w': 'girl', 'i': 18824},
 {'w': 'nervous', 'i': 19456},
 {'w': 'nervous', 'i': 19510},
 {'w': 'sure', 'i': 19662},
 {'w': 'disappointed', 'i': 20461},
 {'w': 'freak', 'i': 20551},
 {'w': 'right', 'i': 20999},
 {'w': 'little', 'i': 25903},
 {'w': 'ready', 'i': 26072},
 {'w': 'sure', 'i': 26263},
 {'w': 'about', 'i': 26950},
 {'w': 'shocked', 'i': 27497},
 {'w': 'mad', 'i': 28687},
 {'w': 'sad', 'i': 28698},
 {'w': 'door', 'i': 30413},
 {

In [None]:
book_data["characters"][0]["poss"][:20]


[{'w': 'Avoidance', 'i': 549},
 {'w': 'therapist', 'i': 957},
 {'w': 'knee', 'i': 1074},
 {'w': 'lap', 'i': 1112},
 {'w': 'ticket', 'i': 1140},
 {'w': 'horse', 'i': 1247},
 {'w': 'bus', 'i': 1272},
 {'w': 'nails', 'i': 1539},
 {'w': 'passengers', 'i': 1755},
 {'w': 'suitcase', 'i': 1901},
 {'w': 'mind', 'i': 2246},
 {'w': 'head', 'i': 2291},
 {'w': 'tracks', 'i': 2332},
 {'w': 'suitcase', 'i': 2337},
 {'w': 'heart', 'i': 2344},
 {'w': 'hands', 'i': 2858},
 {'w': 'knees', 'i': 2861},
 {'w': 'hands', 'i': 2890},
 {'w': 'chest', 'i': 2903},
 {'w': 'sketchbook', 'i': 3113}]

In [None]:
book_data["characters"][0]["id"]


0

In [None]:
book_data["characters"][0]["g"]


{'inference': {'he/him/his': 0.338,
  'she/her': 0.591,
  'they/them/their': 0.05,
  'xe/xem/xyr/xir': 0.008,
  'ze/zem/zir/hir': 0.013},
 'argmax': 'she/her',
 'max': 0.591,
 'total': 612.689}

In [None]:
book_data["characters"][0]["count"]


3990

In [None]:
book_data["characters"][0]["mentions"].keys()


dict_keys(['proper', 'common', 'pronoun'])

In [None]:
book_data["characters"][0]["mentions"]["proper"]


[]

In [None]:
book_data["characters"][0]["mentions"]["common"]


[]

In [None]:
book_data["characters"][0]["mentions"]["pronoun"]


[{'c': 2440, 'n': 'I'},
 {'c': 560, 'n': 'my'},
 {'c': 478, 'n': 'me'},
 {'c': 218, 'n': 'you'},
 {'c': 79, 'n': 'My'},
 {'c': 77, 'n': 'You'},
 {'c': 54, 'n': 'your'},
 {'c': 39, 'n': 'myself'},
 {'c': 18, 'n': 'mine'},
 {'c': 10, 'n': 'Your'},
 {'c': 7, 'n': 'Me'},
 {'c': 2, 'n': 'Mine'},
 {'c': 2, 'n': 'yourself'},
 {'c': 2, 'n': 'yours'},
 {'c': 1, 'n': 'I’d'},
 {'c': 1, 'n': 'ME'},
 {'c': 1, 'n': 'j'},
 {'c': 1, 'n': 'YOUR'}]

# Explore Book Characters in detail

In [None]:
def get_counter_from_dependency_list(dep_list):
    counter=Counter()
    for token in dep_list:
        term=token["w"]
        tokenGlobalIndex=token["i"]
        counter[term]+=1
    return counter

In [None]:
for character in data["characters"]:

    agentList=character["agent"]
    patientList=character["patient"]
    possList=character["poss"]
    modList=character["mod"]

    character_id=character["id"]
    count=character["count"]

    referential_gender_distribution=referential_gender_prediction="unknown"

    if character["g"] is not None and character["g"] != "unknown":
        referential_gender_distribution=character["g"]["inference"]
        referential_gender=character["g"]["argmax"]

    mentions=character["mentions"]
    proper_mentions=mentions["proper"]
    max_proper_mention=""

    # just print out information about named characters
    if len(mentions["proper"]) > 0:
        max_proper_mention=mentions["proper"][0]["n"]

        print(character_id, count, max_proper_mention, referential_gender)

        print()
        printTop=10
        for k, v in get_counter_from_dependency_list(possList).most_common(printTop):
            print("\tposs\t%s %s" % (v,k))
        print()
        for k, v in get_counter_from_dependency_list(agentList).most_common(printTop):
            print("\tagent\t%s %s" % (v,k))
        print()
        for k, v in get_counter_from_dependency_list(patientList).most_common(printTop):
            print("\tpatient\t%s %s" % (v,k))
        print()
        for k, v in get_counter_from_dependency_list(modList).most_common(printTop):
            print("\tmod\t%s %s" % (v,k))
        print()

152 1012 Salix she/her

	poss	19 violin
	poss	17 hand
	poss	8 eyes
	poss	7 hands
	poss	5 head
	poss	4 case
	poss	4 fingers
	poss	4 lips
	poss	3 feet
	poss	3 dad

	agent	57 said
	agent	25 know
	agent	25 took
	agent	15 pulled
	agent	15 put
	agent	15 going
	agent	14 ’m
	agent	12 looked
	agent	11 do
	agent	9 did

	patient	6 tell
	patient	5 texted
	patient	5 kissed
	patient	5 meet
	patient	4 Thank
	patient	3 see
	patient	3 texting
	patient	3 Tell
	patient	3 Show
	patient	3 told

	mod	2 person
	mod	2 nice
	mod	2 thing
	mod	1 polite
	mod	1 girl
	mod	1 cute
	mod	1 kind
	mod	1 upset
	mod	1 embarrassed
	mod	1 worried

97 921 Dad he/him/his

	poss	8 phone
	poss	8 face
	poss	6 head
	poss	6 eyes
	poss	6 arm
	poss	5 arms
	poss	5 hand
	poss	5 voice
	poss	5 shoes
	poss	4 truck

	agent	41 said
	agent	22 ’m
	agent	17 going
	agent	10 want
	agent	10 say
	agent	8 have
	agent	8 had
	agent	7 thought
	agent	7 ’re
	agent	6 looked

	patient	5 tell
	patient	4 called
	patient	4 told
	patient	3 call
	patient	3 sup

In [None]:
df_list = []
for character in data["characters"]:

    agentList = character["agent"]
    patientList = character["patient"]
    possList = character["poss"]
    modList = character["mod"]
    character_id = character["id"]
    count = character["count"]
    referential_gender_distribution = referential_gender_prediction="unknown"

    if character["g"] is not None and character["g"] != "unknown":
        referential_gender_distribution=character["g"]["inference"]
        referential_gender=character["g"]["argmax"]

    mentions=character["mentions"]
    proper_mentions=mentions["proper"]
    max_proper_mention=""

    # just print out information about named characters
    if len(mentions["proper"]) > 0:
        max_proper_mention=mentions["proper"][0]["n"]

        df_list.append( {'Name':max_proper_mention , 'Character ID': character_id,
                         'Mentions': count,
                       'Gender': referential_gender,
                       'Possessives': get_counter_from_dependency_list(possList).most_common(10),
                       'Agent': get_counter_from_dependency_list(agentList).most_common(10),
                       'Patient': get_counter_from_dependency_list(patientList).most_common(10),
                       'Modifiers': get_counter_from_dependency_list(modList).most_common(10)}
        )
df = pd.DataFrame(df_list)
df['Character ID'] = df['Character ID'].astype(str)
df

Unnamed: 0,Name,Character ID,Mentions,Gender,Possessives,Agent,Patient,Modifiers
0,Salix,152,1012,she/her,"[(violin, 19), (hand, 17), (eyes, 8), (hands, ...","[(said, 57), (know, 25), (took, 25), (pulled, ...","[(tell, 6), (texted, 5), (kissed, 5), (meet, 5...","[(person, 2), (nice, 2), (thing, 2), (polite, ..."
1,Dad,97,921,he/him/his,"[(phone, 8), (face, 8), (head, 6), (eyes, 6), ...","[(said, 41), (’m, 22), (going, 17), (want, 10)...","[(tell, 5), (called, 4), (told, 4), (call, 3),...","[(dead, 3), (asshole, 3), (home, 2), (reason, ..."
2,Claire,100,859,she/her,"[(belly, 12), (head, 8), (voice, 5), (feet, 5)...","[(said, 55), (put, 15), (want, 13), (know, 13)...","[(tell, 10), (helped, 4), (told, 4), (drive, 3...","[(okay, 5), (mother, 2), (angry, 2), (sarcasti..."
3,Owen,106,359,he/him/his,"[(head, 6), (arms, 4), (hand, 4), (voice, 3), ...","[(said, 26), (want, 5), (sat, 5), (took, 5), (...","[(Thank, 2), (thank, 2), (set, 2), (pulled, 2)...","[(dead, 2), (blue, 1), (limp, 1), (smaller, 1)..."
4,Ruthie,90,345,she/her,"[(mouth, 5), (hands, 4), (bedroom, 2), (dad, 2...","[(said, 15), (talking, 5), (wanted, 4), (sat, ...","[(tell, 7), (told, 5), (wanted, 3), (trust, 2)...","[(interesting, 3), (smart, 2), (friend, 2), (a..."
...,...,...,...,...,...,...,...,...
57,Beethoven,170,2,he/him/his,"[(Concerto, 1)]",[],[],[]
58,Hope,175,2,she/her,"[(future, 1)]","[(died, 1)]",[],[]
59,Oscar Heidelman,180,2,he/him/his,[],[],[],[]
60,AJ,203,2,she/her,[],[],[],"[(blip, 1)]"


In [None]:
# Load quotation data
quote_df = pd.read_csv("/content/booknlp_sample/10 Things I Can See From Here - Carrie Mac.txt.quotes", delimiter='\t')
quote_df['char_id'] = quote_df['char_id'].astype(str)
quote_df = pd.merge(df[['Character ID', 'Name']], quote_df, left_on = 'Character ID', right_on= 'char_id')

In [None]:
quote_df.sort_values(by='quote_start')[:100]

Unnamed: 0,Character ID,Name,quote_start,quote_end,mention_start,mention_end,mention_phrase,char_id,quote
1546,95,Nancy,1059,1066,1067,1067,She,95,"“ Oh , Maeve , sweetheart . ”"
1547,95,Nancy,1076,1086,1070,1070,her,95,“ It will be okay . I know it . ”
1420,96,Mom,1155,1160,1161,1161,Mom,96,"“ I love you , ”"
1421,96,Mom,1164,1169,1161,1161,Mom,96,“ I love you . ”
1552,103,Disappointment,4160,4173,4115,4115,Disappointment,103,“ What if I stayed at Dan ’s place the whole t...
...,...,...,...,...,...,...,...,...,...
500,97,Dad,8525,8530,8520,8520,he,97,“ They ’re beautiful . ”
501,97,Dad,8607,8625,8582,8582,he,97,"“ When you come in the summer , to stay , your..."
1422,96,Mom,10880,10884,10855,10855,She,96,“ proper hello . ”
1458,101,Raymond,11115,11118,11119,11119,Raymond,101,“ Maeve ! ”


# Entities Exploration

In [None]:
entity_df = pd.read_csv("/content/booknlp_sample/10 Things I Can See From Here - Carrie Mac.txt.entities", delimiter='\t')
entity_df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,86,2,3,PROP,PER,CARRIE MAC
1,234,25,26,NOM,PER,the author
2,235,37,38,NOM,PER,actual persons
3,86,57,58,PROP,PER,Carrie Mac
4,86,74,75,PROP,PER,Carrie Mac
...,...,...,...,...,...,...
15032,2814,94394,94394,PRON,PER,our
15033,2814,94405,94405,PRON,PER,We
15034,2815,94432,94432,PRON,PER,your
15035,2815,94437,94437,PRON,PER,your


In [None]:
entity_df['cat'].value_counts()

PER    13477
FAC      910
VEH      252
GPE      198
LOC      177
ORG       23
Name: cat, dtype: int64

# Get Locations

In [None]:
entity_filter = entity_df['cat'] == 'LOC'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})

Unnamed: 0,entity,text
0,the beach,15
1,the lake,14
2,the water,11
3,the world,9
4,Alice Lake,7
...,...,...
93,Nowhere near Gastown,1
94,the Fraser River,1
95,the edge of the park,1
96,the edge of the road,1


# Get People

In [None]:
entity_filter = entity_df['cat'] == 'PER'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})[:50]

Unnamed: 0,entity,text
0,I,3056
1,her,760
2,you,716
3,my,621
4,me,579
5,she,459
6,he,443
7,She,406
8,his,355
9,He,326


# Get Geopolitical entities

In [None]:
entity_filter = entity_df['cat'] == 'GPE'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})[:50]

Unnamed: 0,entity,text
0,Haiti,27
1,Port Townsend,15
2,the city,12
3,Vancouver,11
4,Seattle,10
5,Gnomenville,8
6,Thailand,8
7,Juilliard,8
8,Continental,6
9,California,6


# Get Facilities

In [None]:
entity_filter = entity_df['cat'] == 'FAC'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})[:50]

Unnamed: 0,entity,text
0,home,106
1,there,26
2,the street,26
3,the park,24
4,the hospital,23
5,here,19
6,the road,15
7,the stairs,15
8,the parking lot,15
9,upstairs,14


# Get Vehicles

In [None]:
entity_filter = entity_df['cat'] == 'VEH'
entity_df[entity_filter]['text'].value_counts().reset_index().rename(columns={'index':'entity'})[:50]

Unnamed: 0,entity,text
0,the van,32
1,the bus,18
2,the train,15
3,the car,13
4,the ferry,12
5,the truck,9
6,the wagon,9
7,the ambulance,8
8,a bus,7
9,the boat,5


# Analysis

See for more: https://booknlp.pythonhumanities.com/04_character_analysis.html

In [None]:
import json
from collections import Counter

In [None]:
def proc(filename):
    with open(filename) as file:
        data=json.load(file)
    return data

In [None]:
def get_counter_from_dependency_list(dep_list):
    counter=Counter()
    for token in dep_list:
        term=token["w"]
        tokenGlobalIndex=token["i"]
        counter[term]+=1
    return counter

In [None]:
#data=proc("x")


In [None]:
def create_character_data(data, printTop):
    character_data = {}
    for character in data["characters"]:

        agentList=character["agent"]
        patientList=character["patient"]
        possList=character["poss"]
        modList=character["mod"]

        character_id=character["id"]
        count=character["count"]

        referential_gender_distribution=referential_gender_prediction="unknown"

        if character["g"] is not None and character["g"] != "unknown":
            referential_gender_distribution=character["g"]["inference"]
            referential_gender=character["g"]["argmax"]

        mentions=character["mentions"]
        proper_mentions=mentions["proper"]
        max_proper_mention=""

        #Let's create some empty lists that we can append to.
        poss_items = []
        agent_items = []
        patient_items = []
        mod_items = []

        # just print out information about named characters
        if len(mentions["proper"]) > 0:
            max_proper_mention=mentions["proper"][0]["n"]
            for k, v in get_counter_from_dependency_list(possList).most_common(printTop):
                poss_items.append((v,k))

            for k, v in get_counter_from_dependency_list(agentList).most_common(printTop):
                agent_items.append((v,k))

            for k, v in get_counter_from_dependency_list(patientList).most_common(printTop):
                patient_items.append((v,k))

            for k, v in get_counter_from_dependency_list(modList).most_common(printTop):
                mod_items.append((v,k))




            # print(character_id, count, max_proper_mention, referential_gender)
            character_data[character_id] = {"id": character_id,
                                  "count": count,
                                  "max_proper_mention": max_proper_mention,
                                  "referential_gender": referential_gender,
                                  "possList": poss_items,
                                  "agentList": agent_items,
                                  "patientList": patient_items,
                                  "modList": mod_items
                                 }

    return character_data

In [None]:
character_data = create_character_data(data, 10)


In [None]:
print (character_data[98])


{'id': 98, 'count': 8, 'max_proper_mention': 'Li', 'referential_gender': 'he/him/his', 'possList': [(1, 'seat')], 'agentList': [(2, 'went'), (1, 'sat'), (1, 'pulled'), (1, 'stabbed'), (1, 'sawed'), (1, 'paraded'), (1, 'sliced')], 'patientList': [(1, 'noticed'), (1, 'done')], 'modList': []}


In [None]:
def find_verb_usage(data, analysis=["agent", "patient"]):
    new_analysis = []
    for item in analysis:
        if item == "agent":
            new_analysis.append("agentList")
        elif item == "patient":
            new_analysis.append("patientList")
    main_agents = {}
    main_patients = {}
    for character in character_data:
        temp_data = character_data[character]
        for item in new_analysis:
            for verb in temp_data[item]:
                verb = verb[1].lower()
                if item == "agentList":
                    if verb not in main_agents:
                        main_agents[verb] = [(character, temp_data["max_proper_mention"])]
                    else:
                        main_agents[verb].append((character, temp_data["max_proper_mention"]))
                elif item == "patientList":
                    if verb not in main_patients:
                        main_patients[verb] = [(character, temp_data["max_proper_mention"])]
                    else:
                        main_patients[verb].append((character, temp_data["max_proper_mention"]))
    verb_usage = {"agent": main_agents,
                 "patient": main_patients}
    return verb_usage

In [None]:
verb_data = find_verb_usage(data)
verb_data

{'agent': {'said': [(152, 'Salix'),
   (97, 'Dad'),
   (100, 'Claire'),
   (106, 'Owen'),
   (90, 'Ruthie'),
   (105, 'Corbin'),
   (112, 'Mrs. Patel'),
   (181, 'Mr. Heidelman'),
   (96, 'Mom'),
   (101, 'Raymond'),
   (111, 'Jessica'),
   (128, 'Grandma'),
   (104, 'Billy'),
   (102, 'Dan'),
   (95, 'Nancy'),
   (230, 'Gigi'),
   (107, 'Deena'),
   (160, 'Bandhu')],
  'know': [(152, 'Salix'),
   (100, 'Claire'),
   (106, 'Owen'),
   (140, 'Maeve'),
   (101, 'Raymond'),
   (128, 'Grandma'),
   (95, 'Nancy'),
   (107, 'Deena'),
   (195, 'Honey')],
  'took': [(152, 'Salix'),
   (106, 'Owen'),
   (181, 'Mr. Heidelman'),
   (230, 'Gigi'),
   (205, 'Brava')],
  'pulled': [(152, 'Salix'), (105, 'Corbin'), (98, 'Li')],
  'put': [(152, 'Salix'), (100, 'Claire'), (128, 'Grandma'), (230, 'Gigi')],
  'going': [(152, 'Salix'), (97, 'Dad'), (105, 'Corbin'), (96, 'Mom')],
  '’m': [(152, 'Salix'),
   (97, 'Dad'),
   (100, 'Claire'),
   (105, 'Corbin'),
   (181, 'Mr. Heidelman'),
   (124, 'the Wrens'

In [None]:
verb_data["agent"]["said"]


[(152, 'Salix'),
 (97, 'Dad'),
 (100, 'Claire'),
 (106, 'Owen'),
 (90, 'Ruthie'),
 (105, 'Corbin'),
 (112, 'Mrs. Patel'),
 (181, 'Mr. Heidelman'),
 (96, 'Mom'),
 (101, 'Raymond'),
 (111, 'Jessica'),
 (128, 'Grandma'),
 (104, 'Billy'),
 (102, 'Dan'),
 (95, 'Nancy'),
 (230, 'Gigi'),
 (107, 'Deena'),
 (160, 'Bandhu')]

In [None]:
verb_data["agent"]["looked"]


[(152, 'Salix'), (97, 'Dad'), (229, 'Pete'), (147, 'O’Ryan')]

# Events Analysis

In [None]:
import pandas as pd
df = pd.read_csv("/content/booknlp_sample/10 Things I Can See From Here - Carrie Mac.txt.tokens", delimiter="\t")
df



Unnamed: 0,paragraph_ID,sentence_ID,token_ID_within_sentence,token_ID_within_document,word,lemma,byte_onset,byte_offset,POS_tag,fine_POS_tag,dependency_relation,syntactic_head_ID,event
0,0,0,0,0,ALSO,also,3,7,ADV,RB,advmod,1,O
1,0,0,1,1,BY,by,8,10,ADP,IN,ROOT,1,O
2,0,0,2,2,CARRIE,CARRIE,11,17,PROPN,NNP,compound,4,O
3,0,0,3,3,MAC,MAC,18,21,PROPN,NNP,compound,4,O
4,1,0,4,4,Wildfire,Wildfire,26,34,PROPN,NNP,pobj,1,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69719,3222,8037,14,94456,.,.,404536,404537,PUNCT,.,punct,94442,O
69720,3223,8038,0,94457,Sign,Sign,404542,404546,PROPN,NNP,ROOT,94457,O
69721,3223,8038,1,94458,up,up,404547,404549,ADP,RP,prt,94457,O
69722,3223,8038,2,94459,now,now,404550,404553,ADV,RB,advmod,94457,O


In [None]:
events = df[~df['event'].isnull()]
events



Unnamed: 0,paragraph_ID,sentence_ID,token_ID_within_sentence,token_ID_within_document,word,lemma,byte_onset,byte_offset,POS_tag,fine_POS_tag,dependency_relation,syntactic_head_ID,event
0,0,0,0,0,ALSO,also,3,7,ADV,RB,advmod,1,O
1,0,0,1,1,BY,by,8,10,ADP,IN,ROOT,1,O
2,0,0,2,2,CARRIE,CARRIE,11,17,PROPN,NNP,compound,4,O
3,0,0,3,3,MAC,MAC,18,21,PROPN,NNP,compound,4,O
4,1,0,4,4,Wildfire,Wildfire,26,34,PROPN,NNP,pobj,1,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69719,3222,8037,14,94456,.,.,404536,404537,PUNCT,.,punct,94442,O
69720,3223,8038,0,94457,Sign,Sign,404542,404546,PROPN,NNP,ROOT,94457,O
69721,3223,8038,1,94458,up,up,404547,404549,ADP,RP,prt,94457,O
69722,3223,8038,2,94459,now,now,404550,404553,ADV,RB,advmod,94457,O


In [None]:
event_options = set(events.event.tolist())
print (event_options)

{'EVENT', 'O'}


In [None]:
real_events = events.loc[df["event"] == "EVENT"]
real_events

Unnamed: 0,paragraph_ID,sentence_ID,token_ID_within_sentence,token_ID_within_document,word,lemma,byte_onset,byte_offset,POS_tag,fine_POS_tag,dependency_relation,syntactic_head_ID,event
753,107,47,5,753,took,take,4678,4682,VERB,VBD,ccomp,759,EVENT
759,107,47,11,759,threw,throw,4702,4707,VERB,VBD,ROOT,759,EVENT
781,107,48,12,781,dragged,drag,4812,4819,VERB,VBD,advcl,777,EVENT
808,107,49,5,808,researched,research,4950,4960,VERB,VBD,advcl,805,EVENT
836,107,55,1,836,read,read,5081,5085,VERB,VBD,ROOT,836,EVENT
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69582,3211,8021,1,94301,grab,grab,403790,403794,VERB,VBP,ROOT,94301,EVENT
69588,3211,8023,1,94312,hand,hand,403823,403827,VERB,VBP,ROOT,94312,EVENT
69595,3212,8026,1,94332,leave,leave,403892,403897,VERB,VBP,ROOT,94332,EVENT
69652,3212,8031,1,94389,smoke,smoke,404131,404136,NOUN,NN,nsubj,94390,EVENT


In [None]:
event_words = set(real_events.word.tolist())
len(event_words)

1294

In [None]:
event_lemmas = list(set(real_events.lemma.tolist()))
event_lemmas.sort()
len(event_lemmas)
print (event_lemmas[:10])

['Breathless', 'Drinking', 'Exhausted', 'Finished', 'Honking', 'Kissing', 'Rang', 'accident', 'accuse', 'ache']


In [None]:
final_lemmas = []
for lemma in event_lemmas:
    lemma = lemma.lower()
    if lemma not in final_lemmas:
        final_lemmas.append(lemma)

print(len(final_lemmas))
print(final_lemmas[:10])

918
['breathless', 'drinking', 'exhausted', 'finished', 'honking', 'kissing', 'rang', 'accident', 'accuse', 'ache']


In [None]:
sentences = real_events.sentence_ID.tolist()
events = real_events.word.tolist()
print (sentences[:10])
print (events[:10])

[47, 47, 48, 49, 55, 63, 64, 67, 69, 69]
['took', 'threw', 'dragged', 'researched', 'read', 'talked', 'told', 'stumped', 'took', 'admit']


In [None]:
df




Unnamed: 0,paragraph_ID,sentence_ID,token_ID_within_sentence,token_ID_within_document,word,lemma,byte_onset,byte_offset,POS_tag,fine_POS_tag,dependency_relation,syntactic_head_ID,event
0,0,0,0,0,ALSO,also,3,7,ADV,RB,advmod,1,O
1,0,0,1,1,BY,by,8,10,ADP,IN,ROOT,1,O
2,0,0,2,2,CARRIE,CARRIE,11,17,PROPN,NNP,compound,4,O
3,0,0,3,3,MAC,MAC,18,21,PROPN,NNP,compound,4,O
4,1,0,4,4,Wildfire,Wildfire,26,34,PROPN,NNP,pobj,1,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69719,3222,8037,14,94456,.,.,404536,404537,PUNCT,.,punct,94442,O
69720,3223,8038,0,94457,Sign,Sign,404542,404546,PROPN,NNP,ROOT,94457,O
69721,3223,8038,1,94458,up,up,404547,404549,ADP,RP,prt,94457,O
69722,3223,8038,2,94459,now,now,404550,404553,ADV,RB,advmod,94457,O


In [None]:
sentence1 = sentences[0]
result = df[df["sentence_ID"] == int(sentence1)]
result

Unnamed: 0,paragraph_ID,sentence_ID,token_ID_within_sentence,token_ID_within_document,word,lemma,byte_onset,byte_offset,POS_tag,fine_POS_tag,dependency_relation,syntactic_head_ID,event
748,107,47,0,748,But,but,4658,4661,CCONJ,CC,cc,753,O
749,107,47,1,749,the,the,4662,4665,DET,DT,det,751,O
750,107,47,2,750,last,last,4666,4670,ADJ,JJ,amod,751,O
751,107,47,3,751,time,time,4671,4675,NOUN,NN,npadvmod,753,O
752,107,47,4,752,I,I,4676,4677,PRON,PRP,nsubj,753,O
753,107,47,5,753,took,take,4678,4682,VERB,VBD,ccomp,759,EVENT
754,107,47,6,754,the,the,4683,4686,DET,DT,det,755,O
755,107,47,7,755,train,train,4687,4692,NOUN,NN,dobj,753,O
756,107,47,8,756,",",",",4692,4693,PUNCT,",",punct,759,O
757,107,47,9,757,a,a,4694,4695,DET,DT,det,758,O


In [None]:
words = result.word.tolist()
resentence = " ".join(words)

In [None]:
print (resentence)

But the last time I took the train , a woman threw herself in front of it just outside Everett .


In [None]:
!wget "/content/booknlp_sample/10 Things I Can See From Here - Carrie Mac.txt.tokens"


/content/booknlp_sample/10 Things I Can See From Here - Carrie Mac.txt.tokens: Scheme missing.


In [None]:
def grab_event_sentences(file):
    df = pd.read_csv(file, delimiter="\t")
    real_events = df.loc[df["event"] == "EVENT"]
    sentences = real_events.sentence_ID.tolist()
    event_words = real_events.word.tolist()
    event_lemmas = real_events.lemma.tolist()
    final_sentences = []
    x=0
    for sentence in sentences:
        result = df[df["sentence_ID"] == int(sentence)]
        words = result.word.tolist()
        resentence = " ".join(words)
        final_sentences.append({"event_word": event_words[x],
                                "event_lemma": event_lemmas[x],
                                "sentence": resentence
                               })
        x=x+1
    return final_sentences

In [None]:
event_data = grab_event_sentences("/content/booknlp_sample/10 Things I Can See From Here - Carrie Mac.txt.tokens")

TypeError: ignored

In [None]:
print (event_data[0])

In [None]:
new_df = pd.DataFrame(event_data)
new_df


In [None]:
new_df.to_csv("10ThingsCarrieMac.events", index=False)
