# Intro to Bookworm
For the sake of coherence with what I've outiled in the README, I'm going to start with Infinite Jest. 

Go and have a quick look at the stripped down bookworm code in [utils.py](./utils.py). The first thing we're going to do is load in all of those functions. I'll also do some explanation of how each function works along the way.

In [2]:
import utils
from utils import *

Next thing to do is load in the book and the characters. These operations are both pretty simple. The book is loaded in as one long string from a `.txt` file. Character lists are stored in a `.csv`, with all potential names for a character stored on each row. They're loaded in as tuples of names in a list of characters.  

Then we split the book down into sections. Bookworm works by looking for _coocurrence_ of characters in these sections of the text as a proxy for their connectedness. It's a very simple trick which works stupidly well.  
There are a few ways we can break down the book into sections:
- `get_sentence_sequences()` uses [NLTK](http://www.nltk.org/)'s standard `.tokenize()` function to split the book into sentences.  
- `get_word_sequences()` uses [NLTK](http://www.nltk.org/)'s `word_tokenize()` to split the book into words, of which it will then select ordered lists of length `n` (default 40).  
- `get_character_sequences()` uses python builtins to split it into substrings of length `n` (default 200).  

Fundamentally, they all return a list of strings which each cover a very small section of the novel. For simplicity's sake we're going to use the sentence-wise splitter.

In [2]:
book = load_book('../data/raw/ij.txt')
characters = load_characters('../data/raw/characters_ij.csv')

sequences = get_sentence_sequences(book)

Now comes the fun bit. We've assembled our cast, and moved the text that they inhabit into a nice, machine-interpretable format.  
What we want to generate now is the blank table below which describes the presence of a character in a sentence. At this point, Bookworm hasn't really 'read' any of the text so all of the interactions between characters and sentences (where each cell in the table represents an interaction) are set to 0:

|            | character 1 | character 2 | character 3 |
|------------|-------------|-------------|-------------|
| sentence 1 | 0           | 0           | 0           |
| sentence 2 | 0           | 0           | 0           |
| sentence 3 | 0           | 0           | 0           |
| sentence 4 | 0           | 0           | 0           |


However, tuples of names aren't that nicely placed into tables as column headings, and entire sentences (especially those in Infinite Jest) are inappropriately long to be used as row indexes. To get around this aesthetic lumpiness, we can set up a hash table for each of the lists, allowing us to move backwards and forwards quickly and easily between the tuple/text themselves and their numeric fingerprints.  
Note: we'll be using pandas to build the table above so that hashing is obviously already being done automatically under the hood - doing it explicitly is a purely aesthetic choice.  

In [3]:
hash_to_sequence, sequence_to_hash = get_sequence_hashes(sequences)
hash_to_character, character_to_hash = get_character_hashes(characters)

Now we've got:

|                  | hash(character 1) | hash(character 2) | hash(character 3) |
|------------------|-------------------|-------------------|-------------------|
| hash(sentence 1) | 0                 | 0                 | 0                 |
| hash(sentence 2) | 0                 | 0                 | 0                 |
| hash(sentence 3) | 0                 | 0                 | 0                 |
| hash(sentence 4) | 0                 | 0                 | 0                 |

| hash              | character   |
|-------------------|-------------|
| hash(character 1) | character 1 |
| hash(character 2) | character 2 |
| hash(character 3) | character 3 |

| hash             | sentence   |
|------------------|------------|
| hash(sentence 1) | sentence 1 |
| hash(sentence 2) | sentence 2 |
| hash(sentence 3) | sentence 3 |
| hash(sentence 4) | sentence 4 |

We also have the hash tables in reverse too, just in case they become useful later on.

The first bit of the `find_connections()` function sets up the blank table above. 

In [4]:
df = find_connections(sequences, characters)

Next, it iterates through the list of sentences it has been fed, checking for an instance of each character. If it finds a character in the sentence, it marks their presence with a 1.  
So if **character 1** appears with **character 2** in sentence 1, and with **character 3** in sentence 2, we would see the following, with the rest of the cells remaining blank:

|                  | hash(character 1) | hash(character 2) | hash(character 3) |
|------------------|-------------------|-------------------|-------------------|
| hash(sentence 1) | 1                 | 1                 | 0                 |
| hash(sentence 2) | 1                 | 0                 | 1                 |
| hash(sentence 3) | 0                 | 0                 | 0                 |
| hash(sentence 4) | 0                 | 0                 | 0                 |

The next stage is the enumeration of coocurence. We can compute this very quickly with the `numpy` dot product of the table with its own transpose. 

In [5]:
cooccurence = calculate_cooccurence(df)

`calculate_cooccurence()` does this computation and then wipes out any interaction of a character with themselves. For the table above, this would give us:

|                   | hash(character 1) | hash(character 2) | hash(character 3) |
|-------------------|-------------------|-------------------|-------------------|
| hash(character 1) | 0                 | 1                 | 1                 |
| hash(character 2) | 1                 | 0                 | 0                 |
| hash(character 3) | 1                 | 0                 | 0                 |

showing that **character 1** has interacted with **character 2** and **character 3**, but **character 2** and **character 3** haven't interacted. Note the symmetry across the diagonal...

Of course, the example table here is miniscule in comparison to the dozens of characters who might turn up in a reasonably sized novel, and the thousands of opportunities they have to interact. The coocurence matrix in reality is likely to contain much larger numbers between characters who regularly appear in the same sentences. Unless we're working with a _really_ tiny, incestuous network, this coocurence matrix is also probably going to be pretty sparse. For that reason it'll often make sense to store it as a sparse matrix...

In [6]:
cooccurence = cooccurence.to_sparse()

Hopefully you now have a decent understanding of what Bookworm does, and how.

We can now show some results! Despite describing a set of tiny matrices above, we've really been computing all of Infinite Jest's massiveness while working through the notebook.

We can print the strongest relationships for a chosen character using the function below:

In [7]:
def print_five_closest(character):
    print('-'*len(str(character))
          + '\n' + str(character) + '\n'
          + '-'*len(str(character)))
    
    top_five = (cooccurence[hash(character)]
                .sort_values(ascending=False)
                .index.values
                [:5])
    
    for hashed_name in top_five:
        print(hash_to_character[hashed_name])

Applying this to 5 characters at random:

In [8]:
for i in range(5):
    print_five_closest(characters[randint(0, len(characters))])
    print()

-------------
('schtitt ',)
-------------
('delint ',)
('mario ',)
('hal ',)
('charles tavis ', 'tavis ')
('himself ', 'mad stork ', 'jim icandenza ', 'james incandenza ')

-------------------
('calvin thrust ',)
-------------------
('gately ', 'don ')
('randy ', 'lenz ')
('tiny ewell ', 'ewell ')
('burt f. smith ', 'burt ')
('ferocious francis ', 'francis ')

-------------------
('gretchen holt ',)
-------------------
('mario ',)
('jolene criess ',)
('prissburger ',)
('charles tavis ', 'tavis ')
('tiny ewell ', 'ewell ')

-----------------------------
('sharyn vaught ', 'sharyn ')
-----------------------------
('caryn vaught ', 'caryn ')
('schtitt ',)
('prissburger ',)
('lateral ', 'alice moore ')
('audern tallat-kelpsa ',)

----------------
('rik dunkel ',)
----------------
('charles tavis ', 'tavis ')
('ortho "the darkness" stice ', 'stice ', 'ortho ', 'the darkness ')
('delint ',)
('hal ',)
('barry loach ',)



Those all seem to make sense... Lets try with a few characters who we know about in more detail

In [9]:
print_five_closest(('the moms ', 'avril ', 'mondragon '))

-------------------------------------
('the moms ', 'avril ', 'mondragon ')
-------------------------------------
('hal ',)
('orin ',)
('mario ',)
('himself ', 'mad stork ', 'jim icandenza ', 'james incandenza ')
('joelle ', 'van dyne ', 'lucille ')


In [10]:
print_five_closest(('joelle ', 'van dyne ', 'lucille '))

------------------------------------
('joelle ', 'van dyne ', 'lucille ')
------------------------------------
('orin ',)
('gately ', 'don ')
('the moms ', 'avril ', 'mondragon ')
('erdedy ',)
('himself ', 'mad stork ', 'jim icandenza ', 'james incandenza ')


In [11]:
print_five_closest(('pemulis ',))

-------------
('pemulis ',)
-------------
('hal ',)
('trevor "axhandle" axford ', 'axford ', 'axhandle ')
('jim troeltsch ', 'troeltsch ')
('james struck ', 'struck ')
('ted schacht ', 'schacht ')


In [12]:
print_five_closest(('bruce green ',))

-----------------
('bruce green ',)
-----------------
('randy ', 'lenz ')
('gately ', 'don ')
('himself ', 'mad stork ', 'jim icandenza ', 'james incandenza ')
('kate gompert ', 'gompert ')
('tommy doocey ',)


Yep...  Go and look at the diagram in the README and compare the results.

# Same code, different book
Lets run the whole thing for an entirely different book and see whether we get similarly positive results. This time, Harry Potter and The Philosopher's Stone - chosen because you're more likely to have some contextual knowledge of who's who and what's what in that book.

In [3]:
book = load_book('../data/raw/hp.txt')
characters = load_characters('../data/raw/characters_hp.csv')
sequences = get_sentence_sequences(book)

hash_to_sequence, sequence_to_hash = get_sequence_hashes(sequences)
hash_to_character, character_to_hash = get_character_hashes(characters)

df = find_connections(sequences, characters)
cooccurence = calculate_cooccurence(df).to_sparse()

In [4]:
characters

[('vernon ', ' dursley '),
 ('petunia ', ' dursley '),
 ('dudley ', ' duddy '),
 ('lily ',),
 ('james ',),
 ('harry ', ' potter '),
 ('voldemort ', ' lord ', ' you-know-who '),
 ('jim mcguffin ',),
 ('secretary ',),
 ('dumbledore ', ' albus '),
 ('mcgonagall ', ' minerva '),
 ('diggle ',),
 ('pomfrey ', ' madam '),
 ('hagrid ', ' rubeus '),
 ('sirius ',),
 ('marge ', ' aunt '),
 ('figg ',),
 ('tibbles ',),
 ('snowy ',),
 ('mr paws ',),
 ('tufty ',),
 ('yvonne ',),
 ('piers polkiss ',),
 ('a boa constrictor ',),
 ('dennis ',),
 ('malcolm ',),
 ('gordon ',),
 ('merlin ',),
 ('mr evans ',),
 ('mrs evans ',),
 ('cornelius ', ' fudge '),
 ('miranda goshawk ',),
 ('bathilda ', ' bagshot '),
 ('adalbert waffling ',),
 ('emeric switch ',),
 ('phyllida spore ',),
 ('arsenus jigger ',),
 ('newton scamander ',),
 ('quentin trimble ',),
 ('tom ',),
 ('doris crockford ',),
 ('quirrell ',),
 ('griphook ',),
 ('madam malkin ',),
 ('draco ', ' malfoy '),
 ('lucius ', ' mr. malfoy '),
 ('narcissa ', ' 

In [14]:
print_five_closest(('harry ', ' potter '))

----------------------
('harry ', ' potter ')
----------------------
('ron ', ' weasley ')
('hermione ', ' granger ')
('hagrid ', ' rubeus ')
('snape ', ' severus ')
('dudley ', ' duddy ')


In [15]:
print_five_closest(('voldemort ', ' lord ', ' you-know-who '))

------------------------------------------
('voldemort ', ' lord ', ' you-know-who ')
------------------------------------------
('harry ', ' potter ')
('quirrell ',)
('snape ', ' severus ')
('dumbledore ', ' albus ')
('ron ', ' weasley ')


In [16]:
print_five_closest(('crabbe ',))

------------
('crabbe ',)
------------
('goyle ',)
('draco ', ' malfoy ')
('harry ', ' potter ')
('neville ', ' longbottom ')
('fred ',)


In [17]:
print_five_closest(('fred ',))

----------
('fred ',)
----------
('george ',)
('ron ', ' weasley ')
('harry ', ' potter ')
('adrian pucey ',)
('katie bell ',)


Hopefully that's enough proof that this thing works well. Go and look at the [next notebook](./Visualising networks.ipynb) which covers a bit more network analysis and some ways in which we can visualise the network.