# NLP with spaCy

SpaCy is a modern natural language processing library for Python https://spacy.io/

We follow material found on offical github https://github.com/explosion/spacy-notebooks

First of all, install spaCy library and language-specific resources, if necessary:

In [1]:
#!pip3 install spacy
#!python3 -m spacy download it # where "it" is Italian language code, use "en" for English

Then import library and language resources in python (it might take a while):

In [2]:
import spacy
nlp = spacy.load('en')

## Analyzing Pride and Prejudice

This notebook reproduce the tutorial available at: https://github.com/explosion/spacy-notebooks/blob/master/notebooks/conference_notebooks/pycon_nlp/01_pride_and_predjudice.ipynb

The text of the book can be downloaded here: https://github.com/explosion/spacy-notebooks/blob/master/notebooks/conference_notebooks/pycon_nlp/data/pride_and_prejudice.txt

First of all we define a convenient function to load text file, then we use it to load the book:

In [3]:
def read_file(file_name):
    with open(file_name, 'r') as file:
        return file.read()

In [4]:
text = read_file('data/pride_and_prejudice.txt')

# then parse it with spacy
processed_text = nlp(text)

In [5]:
# how many sentences in the book?
sentences = [s for s in processed_text.sents]
print(len(sentences))

# take a look at some sentences
print(sentences[15:25])

7761
[", My dear Mr. Bennet," said his lady to him one day, "have you heard that
Netherfield Park is let at last?, "

, Mr. Bennet replied that he had not.

, ", But it is," returned she; "for Mrs. Long has just been here, and she
told me all about it., "

, Mr. Bennet made no answer.

, "Do you not want to know who has taken it?, " cried his wife impatiently.

]


### Find names of characters using named entities

We leverage named entities recognition to build a list of characters' names and the number of occurences in the text.

First, we  import the Counter package from library collections:

In [6]:
from collections import Counter

In [7]:
# extract all the personal names and count their occurrences 
# output is a list in the following form: [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266) ...].

def find_character_occurences(doc):
    """
    Return a list of names from `doc` with corresponding occurences.
    
    :param doc: Spacy NLP parsed document
    :return: list of tuples in form
        [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266)]
    """
    
    characters = Counter() # initialize counter
    for ent in processed_text.ents: # cycle through named entities recognized in spacy doc
        if ent.label_ == 'PERSON': # filter only person names
            characters[ent.lemma_] += 1 # add 1 to the sum of occurrences of the found name
            
    return characters.most_common() # without integer argument, it simply return sorted list

For example:

In [8]:
print(find_character_occurences(processed_text)[:20])

[('Elizabeth', 620), ('Darcy', 364), ('Bennet', 290), ('Jane', 285), ('Bingley', 252), ('Wickham', 185), ('Collins', 178), ('Gardiner', 94), ('Lizzy', 93), ('Lady Catherine', 84), ('Kitty', 69), ('Longbourn', 60), ('Lydia', 50), ('Charlotte', 45), ('Pemberley', 42), ('Forster', 39), ('Mary', 37), ('William', 34), ('Fitzwilliam', 33), ('Hurst', 33)]


In [9]:
print(find_character_occurences(processed_text)[:50])

[('Elizabeth', 620), ('Darcy', 364), ('Bennet', 290), ('Jane', 285), ('Bingley', 252), ('Wickham', 185), ('Collins', 178), ('Gardiner', 94), ('Lizzy', 93), ('Lady Catherine', 84), ('Kitty', 69), ('Longbourn', 60), ('Lydia', 50), ('Charlotte', 45), ('Pemberley', 42), ('Forster', 39), ('Mary', 37), ('William', 34), ('Fitzwilliam', 33), ('Hurst', 33), ('Project Gutenberg - tm', 31), ('Meryton', 27), ('Phillips', 26), ('Catherine', 21), ('Lady Lucas', 18), ('Maria', 18), ('Derbyshire', 17), ('Netherfield', 17), ('de Bourgh', 16), ('Kent', 14), ('Long', 14), ('Miss Lucas', 13), ('Denny', 12), ('Lucases', 10), ('Caroline', 9), ('Jenkinson', 9), ('Reynolds', 9), ('William Lucas', 9), ('Lucas', 8), ('Hill', 8), ('Elizabeth Bennet', 8), ('longbourn', 8), ('Gutenberg - tm', 8), ('Eliza', 7), ('Charlotte Lucas', 7), ('Charles', 7), ('George Wickham', 7), ('Lady Catherine de Bourgh', 6), ('Gutenberg', 6), ('Aye', 6)]


### Find words that describe a character & find characters that are doing specific actions

Leveraging subtrees.

In [10]:
def get_character_adjectives(doc, character_lemma):
    """
    Find all the adjectives related to `character_lemma` in `doc`
    
    :param doc: Spacy NLP parsed document
    :param character_lemma: string object
    :return: list of adjectives related to `character_lemma`
    """
    
    adjectives = []
    for ent in processed_text.ents:
        if ent.lemma_ == character_lemma:
            for token in ent.subtree:
                if token.pos_ == 'ADJ': # Replace with if token.dep_ == 'amod':
                    adjectives.append(token.lemma_)
    
    for ent in processed_text.ents:
        if ent.lemma_ == character_lemma:
            if ent.root.dep_ == 'nsubj':
                for child in ent.root.head.children:
                    if child.dep_ == 'acomp':
                        adjectives.append(child.lemma_)
    
    return adjectives

For example:

In [13]:
print(get_character_adjectives(processed_text, 'Darcy'))

['surprised', 'unwilling', 'least', 'grave', 'late', 'late', 'late', 'late', 'intimate', 'confidential', 'present', 'late', 'superior', 'evident', 'late', 'late', 'poor', 'last', 'little', 'disagreeable', 'clever', 'worth', 'delighted', 'studious', 'sorry', 'unworthy', 'answerable', 'impatient', 'ashamed', 'kind', 'handsome', 'proud', 'tall', 'punctual', 'engage', 'delight', 'fond']


In [19]:
# Find characters that are 'talking', 'saying', 'doing' something the most.

character_verb_counter = Counter()
VERB_LEMMA = 'say'

for ent in processed_text.ents:
    if ent.label_ == 'PERSON' and ent.root.head.lemma_ == VERB_LEMMA:
        character_verb_counter[ent.text] += 1

For example:

In [20]:
print(character_verb_counter.most_common(10))

[('Elizabeth', 46), ('Bennet', 30), ('Bingley', 16), ('Jane', 13), ('Darcy', 13), ('Fitzwilliam', 5), ('Lady Catherine', 5), ('Lizzy', 5), ('Gardiner', 5), ('Charlotte', 5)]
