# Getting `(possessor,possessed)` Pairs

I use SpaCy to find pairs like`("cat","tail")`, `("tree","leaf")`, and `("duke","appenage")` from Project Gutenberg texts.

In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

In [2]:
from tqdm import tqdm
import json
from collections import defaultdict
from nltk import ngrams
from nltk.corpus import wordnet as wn

In [3]:
import spacy
nlp = spacy.load('en')

In [4]:
for a in nlp(u"The dog was small and cold."):
    print(a,a.dep_)

The det
dog nsubj
was ROOT
small acomp
and cc
cold conj
. punct


Load in some randomly chosen Gutenberg texts.

In [5]:
import os
gb_files = [f for f in os.listdir("/Users/kyle/Documents/downloading_gutenberg/data/") if f.startswith('gb_')]

In [6]:
len(gb_files)

19507

Define a function to extract `(possessor,possessed)` relations.

In [7]:
from spacy import displacy

In [8]:
type(nlp("The wifes of the man was here.")[1])

spacy.tokens.token.Token

Extracting `(possessed,possessor)` tuples conservatively: trying to get relationships like "the scales of the fish" also gets "ball of yarn."

In [9]:
def extract_possessor_possessed(tempspacy):
    """
    For a sentence like "The dog's fur, the teeth of the cat." returns [(dog, fur),(cat,teeth)].
    lemmatizes the FIRST noun.  
    """
    
    nouns = ["NN","NNS"]
    possessor_possessed_tuples = []
    
    
    ### the the noun's possessed, the cat's meow
    ### don't go for the meow of the cat because too ambiguous (the ball of yarn)
    for token in tempspacy:
        try:
            if (token.tag_ in nouns) and (token.head.tag_ in nouns): ### find a noun and its noun head
                if token.dep_=="poss": ## make sure the dep that connects them is `poss`
                    possessor_possessed_tuples.append((token.text.lower(),token.head.text.lower()))
#             if (token.tag_ in nouns) and (token.dep_=="pobj") and (token.head.lemma_=="of") and (token.head.head.tag_ in nouns):
#                 possessor_possessed_tuples.append((token.head.head.text,token.text))
        except:
            pass


    return possessor_possessed_tuples

extract_possessor_possessed(nlp(u"Beyond the old cat's stinky legs and ugly face.  A ball of yarn. The shining scales of the fish.  The king's mansions.  Your friend is here."))

[('cat', 'legs'), ('king', 'mansions')]

I want to loop through a bunch of Gutenberg texts that I've randomly downloaded with the Gutenberg python package.  

In [10]:
from gutenberg.cleanup import strip_headers

In [11]:
possessor2possessed_tuples_with_filenumber = []

for fy in tqdm(gb_files):
    with open("/Users/kyle/Documents/downloading_gutenberg/data/"+fy,'r') as f:
        tempdata = f.read()
        filenumber = fy.lstrip("gb_").rstrip(".txt")
        if "Language: English" in tempdata[:1000]:  ## make sure english 
            tempdata = strip_headers(tempdata)
            tempspacy = nlp(tempdata[:200000])### limit to first n chars
            ## (king,kingdom)
            try:
                for pp in extract_possessor_possessed(tempspacy):
                    possessor2possessed_tuples_with_filenumber.append((filenumber,pp))
            except:
                pass

100%|██████████| 19507/19507 [18:59:54<00:00,  3.51s/it]   


`(possessor,possessed)` tuples?

In [12]:
possessor2possessed_tuples_with_filenumber[:100]

[('2068', ('captain', 'property')),
 ('2068', ('dealer', 'wagon')),
 ('2068', ('takin', 'carpets')),
 ('2068', ('body', 'eyesight')),
 ('2068', ('praisin', 'beliefs')),
 ('2068', ('doctor', 'bills')),
 ('2068', ('month', 'rent')),
 ('2068', ('girl', 'eyes')),
 ('2068', ('daughter', 'summer')),
 ('2068', ('father', 'money')),
 ('2068', ('law', 'funeral')),
 ('2068', ('settin', 'takin')),
 ('2068', ('sister', 'orders')),
 ('2068', ('schooner', 'sails')),
 ('2068', ('stovepipe', 'loose')),
 ('2068', ('brother', 'knees')),
 ('2068', ('brother', 'money')),
 ('2068', ('morning', 'duties')),
 ('2068', ('world', 'comin')),
 ('2068', ('doin', 'years')),
 ('2068', ('pastor', 'tenantry')),
 ('2068', ('settin', 'room')),
 ('2068', ('settin', 'room')),
 ('2068', ('lookin', 'ones')),
 ('2068', ('man', 'voice')),
 ('2068', ('week', 'cruise')),
 ('2068', ('unknown', 'stock')),
 ('2068', ('dressin', 'room')),
 ('2068', ('stranger', 'answer')),
 ('2068', ('coastin', 'schooner')),
 ('2068', ('way', 'tis'

In [13]:
print(len(possessor2possessed_tuples_with_filenumber))

583321


In [15]:
with open("possessor2possessed_tuples_with_filenumber.json","w") as f:
    json.dump(possessor2possessed_tuples_with_filenumber,f)

***