# Getting `(noun, adjective, number_of_noun)` and `(noun, transitive_verb, number_of_noun)` Tuples

I'm extracting yet more information from Gutenberg data. The `transitive_verb` is lemmatized. `number_of_noun` is `1` (singular) or `2` (plural). From the sentence "The small dog ate the purple leaves.":

>`[("small","dog",1),("purple","leaves",2)]`

and 

>`[("dog","eat",1)]`

***

Todo

* don't just get adjectives, also participles like "cooked cabbage"

In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

In [2]:
from tqdm import tqdm
import json
from collections import defaultdict
from nltk import ngrams
from nltk.corpus import wordnet as wn

In [3]:
import spacy
nlp = spacy.load('en')

In [4]:
for a in nlp(u"The wild dog was small and cold.  The dog ate the house.  The dog slept."):
    print(a,a.dep_)

The det
wild amod
dog nsubj
was ROOT
small acomp
and cc
cold conj
. punct
  
The det
dog nsubj
ate ROOT
the det
house dobj
. punct
  
The det
dog nsubj
slept ROOT
. punct


Load in some randomly chosen Gutenberg texts.

In [5]:
import os
gb_files = [f for f in os.listdir("/Users/kyle/Documents/downloading_gutenberg/data/") if f.startswith('gb_')]

In [6]:
len(gb_files)

19507

Define a function to extract relations.

In [7]:
type(nlp("The wifes of the man was here.")[1])

spacy.tokens.token.Token

Extracting `(possessed,possessor)` tuples conservatively: trying to get relationships like "the scales of the fish" also gets "ball of yarn."

In [8]:
def extract_adj2nouns(tempspacy):
    """
    returns [(adjective, noun, number of noun),...]
    """
    nouns = ["NN","NNS"]
    adj_noun_tuples = []
    for token in tempspacy:  ## for every token in the document
        try: 
            if token.dep_=="amod":  ## try to see if it is an `amod`, an adjective
                if token.pos_=="ADJ":
                    if token.head.tag_ in nouns:  ## try to see if the head is a noun
                        adj_noun_tuples.append((token.text.lower(),token.head.text.lower(),1 if token.head.tag_=="NN" else 2)) ## add the modifying word and the lemma 
        except:
            pass
    return adj_noun_tuples
                                       
extract_adj2nouns(nlp("Frogs and small frogs were not the only ones there, in the blue house of the blue house.  The smiling dog walked itself through the house."))

[('small', 'frogs', 2), ('only', 'ones', 2), ('blue', 'house', 1)]

In [9]:
def extract_noun2verb(tempspacy):
    """
    returns [(noun, transitive verb, number of noun),...]
    """
    
    nouns = ["NN","NNS"]

    noun2verbs = []
    
    for token in tempspacy:
        try:
            if ((token.tag_ in nouns) and (token.dep_=="nsubj")): ## find a noun subject
                verb = token.head  ## make sure it's head is a verb
                if verb.pos_ == "VERB":  ## ...
                    verb_children_deps = [c.dep_ for c in verb.children] ## make sure one of its deps has a dobj dependency
                    if "dobj" in verb_children_deps:
                        obj = [c for c in verb.children if c.dep_=="dobj"][0] ## just get the last one
                        noun2verbs.append((token.text.lower(),verb.lemma_,1 if token.tag_=="NN" else 2))# just noun and the verb lemma #obj.text.lower())) ## get the lemma, lemmatized verb, and nonlematized object
        except:
            pass

    return noun2verbs


extract_noun2verb(nlp(u"He swung through the fence.  The men circling the house. While running down the street, the man ate apples from a cart. The tree bore fruit and lemons. The dog who ran through the night ate chicken through the night.  The woods are on fire."))

[('men', 'circle', 2), ('man', 'eat', 1), ('dog', 'eat', 1)]

I want to loop through a bunch of Gutenberg texts that I've randomly downloaded with the Gutenberg python package.  

In [10]:
from gutenberg.cleanup import strip_headers

In [11]:
adj2nouns_tuples_with_filenumber = []
noun2verb_tuples_with_filenumber = []

for fy in tqdm(gb_files):
    with open("/Users/kyle/Documents/downloading_gutenberg/data/"+fy,'r') as f:
        tempdata = f.read()
        filenumber = fy.lstrip("gb_").rstrip(".txt")
        if "Language: English" in tempdata[:1000]:  ## make sure english 
            tempdata = strip_headers(tempdata)
            tempspacy = nlp(tempdata[:200000])### limit to first n chars
            ## (king,kingdom)
            try:
                for pp in extract_adj2nouns(tempspacy):
                    adj2nouns_tuples_with_filenumber.append((filenumber,pp))
            except:
                pass
            try:
                for pp in extract_noun2verb(tempspacy):
                    noun2verb_tuples_with_filenumber.append((filenumber,pp))
            except:
                pass
                

100%|██████████| 19507/19507 [18:47:14<00:00,  3.47s/it]   


`(possessor,possessed)` tuples?

In [12]:
adj2nouns_tuples_with_filenumber[:30]

[('2068', ('main', 'road', 1)),
 ('2068', ('gray', 'drapery', 1)),
 ('2068', ('front', 'yards', 2)),
 ('2068', ('new', 'weathervane', 1)),
 ('2068', ('regular', 'church', 1)),
 ('2068', ('fine', 'residence', 1)),
 ('2068', ('opposite', 'gate', 1)),
 ('2068', ('front', 'gate', 1)),
 ('2068', ('other', 'side', 1)),
 ('2068', ('main', 'road', 1)),
 ('2068', ('little', 'story', 1)),
 ('2068', ('populous', 'center', 1)),
 ('2068', ('precarious', 'earnings', 2)),
 ('2068', ('large', 'establishment', 1)),
 ('2068', ('wet', 'blanket', 1)),
 ('2068', ('small', 'yard', 1)),
 ('2068', ('venerable', 'trees', 2)),
 ('2068', ('conservative', 'trees', 2)),
 ('2068', ('old', 'age', 1)),
 ('2068', ('dismantled', 'room', 1)),
 ('2068', ('faded', 'carpet', 1)),
 ('2068', ('outward', 'appearances', 2)),
 ('2068', ('large', 'portion', 1)),
 ('2068', ('wet', 'blanket', 1)),
 ('2068', ('more', 'sunshine', 1)),
 ('2068', ('plump', 'arms', 2)),
 ('2068', ('strong', 'arms', 2)),
 ('2068', ('tall', 'boy', 1)),
 

In [13]:
print(len(adj2nouns_tuples_with_filenumber))

17127986


In [14]:
noun2verb_tuples_with_filenumber[:30]

[('2068', ('weathervane', 'present', 1)),
 ('2068', ('gate', 'stand', 1)),
 ('2068', ('ward', 'remove', 1)),
 ('2068', ('pair', 'discuss', 1)),
 ('2068', ('discussion', 'reach', 1)),
 ('2068', ('passengers', 'make', 2)),
 ('2068', ('kind', 'keep', 1)),
 ('2068', ('family', 'get', 1)),
 ('2068', ('companion', 'put', 1)),
 ('2068', ('month', 'work', 1)),
 ('2068', ('sewin', 'keep', 1)),
 ('2068', ('nobody', 'owe', 1)),
 ('2068', ('youngster', 'cherish', 1)),
 ('2068', ('draught', 'keep', 1)),
 ('2068', ('latter', 'leave', 1)),
 ('2068', ('body', 'need', 1)),
 ('2068', ('girl', 'take', 1)),
 ('2068', ('tide', 'reach', 1)),
 ('2068', ('use', 'advise', 1)),
 ('2068', ('everybody', 'know', 1)),
 ('2068', ('nobody', 'take', 1)),
 ('2068', ('possibility', 'enter', 1)),
 ('2068', ('refusal', 'enter', 1)),
 ('2068', ('knees', 'give', 2)),
 ('2068', ('nobody', 'lead', 1)),
 ('2068', ('nobody', 'put', 1)),
 ('2068', ('everybody', 'know', 1)),
 ('2068', ('folks', 'say', 2)),
 ('2068', ('thing', 'sa

In [31]:
print(len(noun2verb_tuples_with_filenumber))

2161512


In [23]:
adj2nouns_tuples_with_filenumber_unique_per_file = list(set(adj2nouns_tuples_with_filenumber))

In [25]:
adj2nouns_tuples = [tup for filenumber,tup in adj2nouns_tuples_with_filenumber_unique_per_file]
adj2nouns_tuples[:10]

[('new', 'place', 1),
 ('only', 'problem', 1),
 ('large', 'parcels', 2),
 ('regular', 'manner', 1),
 ('secret', 'suffering', 1),
 ('happy', 'provision', 1),
 ('whitened', 'necks', 2),
 ('few', 'people', 2),
 ('calm', 'assurance', 1),
 ('faint', 'tinge', 1)]

In [28]:
len(adj2nouns_tuples)/len(adj2nouns_tuples_with_filenumber)

0.8607644237915655

In [29]:
noun2verb_tuples_with_filenumber_unique_per_file = list(set(noun2verb_tuples_with_filenumber))

In [30]:
noun2verb_tuples = [tup for filenumber,tup in noun2verb_tuples_with_filenumber_unique_per_file]
noun2verb_tuples[:10]

[('education', 'demand', 1),
 ('diagram', 'show', 1),
 ('story', 'stir', 1),
 ('lamb', 'follow', 1),
 ('limestones', 'bear', 2),
 ('failure', 'bring', 1),
 ('officer', 'take', 1),
 ('mind', 'give', 1),
 ('scenery', 'lose', 1),
 ('bridge', 'cross', 1)]

In [33]:
len(noun2verb_tuples)/len(adj2nouns_tuples_with_filenumber)

0.12248532898146927

In [34]:
with open("adj2nouns_tuples.json","w") as f:
    json.dump(adj2nouns_tuples,f)

In [35]:
with open("noun2verb_tuples.json","w") as f:
    json.dump(noun2verb_tuples,f)

***