# Mining Semantic Relations from Gutenberg

I'm extracting information from Gutenberg data with SpaCy.  

- `(possessor, possessed)`
- `(noun, adjective, number_of_noun)`
- `(noun, transitive_verb, number_of_noun)`


The `transitive_verb` is lemmatized. `number_of_noun` is `1` (singular) or `2` (plural). From the sentence "The small dog ate the purple leaves.":

>`[("small","dog",1),("purple","leaves",2)]`

and 

>`[("dog","eat",1)]`

***

Todo

* don't just get adjectives, also participles like "cooked cabbage" (maybe don't do this? yes...too many adjectives as it is...)

In [1]:
from tqdm import tqdm
import json
from collections import defaultdict
from nltk import ngrams
from nltk.corpus import wordnet as wn

In [2]:
import nltk
print(nltk.__version__)

3.7


In [3]:
import spacy

In [4]:
nlp = spacy.load("en_core_web_lg")

In [5]:
print(spacy.__version__)

3.4.1


In [6]:
for a in nlp("The wild dog was small and cold.  The dog ate the house.  The dog slept. The tree bore fruit and lemons."):
    print(a,a.dep_)

The det
wild amod
dog nsubj
was ROOT
small acomp
and cc
cold conj
. punct
  dep
The det
dog nsubj
ate ROOT
the det
house dobj
. punct
  dep
The det
dog nsubj
slept ROOT
. punct
The det
tree nsubj
bore ROOT
fruit dobj
and cc
lemons conj
. punct


## Define functions to extract relations

Extracting `(possessed,possessor)` tuples conservatively: trying to get relationships like "the scales of the fish" also gets "ball of yarn."

In [7]:
def extract_possessor_possessed(tempspacy):
    """
    For a sentence like "The dog's fur, the teeth of the cat." returns [(dog, fur),(cat,teeth)].
    lemmatizes the FIRST noun.  
    """
    nouns = ["NN","NNS"]
    possessor_possessed_tuples = []
    ### the the noun's possessed, the cat's meow
    ### don't go for the meow of the cat because too ambiguous (the ball of yarn)
    for token in tempspacy:
#         try:
        if (token.tag_ in nouns) and (token.head.tag_ in nouns): ### find a noun and its noun head
            if token.dep_=="poss": ## make sure the dep that connects them is `poss`
                possessor_possessed_tuples.append((token.text.lower(),token.head.text.lower()))
#             if (token.tag_ in nouns) and (token.dep_=="pobj") and (token.head.lemma_=="of") and (token.head.head.tag_ in nouns):
#                 possessor_possessed_tuples.append((token.head.head.text,token.text))
#         except:
#             pass
    return possessor_possessed_tuples

extract_possessor_possessed(nlp(u"Beyond the old cat's stinky legs and ugly face.  A ball of yarn. The shining scales of the fish.  The king's mansions.  Your friend is here."))

[('cat', 'legs'), ('king', 'mansions')]

Extracting `(possessed,possessor)` tuples conservatively: trying to get relationships like "the scales of the fish" also gets "ball of yarn."

In [8]:
def extract_adj2nouns(tempspacy):
    """
    returns [(adjective, noun, number of noun),...]
    """
    nouns = ["NN","NNS"]
    adj_noun_tuples = []
    for token in tempspacy:  ## for every token in the document
        #print(token,token.dep_,token.pos_)
#         try: 
        if token.dep_=="amod":  ## try to see if it is an `amod`, an adjective
            if token.pos_=="ADJ":
                if token.head.tag_ in nouns:  ## try to see if the head is a noun
                    adj_noun_tuples.append((token.text.lower(),token.head.text.lower(),1 if token.head.tag_=="NN" else 2)) ## add the modifying word and the lemma 
#         except:
#             pass
    return adj_noun_tuples
                                       
extract_adj2nouns(nlp("Frogs and small frogs were not the only ones there, in the blue bird of the blue house.  The smiling dog walked itself through the house."))

[('small', 'frogs', 2), ('only', 'ones', 2), ('blue', 'bird', 1)]

In [9]:
def extract_noun2verb(tempspacy):
    """
    returns [(noun, transitive verb, number of noun),...]
    """
    
    nouns = ["NN","NNS"]

    noun2verbs = []
    
    for token in tempspacy:
#         try:
        if ((token.tag_ in nouns) and (token.dep_=="nsubj")): ## find a noun subject
            verb = token.head  ## make sure it's head is a verb
            if verb.pos_ == "VERB":  ## ...
                verb_children_deps = [c.dep_ for c in verb.children] ## make sure one of its deps has a dobj dependency
                if "dobj" in verb_children_deps:
                    obj = [c for c in verb.children if c.dep_=="dobj"][0] ## just get the last one
                    noun2verbs.append((token.text.lower(),verb.lemma_,1 if token.tag_=="NN" else 2))# just noun and the verb lemma #obj.text.lower())) ## get the lemma, lemmatized verb, and nonlematized object
#         except:
#             pass
    return noun2verbs


extract_noun2verb(nlp("8:4 And the ark rested in the seventh month, on the seventeenth day of\
the month, upon the mountains of Ararat.And the ark rested in the seventh month, on the seventeenth day of the month, upon the mountains of Ararat. He swung through the fence.  The men circling the house. While running down the street, the man ate apples from a cart. The tree bore fruit and lemons. The dog who ran through the night ate chicken through the night.  The woods are on fire."))

[('man', 'eat', 1), ('tree', 'bear', 1), ('dog', 'eat', 1)]

## Load in some Gutenberg texts.

I want to loop through a bunch of Gutenberg texts that I've downloaded with the Gutenberg python package.  

In [10]:
import os
#directory = "/Users/kyle/Documents/downloading_gutenberg/data/"
directory = "/Volumes/extra_data/gutenberg/data/"
gb_files = [f for f in os.listdir(directory) if f.startswith('gb_')]

In [11]:
len(gb_files)

31116

In [12]:
just_testing = False ### False to run on all Gutenberg files in directory

In [13]:
if just_testing==True:
    gb_files = gb_files[:20]

In [14]:
from gutenberg.cleanup import strip_headers

In [15]:
from collections import defaultdict

In [16]:
possessor2possessed = defaultdict(int)
adj2nouns_tuples = defaultdict(int)
noun2verb_tuples = defaultdict(int)
banned_books = [
    10,#bible...the verse numbers disturb spacy's parser
]

for fy in tqdm(gb_files):
    with open(directory+fy,'r') as f:
        tempdata = f.read()
        filenumber = fy.lstrip("gb_").rstrip(".txt")
        if ("Language: English" in tempdata[:1000] and int(filenumber) not in banned_books):  ## make sure english 
            tempdata = strip_headers(tempdata)
            tempdata = tempdata.replace("\n"," ") ## important! spacy does better without \n
            tempspacy = nlp(tempdata[:200000])### limit to first n chars
            try:
                for pp in list(set(extract_possessor_possessed(tempspacy))):
                    possessor2possessed[pp]+=1
            except:
                pass
            try:
                for pp in list(set(extract_adj2nouns(tempspacy))):
                    adj2nouns_tuples[pp]+=1
            except:
                pass
            try:
                for pp in list(set(extract_noun2verb(tempspacy))):
                    noun2verb_tuples[pp]+=1
            except:
                pass
                

100%|██████████████████████████████████| 31116/31116 [31:42:27<00:00,  3.67s/it]


`(possessor,possessed)` tuples?

In [17]:
list(possessor2possessed.items())[:10]

[(('strength', 'abundance'), 3),
 (('honour', 'ring'), 2),
 (('tongue', 'tune'), 2),
 (('father', 'skill'), 16),
 (('tiger', 'jaws'), 5),
 (('winter', 'deface'), 3),
 (('love', 'parts'), 4),
 (('love', 'face'), 17),
 (('eyes', 'falsehood'), 2),
 (('husband', 'shape'), 3)]

In [18]:
sum(possessor2possessed.values())

1175185

In [19]:
list(adj2nouns_tuples.items())[:10]

[(('holy', 'writ', 1), 8),
 (('plausive', 'words', 2), 1),
 (('vulgar', 'thief', 1), 6),
 (('too', 'base', 1), 5),
 (('cold', 'fountain', 1), 10),
 (('mere', 'word', 1), 189),
 (('honourable', 'service', 1), 46),
 (('second', 'burthen', 1), 6),
 (('obedient', 'servant', 1), 816),
 (('antique', 'song', 1), 10)]

In [20]:
sum(adj2nouns_tuples.values())

33418670

In [21]:
list(noun2verb_tuples.items())[:10]

[(('grace', 'kill', 1), 2),
 (('child', 'sum', 1), 4),
 (('world', 'see', 1), 715),
 (('thou', 'find', 1), 132),
 (('unworthiness', 'raise', 1), 1),
 (('dreams', 'show', 2), 11),
 (('truth', 'need', 1), 27),
 (('death', 'do', 1), 112),
 (('ear', 'confound', 1), 4),
 (('beauty', 'hold', 1), 44)]

In [22]:
sum(noun2verb_tuples.values())

4961751

### Output

In [23]:
if just_testing==False:
    with open("possessor2possessed_tuples_with_count.json","w") as f:
        json.dump(list(possessor2possessed.values()),f)
else:
    print("just a test; not saving data")

NameError: name 'possessor2possessed_tuples' is not defined

In [None]:
if just_testing==False:
    with open("adj2nouns_tuples_with_count.json","w") as f:
        json.dump(list(adj2nouns_tuples.values()),f)
else:
    print("just a test; not saving data")

In [None]:
if just_testing==False:
    with open("noun2verb_tuples_with_count.json","w") as f:
        json.dump(list(noun2verb_tuples.values()),f)
else:
    print("just a test; not saving data")

***