# Gymnasion Data Processing

Here I'm going to mine some chunk of Project Gutenberg texts for `(adj,noun)` and `(noun,verb,object)` relations using mostly SpaCy and textacy.  Extracting them is easy.  Filtering out the chaff is not so easy.  

In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

In [2]:
from tqdm import tqdm
import json
from collections import defaultdict
from nltk import ngrams

In [3]:
from textacy import extract

In [4]:
import spacy
nlp = spacy.load('en')

Load in some randomly chosen Gutenberg texts.

In [5]:
import os
gb_files = [f for f in os.listdir("/Users/kyle/Documents/gutenberg_samples/") if f.startswith('gb_')]

Define a function to extract `(adj,noun)` relations.

In [6]:
def extract_adj2nouns(tempspacy):
    """
    For a sentence like "I ate the small frog." returns [(small, frog)].
    lemmatizes the noun, lowers the adjective
    """
    nouns = ["NN","NNS"]
    adj_noun_tuples = []
    for token in tempspacy:
        if token.dep_=="amod":
            if token.head.tag_ in nouns:
                adj_noun_tuples.append((token.text.lower(),token.head.lemma_))
    return adj_noun_tuples
                                       
extract_adj2nouns(nlp(u"The small frogs were not the only ones there.  The dog walked itself through the house."))

[(u'small', u'frog'), (u'only', u'one')]

Textacy extracts `(s,v,o)` triples.

In [7]:
for triple in extract.subject_verb_object_triples(nlp(u"The husband ignored his wife.")):
    print triple

(husband, ignored, wife)


I want to loop through a bunch of Gutenberg texts that I've randomly downloaded with the Gutenberg python package.  

In [8]:
from langdetect import detect ## to make sure texts are english

In [9]:
from unidecode import unidecode ## to crudely deal with text encoding issues

In [10]:
noun2adj = defaultdict(list)
noun2object = defaultdict(list)

In [11]:
noun2adj_tuples = []
svo_triples = []
errors = 0

for fy in tqdm(gb_files[:1000]):
    with open("/Users/kyle/Documents/gutenberg_samples/"+fy,'r') as f:
        tempdata = f.read()
        try:
            if detect(tempdata)=="en": ## check english
                tempspacy = nlp(tempdata.decode('utf-8'))

                ### adjectives
                try:
                    for pair in extract_adj2nouns(tempspacy):
                        noun2adj_tuples.append(pair)
                except:
                    pass

                ### svo triples
                try:
                    gutenberg_svo_triples = extract.subject_verb_object_triples(tempspacy)
                    for trip in gutenberg_svo_triples:
                        svo_triples.append(trip)
                except:
                    pass
        except:
            errors+=1
            


100%|██████████| 1000/1000 [1:14:54<00:00,  6.57s/it] 


How many pairs (not unique) do I have of `(adj,noun)` relations?

In [12]:
len(noun2adj_tuples)

2285591

Of `(s,v,o)` relations?

In [13]:
len(svo_triples)

2403881

## Inspecting the data so far...

### `(adj, noun)` relations

In [14]:
import random

In [15]:
random.sample(noun2adj_tuples,20)

[(u'uncomplaining', u'spirit'),
 (u'white', u'drapery'),
 (u'second', u'address'),
 (u'little', u'hillock'),
 (u'little', u'benefit'),
 (u'voluntary', u'contribution'),
 (u'beautiful', u'bird'),
 (u'political', u'connection'),
 (u'speculative', u'look'),
 (u'whimsical', u'portrait'),
 (u'such', u'practice'),
 (u'delightful', u'time'),
 (u'wounded', u'man'),
 (u'clean', u'shore'),
 (u'single', u'course'),
 (u'worn', u'grass'),
 (u'hapless', u'prey'),
 (u'colored', u'shirt'),
 (u'leading', u'theory'),
 (u'useful', u'plant')]

Another way to inspect data: frequency distributions.

In [16]:
from nltk import FreqDist as fd

In [17]:
ADJ_noun_fd = fd([a for a,n in noun2adj_tuples])
adj_NOUN_fd = fd([n for a,n in noun2adj_tuples])

In [18]:
ADJ_noun_fd.most_common(40)

[(u'great', 48163),
 (u'other', 46884),
 (u'little', 46060),
 (u'old', 37059),
 (u'good', 34388),
 (u'own', 32720),
 (u'first', 29181),
 (u'many', 28795),
 (u'same', 27871),
 (u'such', 27821),
 (u'young', 24405),
 (u'new', 21912),
 (u'long', 19462),
 (u'few', 19209),
 (u'last', 18339),
 (u'small', 16753),
 (u'whole', 16372),
 (u'large', 14756),
 (u'more', 14084),
 (u'white', 12726),
 (u'several', 12081),
 (u'much', 11978),
 (u'next', 11899),
 (u'certain', 11299),
 (u'poor', 10796),
 (u'black', 10134),
 (u'high', 9956),
 (u'human', 8625),
 (u'full', 8498),
 (u'different', 8428),
 (u'only', 8320),
 (u'general', 8246),
 (u'best', 8235),
 (u'second', 7987),
 (u'big', 7245),
 (u'present', 6954),
 (u'public', 6900),
 (u'short', 6899),
 (u'various', 6855),
 (u'very', 6740)]

In [19]:
adj_NOUN_fd.most_common(40)

[(u'man', 50670),
 (u'time', 27787),
 (u'thing', 24171),
 (u'day', 21246),
 (u'woman', 13259),
 (u'way', 13226),
 (u'year', 12865),
 (u'eye', 12829),
 (u'hand', 12460),
 (u'part', 12177),
 (u'life', 11846),
 (u'people', 11307),
 (u'place', 11195),
 (u'girl', 10362),
 (u'one', 9507),
 (u'word', 9303),
 (u'friend', 8939),
 (u'face', 8731),
 (u'work', 8555),
 (u'house', 7599),
 (u'voice', 7344),
 (u'person', 7000),
 (u'night', 6969),
 (u'form', 6921),
 (u'lady', 6865),
 (u'child', 6706),
 (u'boy', 6630),
 (u'side', 6550),
 (u'line', 6486),
 (u'room', 6368),
 (u'fellow', 6307),
 (u'water', 5996),
 (u'power', 5948),
 (u'something', 5857),
 (u'manner', 5521),
 (u'world', 5453),
 (u'condition', 5368),
 (u'air', 5308),
 (u'light', 5302),
 (u'moment', 5275)]

#### Ideas...

So there are really two problems.  Looking at the frequency distribution tells me that some of the most common adjectives (e.g. "few", "other")) are undesirable, because they aren't closely tied to a noun.  That leaves are `green` is better to know than that leaves can be `other`.  (Also, certain common nouns are probably not as interesting, especially ones like `other`).  I have two intuitions: 1) really common relationships between adjectives and nouns are less interesting/desirable than less common ones, and 2) at the same time, I really don't want `(adj,noun)` pairs that are totally aberrant.  Regarding the second point, I could filter out any adjective that doesn't occur at least `n` in modification of a certain noun, but that really penalizes uncommon nouns (which won't have many adjectives modifying them).

My plan:

1. Filter out relations containing the most adjectives as well as a handful of annoying nouns
2. Filter out those relations between words that are not strongly related according to a word2vec model

A handmade list of nouns to exclude.

In [20]:
ADJS_nouns_to_exclude = [word for word,count in ADJ_noun_fd.most_common(40)]

In [21]:
print ADJS_nouns_to_exclude

[u'great', u'other', u'little', u'old', u'good', u'own', u'first', u'many', u'same', u'such', u'young', u'new', u'long', u'few', u'last', u'small', u'whole', u'large', u'more', u'white', u'several', u'much', u'next', u'certain', u'poor', u'black', u'high', u'human', u'full', u'different', u'only', u'general', u'best', u'second', u'big', u'present', u'public', u'short', u'various', u'very']


In [22]:
from nltk.corpus import stopwords
stops = stopwords.words('english')

In [23]:
stops = stops + ["whose","less","thee","thine","thy","thou","one"] ##adjectives nouns
stops = stops + ["time","thing","one","way","part","something"] ##annoying nouns

In [24]:
noun2adj = defaultdict(list)
for a,n in noun2adj_tuples:
    if n not in stops:
        if (a not in ADJS_nouns_to_exclude) and (a not in stops):
            noun2adj[n].append(a)

In [25]:
import gensim
word2vec_path = "/Users/kyle/Desktop/GoogleNews-vectors-negative300.bin"
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

Using Theano backend.


Below I define a pretty confusing loop to go through the dictionary I just made to filter out those words that `(adj,noun)` pairs that are unrelated according to a word2vec model.  (Here I'm using just cosine similarity but I could probably maybe just measure the probability of the text according to the model.)

In [26]:
new_noun2adj = defaultdict(list)

for k in tqdm(noun2adj.keys()):
    adjs = []
    for adj in noun2adj[k]:
        try:
            adjs.append((adj,model.similarity(adj,k)))
        except:
            pass
    adjs.sort(key = lambda x: x[1], reverse=True)
    adjs_ = []
    for a,score in adjs:
        if a not in adjs_:
            adjs_.append(a)
            
    ## this is some weird hand-crafted logic to filter adjectives belonging to rare and common nouns differently...
    ## the idea is to only take the cream of the crop when there are a lot of options --- i.e. when the noun is common
    if len(adjs_)>20:
        for adj in adjs_[:10]:
            new_noun2adj[k].append(adj)
    elif len(adjs_)>10:
        for adj in adjs_[:10]:
            new_noun2adj[k].append(adj)
    elif len(adjs_)>2:
        adj = adjs_[0]
        new_noun2adj[k].append(adj)
    else:
        pass

100%|██████████| 34122/34122 [00:47<00:00, 724.92it/s] 


In [27]:
new_noun2adj['hat']

[u'uniform',
 u'tweed',
 u'worn',
 u'broadcloth',
 u'pink',
 u'cowboy',
 u'khaki',
 u'blue',
 u'floppy',
 u'plumed']

In [28]:
with open("data/noun2adj.json","w") as f:
    json.dump(new_noun2adj,f)

###  `(s,v,o)` triples...

In [29]:
svo_triples_reformatted = [(s.lemma_,v.lemma_,o.text) for s,v,o, in svo_triples]

Inspect data.

In [30]:
random.sample(svo_triples_reformatted,20)

[(u'ruth', u'close', u'novel'),
 (u'alliance', u'be', u'together'),
 (u'give', u'be', u'procedure'),
 (u'jungle wall', u'be', u'line'),
 (u'-PRON-', u'bear', u'yoke'),
 (u'-PRON-', u'be', u'as'),
 (u'fred', u'take', u'habit'),
 (u'-PRON-', u'hunt', u'superintendent'),
 (u'officer', u'take', u'care'),
 (u'remedy', u'banish', u'tender Sentiments'),
 (u'-PRON-', u'begin', u'flinging'),
 (u'miss flagg', u'take', u'manuscript'),
 (u'-PRON-', u'may believe', u'charge'),
 (u'troop', u'offer', u'show'),
 (u'-PRON-', u'say', u'nothing'),
 (u'man', u'be call', u'mad'),
 (u'there', u'be', u'demand'),
 (u'miltiades', u'become', u'household problem'),
 (u'-PRON-', u'deceive', u'them'),
 (u'court', u'pronounce', u'defendant--')]

In [31]:
Svo_fd = fd([s for s,v,o in svo_triples_reformatted])
sVo_fd = fd([v for s,v,o in svo_triples_reformatted])
svO_fd = fd([o for s,v,o, in svo_triples_reformatted])

In [32]:
topS = [word for word,count in Svo_fd.most_common(40)]
print topS

[u'-PRON-', u'there', u'who', u'which', u'that', u'this', u'man', u'one', u'what', u'people', u'god', u'woman', u'father', u'all', u'boy', u'be', u'mother', u'some', u'thing', u'girl', u'king', u'other', u'thou', u'child', u'these', u'friend', u'name', u'"', u'eye', u'word', u'nothing', u'person', u'have', u'lady', u'many', u'those', u'wife', u'son', u'face', u'life']


In [33]:
topV = [word for word,count in sVo_fd.most_common(40)]
print topV

[u'be', u'have', u'give', u'take', u'seem', u'make', u'see', u'tell', u'know', u'have be', u'begin', u'say', u'find', u'want', u'call', u'leave', u'do', u'get', u'bring', u'hear', u'ask', u'become', u'show', u'send', u'come', u'keep', u'use', u'love', u'put', u'ought', u'hold', u'go', u'appear', u'wish', u'be not', u'would be', u'feel', u'reach', u'try', u'mean']


In [34]:
topO = [word for word,count in svO_fd.most_common(40)]
print topO

[u'him', u'it', u'me', u'you', u'them', u'her', u'to be', u'one', u'us', u'nothing', u'that', u'man', u'himself', u'more', u'something', u'way', u'place', u'time', u'thing', u'to', u'to do', u'all', u'this', u'anything', u'head', u'to see', u'to go', u'to make', u'to have', u'hand', u'themselves', u'to say', u'things', u'part', u'men', u'eyes', u'name', u'life', u'to take', u'herself']


The loop below filters out an `(s,v,o)` triple if any one of its elements meets certain exclusionary conditions.

In [35]:
svo_triples_filtered = []

for s,v,o, in svo_triples_reformatted:
    Sval,Vval,Oval=False,False,False
    if len(s.split())==1: ## make sure it's not a complicated noun chunk
        if s.lower() not in stops:  ## make sure it's not a stopword
            Sval=True
    if v not in topV:   ## make sure it's not really common
        if len(v.split())==1:  ## make sure it's not a complicated verb chunk
            if v.lower() not in stops:  ## make sure it's not a stopwords
                Vval=True    
    if len(o.split())==1:    ### make sure it's not a complicated noun chunk
        if o.lower() not in stops:  ### make sure it's not a stopword
            if o.lower()==o:  ## this is kind of a hack to exclude proper nouns
                if o.endswith("ing")==False:  ## filter out annoying present participles
                    Oval=True
    if (Sval,Vval,Oval)==(True,True,True):
        svo_triples_filtered.append((s,v,o))

In [36]:
noun2v_o = defaultdict(list)
for s,v,o in svo_triples_filtered:
    noun2v_o[s].append((v,o))

In [37]:
noun2v_o["king"]

[(u'decline', u'share'),
 (u'request', u'presence'),
 (u'present', u'offerings'),
 (u'beg', u'monks'),
 (u'present', u'offerings'),
 (u'receive', u'increase'),
 (u'divide', u'relics'),
 (u'employ', u'men'),
 (u'employ', u'protect'),
 (u'build', u'tope'),
 (u'prepare', u'supply'),
 (u'caparison', u'elephant'),
 (u'present', u'flowers'),
 (u'present', u'incense'),
 (u'turn', u'furrow'),
 (u'promise', u'command'),
 (u'cast', u'forth'),
 (u'acknowledge', u'superior'),
 (u'seek', u'alliance'),
 (u'offer', u'price'),
 (u'rest', u'responsibility'),
 (u'undertake', u'repair'),
 (u'command', u'celebration'),
 (u'detest', u'religion'),
 (u'prove', u'faithful'),
 (u'betroth', u'daughters'),
 (u'raise', u'siege'),
 (u'provide', u'vessels'),
 (u'summon', u'force'),
 (u'stake', u'weight'),
 (u'cause', u'sick'),
 (u'cause', u'wounded'),
 (u'cede', u'conquests'),
 (u'allow', u'suspicion'),
 (u'allow', u'cruelty'),
 (u'open', u'gates'),
 (u'order', u'proceeded'),
 (u'order', u'appeared'),
 (u'declare',

Again, filter out those `(s,v,o)` combinations in which the `v` and `o` are not similar according to word2vec model.

In [38]:
new_noun2v_o = defaultdict(list)

for k in tqdm(noun2v_o.keys()):
    vos = []
    for verb,obj in noun2v_o[k]:
        try:
            vos.append((verb,obj,model.similarity(obj,verb)))
        except:
            pass
    vos.sort(key = lambda x: x[2], reverse=True)
    
    ##again, logic to handle rare and common nouns differently
    if len(vos)>20:
        for verb,obj,value in vos[:10]:
            new_noun2v_o[k].append((verb,obj))
    elif len(vos)>10:
        for verb,obj,value in vos[:5]:
            new_noun2v_o[k].append((verb,obj))
    elif len(vos)>2:
        verb,obj,value = vos[0]
        new_noun2v_o[k].append((verb,obj))
    else:
        pass

100%|██████████| 23431/23431 [00:11<00:00, 2037.84it/s]


In [39]:
new_noun2v_o["seed"]

[(u'produce', u'plant'),
 (u'bruise', u'head'),
 (u'burst', u'breaks'),
 (u'produce', u'fruits'),
 (u'prove', u'none'),
 (u'bear', u'fruit'),
 (u'lie', u'latent'),
 (u'beg', u'bread'),
 (u'contain', u'tannin'),
 (u'contain', u'mixed')]

In [40]:
with open("data/noun2v_o.json","w") as f:
    json.dump(new_noun2v_o,f)