# Recursive Ekphrasis Gym

## Processing `(possessor,possessed)` pairs

In [1]:
! wget https://raw.githubusercontent.com/kbooten/ekphrasisgym/main/possessor2possessed_tuples_with_filenumber.json

--2022-05-27 11:37:30--  https://raw.githubusercontent.com/kbooten/ekphrasisgym/main/possessor2possessed_tuples_with_filenumber.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18738032 (18M) [text/plain]
Saving to: ‘possessor2possessed_tuples_with_filenumber.json.3’


2022-05-27 11:37:31 (143 MB/s) - ‘possessor2possessed_tuples_with_filenumber.json.3’ saved [18738032/18738032]



In [2]:
import json

with open('possessor2possessed_tuples_with_filenumber.json','r') as f:
  possessor2possessed_tuples_with_filenumber_lists = json.load(f)

In [3]:
possessor2possessed_tuples_with_filenumber_lists[:4]

[['2068', ['captain', 'property']],
 ['2068', ['dealer', 'wagon']],
 ['2068', ['takin', 'carpets']],
 ['2068', ['body', 'eyesight']]]

In [4]:
from collections import defaultdict

possessor2possessed = defaultdict(list)

In [5]:
for i in possessor2possessed_tuples_with_filenumber_lists:
  filenumber = i[0]
  possessor = i[1][0]
  possessed = i[1][1] 
  possessor2possessed[possessor].append((possessed,filenumber))

In [6]:
possessor2possessed["curate"][:10]

[('garden', '43754'),
 ('daughters', '48198'),
 ('side', '28684'),
 ('children', '145'),
 ('children', '145'),
 ('son', '145'),
 ('place', '22008'),
 ('love', '2686'),
 ('wife', '11825'),
 ('ministry', '46570')]

Using WordNet to make sure a token is a word.

In [7]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
wn.synsets("hat",pos=wn.NOUN)

[Synset('hat.n.01'), Synset('hat.n.02')]

In [9]:
for key,values in possessor2possessed.items():
  values_unique_per_file = [thing for thing,f in list(set(values))]
  values_unique_per_file = [v for v in values_unique_per_file if wn.synsets(v,pos=wn.NOUN)!=[]] ## spellcheck via wordnet
  possessor2possessed[key] = values_unique_per_file

In [10]:
possessor2possessed["curate"][:5]

['chamber', 'presence', 'rage', 'jealousy', 'calves']

In [11]:
possessor2possessed["sake"][:3]

['sake', 'sake', 'cause']

Get rid of values that are the same as the key.

In [12]:
for key,values in possessor2possessed.items():
  possessor2possessed[key] = [v for v in values if v!=key]

In [13]:
possessor2possessed["sake"][:3]

['cause']

Sometimes a value is just a letter.  Filter out really short words.

In [14]:
for key,values in possessor2possessed.items():
  possessor2possessed[key] = [v for v in values if len(v)>2]

### Rank by TF-IDF

In [15]:
sets_of_words = list(possessor2possessed.values())

In [16]:
total_number_of_sets = len(sets_of_words)
total_number_of_sets

16254

In [17]:
from collections import defaultdict

word2doc_count = defaultdict(int)

In [18]:
for s in sets_of_words:
  s = list(set(s))
  for t in s:
    word2doc_count[t]+=1

In [19]:
word2doc_count['friend']

255

In [20]:
possessor2possessed_and_weights = {}

In [21]:
for key,words in possessor2possessed.items():
  if len(words)!=0: ## no empty sets
    unique_words = list(set(words))
    possessed_and_weights = []
    for w in unique_words:
      tf = words.count(w)/len(words)
      idf = total_number_of_sets/word2doc_count[w]
      tfidf = tf * idf
      possessed_and_weights.append((w,tfidf))
    possessor2possessed_and_weights[key]=possessed_and_weights

In [22]:
possessor2possessed_and_weights['wolf']

[('mantle', 1.7857613711272247),
 ('den', 7.488022113022114),
 ('shadow', 0.770697012802276),
 ('ears', 0.22254169062679702),
 ('charge', 0.717806041335453),
 ('tooth', 1.7432432432432432),
 ('road', 0.9386694386694386),
 ('bite', 2.7117117117117115),
 ('heart', 0.3636567030606766),
 ('eyes', 0.40994521957567864),
 ('head', 2.203832595136943),
 ('coat', 0.3453595104538501),
 ('belly', 3.6608108108108106),
 ('cubs', 9.152027027027026),
 ('press', 4.0675675675675675),
 ('howl', 73.21621621621621),
 ('lair', 4.437346437346437),
 ('brain', 0.45475910693301996),
 ('neck', 0.21597703898588855),
 ('attention', 0.18535750940814233),
 ('form', 0.4693347193347193),
 ('rapacity', 12.202702702702702),
 ('sire', 4.576013513513513),
 ('skin', 5.198784583399968),
 ('log', 5.22972972972973),
 ('challenge', 2.2186732186732185),
 ('path', 0.42567567567567566),
 ('attitude', 0.42321512263708794),
 ('fur', 0.8513513513513513),
 ('cage', 1.5253378378378377),
 ('joint', 20.91891891891892),
 ('shape', 1.2844

In [23]:
with open('possessor2possessed_and_weights.json','w') as f:
  json.dump(possessor2possessed_and_weights,f)

In [24]:
from google.colab import files
files.download('/content/possessor2possessed_and_weights.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

***