# Taaltheorie en Taalverwerking · 2019 · Week 20

In this assignment, we will work more with WordNet and explore music processing with NLTK tools. Don't forget to load WordNet!

In [3]:
# FILL THIS IN FOR YOUR GROUP, also name your file as: tttv-w20-<group>-<name1>-<name2>.ipynb

# Group        : D
# Name - UvaID : Joshua de Roos - 11242736
# Name - UvaID : Lodewijk van Keizerswaard - 11054115
# Date         : 24-05-2019

In [4]:
from math import *
import nltk
from nltk.corpus import wordnet as wn
# nltk.download('wordnet')

You will also need to install and use the [**requests**](https://2.python-requests.org/en/master/) library for this assignment, either by typing `pip install requests` in a terminal or by using your Python package manager.

In [5]:
import requests

### Question 1 (11 pts total)

In this exercise you will explore the output of a distributional semantic model and compare its semantic similarity ranking to that obtained with path-length distance in an ontology. 

[Indra](http://lambda3.org/Indra/) is a library for working with several distributional semantic models and includes a number of pre-trained models that can be queried online. For a target word, Indra will return its $n$ nearest semantic neighbours ordered by similarity strength, calculated using the similarity measure of your choice.

For this assignment, we will use the Word2Vec model on English-language corpora (`wiki-2018` and `googlenews`), and we will restrict ourselves to the five most similar words according to cosine similarity. We can make a query over `wiki-2018` for the word *sailboat* as follows. We will receive a dictionary containing the most similar words according to Word2Vec, as well as their cosine similarity to *sailboat*. 

In [6]:
r = requests.post('http://indra.lambda3.org/neighbors/relatedness', json = {
        'corpus': 'wiki-2018',
        'model': 'W2V',
        'language': 'EN', 
        'topk': 5, 
        'scoreFunction': 'COSINE',
        'terms': ['sailboat']
})
r.json()

# {'corpus': 'wiki-2018',
#  'model': 'W2V',
#  'language': 'EN',
#  'topk': 5,
#  'terms': {'sailboat': {'sailboat': 0.9999999999999999,
#    'catamaran': 0.7580487287345087,
#    'trimaran': 0.7260682723499513,
#    'dinghy': 0.7202068755210131,
#    'multihull': 0.6924827761338878}}}

KeyboardInterrupt: 

#### Question 1.1 (3 pts)

Search for the target word *potato* using the same parameters as the example above. As output, you should get a list of five words including *potato*, with a similarity score that indicates how similar they are to *potato* according to the model. 

Use NLTK to determine the lowest common hypernym of these five words in *WordNet*.

**Hint:** Don't forget that you need to check all possible senses for each word. The lowest common hypernym is the lowest-level synset that subsumes at least one sense of all five words. You probably *don't* want to use the NLTK **lowest_common_hypernyms()** method. (Can you see why?)

In [None]:
import numpy as np

def get_trees(term):
    synset = wn.synsets(term, pos=wn.NOUN)
    
    trees = []
    for s in synset:
        trees.append(s.hypernym_paths())
        
    nodes = []
    
    for sence in trees:
        for tree in sence:
            for node in tree:
                nodes.append(node)
        
    return set(nodes)

def get_common_hypernym(intersection):
    depth = -1
    ret_node = None
    for node in intersection:
        d = node.min_depth()
        if d > depth:
            depth = d
            ret_node = node
    return ret_node

r = requests.post('http://indra.lambda3.org/neighbors/relatedness', json = {
        'corpus': 'wiki-2018',
        'model': 'W2V',
        'language': 'EN', 
        'topk': 5, 
        'scoreFunction': 'COSINE',
        'terms': ['potato']
})
ret = list(r.json()['potato']['terms'].keys())
print(ret)
    
n1 = get_trees('potato')
n2 = get_trees('corn')
n3 = get_trees('rutabaga')
n4 = get_trees('potatoes')
n5 = get_trees('puree')

intersection = n1 & n2 & n3 & n4 & n5

print(get_common_hypernym(intersection))





# print(ret)
# print(ret['terms']['potato'])

# sn1 = wn.synsets('potato', pos=wn.NOUN)
# sn1[0].common_hypernyms(sn1[1])

# ss1 = wn.synsets('potato', pos=wn.NOUN)
# ss2 = wn.synsets('corn', pos=wn.NOUN)
# ss3 = wn.synsets('rutabaga', pos=wn.NOUN)
# ss4 = wn.synsets('puree', pos=wn.NOUN)



# print(ss1[0].hypernym_paths())


#### Question 1.2 (3 pts)

Repeat the same exercise using the `googlenews` corpus instead.

In [None]:
# Add your code here.

#### Question 1.3 (3 pts)

Implement a function called **co\_hyponym(node1, node2)** to check whether two *words* (not synsets) are co-hyponyms or sister terms (i.e., whether any sense of one of the words has some immediate hypernym in common with some sense of the other word). 

Which of the semantic neighbours on your lists from the previous two questions are co-hyponyms of *potato*? (Do not count *potatoes*, which has the same lemma as *potato*.)

In [None]:
# Check if they have the same parent.
def co_hyponym(word1, word2):
    pass

# Make a print statement like the following for every neighbour.
print(co_hyponym("potato", "potatoes"))

#### Question 1.4 (2 pts)

Give the results of **textbook_similarity** queries between *potato* and each of the other semantic neighbours from previous questions, using your function from last week. Give the resulting list of semantic neighbours ordered by similarity strength (according to WordNet path-length) and compare this ranking to the rankings from Indra.

In [None]:
def textbook_similarity(word1, word2, pos):
    # Copy your implementation from last week, or ask your TA for help if you had it wrong.
    pass

# Make a print statement like the following for every neighbour.
print(textbook_similarity("potato", "potatoes", pos = wn.NOUN))

## Question 2: Parsing Eurovision (7 pts total)

This question will introduce the basics of grammar-based syntactic analysis of musical harmony.

### Question 2.1 (1 pt)

Go to http://chordify.net and run an analysis of France's official entry for the Eurovision Song Contest 2019 by copying the following link to the search: https://www.youtube.com/watch?v=dw7WqoSHtgU. Ignore the Eurovision branding at the beginning of the video. Starting from the entrance of the piano at the beginning of the song, copy the first 7 chords as they appear in the Chordify interface, as a list of Python strings. You may find it slightly faster if you change to the *Akkorden* view instead of *Diagrammen*. Use the '#' character for sharp signs.

**Answer:**

In [None]:
french_chords = []

### Question 2.2 (3 pts)

The following grammar is a (somewhat simplified) implementation of the harmonic grammar Fred Lerdahl proposes in his book *Tonal Pitch Space* (2001). In addition to the traditional harmonic classes of tonic, dominant, and subdominant harmony, Lerdahl's grammar includes a *departure* class (`Dep`), a *return* class (`Ret`), and a *neighbour* class (`N`). His definitions of the harmonic classes in terms of Roman numerals are commented out: the 'lexical rules' for this grammar.

The tonal centre of France's Eurovision entry is F$\sharp$ (Chordify gets it wrong). Use the principles we discussed in class to replace the Roman-numeral lexical rules with lexical rules for all of the chord symbols in `french_chords` (e.g., `T -> 'G'` for pieces in the key of G).

Run the code block when you are finished to see how many parse trees Lerdahl's grammar can compute for this sentence.

**Hint:** If you still find the rules for converting between chord names and Roman numerals confusing, try browsing Wikipedia's surprisingly good collection of articles on harmony, starting here: https://en.wikipedia.org/wiki/Roman_numeral_analysis

In [None]:
from nltk import CFG
from nltk.parse.chart import LeftCornerChartParser

lerdahl_grammar = CFG.fromstring("""
  P -> T
  T -> T T
  T -> D T
  D -> D D
  D -> S D
  S -> S S
  T -> T N T
  D -> D N D
  S -> S N S
  T -> T Dep
  D -> D Dep
  S -> S Dep
  Dep -> N Dep
  T -> Ret T
  D -> Ret D
  S -> Ret S
  
  # Replace the following strings with the actual chord symbols you need.
  # 
  # T -> 'I'
  # D -> 'V' | 'VII'
  # S -> 'II' | 'III' | 'IV' | 'VI' | 'VII'
  # Dep -> 'II' | 'III' | 'IV' | 'V' | 'VI' | 'VII' 
  # Ret -> 'II' | 'III' | 'IV' | 'V' | 'VI' | 'VII' 
  # N -> 'II' | 'III' | 'IV' | 'V' | 'VI' | 'VII'
""")

lerdahl_parser = LeftCornerChartParser(lerdahl_grammar)
lerdahl_parses = lerdahl_parser.parse(french_chords)

# Print the total number of parses as well as the actual parse trees.
lerdahl_sum = 0
for t in lerdahl_parses: 
    lerdahl_sum = lerdahl_sum + 1
    t.pretty_print()
print(lerdahl_sum, 'trees')

### Question 2.3 (2 pts)

The next grammar is based on the foundations of Chordify (Bas de Haas, *Music Information Retrieval Based on Tonal Harmony*, 2012). Chordify uses a different set of harmonic classes: tonic, dominant, subdominant, and *tonic prolongation* (`TPG`). They also define the members of the traditional classes differently than Fred Lerdahl; theirs  are based on a beautiful branch of music theory known as neo-Riemannian theory. Replace the commented 'lexical rules' in this grammar with actual chord symbols from `french_chords`, just like you did in the previous question, and run the block.

In [None]:
chordify_grammar = CFG.fromstring("""
  P -> PCPa
  P -> PCPb
  P -> HCP
  P -> P P

  PCPa -> D T | D D T
  PCPa -> PCPa T
  PCPb -> T D T
  HCP -> T D
  HCP -> T HCP
  D -> S D
  T -> TPG
  
  T -> T T
  D -> D D
  S -> S S
  
  # Replace the following strings with the actual chord symbols you need.
  # 
  # T -> 'I'
  # D -> 'V' | 'VII'
  # S -> 'II' | 'IV' 
  # TPG -> 'III' | 'VI'
""")

chordify_parser = LeftCornerChartParser(chordify_grammar)
chordify_parses = chordify_parser.parse(french_chords)

# Print the total number of parses as well as the actual parse trees.
chordify_sum = 0
for t in chordify_parses: 
    chordify_sum = chordify_sum + 1
    t.pretty_print()
print(chordify_sum, 'trees')

### Question 2.4 (1 pt)

Which grammar is more practical for daily use, Lerdahl's or Chordify's? Why?

**Answer:**