# COLX 561 Lab Assignment 4: Discourse (Cheat sheet)

## Assignment objectives

In this assignment you will:
- Manually analyze a short passage using the RST and Centering Theory frameworks
- Show that texts in the Brown corpus have lexical coherence that can be identified using Word2Vec embeddings 
- Build a rule-based anaphor resolution system in the context of an existing annotated corpus (WikiCoref)

## Getting started

Run the code below to access relevant modules (you can add to this as needed):

In [1]:
#provided code
import nltk
import os
import re
import gensim
import numpy as np
from scipy.spatial.distance import cosine
from nltk.corpus import brown
from nltk import pos_tag
from bs4 import BeautifulSoup
from nltk.data import find
from nltk.corpus import wordnet as wn

## Tidy submission

rubric={mechanics:1}

To get the marks for tidy submission:

- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

### Exercise 1: Discourse analysis

This exercise involves doing discourse analysis for the following 5 sentences:

#### 1.1
rubric={raw:3}

Provide a reasonable RST parse for these sentences. For each relation, provide the relation name, the two spans involved (in order), and which of the two is the nucleus (if any). For example:

```
Condition 1-2 3 0
``` 

which means there is a *condition* relationship between sentences 1-2 and sentence 3, and the nucleus is the first (ie, 0th) of the two spans. There is more than one reasonable interpretation.

Answer:

- Background
- Cause 
- Elaboration 
- Contrast 

#### 1.2
rubric={raw:3}

Provide a centering algorithm analysis of these sentences. For each sentence, provide the $C_f$, $C_b$, and $C_p$, and the type of transition from the previous sentence, if appropriate. For example:

4. $C_f$ = {Ted, Bill}, $C_b$ = Bill, $C_p$ = Ted, Retain

Answer:

1. $C_f$ = ... , None
2. $C_f$ = ... , Continue
3. $C_f$ = ... , Smooth shift
4. $C_f$ = ... , Retain
5. $C_f$ = ... , Retain

### Exercise 2: Lexical coherence (Optional)

rubric={accuracy:1}

In this exercise you will be checking the lexical coherence of adjacent sentences in the Brown corpus, comparing coherence within a text and across text boundaries.

Run the code below which creates a list of sentences from the first 50 texts of the Brown corpus and a list of the boundaries between them, and loads the built-in word2vec model included in `nltk`.

In [2]:
import nltk
nltk.download('word2vec_sample')

# provided code
doc_count = 0
sents = []
doc_boundaries = []

for fileid in brown.fileids():
    for sent in brown.sents(fileid):
        sents.append(sent)
    doc_boundaries.append(len(sents)) #This will establish the "breakpoints" between sentences
    doc_count += 1
    if doc_count == 50: #We're only interested in the first 50 documents
        break

doc_boundaries.pop(-1) # don't need the last boundary

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


Your task is to finish the `get_cosine_similarities()` function which creates an array consisting of the cosine similarities of the sums of the embeddings for the words in the $k$ sentences on either side of each sentence boundary in the corpus. You should not calculate this cosine similarity when you are less than $k$ from the edge; the code to insert placeholders (-1) for these is provided to you. Make sure you do this in an appropriately modular and efficient way.

* The =   `[0.2, 0.3, 0.4]`
* black = `[-0.1, 0.7, 1.5]`
* cat =   `[0.5,-0.3, -0.4]`  

The function will return `[0.6, 0.7, 1.5]` (`= [0.2-0.1+0.5, 0.3+0.7-0.3, 0.4+1.5-0.4]` by `np.sum(vectors, axis=0)`), which is a sentence represention for *The black cat*. 

In [4]:
def get_vector_sum(sents, word_vectors):
    '''gets a sum of the word_vectors which represents the words in sents
       sents: a list of sentences of length 'k'
       Hint: passing a list of words to word_vectors will return a list of word vectors corresponding to the words.
       Return value: a single vector that is the sum of the vectors for all the words in sents.
       ie: The = [0.2, 0.3, 0.4], black = [-0.1, 0.7, 1.5], cat = [0.5,-0.3, -0.4].  Then "The black cat"
       will give [[0.2, 0.3, 0.4], [-0.1, 0.7, 1.5], [0.5,-0.3,-0.4]], and the function will return
       [0.6, 0.7, 1.5] (ie, [0.2 - 0.1 + 0.5, 0.3-0.3+0.7, 0.4+1.5-0.4])  This function will then sum
       up those vectors along a single axis, giving a single vector that is the sum of the vectors of each word
       in the sentence.
    '''
# my code here
    words = []
    for sent in sents:
        words.extend([word.lower() for word in sent if word.lower() in word_vectors])
    return np.sum(word_vectors[words], axis=0)
# my code here

from scipy.spatial.distance import cosine # similarity = 1 - cosine distance; 

def get_cosine_similarities(sents, word_vectors,k):
    ''' returns an array of similarities of len(sents), corresponding to the cosine of the
    vectors produced by summing the embeddings of words in the k sentences of either side of 
    each sentence boundary, with the first and last k similarities set to -1
    
    For each sentence in the range from k to len(sents) - k, you should append the cosine between
    the sum of k sentences before the current sentence (using get_vector_sum), and the k sentences after (again, using get_vector_sum) 
    
    
    '''
    cosine_similarities = [-1]*k #Set first k similarities to -1
    #your code here

    #your code here
    for i in range(k):
        cosine_similarities.append(-1) #Set last k similarities to -1
    
    return np.array(cosine_similarities)

The code below compares the cosine similarities of sentence boundaries which are also document boundaries to cosine similarities involving only sentences within a text.

In [8]:
cosine_similarities = get_cosine_similarities(sents,word2vec_model,3)
print(len(cosine_similarities))
doc_boundary_mean = np.mean(cosine_similarities[doc_boundaries])
doc_boundaries_window = set()

for boundary in [0,len(cosine_similarities)] + doc_boundaries:
    doc_boundaries_window.update(range(boundary -3, boundary + 3))
non_doc_boundaries = list(set(range(len(cosine_similarities))) - doc_boundaries_window)
non_doc_boundary_mean = np.mean(cosine_similarities[non_doc_boundaries])

print(doc_boundary_mean)
print(non_doc_boundary_mean)

assert(doc_boundary_mean < 0.8)
assert(len(cosine_similarities) == 5238)
assert(non_doc_boundary_mean - doc_boundary_mean > 0.05)
print("Success!")       

### Exercise 3: Anaphor resolution for pronouns

In this exercise, we will make use of the WikiCoref corpus, which can be downloaded [here](http://rali.iro.umontreal.ca/rali/sites/default/files/resources/wikicoref/WikiCoref.tar.gz). Do not include it with your submission, instead you should put the path to the `Annotation` directory of the unzipped corpus here, and we will change the path if we need test your code:

Ghaddar, A., & Langlais, P. (2016). **WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles**. In *Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16)*, 136–142. https://aclanthology.org/L16-1021

***Susan dropped [the plate]$_1$. [It]$_1$ shattered loudly.***

- ***It* is an anaphor**
- ***the plate* and *It* have the same referent (co-reference)**

In [2]:
#provided code
wikicoref_path = "... Lab4/WikiCoref/Annotation/" # modify this path

The Wikicoref corpus contains Wikipedia texts with annotations of coreference relationships among entity mentions in those texts. The `Annotation` directory contains a series of subdirectories with each one corresponding to a Wikipedia entry. There are two files that we will access for each entry. The first is a text file which contains the text, one token on each line, with a single blank line. Below is a provided function `get_text_with_boundaries()` which (assuming you have the correct `wikicoref_path`) will build two data structures you will need: a text consisting of a list of tokens, and a set which has the indicies of the sentence boundaries. You don't need to modify this code. Note that the text must be a list of tokens (and not a list of sentences) so that mentions can be properly identified using the information you will collect in **3.1**.

**`Canada % ls -R`**

```
Basedata		Canada.txt(*)		Markables		Styles
Canada.mmax		Customizations		Schemes			common_paths.xml

./Basedata:
Canada_words.xml	words.dtd

./Customizations:
coref_customization.xml		sentence_customization.xml

./Markables:
Canada_coref_level.xml(*)			Canada_coref_level_OntoNotesScheme.xml	test
Canada_coref_level.xml.bak		Canada_sentence_level.xml
Canada_coref_level_ACEScheme.xml	markables.dtd

./Schemes:
coref_scheme.xml	sentence_scheme.xml

./Styles:
default_style.xsl
```

**Barack_Obama % ls -R**
```
Barack Obama.mmax	Basedata		Markables		Styles
Barack Obama.txt	Customizations		Schemes			common_paths.xml

./Basedata:
Barack Obama_words.xml	words.dtd

./Customizations:
coref_customization.xml		sentence_customization.xml

./Markables:
Barack Obama_coref_level.xml			Barack Obama_coref_level_OntoNotesScheme.xml	test
Barack Obama_coref_level.xml.bak		Barack Obama_sentence_level.xml
Barack Obama_coref_level_ACEScheme.xml		markables.dtd

./Schemes:
coref_scheme.xml	sentence_scheme.xml

./Styles:
default_style.xsl
```

`Barack_Obama/Barack Obama_coref_level.xml`


- `Barack_Obama`    (with `_` )
- `Barack Obama_coref_level.xml`  (without `_` )

In [3]:
#provided code
def get_text_with_boundaries(entry):
    '''given an Wikicoref entry name, opens the corresponding text file in the Wikicoref corpus and 
    coverts it into a list of tokens and a set which has indicies of the boundaries between sentences'''
    text_filename = entry.replace("_"," ") + ".txt"
    sentence_boundaries = set()
    text = []
    with open(wikicoref_path + "/" + entry + "/" + text_filename,encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                text.append(line)
            else:
                sentence_boundaries.add(len(text) - 1)
    return text,sentence_boundaries

- `get_text_with_boundaries("word")[0]` = a list of tokens
- `get_text_with_boundaries("word")[1]` = a set of sentence boundary position

In [4]:
print(get_text_with_boundaries("Canada")[0][:40])
37 in get_text_with_boundaries("Canada")[1]

['Canada', 'is', 'a', 'North', 'American', 'country', 'consisting', 'of', 'ten', 'provinces', 'and', 'three', 'territories', '.', 'Located', 'in', 'the', 'northern', 'part', 'of', 'the', 'continent', ',', 'it', 'extends', 'from', 'the', 'Atlantic', 'to', 'the', 'Pacific', 'and', 'northward', 'into', 'the', 'Arctic', 'Ocean', '.', 'Canada', 'is']


True


The other function will load a list of objects which have information about entity mentions from an XML file (`*entry name*_coref_level.xml`) found in the `Markables` directory inside the folder for the entry.  You also don't need to modify this code.  I encourage you to take a look at what is returned by this function, as it will help you with the later exercises.

In [None]:
# <markable id="markable_318" span="word_1..word_1" coref_class="set_399" topic="http://rdf.freebase.com/ns/m.0d060g" coreftype="ident" mentiontype="ne"  mmax_level="coref" />
# <markable id="markable_319" span="word_3..word_13" coref_class="set_399" topic="http://rdf.freebase.com/ns/m.0d060g" coreftype="cop" mentiontype="np"  mmax_level="coref" />
# <markable id="markable_538" span="word_4..word_5" coref_class="set_3" topic="http://rdf.freebase.com/ns/m.059g4" coreftype="ident" mentiontype="ne"  mmax_level="coref" />
# <markable id="markable_555" span="word_9..word_10" coref_class="set_226" topic="nan" coreftype="ident" mentiontype="np"  mmax_level="coref" />
# <markable id="markable_552" span="word_9..word_13" coref_class="set_27" topic="nan" coreftype="ident" mentiontype="np"  mmax_level="coref" />


In [7]:
def get_mention_info(entry):
    '''Creates a list of mention_info objects for the Wikipedia entry from the corpus. Here, a 
    mention_info object corresponds to a beautiful soup node of a markable tag in the Wikicoref xml'''
    mention_info_list = []
    with open(wikicoref_path + "/" + entry + "/Markables/" +entry.replace("_", " ") +"_coref_level.xml",encoding="utf-8") as f:
        soup = BeautifulSoup(f,"lxml")
        for markable in soup.find_all("markable"):
            mention_info_list.append(markable)
    return mention_info_list

In [9]:
get_mention_info("Canada")[:5]

[<markable coref_class="set_399" coreftype="ident" id="markable_318" mentiontype="ne" mmax_level="coref" span="word_1..word_1" topic="http://rdf.freebase.com/ns/m.0d060g"></markable>,
 <markable coref_class="set_399" coreftype="cop" id="markable_319" mentiontype="np" mmax_level="coref" span="word_3..word_13" topic="http://rdf.freebase.com/ns/m.0d060g"></markable>,
 <markable coref_class="set_3" coreftype="ident" id="markable_538" mentiontype="ne" mmax_level="coref" span="word_4..word_5" topic="http://rdf.freebase.com/ns/m.059g4"></markable>,
 <markable coref_class="set_226" coreftype="ident" id="markable_555" mentiontype="np" mmax_level="coref" span="word_9..word_10" topic="nan"></markable>,
 <markable coref_class="set_27" coreftype="ident" id="markable_552" mentiontype="np" mmax_level="coref" span="word_9..word_13" topic="nan"></markable>]

In [10]:
assert len(get_mention_info("Canada")) == 1079
print("Success!")

Success!


#### 3.1

rubric={accuracy:2}

Next, you're going to complete a function which will allow you to grab the actual text corresponding to the mention. `get_mention_indices()` should take one of the `mention_info` objects created above and pull out a 2-tuple of indices (Hint: remember "match" and "group" for RegExes from 521?); The provided function `get_mention()` does the rest, getting the text associated with those tuples. You will need to use `get_mention()` extensively for the rest of this exercise.

The one slightly tricky part of this problem is that this corpus annotation uses a different indexing philosophy than Python does! You will want to figure this out first, and convert the one in the annotation to the one that Python uses.

The other functions provided below also deal with indicies but can be ignored for now.

**We need to store the following attributes of each `markable` XML tag: `span` (`get_mention_indices()`), `coref_class`, and `mentiontype`.**

```
<markable 
    coref_class="set_399"        
    coreftype="cop" 
    id="markable_319" 
    mentiontype="np"             
    mmax_level="coref" 
    span="word_3..word_13"       <--- get_mention_indices()
    topic="http://rdf.freebase.com/ns/m.0d060g">
</markable>
```

In [14]:
span_re = re.compile(r"word_(\d+)\.\.word_(\d+)")
#The items returned from "get_mention_info" contains a string "word_X..word_Y", where X is the index
#before the span, and Y is the index after the span.


def get_mention_indices(mention_info):
    '''takes a mention_info object created by the get_mention_info function and 
    returns a tuple of integers corresponding to the span of the mention in the text'''
    # your code here
    # `mention_info["span"]` will return `word_3..word_13` (see above);
    # you extract numbers from word_3..word_13 using group(1) and group(2)
    # and return (3-1, 13);
    # note that you should return int, not str. 
    # your code here
    
def get_mention(mention_info,text):
    '''get a list of tokens from text corresponding to the mention from mention_info'''
    start,end = get_mention_indices(mention_info)
    return text[start:end]

### used for 3.5
def get_mentions_ind_for_same_sentence(index, mention_list, sentence_boundaries):
    '''given a starting index in mention_list, this returns the index of the mention in mention_list which
    is the first mention in the same sentence as index, based on the provided set of sentence_boundaries'''
    mention_start, mention_end = get_mention_indices(mention_list[index])
    i = mention_start
    while i >=0 and i not in sentence_boundaries:
        i -=1
    j = index - 1
    while j >=0 and get_mention_indices(mention_list[j])[0] >= i:
        j -= 1
    return j + 1

#### Functions below are not needed but good for checking what you're doing (print out sentences with mentions)
def get_mention_sentence_indices(mention_info,sentence_boundaries):
    '''takes a mention info object created by the get_mention_info function and the list of sentence boundaries in the
    corresponding text, and returns the indicies of the sentece which contain that mention'''
    mention_start, mention_end = get_mention_indices(mention_info)
    sent_start = mention_start
    while sent_start - 1 not in sentence_boundaries and sent_start > 0:
        sent_start -= 1
    sent_end = mention_end
    while sent_end not in sentence_boundaries and sent_end < len(text):
        sent_end += 1
    return (sent_start,sent_end)

def get_mention_sentence(mention_info,sentence_boundaries,text):
    '''get a list of tokens from text corresponding to sentence containing the mention from 
    mention_info'''    
    start,end = get_mention_sentence_indices(mention_info,sentence_boundaries)
    return text[start:end]                                                      

In [16]:
# <markable id="markable_552" span="word_9..word_13" ...
get_mention_indices(get_mention_info("Canada")[4])

(8, 13)

In [23]:
print(get_mention_info("Canada")[4])
print(get_text_with_boundaries("Canada")[0][:30])
print(get_mention(get_mention_info("Canada")[4], get_text_with_boundaries("Canada")[0]))

<markable coref_class="set_27" coreftype="ident" id="markable_552" mentiontype="np" mmax_level="coref" span="word_9..word_13" topic="nan"></markable>
['Canada', 'is', 'a', 'North', 'American', 'country', 'consisting', 'of', 'ten', 'provinces', 'and', 'three', 'territories', '.', 'Located', 'in', 'the', 'northern', 'part', 'of', 'the', 'continent', ',', 'it', 'extends', 'from', 'the', 'Atlantic', 'to', 'the']
['ten', 'provinces', 'and', 'three', 'territories']


#### 3.2

rubric={accuracy:3}

Next, use the `get_mention()` function from above to iterate over all the mentions in the corpus and use the information in the `mentiontype` field to collect a set of all the pronouns ("pro").  HINT: the mention will have a key: `mentiontype`. Print out this set (you should lowercase your pronouns, so that you don't end up with lowercased and uppercased pronouns in the set)

```
<markable 
    coref_class="set_399" 
    coreftype="ident" 
    id="markable_594" 
    mentiontype="pro"             <---
    mmax_level="coref" 
    span="word_24..word_24" 
    topic="http://rdf.freebase.com/ns/m.0d060g">
</markable>
```

* using `get_mention_info()`, find `mentiontype == "pro"` and add it to `pronouns`

In [18]:
pronouns = set()
for entry in os.listdir(wikicoref_path):
    if ".ini" in entry: #ignore .ini entries
        continue
    #Your code here
    # 1. get_mention_info(entry) will return `mention_info_list`
    # 2. get_text_with_boundaries(entry)` will return `text` and `sentence_boundaries`
    # 3. you iterate `mention_info_list` to find `mention_info["mentiontype"] == "pro"
    # 4. if you find it: using `get_mention(mention_info,text)` to add in `pronouns` (don't forget `lower()` for your pronouns`)

    #Your code here            
print(pronouns)

{'his', 'him', 'itself', 'we', 'i', 'them', 'her', 'those', 'she', 'herself', 'myself', 'me', 'they', 'its', 'my', 'their', 'our', 'you', 'these', 'he', 'ourselves', 'it', 'himself', 'your', 'themselves', 'this'}


Based on this, we create a new object called `pronoun_features` which has, for each *3rd-person* pronoun, a feature dictionary with information about the number, personhood, and gender of the pronoun.  Remember that the *3rd-person* refers to individuals that are not being addressed directly (ie, "he", "she", "they", "her", etc.) Note that personhood $\neq$ (grammatical) person, e.g. *3rd-person*.  Instead, personhood is whether the pronoun refers to a person. You should remove first and second pronouns (such as *me* and *you*). Here's an example with a suggested schema you could use to use for this:

```
{"she": {"PLUR": False,
         "PERS": True,
         "MALE": False}}
```

Note that if a pronoun isn't specified for one of these attibutes (e.g. *they* isn't specified for gender or personhood in English), the feature (e.g. "MALE") should not be included.

In [19]:
#Provided code:
pronoun_features = {'itself':{"PERS":False,"PLUR":False},
                'they':{"PLUR":True}, 'them':{"PLUR":True}, 
                'those':{"PERS":False,"PLUR":True}, 
                'their':{"PLUR":True}, 
                'themselves':{"PLUR":True}, 
                'this':{"PLUR":False,"PERS":False}, 
                'his':{"PLUR":False,"PERS":True,"MALE":True}, 
                'he':{"PLUR":False,"PERS":True,"MALE":True}, 
                'herself':{"PLUR":False,"PERS":True,"MALE":False}, 
                'she':{"PLUR":False,"PERS":True,"MALE":False}, 
                'its':{"PLUR":False,"PERS":False,"MALE":True}, 
                'him':{"PLUR":False,"PERS":True,"MALE":True}, 
                'her':{"PLUR":False,"PERS":True,"MALE":False}, 
                'himself':{"PLUR":False,"PERS":True,"MALE":True}, 
                'it':{"PLUR":False,"PERS":False}, 
                'these':{"PLUR":True,"PERS":False},
                'them':{"PLUR":True}}

You can compare these feature dictionaries for compatibility using the provided `compatible()` function below:

In [20]:
#provided code
def compatible(mention1,mention2):
    '''
       This function checks that two mentions match in all marked pronoun features
    '''
    for feature in mention1:
        if feature in mention2 and mention1[feature] != mention2[feature]:
            return False
    for feature in mention2:
        if feature in mention1 and mention1[feature] != mention2[feature]:
            return False
    return True

In [21]:
assert "me" not in pronoun_features
assert "you" not in pronoun_features
assert compatible(pronoun_features["this"],pronoun_features["it"])
assert not compatible(pronoun_features["their"],pronoun_features["it"])
assert compatible(pronoun_features["she"],pronoun_features["her"])
assert not compatible(pronoun_features["she"],pronoun_features["himself"])
assert not compatible(pronoun_features["her"],pronoun_features["its"])
print("Success!")

Success!


#### 3.3

rubric={accuracy:3,quality:1}

Next you will build a basic coreference system and test it out. Complete the `test_anaphor_resolution_system` function below. Your algorithm should do the following:

1. Iterate over all the wikipedia entries and, for each entry, iterate over the mentions using indicies (This code is provided for you). 

2. When you find an anaphor (pronoun) that is in your `pronoun_feature` dict from **3.2** (you should add 1 to `total` at this point), you will attempt to find its antecedent by iterating though past mentions using a second index which you must get from the `search_iterator` (by default, this just iterates backwards, however you will upgrade this function in **3.4**)

3. When the past mention is a non-anaphoric mention (ie, not a pronoun), you will extract its features by passing its features by passing the result of `get_mention` function from **3.1** to the `get_mention_features` function (currently just returns an empty dictionary but you will upgrade this in **3.5**!) and check to see if your current pronoun is compatible with it, using the `compatible` function given above. If it is not compatible, the search should continue.

4. If it is, then your system will guess that antecedent. You can check to see if you are correct by seeing if the pronoun and potential antecedent have the same "coref_class", if so, you should add 1 to `correct`.

5. Regardless of whether you are correct or not, once you have guessed an antecedent for the anaphor, you should stop looking for an antecedent and continue looking for the next anaphor (back to step 1).

6. When you have finished trying to find antecedents for all anaphors in all the entries, return your accuracy (this code is provided)

As suggested above, you MUST use the two functions given below in your implementation. They are not doing much right now (ignoring their arguments, in fact), but you will improve them later.

Run the provided test code to see how well your basic system is working

In [22]:
print(get_mention_info("Canada")[0])
print(get_mention(get_mention_info("Canada")[0], get_text_with_boundaries("Canada")[0]))
print("-")

print(get_mention_info("Canada")[1])
print(get_mention(get_mention_info("Canada")[1], get_text_with_boundaries("Canada")[0]))

print("-")
print(get_mention_info("Canada")[6])
print(get_mention(get_mention_info("Canada")[6], get_text_with_boundaries("Canada")[0]))

<markable coref_class="set_399" coreftype="ident" id="markable_318" mentiontype="ne" mmax_level="coref" span="word_1..word_1" topic="http://rdf.freebase.com/ns/m.0d060g"></markable>
['Canada']
-
<markable coref_class="set_399" coreftype="cop" id="markable_319" mentiontype="np" mmax_level="coref" span="word_3..word_13" topic="http://rdf.freebase.com/ns/m.0d060g"></markable>
['a', 'North', 'American', 'country', 'consisting', 'of', 'ten', 'provinces', 'and', 'three', 'territories']
-
<markable coref_class="set_399" coreftype="ident" id="markable_594" mentiontype="pro" mmax_level="coref" span="word_24..word_24" topic="http://rdf.freebase.com/ns/m.0d060g"></markable>
['it']


In [23]:
def search_iterator(anaphor_index, mention_info_list,sentence_boundaries):
    '''a generator function which just ignores most of its arguements and just provides indices
    backwards order starting with the index before the provided anaphor_index'''
    for i in range(anaphor_index - 1,-1,-1):
        yield i

def get_mention_features(mention):
    '''just returns an empty dictionary, which means the mention has no known features and so
    will be compatible with everything'''
    return {}

In [25]:
def test_anaphor_resolution_system(wikicoref_path):
    '''interates through all the mentions for all the wikipedia entries in wikipedia path and attempts to assign an antecedant
    to each. Returns an accuracy score based on how many antecedants are correctly assigned'''
    total = 0
    correct = 0
    for entry in os.listdir(wikicoref_path):
        if ".ini" in entry:
            continue
        mention_info_list = get_mention_info(entry)
        text, sentence_boundaries = get_text_with_boundaries(entry)
        for i in range(len(mention_info_list)):
            # your code here for step 2:
            # in this snippet, you should check 
            # if the current mention (`mention_info_list[i]`)'s `mentiontype` is a pronoun in the pronoun list.  
            # If it is (using `get_mention()` and lower()), you should get its features 
            #   from the `pronoun_features`` dictionary above.
            #   and total += 1
            
            # your code here
                    for j in search_iterator(i, mention_info_list,sentence_boundaries):
                        # your code here:
                        # in this snippet, you should check if the mention (`get_mention(mention_info_list[j],text)`) 
                        #       returned by the search iterator
                        # If it is compatible with your current pronoun (from the for loop above)
                        #       check that the coreference class of the pronoun is the same as this reference
                        #       If it is, then increase the number of correctly predicted items. correct += 1
                        #       and break
    return correct/total

In [26]:
result = test_anaphor_resolution_system(wikicoref_path)
print(result)
assert 0.39 < result < 0.40
print("Success!")

0.3922413793103448
Success!


#### 3.4

rubric={accuracy:2}

The next piece of your anaphor resolution system will involve an enhancement to your backward searching process: instead of just stepping back mention-by-mention, you will start looking from the *beginning* of sentences rather than the end. This is a rough way of injecting information about grammatical saliency into your search for an antecedent (in English, at least, the subject often comes first, and has highest salience). A major piece of this has been done for you, in the form of the `get_mentions_ind_for_same_sentence()` function provided with the other mention functions in **3.1**. For the example below:

***Ted* is happy. *He* is getting married to *Molly* who used to be married to *Fred*. *Her* bridesmaid is *Dolly* and *his* best man is *Ned*.**

We have the mentions: *Ted*, *He*, *Molly*, *Fred*, *Her*, *Dolly*, *his*, and *Ned*. If we are looking for the antecedent of *his*, calling `get_mentions_ind_for_same_sentence()` with the index of mention *his* will return the index of *Her* (ie, the start of that sentence). Using that index as a starting point, we would check forward through that sentence until we hit *his*, and then move to the previous sentence by calling `get_mentions_ind_for_same_sentence()` on the mention before *Her*, namely *Fred*. That would get us started at the index for *He*, and then we iterate forward until we hit *Her*, at which point we call  `get_mentions_ind_for_same_sentence()` on *Ted*, return *Ted*'s index, then stop because we are already at index 0. Implement this logic as a generator function (**HINT**: your solution should have one `while` loop and then one `for` loop within it).

So the process is: 
1. starting with "his" in the final sentence, find the first mention in the sentence, using `get_mentions_ind_for_same_sentence`.  
2. Starting at that index, yield the indices for the words between that word and the anaphor in the sentence. 
3. Decrease the index to the word before the mention found in 1. 
4. Return to 1.  If your index is 0, you're done, because you are at the end of the sentence.

For the example sentence, the indices will be [16 (Her),17,18,19,20,21,3 (He),4,5,6,7,8,9,10,11,12,13,14,15,0 (Ted),1,2] (You don't need the items in parentheses - those are just for clarification)  The logic behind this order is this: "to disambiguate this anaphor, go to the start of this sentence, and check the compatibility of all words with the anaphor - if you find one that disambiguates it, good!  If not, go to the previous sentence, and do the same thing.  Keep doing this until you get to the start of the discourse".

There are two sets of tests below: one tests whether the algorithm is working properly, and the other calls `test_anaphor_resolution_system` again to see whether how much this new version of `search_iterator` has improved performance.

***
```
0*Ted* is happy. 
1*He* is getting married to 2*Molly* who used to be married to 3*Fred*. 
4*Her* bridesmaid is 5*Dolly* and 6*his* best man is 7*Ned*.
```
***

previously, using `search_iterator()` with `anaphor_index` = 6 (*his*):
```
indicies == [5, 4, 3, 2, 1, 0]    
```


for now, using *new* `search_iterator()` with `anaphor_index` = 6 (*his*):
```
indicies == [4, 5, 1, 2, 3, 0]
```

In [27]:
def search_iterator(anaphor_index, mention_list, sentence_boundaries):
    '''generator function which yields indices in mention_list based on the idea of moving backwards through the text starting
    from end_index but forward through sentences as indicated by the set of sentence_boundaries'''
    start_index = -1
    end_index = anaphor_index
    first = True
    # your code here

    # your code here

In [28]:
mention_list = get_mention_info("Canada")
text,sentence_boundaries = get_text_with_boundaries("Canada")
indices = list([index for index in search_iterator(14, mention_list,sentence_boundaries)])
assert indices == [10, 11, 12, 13, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5] #How many sentences are in this utterance?
print("Success!")

Success!


In [29]:
result = test_anaphor_resolution_system(wikicoref_path)
print(result)
assert 0.52 < result < 0.53
print("Success!")

0.5275862068965518
Success!


#### 3.5

rubric={accuracy:3, quality:1}

Finally, you will upgrade the `get_mention_features` function so that it guesses at least one of the three relevant features for *non-pronomial* mentions. Based on our pronoun list, there are clearly three options for this: plural (whether a mention is plural), person (whether the mention is a person), and gender (whether the mention is male or female). For example, if provided the mention ```["the", "boy"]```, a fully-functional version of `get_mention_features` would return a dictionary ```{"PLUR":False,"PERS":True,"MALE":True}``` You can use any resources you like for this, including POS taggers, manually-created or off-the-shelf lexicons (remember that NLTK has a lexicon of male and female names!), WordNet, word2vec vectors, sklearn classifiers, etc. You can get up to 3 points of bonus for this question if you tackle multiple features or provide a particularly comprehensive solution for one of them, but you only have to cover one feature get the non-optional points. 

First, you must test your function *intrinsically* by showing your function is working for particular mentions. To do this, iterate over mentions (your code from **3.3** can be modified for this purpose). and print out the mention and the output of `get_mention_features`. Based on your own manual analysis, you should be getting at least 3/4 of cases *in both classes*, if not, try to improve your function. When you are satisfied with your performance, test *extrinsically* by using the new `get_mention_features` in the context of your coreference resolution system. It is okay if your extrinsic evaluation gives negative results, as long as your intrinsic evaluation shows you are doing pretty well (though it definitely should be possible to improve your overall results!). Note if this was a real world situation we'd probably want to segregate our intrinsic and extrinsic evaluations (make sure they aren't being done over the same mentions) since the former might involve overfitting on the specific mentions in this set, but we are not going to require you worry about this here (though you can if you like!).

`["the", "boy"] --> {"PLUR":False,"PERS":True,"MALE":True}`


```
pos_tag(["the", "boy"]) --> ["DT", "NN"]
pos_tag(["boys"])       --> ["NNS"]
```
if one of pos tags in mention ends with `"S"`, then `"PLUR":True`



In [30]:
def get_mention_features(mention):
    '''returns a feature dictionary with PLUR either true or false depending on whether the corresponding mention 
    has a plural noun based on part of speech tagged'''
    features = {}
    #Your code here

    #Your code here
    return features

def check_mention_features():
    '''print out mentions and mention features for the first five non-pronomial mentions of each entry 
    in WikiCoref corpus'''
    for entry in os.listdir(wikicoref_path):
        if ".ini" in entry:
            continue
        count = 0
        mention_info_list = get_mention_info(entry)
        text,sentence_boundaries = get_text_with_boundaries(entry)
        for mention_info in mention_info_list:      
            if mention_info["mentiontype"] != "pro":
                count += 1
                mention = get_mention(mention_info,text)
                print(mention)
                print(get_mention_features(mention))
                if count == 5:
                    break

In [31]:
check_mention_features()

['The', 'Siege', 'of', 'Chaves']
{'PLUR': True}
['the', 'French', 'siege', 'and', 'capture', 'of', 'Chaves', ',', 'Portugal', 'from', '10', 'to', '12', 'March', '1809', ',', 'and', 'the', 'subsequent', 'siege']
{'PLUR': False}
['Chaves']
{'PLUR': True}
['Portugal']
{'PLUR': False}
['the', 'town']
{'PLUR': False}
['A', 'financial', 'analyst']
{'PLUR': False}
['securities', 'analyst']
{'PLUR': True}
['research', 'analyst']
{'PLUR': False}
['equity', 'analyst']
{'PLUR': False}
['investment', 'analyst']
{'PLUR': False}
['The', 'Battle', 'of', 'Kosovo']
{'PLUR': False}
['the', 'Battle', 'of', 'Kosovo', 'Field']
{'PLUR': False}
['the', 'Battle', 'of', 'Blackbird', "'s", 'Field']
{'PLUR': False}
['St.', 'Vitus', "'", 'Day']
{'PLUR': False}
['St.', 'Vitus', "'", 'Day', ',', 'June', '15', ',', '1389']
{'PLUR': False}
['Gonzales']
{'PLUR': True}
['Gonzales', ',', '-LRB-', 'born', 'Jason', 'Charles', 'Beck', ';', '1972', '-RRB-']
{'PLUR': True}
['Jason', 'Charles', 'Beck']
{'PLUR': False}
['a', '

In [32]:
test_anaphor_resolution_system(wikicoref_path)

0.5482758620689655