# Assignment 2 starter code
This notebook contains code to run [coreferee](https://github.com/explosion/coreferee), a coreference system running under spaCy to extract coreference chains (or clusters) from text.
To run the notebook, you first have to intall coreferee. See instructions here: https://spacy.io/universe/project/coreferee, but mostly what you need to do is, from a command prompt:

    $ python -m pip install coreferee
$ python -m coreferee install en
    
You'll also need to download the spaCy transformer model for English, which needs the large model as well. 
It turns out, spacy has just released new versions and coreferee is not yet compatible with them, so you need to download specific versions of each model:

    $ python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0.tar.gz 
$ python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0.tar.gz

If you are running this on binder, then do these things in the next two cells instead

## Part 1: Run coreferee

### Binder only

In [None]:
# run this cell only if you are on Binder

!python3 -m pip install coreferee
!python3 -m coreferee install en

In [None]:
# run this cell only if you are on Binder

!python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0.tar.gz
!python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0.tar.gz

### End of Binder only

In [7]:
# import what we need, load the transformer model, 
# add coreferee to the spacy nlp pipeline

import coreferee, spacy
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe('coreferee')

<coreferee.manager.CorefereeBroker at 0x20c4a4d2908>

In [8]:
# this is just a test, so that you can see what the coreference chains look like
# you may get a CUDA warning here. As long as it's only a warning, things should run just fine

doc = nlp('Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.')

In [9]:
# now we print the coreference chains found

doc._.coref_chains.print()

0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)


A few things to note about the output:

* We have 4 coreference chains, relating to: *Peter, work, wife(+Peter), Spain*
* Coreferee is able to deal with cataphora, where the pronoun (*he*) appears before the referent (*Peter*)
* Coreferee can deal with groups: *\[he+wife\], they*
* The wife does not appear as an entity with a chain, because there is no referring expression to that entity. It only appears as part of *he and his wife*

In [10]:
# once we have an index for a particular referring expression, 
# we can ask coreferee to resolve it. For instance, printing
# the following expression gives us the referent for 
# the referring expression 31 (they)

print(doc._.coref_chains.resolve(doc[31]))

[Peter, wife]


### Run coreferee on local files

In [11]:
# do coreference chains for 5 documents in the data/ directory
# below is a sample for the first text

with open ("data/5c1dbe1d1e67d78e2797d611.txt", "r", encoding='utf-8') as f:
    text1 = f.read()

In [12]:
doc1 = nlp(text1)

In [13]:
doc1._.coref_chains.print()

0: couple(9), couple(76)
1: years(16), their(19)
2: letter(48), them(59), they(78)
3: Ayo(72), Ayo(150)
4: custody(87), it(96)
5: News(112), News(136)
6: fact(117), It(165)
7: orphanage(161), orphanage(206)
8: letter(171), letter(219)
9: Kim(174), Kim(211)
10: children(254), their(266)
11: adoption(300), adoption(333)
12: headlines(317), they(325)
13: Kim(338), Kim(401), Kim(425), her(437), she(452)
14: Nigeria(345), country(359)
15: papers(372), them(376)
16: Clark(395), Clark(432), He(444), him(478), him(486), him(496), he(501)
17: [Kim(401); son(404)], their(403)
18: Canada(474), They(484)
19: Nigeria(489), They(494)
20: Morans(561), Morans(602)
21: family(578), family(636), they(658), their(660), they(684)
22: government(589), it(598)
23: Kim(633), she(652), she(695)
24: agency(648), it(672)


In [14]:
# example: who does she(452) refer to?

print(doc1._.coref_chains.resolve(doc1[452]))

[Kim]


In [15]:
# print referring expressions that are people
# we are interested in those because they are the sources of quotes
for ent in doc1.ents:
    if ent.label_ in ["PERSON"]:
        print(ent.text, ent.label_)

Kim PERSON
Clark Moran PERSON
Ayo PERSON
Kim PERSON
Ayo PERSON
Kim PERSON
Clark PERSON
Kim PERSON
Ayo PERSON
Morans PERSON
Kim PERSON
Ayo PERSON
Clark PERSON
Kim PERSON
Kim PERSON
Clark PERSON
Ayo PERSON
Kim PERSON
Morans PERSON
Ayo PERSON
Kim PERSON
Ben Miljure PERSON


## Side note: visualizations
If you want to see this all in a much prettier format, you can use [displacy](https://spacy.io/usage/visualizers). 

In [10]:
from spacy import displacy

In [11]:
options = {"ents": ["PERSON"],
          "colors": {"PERSON": "lightsteelblue"}}

displacy.render(doc1, style="ent", options=options, jupyter=True)

## Part 2: Run the quote extraction from Assignment 1
I suggest using the Matcher quote extraction system from A1, but, if you implemented your own version, or improved on this one, feel free to use that instead.

In [12]:
#import what we need for this
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

In [13]:
# we don't need to load the text again; use text1 from above

matcher = Matcher(nlp.vocab)
pattern_q = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}]
matcher.add("QUOTES", [pattern_q], greedy='LONGEST')
doc = nlp(text1)
matches_q = matcher(doc)
matches_q.sort(key = lambda x: x[1])
print (len(matches_q))
for match in matches_q[:10]:
    print (match, doc[match[1]:match[2]])

3
(16432004385153140588, 115, 133) "The fact that we are being accused right now of an unethical adoption is crazy."
(16432004385153140588, 164, 174) "It does say that in the letter,"
(16432004385153140588, 179, 209) "I have no idea where that information came from because both Clark and I were there in the office with all of the workers from the orphanage."


## Your turn

Check instructions on Canvas for what to do and what to submit. 