# Assignment 2 starter code
This notebook contains code to run [coreferee](https://github.com/explosion/coreferee), a coreference system running under spaCy to extract coreference chains (or clusters) from text.
To run the notebook, you first have to intall coreferee. See instructions here: https://spacy.io/universe/project/coreferee, but mostly what you need to do is, from a command prompt:

    $ python -m pip install coreferee
$ python -m coreferee install en
    
You'll also need to download the spaCy transformer model for English, which needs the large model as well. 
It turns out, spacy has just released new versions and coreferee is not yet compatible with them, so you need to download specific versions of each model:

    $ python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0.tar.gz 
$ python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0.tar.gz

If you are running this on binder, then do these things in the next two cells instead

## Part 1: Run coreferee

### Binder only

In [None]:
# run this cell only if you are on Binder

!python3 -m pip install coreferee
!python3 -m coreferee install en

In [None]:
# run this cell only if you are on Binder

!python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0.tar.gz
!python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0.tar.gz

### End of Binder only

In [36]:
# import what we need, load the transformer model, 
# add coreferee to the spacy nlp pipeline

import coreferee, spacy
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe('coreferee')

<coreferee.manager.CorefereeBroker at 0x7faf512a5d30>

In [58]:
# this is just a test, so that you can see what the coreference chains look like
# you may get a CUDA warning here. As long as it's only a warning, things should run just fine

doc = nlp("Although he was very busy with his work, Peter had had enough of it. 'God, I'm frustrated!', he said. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")

In [60]:
# now we print the coreference chains found

doc._.coref_chains.print()

print(doc._.coref_chains.resolve(doc[25]))

0: he(1), his(6), Peter(9), he(25), He(28), his(30)
1: work(7), it(14)
2: [He(28); wife(31)], they(33), They(38), they(43)
3: Spain(41), country(46)
[Peter]


A few things to note about the output:

* We have 4 coreference chains, relating to: *Peter, work, wife(+Peter), Spain*
* Coreferee is able to deal with cataphora, where the pronoun (*he*) appears before the referent (*Peter*)
* Coreferee can deal with groups: *\[he+wife\], they*
* The wife does not appear as an entity with a chain, because there is no referring expression to that entity. It only appears as part of *he and his wife*

In [40]:
# once we have an index for a particular referring expression, 
# we can ask coreferee to resolve it. For instance, printing
# the following expression gives us the referent for 
# the referring expression 31 (they)

print(doc._.coref_chains.resolve(doc[31]))

[Peter, wife]


### Run coreferee on local files

In [47]:
# do coreference chains for 5 documents in the data/ directory
# below is a sample for the first text

# COREFERENCE 5 TEXTS

with open ("data/5c1dbe1d1e67d78e2797d611.txt", "r", encoding='utf-8') as f:
    text1 = f.read()
with open ("data/5c1dccbf1e67d78e279807d8.txt", "r", encoding='utf-8') as f:
    text2 = f.read()
with open ("data/5c1de1661e67d78e27984d34.txt", "r", encoding='utf-8') as f:
    text3 = f.read()
with open ("data/5c1e0b68795bd2a5d03a49a9.txt", "r", encoding='utf-8') as f:
    text4 = f.read()
with open ("data/5c1efb3d1e67d78e279bd39a.txt", "r", encoding='utf-8') as f:
    text5 = f.read()
    
# DISPLACY 5 TEXTS
    
with open ("data/5c5d3e251e67d78e275e54b5.txt", "r", encoding='utf-8') as f:
    text6 = f.read()
with open ("data/5c5d3f0a1e67d78e275e5788.txt", "r", encoding='utf-8') as f:
    text7 = f.read()
with open ("data/5c5d532a795bd2d5c282a094.txt", "r", encoding='utf-8') as f:
    text8 = f.read()
with open ("data/5c5da7aa1e67d78e275f8a3c.txt", "r", encoding='utf-8') as f:
    text9 = f.read()
with open ("data/5c5e50711e67d78e27616b23.txt", "r", encoding='utf-8') as f:
    text10 = f.read()


#### for Rachel's reference

5c1dbe1d1e67d78e2797d611.txt

5c1dccbf1e67d78e279807d8.txt

5c1de1661e67d78e27984d34.txt

5c1e0b68795bd2a5d03a49a9.txt

5c1efb3d1e67d78e279bd39a.txt

#### for Antanila's reference

5c5d3e251e67d78e275e54b5.txt

5c5d3f0a1e67d78e275e5788.txt

5c5d532a795bd2d5c282a094.txt

5c5da7aa1e67d78e275f8a3c.txt

5c5e50711e67d78e27616b23.txt
    

In [48]:
#coreferee texts
doc1 = nlp(text1)
doc2 = nlp(text2)
doc3 = nlp(text3)
doc4 = nlp(text4)
doc5 = nlp(text5)

#displacy texts

doc6 = nlp(text6)
doc7 = nlp(text7)
doc8 = nlp(text8)
doc9 = nlp(text9)
doc10 = nlp(text10)

#### DOCUMENT 1 COREFERENCE CHAINS

In [51]:
doc1._.coref_chains.print()

0: couple(9), couple(76)
1: years(16), their(19)
2: letter(48), them(59), they(78)
3: Ayo(72), Ayo(150)
4: custody(87), it(96)
5: News(112), News(136)
6: fact(117), It(165)
7: orphanage(161), orphanage(206)
8: letter(171), letter(219)
9: Kim(174), Kim(211)
10: children(254), their(266)
11: adoption(300), adoption(333)
12: headlines(317), they(325)
13: Kim(338), Kim(401), Kim(425), her(437), she(452)
14: Nigeria(345), country(359)
15: papers(372), them(376)
16: Clark(395), Clark(432), He(444), him(478), him(486), him(496), he(501)
17: [Kim(401); son(404)], their(403)
18: Canada(474), They(484)
19: Nigeria(489), They(494)
20: Morans(561), Morans(602)
21: family(578), family(636), they(658), their(660), they(684)
22: government(589), it(598)
23: Kim(633), she(652), she(695)
24: agency(648), it(672)


#### DOCUMENT 2 COREFERENCE CHAINS

In [52]:
doc2._.coref_chains.print()


0: his(5), him(23), gatekeeper(33)
1: Mall(19), Mall(100)
2: visitors(36), their(44)
3: Roarke(70), his(80)
4: Press(97), Press(105)
5: Espinos(135), his(146)
6: Press(183), its(188)
7: superheroes(248), They(250), they(270)
8: creator(373), their(417)
9: Espinos(378), his(382), He(422), his(432)
10: their(437), [Markowitz(450); guy(459)]
11: publisher(487), They(514), They(524)
12: Campbell(496), his(509)
13: Campbell(534), his(550), His(557), his(569)
14: maverick(598), his(602), he(608), he(631), his(642)
15: Campbell(668), His(694), his(706)
16: tons(691), They(734)
17: trainer(717), It(754)
18: None(767), it(774)
19: two(817), It(877)
20: resident(896), his(903)
21: copy(936), it(943)
22: all(945), it(948), it(969)
23: Xander(992), he(997)
24: dad(1005), he(1023), he(1039)


#### DOCUMENT 3 COREFERENCE CHAINS 

In [53]:
doc3._.coref_chains.print()

0: Scheer(5), Scheer(24)
1: Trudeau(10), Trudeau(29)
2: Canada(21), Canada(48)
3: [Trudeau(29); party(33)], them(40)
4: Trudeau(76), his(87), His(107), Trudeau(113)
5: Scheer(104), Scheer(119)
6: [Scheer(119); Conservatives(122)], themselves(128)
7: accusations(157), they(163)
8: Convoy(187), his(236)
9: pipelines(193), their(197), they(204)
10: Scheer(242), he(244), his(259), Scheer(289)
11: Twitter(284), it(291)
12: reaction(294), it(303)
13: defence(352), it(358)
14: Chicken(365), he(401), he(406)
15: language(388), it(391), it(425)
16: WelcomeToCanada(440), Canada(525)
17: U.S.(450), country(467)
18: Trudeau(452), Trudeau(458), Trudeau(483), he(487)
19: Scheer(462), Scheer(533), he(535), his(537), Scheer(570), he(659)
20: [he(535); party(538)], they(545)
21: system(633), it(643)
22: Canada(653), It(666)
23: Conservatives(675), Conservatives(728), Conservatives(765)
24: spokeswoman(686), he(690)
25: Trudeau(788), Trudeau(824), Trudeau(826)
26: Conservatives(829), they(836)
27: sover

#### DOCUMENT 4 COREFERENCE CHAINS

In [54]:
doc4._.coref_chains.print()

0: Vancouver(20), Vancouver(34)
1: letter(44), it(50)
2: tax(87), It(89)
3: Vancouver(130), Vancouver(154)
4: council(140), its(145)
5: Some(161), they(170)
6: tax(167), tax(235)
7: Eby(191), his(210)
8: those(231), they(237), they(252), They(284), their(292)
9: Members(349), their(357)
10: Stewart(369), his(373)
11: Carr(419), her(428)
12: Vancouverites(434), they(448), their(457), Vancouverites(514)
13: Carr(483), Carr(517)
14: owners(496), they(500)
15: Fry(533), he(542), he(554), He(562), he(591), his(597)
16: virtue(581), It(625)
17: vote(589), vote(645)
18: reservations(605), they(612)
19: councillor(623), her(635)
20: Fry(653), he(655), his(660), he(700), he(704), He(716), His(730), he(740), Fry(747), himself(749), He(771)
21: province(664), province(690)
22: that(719), it(728)
23: council(758), it(774)
24: motion(787), it(804)
25: issues(816), their(823)
26: One(827), them(838), they(842)


#### DOCUMENT 5 COREFERENCE CHAINS

In [55]:
doc5._.coref_chains.print()



0: Stella(38), Stella(90), Stella(134), Stella(207), Stella(280)
1: union(46), union(77)
2: the(68), It(81)
3: workers(88), They(93), their(101)
4: lot(126), It(137)
5: restaurateurs(176), their(185)
6: Traeger(214), He(236)
7: union(233), union(239)
8: their(270), [owners(282); Sohlberg(284); Abreder(287)], they(289), they(327)
9: Traeger​(399), he(401), He(422), he(434)
10: unionization(424), it(447)
11: Stella(462), he(546)
12: employees(465), they(475), they(498), their(505)
13: ends(530), they(534), they(541)
14: union(613), union(658)
15: Stella(619), Stella(644)


In [43]:
# example: who does she(452) refer to?

print(doc1._.coref_chains.resolve(doc1[452]))

[Kim]


In [44]:
# print referring expressions that are people
# we are interested in those because they are the sources of quotes
for ent in doc1.ents:
    if ent.label_ in ["PERSON"]:
        print(ent.text, ent.label_)

Kim PERSON
Clark Moran PERSON
Ayo PERSON
Kim PERSON
Ayo PERSON
Kim PERSON
Clark PERSON
Kim PERSON
Ayo PERSON
Morans PERSON
Kim PERSON
Ayo PERSON
Clark PERSON
Kim PERSON
Kim PERSON
Clark PERSON
Ayo PERSON
Kim PERSON
Morans PERSON
Ayo PERSON
Kim PERSON
Ben Miljure PERSON


## Side note: visualizations
If you want to see this all in a much prettier format, you can use [displacy](https://spacy.io/usage/visualizers). 

In [14]:
from spacy import displacy

#### Document 1 Visualization

In [45]:
options = {"ents": ["PERSON"],
          "colors": {"PERSON": "green"}}

displacy.render(doc1, style="ent", options=options, jupyter=True)

#### Document 2 Visualization

In [29]:
options = {"ents": ["PERSON"],
          "colors": {"PERSON": "purple"}}

displacy.render(doc2, style="ent", options=options, jupyter=True)

#### Document 3 Visualization

In [32]:
options = {"ents": ["PERSON"],
          "colors": {"PERSON": "red"}}

displacy.render(doc3, style="ent", options=options, jupyter=True)

#### Document 4 Visualization

In [34]:
options = {"ents": ["PERSON"],
          "colors": {"PERSON": "blue"}}

displacy.render(doc4, style="ent", options=options, jupyter=True)

#### Document 5 Visualization

In [35]:
options = {"ents": ["PERSON"],
          "colors": {"PERSON": "black"}}

displacy.render(doc1, style="ent", options=options, jupyter=True)

## Part 2: Run the quote extraction from Assignment 1
I suggest using the Matcher quote extraction system from A1, but, if you implemented your own version, or improved on this one, feel free to use that instead.

In [62]:
#import what we need for this
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

In [104]:
# we don't need to load the text again; use text1 from above

matcher = Matcher(nlp.vocab)
pattern_q = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}]
matcher.add("QUOTES", [pattern_q], greedy='LONGEST')
doc = nlp(text1)
matches_q = matcher(doc)
matches_q.sort(key = lambda x: x[1])
print (len(matches_q))
for match in matches_q[:10]:
    print (match, doc[match[1]:match[2]])

#The first quote matched starts at 115 and ends at 133
#(16432004385153140588, 115, 133) "The fact that we are being accused right now of an unethical adoption is crazy."

#Print ten tokens before & after
print(doc[105:143])

#Find person entity
for ent in doc[105:143].ents:
    if ent.label_ in ["PERSON"]:
        print(ent.text)

doc1 = nlp("Although he was very busy with his work, Peter had had enough of it. 'God I am frustrated!', he said. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")

matcher = Matcher(nlp.vocab)
pattern_q = [{"ORTH": "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {"ORTH": "'"}]

matcher.add("QUOTES", [pattern_q], greedy='LONGEST')
matches_q = matcher(doc1)
matches_q.sort(key = lambda x: x[1])
print (len(matches_q))
for match in matches_q[:10]:
    print (match, doc1[match[1]:match[2]])
    
counter = 6 
pronounFound = 0

# Find pronoun
for word in doc1[6:33]:
    if doc1[counter].text == "he":
        print(counter)
        pronounFound = counter
        print(word)
    counter += 1

    
# Find who that pronoun refers to
print(doc1._.coref_chains.resolve(doc1[pronounFound]))

3
(16432004385153140588, 115, 133) "The fact that we are being accused right now of an unethical adoption is crazy."
(16432004385153140588, 164, 174) "It does say that in the letter,"
(16432004385153140588, 179, 209) "I have no idea where that information came from because both Clark and I were there in the office with all of the workers from the orphanage."
right now," Kim told CTV News Friday. "The fact that we are being accused right now of an unethical adoption is crazy.". 
 CTV News has learned that a third party
Kim
1
(16432004385153140588, 16, 23) 'God I am frustrated!'
24
he
[Peter]
