# Ling 450/807 SFU - Assignment 1

This assignment walks you through two different ways of extracting simple quotes from text and then directs you to a third, already implemented way. Your task is to enhance the simple methods or develop your own. For further instructions, check the assignment file on Canvas. 
The binder contains this notebook and some sample files.

Group 1: Antanila, Rachel, Lovely

## Approach 1: Using regular expressions

In [1]:
import spacy
import re

In [2]:
# this loads and processes only one file at a time. You need to do 5 and comment on the results
# to load the 5 texts, you can just change the name of the file below or figure out a way 
# to pass a list of files to the read command. It's up to you

with open ("data/5c1dbe1d1e67d78e2797d611.txt", "r", encoding='utf-8') as f:
    text = f.read()

In [3]:
def find_sents(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    sentences = list(doc.sents)
    return(sentences)

### Finding text within quotes

In [4]:
def get_quotes(text):
    quotes = re.findall(r' "(.*?)"', text)
    return(quotes)

In [5]:
found_sents = find_sents(text)

In [6]:
# note: this just prints the text in quotes. If you want to save it locally
# to analyze how the 3 approaches are different, you need to run a command to save
# for instance to a text file

for sent in found_sents:
    str_sent = str(sent)
    found_quotes = get_quotes(str_sent)
    if len(found_quotes) > 0:
        print(found_quotes)

["Honestly, it feels like we're living our worst nightmare right now,"]
['It does say that in the letter,', 'I have no idea where that information came from because both Clark and I were there in the office with all of the workers from the orphanage.']
['the Government of Canada has obligations under international conventions to ensure children are not abducted, bought or sold, or removed from their biological families without legal consent.']
['in some cases, extra steps in the citizenship or immigration process may be needed to make sure the adoption meets all requirements of international adoption.']
["We're not giving up, but it feels really overwhelming to think about what this means and what they're trying to do to us right now,"]
["I can't believe that this is our life, that this is our story."]


## Approach 2: Using spaCy's Matcher

This approach is based on notebooks by Dr. W.J.B. Mattingly, http://spacy.pythonhumanities.com

In [3]:
# load all the stuff we'll need
import spacy
from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

## spaCy's Matcher
This notebook relies on spaCy's Matcher (see Advanced NLP with spaCy, [chapter 2](https://course.spacy.io/en/chapter2)). 

## Finding quotes and speakers

In [4]:
# load a text file. Remember, you have to do 5
with open ("data/5c1dbe1d1e67d78e2797d611.txt", "r", encoding='utf-8') as f:
    text = f.read()
with open ("data/5c1dccbf1e67d78e279807d8.txt", "r", encoding='utf-8') as f:
    text2 = f.read()
with open ("data/5c1de1661e67d78e27984d34.txt", "r", encoding='utf-8') as f:
    text3 = f.read()
with open ("data/5c1e0b68795bd2a5d03a49a9.txt", "r", encoding='utf-8') as f:
    text4 = f.read()
with open ("data/5c1efb3d1e67d78e279bd39a.txt", "r", encoding='utf-8') as f:
    text5 = f.read()
    
    

In [20]:
# Convert texts to Docs
doc = nlp(text)
doc2 = nlp(text2)
doc3 = nlp(text3)
doc4 = nlp(text4)
doc5 = nlp(text5)

docList = [doc, doc2, doc3, doc4, doc5]

# Show first few contents of each document
for doc in docList:
    print(doc)
    print("------")

 CTV Vancouver. 
 An Abbotsford, B.C. couple that has been waiting nearly two years to bring their newly adopted son home from Africa has learned that the Canadian government is not prepared to grant the child citizenship. 
 Kim and Clark Moran received a letter this week from Immigration, Refugees and Citizenship Canada informing them that the federal department has concerns about two-year-old Ayo, whom the couple claims they adopted from an orphanage in Nigeria and gained custody of in August. 
 "Honestly, it feels like we're living our worst nightmare right now," Kim told CTV News Friday. "The fact that we are being accused right now of an unethical adoption is crazy.". 
 CTV News has learned that a third party has come forward with an allegation that Ayo's adoption came from a private residence and not an orphanage. 
 "It does say that in the letter," Kim confirmed, adding that "I have no idea where that information came from because both Clark and I were there in the office with a

### Finding proper nouns

In [17]:
## Extract multi-word nouns 
## greedy = "LONGEST" will match as much as possible of the noun, in this case, do we want "CTV News Friday" or just Friday?

matcher = Matcher(nlp.vocab)
pattern_n = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern_n], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])


56
(3232560085755078826, 111, 114) CTV News Friday
(3232560085755078826, 1, 3) CTV Vancouver
(3232560085755078826, 44, 46) Clark Moran
(3232560085755078826, 56, 58) Citizenship Canada
(3232560085755078826, 135, 137) CTV News
(3232560085755078826, 722, 724) CTV Vancouver
(3232560085755078826, 725, 727) Ben Miljure
(3232560085755078826, 6, 7) Abbotsford
(3232560085755078826, 8, 9) B.C.
(3232560085755078826, 25, 26) Africa


### Finding quotes

In [21]:
# a simple pattern to extract things in single quotes
# as with Approach 1, the for loop prints the results to the screen
# you can try and save it to a file if you want to compare with Approach 1 and 3

matcher = Matcher(nlp.vocab)
pattern_q = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}]
matcher.add("QUOTES", [pattern_q], greedy='LONGEST')
doc = nlp(text)
matches_q = matcher(doc)
matches_q.sort(key = lambda x: x[1])
print (len(matches_q))
for match in matches_q[:10]:
    print (match, doc[match[1]:match[2]])


3
(16432004385153140588, 115, 133) "The fact that we are being accused right now of an unethical adoption is crazy."
(16432004385153140588, 164, 174) "It does say that in the letter,"
(16432004385153140588, 179, 209) "I have no idea where that information came from because both Clark and I were there in the office with all of the workers from the orphanage."


### Expansion on Approach 2: "capture all the strings in quotes, but not shorter strings that are not really quotes, and the speaker of each quote."

In [32]:
# Match quote to speaker
# Expansion: To capture speaker, make use of speech verbs

speech_verbs = ["told", "confirmed", "notes", "says", "said"]

matcher = Matcher(nlp.vocab)
#pattern_n = [{"POS": "PROPN", "OP": "+"}]

pattern_test = [
    
    #"___", said Alice
    [{"ORTH": {"IN": speech_verbs}}, {"ORTH": ",", "OP": "*"},{"POS": "PROPN", "OP": "+"}],
    
    #Alice said ___
    [{"POS": "PROPN", "OP": "+"}, {"ORTH": {"IN": speech_verbs}}]
    
]
matcher.add("PROPER_NOUNS", pattern_test, greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])
    

5
(3232560085755078826, 110, 114) told CTV News Friday
(3232560085755078826, 174, 176) Kim confirmed
(3232560085755078826, 508, 510) Kim said
(3232560085755078826, 514, 516) told CTV
(3232560085755078826, 633, 635) Kim said


## Approach 3: Implemented version
This approach was implemented by colleagues at the [Australian Text Analytics Platform](https://www.atap.edu.au/) (ATAP). The approach is based on the [Gender Gap Tracker](https://github.com/sfu-discourse-lab/GenderGapTracker) done in the Discourse Processing Lab here at SFU. 

The first link below leads you to a binder where you can load your own files and download the output. If you prefer to do everything in your own notebook, you can download/clone the project and you'll see a notebook there (quote_extractor_notebook.ipynb)

* [Binder link](https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/workshop_01_20220908/README.md)
* [Regular GitHub project](https://github.com/Australian-Text-Analytics-Platform/quotation-tool)

Within the ATAP binder, upload 5 files from A1/data (the same you did for approaches 1 and 2), process them and download the results to your own computer. 

## Your turn

Check instructions on Canvas for what to do and what to submit. 

