# Ling 450/807 SFU - Assignment 1

Virginia Uhi, Eunice Wong, & Han Zhang


## Import packages

We import everything we will need here at the beginning and load the spaCy language model. Note that we are using the small English model. One thing you could try is to download and load [other models for English](https://spacy.io/models/en) and compare the results. 

In [1]:
# Importing Spacy 
import os
import spacy
import re
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)


## Approach 1: Using regular expressions

In [21]:
# The starter file loads and processes only one file at a time. We are using the command to do 5 files at the same time

# First we take five text files out 

five_files = ["A1_data/5c1548a31e67d78e2771624f.txt", "A1_data/5c489df91e67d78e271d66c5.txt", "A1_data/5c182ac21e67d78e277944ad.txt", "A1_data/5c28972a795bd2fac69fa974.txt", "A1_data/5c29beda1e67d78e27b74939.txt"]

# Process five text files at the same time, and have all the quotes in the five files can be extracted and put in the same file
with open (five_files, 'w', encoding = "utf-8") as output:
    def get_quotes(text):
        quotes = re.findall(r'["“](.*?)[”"]', text) 
            # so that every time as soon as Spacy identify any quotation marks (either straight or curly), the quoted content with the quotation mark will be extracted
        return(quotes)
    
    for file_path in five_files: 
        output.write(file_path + "\n")
        with open(file_path, "r", encoding="utf-8") as file:
            text = file.read()
                # The starter code seperate the text into sentences at the begining; However, for the sake of better identifying the sentence, we decide to not seperate the sentences first

            found_quotes = get_quotes(text) 
            print(found_quotes)
            for quote in found_quotes:
                output.write(str(quote) + "\n")  # Using the "\n" to make the quotes listed line by line

    

TypeError: expected str, bytes or os.PathLike object, not list

## Approach 2: Using spaCy's Matcher

This approach is based on notebooks by [William J.B. Mattingly](https://wjbmattingly.com/). His book, Introduction to Python for Humanists, is available online from the [SFU Library](https://sfu-primo.hosted.exlibrisgroup.com/permalink/f/usv8m3/01SFUL_ALMA51476999620003611). 

For more on spaCy's Matcher, see Advanced NLP with spaCy, [chapter 2](https://course.spacy.io/en/chapter2)). 

We have already loaded everything we need at the beginning of this notebook (imported Matcher, assigned it to a `matcher` object), so now we can use it. 

## Finding quotes and speakers

### Finding proper nouns

In [19]:
# This is optional. It just tells you who are the people mentioned. You can use it later if you want to find out the speakers of the quotes

# matcher = Matcher(nlp.vocab)
# pattern_n = [{"POS": "PROPN"}]
# matcher.add("PROPER_NOUNS", [pattern_n], greedy="LONGEST")
# doc = nlp(text)
# matches = matcher(doc)
# print (len(matches))
# for match in matches[:10]:
    #print (match, doc[match[1]:match[2]])
    
## You can try to extract full names by adding multi-word nouns, http://spacy.pythonhumanities.com/02_02_matcher.html

63
(3232560085755078826, 1, 2) CTV
(3232560085755078826, 2, 3) Vancouver
(3232560085755078826, 6, 7) Abbotsford
(3232560085755078826, 8, 9) B.C.
(3232560085755078826, 25, 26) Africa
(3232560085755078826, 42, 43) Kim
(3232560085755078826, 44, 45) Clark
(3232560085755078826, 45, 46) Moran
(3232560085755078826, 52, 53) Immigration
(3232560085755078826, 54, 55) Refugees


In [6]:
# We escape the he code above and try another method
# the starter code for extracting the propoer nouns encounter problem when the propoer noun contains mutiple words

matcher = Matcher(nlp.vocab)
pattern_n = [{"POS": "PROPN", "OP": "+" }] # with OP it will look for propoer noun one or more times
matcher.add("PROPER_NOUNS", [pattern_n], greedy="LONGEST")    
doc = nlp(text)
matches = matcher(doc)
matches.sort (key = lambda x: x[1])  # in case that the results will be organized in descending order based on the length (given the greedy function)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])
    

70
(3232560085755078826, 0, 1) Australia
(3232560085755078826, 9, 10) Canadians
(3232560085755078826, 12, 13) China
(3232560085755078826, 30, 31) Canada
(3232560085755078826, 39, 44) Foreign Affairs Minister Marise Payne
(3232560085755078826, 45, 46) Sunday
(3232560085755078826, 55, 56) Australia
(3232560085755078826, 71, 72) France
(3232560085755078826, 73, 76) New York Times
(3232560085755078826, 83, 84) China


### Finding quotes

In [25]:
for file in os.listdir("A1_grp3_txt"): 
    print(file)

    with open ("A1_grp3_txt/"+file, "r", encoding='utf-8') as f: #open all files in the folder and read them
        text = f.read()
    
    matcher = Matcher(nlp.vocab)
    
    pattern_q16 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}] #7
    pattern_q17 = [{'ORTH': '“'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '”'}] #4
    # both curly and straight quotes
    
    matcher.add("QUOTES", [pattern_q16, pattern_q17])
    doc = nlp(text)
    matches_q = matcher(doc)
    print(len(matches_q))
    for match in matches_q[:10]:
        print(match, doc[match[1]:match[2]])
    print("\n") #blank space between outputs

.ipynb_checkpoints


IsADirectoryError: [Errno 21] Is a directory: 'A1_grp3_txt/.ipynb_checkpoints'

In [5]:
print(os.getcwd())

/home/a1a01287-00fd-4f66-9d41-05fe0b880957/LING807 Compuling


In [69]:
# combination of different OP. 

matcher = Matcher(nlp.vocab)
pattern_q = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "!"}, {'IS_PUNCT': True, "OP": "!"}, {'ORTH': '"'}] #0
pattern_q1 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "!"}, {'IS_PUNCT': True, "OP": "?"}, {'ORTH': '"'}] #0
pattern_q2 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "!"}, {'IS_PUNCT': True, "OP": "+"}, {'ORTH': '"'}] #0
pattern_q3 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "!"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}] #0

pattern_q4 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "?"}, {'IS_PUNCT': True, "OP": "!"}, {'ORTH': '"'}] #0
pattern_q5 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "?"}, {'IS_PUNCT': True, "OP": "?"}, {'ORTH': '"'}] #0
pattern_q6 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "?"}, {'IS_PUNCT': True, "OP": "+"}, {'ORTH': '"'}] #0
pattern_q7 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "?"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}] #0

pattern_q8 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "!"}, {'ORTH': '"'}] #0
pattern_q9 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "?"}, {'ORTH': '"'}] #3
pattern_q10 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "+"}, {'ORTH': '"'}] #3
pattern_q11 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}] #3

pattern_q12 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "*"}, {'IS_PUNCT': True, "OP": "!"}, {'ORTH': '"'}] #0
pattern_q13 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "*"}, {'IS_PUNCT': True, "OP": "?"}, {'ORTH': '"'}] #3
pattern_q14 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "*"}, {'IS_PUNCT': True, "OP": "+"}, {'ORTH': '"'}] #3
pattern_q15 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "*"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}] #3

pattern_q16 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}] #7
pattern_q17 = [{'ORTH': '“'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '”'}] #4

#pattern_q4 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'ORTH': '"'}]  #control
#pattern_q5 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "*"}, {'ORTH': '.,!?': True, "OP": "+"}, {'ORTH': '"'}] 
#pattern_q6 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "*"}, {'ORTH': '"'}]
#pattern_q7 = [{'ORTH': '"'}, {'IS_ALPHA': True, 'OP': '?'}, {'ORTH': '"'}]

matcher.add("QUOTES", [pattern_q16, pattern_q17])
doc = nlp(text)
matches_q = matcher(doc)
# matches_q.sort(key = lambda x: x[1])
print (len(matches_q))
for match in matches_q[:10]:
    print (match, doc[match[1]:match[2]])

# MAYBE STRAIGHT/CURLY QUOTATION MARKS MATTER, SINCE IT IS RECOGNISING THE MIDDLE PART OF THE QUOTES AS WELL

7
(16432004385153140588, 108, 116) " Kim told CTV News Friday. "
(16432004385153140588, 115, 133) "The fact that we are being accused right now of an unethical adoption is crazy."
(16432004385153140588, 108, 133) " Kim told CTV News Friday. "The fact that we are being accused right now of an unethical adoption is crazy."
(16432004385153140588, 164, 174) "It does say that in the letter,"
(16432004385153140588, 173, 180) " Kim confirmed, adding that "
(16432004385153140588, 179, 209) "I have no idea where that information came from because both Clark and I were there in the office with all of the workers from the orphanage."
(16432004385153140588, 280, 309) "in some cases, extra steps in the citizenship or immigration process may be needed to make sure the adoption meets all requirements of international adoption."


## Approach 3: Implemented version
This approach was implemented by colleagues at the [Australian Text Analytics Platform](https://www.atap.edu.au/) (ATAP). The approach is based on the [Gender Gap Tracker](https://github.com/sfu-discourse-lab/GenderGapTracker) done in the Discourse Processing Lab here at SFU. 

The first link below leads you to a binder where you can load your own files and download the output. If you prefer to do everything in your own notebook, you can download/clone the project from GitHub. 

* [Binder link](https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/workshop_01_20220908/README.md)

    * Click on the "binder launch" button.
    * At the CILogin, under "Select an Identity Provider", go to the drop-down menu (usually default as ORCID) and select "Simon Fraser University".
    * This launches [Binder](https://mybinder.readthedocs.io/en/latest/), a service that allows you to run a notebook online on Jupyter Lab (similar to Google Colab). 
    * Run all the code cells in that notebook, uploading files from the A1_data directory. 
    * At the end, you can save the output as an Excel file. 

* [Regular GitHub project](https://github.com/Australian-Text-Analytics-Platform/quotation-tool)

    * Run the notebook "quote_extractor_notebook.ipynb"

Within the ATAP binder, upload 5 files from A1_data (the same you did for approaches 1 and 2), process them and download the results to your own computer. 

## Your turn

Check instructions on Canvas for what to do and what to submit. 

