# Ling 450/807 SFU - Assignment 1

This assignment walks you through two different ways of extracting simple quotes from text and then directs you to a third, already implemented way. Your task is to enhance the simple methods or develop your own. For further instructions, check the assignment file on Canvas. You'll need this notebook and the sample files in the A1_data directory. 

## Import packages

We import everything we will need here at the beginning and load the spaCy language model. Note that we are using the small English model. One thing you could try is to download and load [other models for English](https://spacy.io/models/en) and compare the results. 

In [23]:
import spacy
import re
import os
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

## Approach 1: Using regular expressions

### Finding text within quotes

In [27]:
import re
import os

# Define function to extract text within quotes
def get_quotes(text):
    quotes = re.findall(r'“(.*?)”', text)
    return quotes

# Directory containing the text files
directory = "C:\\Users\\User\\SDA250Mywork\\A1_data"

# Process each text file in the directory
for filename in os.listdir(directory):
    if filename.endswith('.txt'):
        file_path = os.path.join(directory, filename)
        with open(file_path, "r", encoding='utf-8') as f:
            text = f.read()

        print(f"Quotes extracted from {file_path}:")
        quotes = get_quotes(text)

        # Print the extracted quotes
        for quote in quotes:
            print(quote)
        print()


Quotes extracted from C:\Users\User\SDA250Mywork\A1_data\5c1452701e67d78e276ee126.txt:
I was clear when I was mayor – I don’t support Uber at all,
It was a twinkle in some engineer’s eye some years ago.
Mayor McCallum’s statements vary greatly from truth,
There’s a tried-and-true method in Canadian politics: after an election a new government takes office and says, ‘Oh my gosh, the cupboards are bare.’ Or, ‘We’re much deeper in debt than I thought we were, and now I’ve seen the real books.' So I think there’s an element of that kind of gamesmanship going on,
Then there’s the fact that McCallum has been out of office for quite some time, thinking he knew the job, but some things have changed,
If you take Fraser Highway SkyTrain and if we’re building that seven days a week around the clock, we probably can save, and this is TransLink’s figures, we can probably save $2-300 million,
TransLink has not conducted any detailed study on potential construction methods for a SkyTrain route from S

## Approach 2: Using spaCy's Matcher

This approach is based on notebooks by [William J.B. Mattingly](https://wjbmattingly.com/). His book, Introduction to Python for Humanists, is available online from the [SFU Library](https://sfu-primo.hosted.exlibrisgroup.com/permalink/f/usv8m3/01SFUL_ALMA51476999620003611). 

For more on spaCy's Matcher, see Advanced NLP with spaCy, [chapter 2](https://course.spacy.io/en/chapter2)). 

We have already loaded everything we need at the beginning of this notebook (imported Matcher, assigned it to a `matcher` object), so now we can use it. 

## Finding quotes and speakers

In [None]:
# load a text file. Remember, you have to do 5
with open ("A1_data/5c1dbe1d1e67d78e2797d611.txt", "r", encoding='utf-8') as f:
    text = f.read()

In [None]:
# convert it to a spacy doc
doc = nlp(text)

### Finding proper nouns

In [None]:
# This is optional. It just tells you who are the people mentioned. You can use it later if you want to find out the speakers of the quotes

matcher = Matcher(nlp.vocab)
pattern_n = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUNS", [pattern_n], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])
    
## You can try to extract full names by adding multi-word nouns, http://spacy.pythonhumanities.com/02_02_matcher.html

### Finding quotes

In [None]:
# a simple pattern to extract things in single quotes
# as with Approach 1, the for loop prints the results to the screen
# you can try and save it to a file if you want to compare with Approach 1 and 3

matcher = Matcher(nlp.vocab)
pattern_q = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}]
matcher.add("QUOTES", [pattern_q], greedy='LONGEST')
doc = nlp(text)
matches_q = matcher(doc)
matches_q.sort(key = lambda x: x[1])
print (len(matches_q))
for match in matches_q[:10]:
    print (match, doc[match[1]:match[2]])

## Approach 3: Implemented version
This approach was implemented by colleagues at the [Australian Text Analytics Platform](https://www.atap.edu.au/) (ATAP). The approach is based on the [Gender Gap Tracker](https://github.com/sfu-discourse-lab/GenderGapTracker) done in the Discourse Processing Lab here at SFU. 

The first link below leads you to a binder where you can load your own files and download the output. If you prefer to do everything in your own notebook, you can download/clone the project from GitHub. 

* [Binder link](https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/workshop_01_20220908/README.md)

    * Click on the "binder launch" button.
    * At the CILogin, under "Select an Identity Provider", go to the drop-down menu (usually default as ORCID) and select "Simon Fraser University".
    * This launches [Binder](https://mybinder.readthedocs.io/en/latest/), a service that allows you to run a notebook online on Jupyter Lab (similar to Google Colab). 
    * Run all the code cells in that notebook, uploading files from the A1_data directory. 
    * At the end, you can save the output as an Excel file. 

* [Regular GitHub project](https://github.com/Australian-Text-Analytics-Platform/quotation-tool)

    * Run the notebook "quote_extractor_notebook.ipynb"

Within the ATAP binder, upload 5 files from A1_data (the same you did for approaches 1 and 2), process them and download the results to your own computer. 

## Your turn

Check instructions on Canvas for what to do and what to submit. 

