## Expansion on Regular expressions 

In [1]:
import re

def extract_quotes_with_speaker_regex(text):
    # Define the regular expression pattern to match speaker and quote pairs
    pattern = r'(?P<speaker>[\w\s]+): "(?P<quote>.*?)"'
    
    # Find all matches using the regular expression pattern
    matches = re.finditer(pattern, text)
    
    quotes_with_speaker = []
    for match in matches:
        speaker = match.group('speaker').strip()
        quote = match.group('quote').strip()
        quotes_with_speaker.append((speaker, quote))
    
    return quotes_with_speaker

# Test with your dataset
text = """
Justin Trudeau: "The people are revolting against carbon taxes."
Conservatives: "The planet burns!"
"""

quotes_with_speaker_regex = extract_quotes_with_speaker_regex(text)

# Output the speaker and quote pairs
for speaker, quote in quotes_with_speaker_regex:
    print(f"Speaker: {speaker.strip()}, Quote: {quote.strip()}")


Speaker: Justin Trudeau, Quote: The people are revolting against carbon taxes.
Speaker: Conservatives, Quote: The planet burns!


#### Explanation of the code above
The method was upgraded from just extracting quotes to now getting both speakers and quotes. Originally, it only found text within quotation marks, missing who said it. To fix this, adjustments were made to the pattern. By adding named groups for speakers and quotes, it could now identify both. The speaker group grabs names like "Justin Trudeau", while the quote group captures the actual text in the quotes. With each match, the function separates speakers and quotes, giving a clearer picture of who said what. This upgrade makes the method more useful for understanding the text's content.


## Method to capture indirect quotes

In [6]:
import os
import spacy

# Load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")

def extract_indirect_quotes_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    # Process the text with spaCy
    doc = nlp(text)

    indirect_quotes = []

    # Iterate over the sentences in the document
    for sent in doc.sents:
        # Look for verbs of saying in the dependency tree
        for token in sent:
            if token.dep_ == 'ROOT' and token.lemma_ in ['say', 'claim', 'state', 'declare', 'announce']:
                # Extract the subtree representing the reported speech
                reported_speech = [tok.text_with_ws for tok in token.subtree]
                indirect_quote = ''.join(reported_speech).strip()
                indirect_quotes.append(indirect_quote)
                break  # Only consider the first verb of saying in each sentence

    return indirect_quotes

# Directory containing the text files
directory = "C:\\Users\\User\\SDA250Mywork\\A1_data"

# Iterate over each file in the directory
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        print(f"Extracting indirect quotes from {filename}:")
        indirect_quotes = extract_indirect_quotes_from_file(file_path)
        for quote in indirect_quotes:
            print(quote)
        print()


Extracting indirect quotes from 5c1452701e67d78e276ee126.txt:
A claim the city's debt is higher than city reports say it is could support a total of $135 million in proposed cuts to the previous administration's projects, including the $58 million Grandview Heights Community Centre and Library.
A claim the costs of building a SkyTrain extension to Langley are lower than TransLink says they are could make one of McCallum's two big-ticket promises, SkyTrain and a municipal police force, more attractive.
Some voters say he is already losing their trust.
But observers say there may be a method to McCallum’s messages.
So I think there’s an element of that kind of gamesmanship going on,” University of the Fraser Valley political scientist Hamish Telford said.
“Then there’s the fact that McCallum has been out of office for quite some time, thinking he knew the job, but some things have changed,” Telford said.
McCallum claimed in that speech the cost could be reduced if the construction was do

#### Explanation of the code above

This Python script automates the extraction of indirect quotes from multiple text files stored in the `A1_data` directory. It relies on the spaCy library, which helps understand the structure of text. The key function, `extract_indirect_quotes_from_file`, uses spaCy to find verbs indicating speech (like "say" or "claim") in each sentence of a text file and extracts the associated reported speech. The script then loops through each file in the directory, applies the extraction function, and prints out the indirect quotes found. This streamlined approach makes it easier to analyze the language used in the texts without having to manually search through each file.

