# Full Extraction Pipeline

This is a notebook where I'm updating the entire pipeline of the data extraction process from start to finish, with a single example. Functions/aspects of the process should be thoroughly tested and vetted before being added to this notebook. This notebook should be a "final" version of the pipeline, and should be able to be run from start to finish with no issues.

# Loading in HTML File(s)

Creates/finds a folder called "raw_htmls" and each file in that folder:
- loads the HTML file
- parses the HTML file
- cleans and extracts the text from the HTML file (gets rid of encoding artifacts, extra lines, etc)
- OPTIONAL -- creates a .txt file with the cleaned text, named according to the file number of the case
- creates a CSV with the raw_file_str and the file_number of the case
- puts this CSV (Pandas df) through the rest of the pipeline
- writes to a CSV the contents of the populated Pandas df once pipeline is complete

In [1]:
import pandas as pd
import os
from bs4 import BeautifulSoup

In [2]:
input_dir = "raw_html_files/"

files_dict = {}
files_dict['raw_file_str'] = []

for file in os.listdir(input_dir):
    try:
        if os.path.isfile(input_dir + file) and not file.startswith('.') and file.endswith('.html'): # will only work for non-system files that are .html files
            # print("Adding ", file, "...")
            with open(input_dir + file) as f:
                html = f.read()
            soup = BeautifulSoup(html, "html.parser")

            # find metadata
            document_meta = soup.find("div", {"id": "documentMeta"}) 
            meta_items = document_meta.find_all("div", {"class": "row py-1"})

            # "Metadata"
            case_ID = ""
            meta_data = []
            for meta_item in meta_items:
                children_text = []
                for x in meta_item.findChildren()[:2]:
                    children_text.append(x.text)
                child_string = '\t'.join(children_text)
                if "file number" in child_string.lower():
                    case_ID = child_string.split("\t")[1].strip()
                    # print(case_ID)
                meta_data.append(child_string)

            # "Content"
            document_body = soup.find("div", {"class": "documentcontent"}).get_text()

            # add to raw_files_dict{} to be put into dataframe later
            files_dict['raw_file_str'].append('Metadata:\n' +          # metadata marker
                                               '\n'.join(meta_data) +   # metadata text
                                               'Content:\n' +           # content marker
                                               document_body)           # content text
            
    except:
        print("Error with:", file)

data_df = pd.DataFrame(files_dict)
data_df

Unnamed: 0,raw_file_str
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...


# Step 1: General Cleaning
- General file cleaning

In [3]:
import re

def general_cleaning(raw_file_str: str):
    """
    Performs general cleaning on a raw file string.

    This function removes tabs, non-breaking spaces, leading/trailing whitespace, empty lines, 
    and "\xa0" characters. This function operates line-by-line for the input text and only keeps 
    non-empty lines after stripping.

    Parameters
    ----------
    raw_file_str : str
        The raw file content as a string, where different lines are separated by '\n'.

    Returns
    -------
    list
        A list of cleaned lines. Each element of the list is a cleaned string corresponding to a non-empty 
        line in the input string. Tabs and "\xa0" characters are replaced with spaces, leading/trailing 
        whitespaces are removed.

    Examples
    --------
    >>> general_cleaning("  First line \t \n \xa0 \nSecond line \n   Third line\t")
    ['First line', 'Second line', 'Third line']
    """

    # gets rid of tabs, non-breaking spaces, leading/trailing whitespace, removes empty lines, and "\xa0"
    generally_cleaned_list = [line.replace("\t", " ").replace("\xa0", "").strip() for line in raw_file_str.split('\n') if line.strip() != '']
    return generally_cleaned_list

def remove_whitespace_and_underscores(string):
    """
    Removes consecutive whitespace and more than three consecutive underscores from a given string.
    
    Parameters
    ----------
    string : str
        The input string to be processed.
        
    Returns
    -------
    str
        The processed string with consecutive whitespace and more than three consecutive underscores removed.
    
    Examples
    --------
    >>> remove_whitespace_and_underscores("Hello    world___")
    'Hello world'
    
    >>> remove_whitespace_and_underscores("   This    string_has___many____underscores  ")
    'This string_has_many_underscores'
    """
    # Remove consecutive whitespace
    string = re.sub(r'\s+', ' ', string)

    # Remove more than three consecutive underscores
    string = re.sub(r'_+', '', string)

    return string.strip()

# Step 2: Metadata + Content Separation

This is the Flan-T5 model trained to separate Content and Metadata but tbh the rule-based method worked perfectly and took about 0.00001x the time so I think we should use that instead

In [4]:
# import transformers
# # from transformers import AutoTokenizer
# from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

# model_name = "metadata_extractor_flant5_small"

# # folder where the model files are located -- unzip before running
# model_dir = f"/Users/kmaurinjones/Desktop/School/UBC/UBC_Coursework/capstone/Allard_A_Capstone/models/metadata_extractor/{model_name}"

# tokenizer = AutoTokenizer.from_pretrained(model_dir)
# model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

In [5]:
# def extract_metadata_t5(raw_case_file_text: str, model, tokenizer = tokenizer):
#     """
#     Extracts metadata and content from a raw case file text using a T5-based model.

#     This function performs general cleaning on the raw case file text and then applies a Flant-T5-small model
#     to extract the metadata and content. The metadata is extracted using the "extract metadata boundary:"
#     prefix, and the content is obtained by removing the metadata from the cleaned text.

#     Parameters
#     ----------
#     raw_case_file_text : str
#         The raw case file text to be processed.
#     model : T5Model
#         The T5-based model to use for extraction. By default, it uses the pre-defined model.
#     tokenizer : T5Tokenizer, optional
#         The tokenizer associated with the T5-based model. By default, it uses the pre-defined tokenizer.

#     Returns
#     -------
#     tuple
#         A tuple containing two strings: the extracted metadata and the content.
    
#     Examples
#     --------
#     >>> raw_text = "metadata: Title: Example Case\nContent: This is the content of the case."
#     >>> extract_metadata_t5(raw_text)
#     ('Title: Example Case', 'This is the content of the case.')

#     >>> raw_text = "metadata: Author: John Doe\nContent: Some content."
#     >>> extract_metadata_t5(raw_text)
#     ('Author: John Doe', 'Some content.')
#     """

#     # do general case file cleaning
#     clean_file_list = general_cleaning(raw_case_file_text)
#     clean_file_str = " ".join([line for line in clean_file_list if ("metadata:" or "content:") not in line.lower()])

#     if "Browse myCanLII Save this case Set up citation alert Email this case" in clean_file_str:
#         clean_file_str = clean_file_str.replace("Browse myCanLII Save this case Set up citation alert Email this case", "").strip()

#     # run model on cleaned case file text
#     inputs = ["extract metadata boundary:" + clean_file_str] # PREFIX = "extract metadata boundary:"

#     inputs = tokenizer(inputs, max_length = 256, truncation = True, return_tensors = "pt")
#     output = model.generate(**inputs, num_beams = 8, do_sample = True, min_length = 1, max_length = 128)
#     decoded_output = tokenizer.batch_decode(output, skip_special_tokens = True)[0]

#     for to_delete in ["<", ">"]:
#         decoded_output = decoded_output.replace(to_delete, "")

#     metadata = decoded_output.strip()

#     # this is just for reformatting the first URL in the metadata -- really specific but seemed to be the only pitfall of the model
#     # this fixes the issue completely
#     pattern = r'https://[^,]*,'
#     matches = re.findall(pattern, metadata)
#     metadata = metadata.replace(matches[0], f"<{matches[0][:-1]}>,")

#     # differentially get the content
#     content = clean_file_str.replace(metadata, "").replace("Content:", "").strip()

#     full_file_cleaned = "Metadata: " + metadata + " " + "Content: " + content
    
#     return full_file_cleaned, metadata, content

In [6]:
# for row in data_df.index:
#     full_raw_text = data_df.loc[row, 'raw_file_str']

#     # full_file, case_metadata, case_content = extract_metadata_t5(
#     #     raw_case_file_text = full_raw_text,
#     #     model = model,
#     #     tokenizer = tokenizer)
    
#     full_file, case_metadata, case_content = separate_file_sections(full_raw_text)
    
#     data_df.loc[row, 'full_file'] = full_file
#     data_df.loc[row, 'metadata'] = case_metadata
#     data_df.loc[row, 'content'] = case_content

# data_df

Rule-based method for separating raw case file str into content and metadata, and also returns a cleaned version of the entire case file (metadata + content) in case we want to print that to a text file later on (it includes \n characters that we can later split the str by in order to print it to a human-readable list)

In [7]:
def separate_file_sections(text_with_newlines: str):
    metadata_list = []
    content_list = []

    is_metadata = True
    is_content = False

    cleaned_full_file = general_cleaning(text_with_newlines)

    for line in text_with_newlines.split("\n"):
        if line.strip() == 'Metadata:':
            is_metadata = True
            is_content = False
        elif line.strip() == 'Content:':
            is_metadata = False
            is_content = True
        elif is_metadata:
            metadata_list.append(remove_whitespace_and_underscores(line))
        elif is_content:
            content_list.append(remove_whitespace_and_underscores(line))

    return "\n".join(cleaned_full_file).strip(), " ".join(cleaned_full_file).strip(), " ".join(metadata_list).strip(), " ".join(content_list).strip()

In [8]:
for row in data_df.index:
    full_raw_text = data_df.loc[row, 'raw_file_str']

    # full_file, case_metadata, case_content = extract_metadata_t5(
    #     raw_case_file_text = full_raw_text,
    #     model = model,
    #     tokenizer = tokenizer)
    
    for_txt_file, full_file_str, case_metadata, case_content = separate_file_sections(full_raw_text)
    
    data_df.loc[row, 'cleaned_case_for_txt_file'] = for_txt_file
    data_df.loc[row, 'full_file'] = full_file_str
    data_df.loc[row, 'metadata'] = case_metadata
    data_df.loc[row, 'content'] = case_content

data_df

Unnamed: 0,raw_file_str,cleaned_case_for_txt_file,full_file,metadata,content
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Metadata:\nDate: 2018-07-06\nFile number:\nSWL...,Metadata: Date: 2018-07-06 File number: SWL-17...,Date: 2018-07-06 File number: SWL-17348-18 Ci...,Order under Section 69 Residential Tenancies A...
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Metadata:\nDate: 2017-07-05\nFile number:\nTEL...,Metadata: Date: 2017-07-05 File number: TEL-80...,Date: 2017-07-05 File number: TEL-80773-17 Ci...,Order under Section 69 Residential Tenancies A...
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Metadata:\nDate: 2017-05-26\nFile number:\nTEL...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Ci...,Order under Section 69 Residential Tenancies A...
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Metadata:\nDate: 2017-07-18\nFile number:\nTEL...,Metadata: Date: 2017-07-18 File number: TEL-81...,Date: 2017-07-18 File number: TEL-81359-17-AM ...,Order under Section 69 Residential Tenancies A...
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Metadata:\nDate: 2017-08-17\nFile number:\nTEL...,Metadata: Date: 2017-08-17 File number: TEL-81...,Date: 2017-08-17 File number: TEL-81405-17 Ci...,Order under Section 69 Residential Tenancies A...


# Step 3: File Number + Citation

In [9]:
import re

def get_case_citation(metadata_list):
    """
    Extracts the case citation from a list of metadata lines.

    This function searches through the metadata lines for a line containing "Citation:" or "Référence:"
    and extracts the citation information from that line.

    Parameters
    ----------
    metadata_list : list of str
        A list of metadata lines.

    Returns
    -------
    str or None
        The extracted case citation, or None if no citation is found.

    Examples
    --------
    >>> metadata = ["Title: Example Case", "Citation: ABC123 (LTB)"]
    >>> get_case_citation(metadata)
    'ABC123 (LTB)'

    >>> metadata = ["Title: Another Case", "Référence: XYZ789 (LTB)"]
    >>> get_case_citation(metadata)
    'XYZ789 (LTB)'
    """
    if isinstance(metadata_list, str):
        metadata_list = metadata_list.split("\n")

    for line in metadata_list:
        if "Citation:" in line:
            citation_start = line.find("Citation: ")
            citation_end = line.find("LTB)") + 4
            return line[citation_start:citation_end].replace("Citation: ", "").strip()
        elif "Référence: " in line:
            citation_start = line.find("Référence: ")
            citation_end = line.find("LTB)") + 4
            return line[citation_start:citation_end].replace("Référence: ", "").strip()
    return None

def get_file_number(metadata_list):
    """
    Extracts the file number from a list of metadata lines.

    This function concatenates the metadata lines into a single string and extracts the file number
    from that string. The file number is obtained either after "File number:" or "Numéro de dossier:".

    Parameters
    ----------
    metadata_list : list of str
        A list of metadata lines.

    Returns
    -------
    str or None
        The extracted file number, or None if no file number is found.

    Examples
    --------
    >>> metadata = ["File number: TNL-10001-18", "Citation: ABC123 (LTB)"]
    >>> get_file_number(metadata)
    'TNL-10001-18'

    >>> metadata = ["Numéro de dossier: XYZ789", "Référence: DEF456 (LTB)"]
    >>> get_file_number(metadata)
    'XYZ789'
    """
    if isinstance(metadata_list, list):
        metadata_str = " ".join(metadata_list)
    else:
        metadata_str = metadata_list

    if "Citation: " in metadata_str:
        file_nums = metadata_str[metadata_str.find("File number: ") + len("File number: ") : metadata_str.find("Citation:")].strip()
    elif "Référence: " in metadata_str:
        file_nums = metadata_str[metadata_str.find("Numéro de dossier: ") + len("Numéro de dossier: ") : metadata_str.find("Référence")].strip()

    if len(file_nums) == 0:
        return None

    file_nums = file_nums.replace(";", " ")

    file_num = list(set(file_nums.split()))
    file_num = ";".join(file_num)
    file_num = re.sub(r'[^\w\s]$', '', file_num)

    if ";" in file_num:
        file_num = list(set(file_num.split(";")))
        file_num = [re.sub(r'[\(\)]', '', num) for num in file_num]
        file_num = ";".join(file_num)

    file_num = re.sub(r'[\(\)]', '', file_num)

    return file_num

In [10]:
for row in data_df.index:
    data_df.loc[row, 'citation'] = get_case_citation(data_df.loc[row, 'metadata'])
    data_df.loc[row, 'file_number'] = get_file_number(data_df.loc[row, 'metadata'])

data_df

Unnamed: 0,raw_file_str,cleaned_case_for_txt_file,full_file,metadata,content,citation,file_number
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Metadata:\nDate: 2018-07-06\nFile number:\nSWL...,Metadata: Date: 2018-07-06 File number: SWL-17...,Date: 2018-07-06 File number: SWL-17348-18 Ci...,Order under Section 69 Residential Tenancies A...,"SWL-17348-18 (Re), 2018 CanLII 88643 (ON LTB)",SWL-17348-18
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Metadata:\nDate: 2017-07-05\nFile number:\nTEL...,Metadata: Date: 2017-07-05 File number: TEL-80...,Date: 2017-07-05 File number: TEL-80773-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-80773-17 (Re), 2017 CanLII 60498 (ON LTB)",TEL-80773-17
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Metadata:\nDate: 2017-05-26\nFile number:\nTEL...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Metadata:\nDate: 2017-07-18\nFile number:\nTEL...,Metadata: Date: 2017-07-18 File number: TEL-81...,Date: 2017-07-18 File number: TEL-81359-17-AM ...,Order under Section 69 Residential Tenancies A...,"TEL-81359-17-AM (Re), 2017 CanLII 60052 (ON LTB)",TEL-81359-17-AM
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Metadata:\nDate: 2017-08-17\nFile number:\nTEL...,Metadata: Date: 2017-08-17 File number: TEL-81...,Date: 2017-08-17 File number: TEL-81405-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-81405-17 (Re), 2017 CanLII 60203 (ON LTB)",TEL-81405-17


# Step 4: Detect Language
- not necessary for anything in the pipeline, just a fun extra point of data

In [11]:
# !pip install langdetect
from langdetect import detect

def is_mostly_french(text, threshold):
    try:
        detected_language = detect(text)
        if detected_language == 'fr':
            return True
        else:
            return False
    except:
        return False

def is_french(text, threshold):
    try:
        detected_language = detect(text)
        if detected_language == 'fr':
            return True
        language_probabilities = detect_langs(text)
        for lang in language_probabilities:
            if lang.lang == 'fr' and lang.prob > threshold:
                return True
        return False
    except:
        return False

In [12]:
for row in data_df.itertuples():

    # adding to 'language' column
    if is_french(data_df.loc[row.Index, "raw_file_str"], 0.7) == True:
        data_df.at[row.Index, 'language'] = "French"
    else:
        data_df.at[row.Index, 'language'] = "English"

data_df

Unnamed: 0,raw_file_str,cleaned_case_for_txt_file,full_file,metadata,content,citation,file_number,language
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Metadata:\nDate: 2018-07-06\nFile number:\nSWL...,Metadata: Date: 2018-07-06 File number: SWL-17...,Date: 2018-07-06 File number: SWL-17348-18 Ci...,Order under Section 69 Residential Tenancies A...,"SWL-17348-18 (Re), 2018 CanLII 88643 (ON LTB)",SWL-17348-18,English
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Metadata:\nDate: 2017-07-05\nFile number:\nTEL...,Metadata: Date: 2017-07-05 File number: TEL-80...,Date: 2017-07-05 File number: TEL-80773-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-80773-17 (Re), 2017 CanLII 60498 (ON LTB)",TEL-80773-17,English
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Metadata:\nDate: 2017-05-26\nFile number:\nTEL...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Metadata:\nDate: 2017-07-18\nFile number:\nTEL...,Metadata: Date: 2017-07-18 File number: TEL-81...,Date: 2017-07-18 File number: TEL-81359-17-AM ...,Order under Section 69 Residential Tenancies A...,"TEL-81359-17-AM (Re), 2017 CanLII 60052 (ON LTB)",TEL-81359-17-AM,English
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Metadata:\nDate: 2017-08-17\nFile number:\nTEL...,Metadata: Date: 2017-08-17 File number: TEL-81...,Date: 2017-08-17 File number: TEL-81405-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-81405-17 (Re), 2017 CanLII 60203 (ON LTB)",TEL-81405-17,English


# Step 5: Year
- also not necessary for anything in the pipeline, just another datapoint for the corpus

In [13]:
year_pattern = r"\b(\d{4})\b"

for row in data_df.itertuples():

    year_match = re.search(year_pattern, data_df.loc[row.Index, "metadata"])
    if year_match:
        year = year_match.group(1)
        data_df.loc[row.Index, "year"] = year
    else:
        data_df.loc[row.Index, "year"] = "year not found"

data_df

Unnamed: 0,raw_file_str,cleaned_case_for_txt_file,full_file,metadata,content,citation,file_number,language,year
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Metadata:\nDate: 2018-07-06\nFile number:\nSWL...,Metadata: Date: 2018-07-06 File number: SWL-17...,Date: 2018-07-06 File number: SWL-17348-18 Ci...,Order under Section 69 Residential Tenancies A...,"SWL-17348-18 (Re), 2018 CanLII 88643 (ON LTB)",SWL-17348-18,English,2018
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Metadata:\nDate: 2017-07-05\nFile number:\nTEL...,Metadata: Date: 2017-07-05 File number: TEL-80...,Date: 2017-07-05 File number: TEL-80773-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-80773-17 (Re), 2017 CanLII 60498 (ON LTB)",TEL-80773-17,English,2017
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Metadata:\nDate: 2017-05-26\nFile number:\nTEL...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Metadata:\nDate: 2017-07-18\nFile number:\nTEL...,Metadata: Date: 2017-07-18 File number: TEL-81...,Date: 2017-07-18 File number: TEL-81359-17-AM ...,Order under Section 69 Residential Tenancies A...,"TEL-81359-17-AM (Re), 2017 CanLII 60052 (ON LTB)",TEL-81359-17-AM,English,2017
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Metadata:\nDate: 2017-08-17\nFile number:\nTEL...,Metadata: Date: 2017-08-17 File number: TEL-81...,Date: 2017-08-17 File number: TEL-81405-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-81405-17 (Re), 2017 CanLII 60203 (ON LTB)",TEL-81405-17,English,2017


# Step 6: LTB Location
- there are a few different methods and things to try, so I run them in succession to make sure to capture SOMETHING

In [14]:
def find_all_positions(text, keyword):
    positions = []
    start = 0
    while True:
        index = text.find(keyword, start)
        if index == -1:
            break
        positions.append(index)
        start = index + 1
    return positions

# backup for other methods, using " ON " + postal code as a marker
def find_postal_code_index(text):
    match = re.search(r' ON [A-Z]\d[A-Z] ?\d[A-Z]\d', text)
    if match:
        return match.start()
    else:
        return None

def extract_loc_rule(text_list):

    if isinstance(text_list, list):
        content_str = " ".join(text_list)
    else:
        content_str = text_list
    # print(content_str)

    # "hear" is a good marker for the location sentence ("heard in/on", "hearing", etc)
    if "hear" in content_str:
        hear_inds = find_all_positions(content_str, "hear") # list of indices where "hear" appears in string
        # print(hear_inds)

        for hear_ind in hear_inds:

            hear_substr = content_str[hear_ind - 50 : hear_ind + 50]
            possible_sentences = hear_substr.split(". ")
            # print(possible_sentences)
            
            hear_sent = [sent for i, sent in enumerate(possible_sentences) if "hear" in sent][0]
            # print(hear_sent)

            if len(hear_sent.split(" on ")) != 2:
                # print("TEST")
                # return None
                pass # go to next "hear" location in string
            else:
                
                location_sent, date = hear_sent.split(" on ") # should only split into 2 parts
                # print(location_sent)

                if " in " in location_sent:

                    # print(location_sent)
                    location = location_sent.split(" in ")[1].strip() # location name should be last tokens of string after token containing " in " (city name could be multiple tokens, so need to get all tokens after " in " token, not just last one)

                    # print(location)
                    location = location.split(" ")[-1] # location name should be last tokens of string after token containing "hear" (city name could be multiple tokens, so need to get all tokens after "hear" token, not just last one)
                    # print("YESSS")
                    return location.strip()
                
                else:
                    pass

    # use " ON " + postal code as a marker
    if " ON " in content_str:
        # print("|UESSS")
        # print(find_postal_code_index(content_str))
        postal_ind = find_postal_code_index(content_str)
        content_str_subsection = (content_str[postal_ind - 50 : postal_ind + 50])
        location = content_str_subsection.split(" ON ")[0].split()[-1]
        return location.strip()
    
    # if absolutely nothing works, return None and we'll use a transformer or something more nuanced
    return None

# this is a list that was made from all ltb locations identified in the annotated data
all_locations = [
    'Sudbury',
    # 'Review completed without hearing',
    'Thunder Bay',
    'Ottawa',
    'Woodstock',
    'Orangeville',
    'Toronto',
    'Burlington',
    'Waterloo',
    'Stratford',
    # 'Not stated, hearing in Windsor',
    'London',
    'Lindsay',
    'Hamilton',
    'Peterborough',
    'Windsor',
    'Brantford',
    'Cobourg',
    'Kingston',
    'Belleville',
    'Bracebridge',
    'Barrie',
    'Newmarket',
    'Mississauga',
    'Not stated',
    'by telephone',
    'Whitby'
    ]

import spacy

# Load the English language model in spaCy
nlp = spacy.load('en_core_web_sm')

def extract_location_spacy(string_list, model = nlp, other_locations = all_locations):

    if isinstance(string_list, list):
        content_str = " ".join(string_list)
    else:
        content_str = string_list

    content_str = " ".join(string_list)

    # uses a spacy model + its vocabulary to extract and return the location if possible
    doc = model(content_str)
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            return ent.text

    # otherwise looks through the list of all locations in the annotated data and returns the first one that appears in the string -- for example, "Hamilton"
    for tok in content_str.split():
        if tok in all_locations:
            return tok

    # if all else fails, return None -- use a transformer or something later idk
    return None

# all_locations

2023-06-06 18:31:52.875402: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [15]:
for row in data_df.itertuples():

    ### rule-based extremely quick -- pretty effective but imperfect
    location = extract_loc_rule(data_df.loc[row.Index, 'content'])#.title() # returns the string in title case

    if not location[0].isalnum(): # something like "[CITY]" -- use spacy method
        location = extract_location_spacy(data_df.loc[row.Index, 'content'])

    if not location[0].isupper():
        # I know this isn't a great rule in general but this seems to be consistent/reliable across all cases.
        # City names are all capitalized. Otherwise it finds "it", "heard", and more as locations
        location = extract_location_spacy(data_df.loc[row.Index, 'content'])

    if location == None: # rule-based returns None
        location = extract_location_spacy(data_df.loc[row.Index, 'content'])

    # use the found location
    data_df.at[row.Index, 'ltb_location'] = location.title() # Title casing for consistency

data_df.head()

Unnamed: 0,raw_file_str,cleaned_case_for_txt_file,full_file,metadata,content,citation,file_number,language,year,ltb_location
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Metadata:\nDate: 2018-07-06\nFile number:\nSWL...,Metadata: Date: 2018-07-06 File number: SWL-17...,Date: 2018-07-06 File number: SWL-17348-18 Ci...,Order under Section 69 Residential Tenancies A...,"SWL-17348-18 (Re), 2018 CanLII 88643 (ON LTB)",SWL-17348-18,English,2018,F
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Metadata:\nDate: 2017-07-05\nFile number:\nTEL...,Metadata: Date: 2017-07-05 File number: TEL-80...,Date: 2017-07-05 File number: TEL-80773-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-80773-17 (Re), 2017 CanLII 60498 (ON LTB)",TEL-80773-17,English,2017,Whitby
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Metadata:\nDate: 2017-05-26\nFile number:\nTEL...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017,Toronto
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Metadata:\nDate: 2017-07-18\nFile number:\nTEL...,Metadata: Date: 2017-07-18 File number: TEL-81...,Date: 2017-07-18 File number: TEL-81359-17-AM ...,Order under Section 69 Residential Tenancies A...,"TEL-81359-17-AM (Re), 2017 CanLII 60052 (ON LTB)",TEL-81359-17-AM,English,2017,Toronto
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Metadata:\nDate: 2017-08-17\nFile number:\nTEL...,Metadata: Date: 2017-08-17 File number: TEL-81...,Date: 2017-08-17 File number: TEL-81405-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-81405-17 (Re), 2017 CanLII 60203 (ON LTB)",TEL-81405-17,English,2017,Lindsay


# Step 7: Hearing Date
- this is in progress -- current system has 74% accuracy and is a rule-based, statistical system
- ML might work better but something needs to be trained for it

# Step 8: Decision Date
- also in progress, but performs better than hearing date extraction

# Step 9: Case URL
- also not necessary for any other part of the pipeline. good for corpus

In [16]:
import re

def get_url_from_citation_string(text: str):
    """
    Returns URL to case file given a list of strings of metadata from a case file.
    String must begin with "Citation: " and URL must be within angle brackets.

    Parameters
    ----------
    text : str
        A string of metadata from a case file.

    Returns
    -------
    str
        A string of the URL to the case file.
    """

    pattern = r"<(.*?)>"
    matches = re.findall(pattern, text)
    return matches[0]

def get_url_from_metadata(case_metadata: list):
    """
    Extract URL to case file from a list of strings of metadata from a case file.

    Parameters
    ----------
    case_metadata : list
        A list of strings of metadata from a case file.

    Returns
    -------
    str
        A string of the URL to the case file.
    """

    if isinstance(case_metadata, str):
        case_metadata = case_metadata.split("\n")

    for line in case_metadata:
        if ("Citation:" or "Référence:") in line:
            return get_url_from_citation_string(line)
        
    return None

In [17]:
for row in data_df.itertuples():

    try:
        data_df.at[row.Index, 'url'] = get_url_from_metadata(data_df.loc[row.Index, 'metadata'])
    except Exception as any_error:
        data_df.at[row.Index, 'url'] = "URL NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_str,cleaned_case_for_txt_file,full_file,metadata,content,citation,file_number,language,year,ltb_location,url
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Metadata:\nDate: 2018-07-06\nFile number:\nSWL...,Metadata: Date: 2018-07-06 File number: SWL-17...,Date: 2018-07-06 File number: SWL-17348-18 Ci...,Order under Section 69 Residential Tenancies A...,"SWL-17348-18 (Re), 2018 CanLII 88643 (ON LTB)",SWL-17348-18,English,2018,F,https://canlii.ca/t/hv7qd
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Metadata:\nDate: 2017-07-05\nFile number:\nTEL...,Metadata: Date: 2017-07-05 File number: TEL-80...,Date: 2017-07-05 File number: TEL-80773-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-80773-17 (Re), 2017 CanLII 60498 (ON LTB)",TEL-80773-17,English,2017,Whitby,https://canlii.ca/t/h5z39
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Metadata:\nDate: 2017-05-26\nFile number:\nTEL...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017,Toronto,https://canlii.ca/t/h539n
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Metadata:\nDate: 2017-07-18\nFile number:\nTEL...,Metadata: Date: 2017-07-18 File number: TEL-81...,Date: 2017-07-18 File number: TEL-81359-17-AM ...,Order under Section 69 Residential Tenancies A...,"TEL-81359-17-AM (Re), 2017 CanLII 60052 (ON LTB)",TEL-81359-17-AM,English,2017,Toronto,https://canlii.ca/t/h5z3r
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Metadata:\nDate: 2017-08-17\nFile number:\nTEL...,Metadata: Date: 2017-08-17 File number: TEL-81...,Date: 2017-08-17 File number: TEL-81405-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-81405-17 (Re), 2017 CanLII 60203 (ON LTB)",TEL-81405-17,English,2017,Lindsay,https://canlii.ca/t/h5z3s


# Step 10: Adjudicating Member

In [18]:
import re

def get_adj_member(list_of_strings: list):

    if isinstance(list_of_strings, str):
        list_of_strings = list_of_strings.split("\n")

    text = " ".join(list_of_strings)
    pattern = r"Date Issued(.*?)Member"
    matches = re.findall(pattern, text, re.DOTALL)
    # extracted_text = [match.strip() for match in matches]
    extracted_text = list(set(match.strip() for match in matches))
    if len(extracted_text) > 0:
        return ", ".join(extracted_text) # returns a list of matches and sometimes there's more than one match so we just take the first one -- there are never two
    elif "date issued" in text.lower():
        DI_inds = find_all_positions(text.lower(), "date issued")
        for DI_ind in DI_inds:
            DI_substr = text[DI_ind - 50 : DI_ind + 50].lower()
            if len(DI_substr.split("date issued")) != 2:
                # should only contain 2
                pass

            DI_sent = DI_substr.split("date issued")[1].strip()
            if ", " in DI_sent: # there should be a comma be just in case, it doesn't hurt to have this (and this to try the "hear" method after iterating over all of these if none work)
                DI_sent = DI_sent.split(", ")[0]

            if "member" in DI_sent:
                DI_sent = DI_sent.replace("member", "")

            return DI_sent.title().strip()

In [19]:
for row in data_df.itertuples():

    try:
        data_df.at[row.Index, 'adjudicating_member'] = get_adj_member(data_df.loc[row.Index, 'content']).replace("Vice Chair", "").replace("Vice-Chair", "").strip()
    except Exception as any_error:
        data_df.at[row.Index, 'adjudicating_member'] = "MEMBER NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_str,cleaned_case_for_txt_file,full_file,metadata,content,citation,file_number,language,year,ltb_location,url,adjudicating_member
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Metadata:\nDate: 2018-07-06\nFile number:\nSWL...,Metadata: Date: 2018-07-06 File number: SWL-17...,Date: 2018-07-06 File number: SWL-17348-18 Ci...,Order under Section 69 Residential Tenancies A...,"SWL-17348-18 (Re), 2018 CanLII 88643 (ON LTB)",SWL-17348-18,English,2018,F,https://canlii.ca/t/hv7qd,Kevin Lundy
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Metadata:\nDate: 2017-07-05\nFile number:\nTEL...,Metadata: Date: 2017-07-05 File number: TEL-80...,Date: 2017-07-05 File number: TEL-80773-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-80773-17 (Re), 2017 CanLII 60498 (ON LTB)",TEL-80773-17,English,2017,Whitby,https://canlii.ca/t/h5z39,Ruth Carey
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Metadata:\nDate: 2017-05-26\nFile number:\nTEL...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017,Toronto,https://canlii.ca/t/h539n,Laura Hartslief
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Metadata:\nDate: 2017-07-18\nFile number:\nTEL...,Metadata: Date: 2017-07-18 File number: TEL-81...,Date: 2017-07-18 File number: TEL-81359-17-AM ...,Order under Section 69 Residential Tenancies A...,"TEL-81359-17-AM (Re), 2017 CanLII 60052 (ON LTB)",TEL-81359-17-AM,English,2017,Toronto,https://canlii.ca/t/h5z3r,Shelby Whittick
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Metadata:\nDate: 2017-08-17\nFile number:\nTEL...,Metadata: Date: 2017-08-17 File number: TEL-81...,Date: 2017-08-17 File number: TEL-81405-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-81405-17 (Re), 2017 CanLII 60203 (ON LTB)",TEL-81405-17,English,2017,Lindsay,https://canlii.ca/t/h5z3s,Laura Hartslief


# Step 11: Extract Case Outcome Span
- still need to classify case outcome span

In [20]:
import re
import itertools

def find_all_positions(text: str, keyword: str):
    """
    Finds all positions of a keyword in a given text.

    This function searches for a keyword in a given text and returns a list of positions where the keyword is found.

    Parameters
    ----------
    text : str
        The text to search within.
    keyword : str
        The keyword to find in the text.

    Returns
    -------
    list
        A list of integers representing the positions of the keyword in the text.

    Examples
    --------
    >>> find_all_positions("This is an example sentence.", "example")
    [11]
    """
    positions = []
    start = 0
    while True:
        index = text.find(keyword, start)
        if index == -1:
            break
        positions.append(index)
        start = index + 1
    return positions

def get_outcome_span(text: str, return_truncated: bool = True):
    """
    Extracts the outcome span from a given text using different methods.

    This function extracts the outcome span from a given text using multiple methods. It first attempts to find
    the span between occurrences of the phrases "accordance with" and "ordered". If that method fails, it then
    tries to find the span after the phrase "it is ordered". If that also fails, it looks for the span after the
    phrase "find". The function returns the extracted outcome span as a cleaned string.

    Parameters
    ----------
    text : str
        The text from which to extract the outcome span.

    Returns
    -------
    str or None
        The extracted outcome span as a cleaned string, or None if no span is found.

    Examples
    --------
    >>> get_outcome_span(unstructured_case_file)
    "In accordance with the order, it is ordered that the defendant pays a fine."
    """

    ############### FIRST METHOD ################

    for keyword in ['in accordance with', 'grant', 'relief', 'fair']: # these all seem common but none seem to exist in 100% of cases

        if keyword in text:

            # find all occurrences of 'in accordance with' and 'ordered'
            accordance_with_indices = [m.end() for m in re.finditer(keyword, text)]
            ordered_indices = [m.start() for m in re.finditer("ordered", text)]

            # generate all possible pairs of indices
            index_pairs = list(itertools.product(accordance_with_indices, ordered_indices))

            # filter pairs where 'accordance with' index is less than 'ordered' index
            index_pairs = [(i, j) for (i, j) in index_pairs if i < j]
            if index_pairs:
                # find the pair with the shortest distance between indices
                min_distance_pair = min(index_pairs, key = lambda x: x[1] - x[0])
                try:
                    best_subset = text[min_distance_pair[0] - 300 : min_distance_pair[1] + 400].strip()
                except IndexError:
                    best_subset = text[min_distance_pair[0] - 600 : min_distance_pair[1]].strip()

                best_subset = best_subset.split(". ")

                if not best_subset:
                    continue # to next match of all matches of the keyword

                sent_id = [idx for idx, i in enumerate(best_subset) if keyword in i.lower()][0]

                clean_outcome = best_subset[sent_id]

                # return JUST the (presumably) most relevant outcome span (after cleaning it up a bit)
                if return_truncated:
                    clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                    clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                    if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                        clean_outcome = clean_outcome.split(")")[1].strip()
                    return clean_outcome

                # return all case file text until the end of the outcome span
                else:
                    return text[: text.find(clean_outcome) + len(clean_outcome)].strip()

    ################ SECOND METHOD ################

    keyword = "it is ordered"
    if keyword in text.lower():
        matches = find_all_positions(text.lower(), keyword)

        for match in matches:
            try: # match + 400 chars
                clean_outcome = ". ".join(text[match - 400 : match + 400].split(". ")[1:-1]) 
            except IndexError: # match idx until end of string (+ 400 is sometimes out of range)
                clean_outcome = ". ".join(text[match - 600 :].split(". ")[1:-1])

            # return None
            # print("METHOD 2")
            if not clean_outcome:
                continue # to next match of all matches of the keyword

            if return_truncated:
                clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                    clean_outcome = clean_outcome.split(")")[1].strip()
                return clean_outcome

            # return all case file text until the end of the outcome span
            else:
                return text[: text.find(clean_outcome) + len(clean_outcome)].strip()

    ############### THIRD METHOD ################

    keyword = " find " # spaces to prevent "finding" or other derivations from being included -- specifically looking for statements like "I find that..."
    if keyword in text.lower():
        matches = find_all_positions(text.lower(), keyword)
        for match in matches:

            try: # match + 400 chars
                clean_outcome = ". ".join(text[match - 400 : match + 400].split(". ")[1:-1]) 
            except IndexError: # match idx until end of string (+ 400 is sometimes out of range)
                clean_outcome = ". ".join(text[match - 600 :].split(". ")[1:-1])

            if not clean_outcome:
                continue # to next match of all matches of the keyword
            
            if return_truncated:
                clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                    clean_outcome = clean_outcome.split(")")[1].strip()
                return clean_outcome
            else:
                return text[: text.find(clean_outcome) + len(clean_outcome)].strip()

    # if absolutely nothing works, return none and try Longformer or something idk
    return None

In [21]:
for row in data_df.itertuples():

    try:
        pass
        content_str = data_df.at[row.Index, 'content']
        data_df.at[row.Index, 'outcome_span'] = get_outcome_span(content_str, return_truncated = True)
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

print(data_df.shape)
data_df.head()

(5, 13)


Unnamed: 0,raw_file_str,cleaned_case_for_txt_file,full_file,metadata,content,citation,file_number,language,year,ltb_location,url,adjudicating_member,outcome_span
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Metadata:\nDate: 2018-07-06\nFile number:\nSWL...,Metadata: Date: 2018-07-06 File number: SWL-17...,Date: 2018-07-06 File number: SWL-17348-18 Ci...,Order under Section 69 Residential Tenancies A...,"SWL-17348-18 (Re), 2018 CanLII 88643 (ON LTB)",SWL-17348-18,English,2018,F,https://canlii.ca/t/hv7qd,Kevin Lundy,I have considered all of the disclosed circums...
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Metadata:\nDate: 2017-07-05\nFile number:\nTEL...,Metadata: Date: 2017-07-05 File number: TEL-80...,Date: 2017-07-05 File number: TEL-80773-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-80773-17 (Re), 2017 CanLII 60498 (ON LTB)",TEL-80773-17,English,2017,Whitby,https://canlii.ca/t/h5z39,Ruth Carey,I have considered all of the disclosed circums...
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Metadata:\nDate: 2017-05-26\nFile number:\nTEL...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017,Toronto,https://canlii.ca/t/h539n,Laura Hartslief,I have considered all of the disclosed circums...
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Metadata:\nDate: 2017-07-18\nFile number:\nTEL...,Metadata: Date: 2017-07-18 File number: TEL-81...,Date: 2017-07-18 File number: TEL-81359-17-AM ...,Order under Section 69 Residential Tenancies A...,"TEL-81359-17-AM (Re), 2017 CanLII 60052 (ON LTB)",TEL-81359-17-AM,English,2017,Toronto,https://canlii.ca/t/h5z3r,Shelby Whittick,I have considered all of the disclosed circums...
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Metadata:\nDate: 2017-08-17\nFile number:\nTEL...,Metadata: Date: 2017-08-17 File number: TEL-81...,Date: 2017-08-17 File number: TEL-81405-17 Ci...,Order under Section 69 Residential Tenancies A...,"TEL-81405-17 (Re), 2017 CanLII 60203 (ON LTB)",TEL-81405-17,English,2017,Lindsay,https://canlii.ca/t/h5z3s,Laura Hartslief,I have considered all of the disclosed circums...


In [24]:
for row in data_df['outcome_span']:
    print (row)
    print()

I have considered all of the disclosed circumstances in accordance with subsection 83(2) of the Act, including the request of the Landlord’s Legal Representative, and find that it would not be unfair to grant relief from eviction subject to the condition(s) set out in this order pursuant to subsection 83(1)(a) and 204(1) of the Act

I have considered all of the disclosed circumstances in accordance with subsection 83(2) of the Residential Tenancies Act, 2006 (the 'Act'), and find that it would be unfair to grant relief from eviction pursuant to subsection 83(1) of the Act

I have considered all of the disclosed circumstances in accordance with subsection 83(2) of the Residential Tenancies Act, 2006 (the 'Act'), and find that it would be unfair to grant relief from eviction pursuant to subsection 83(1) of the Act

I have considered all of the disclosed circumstances in accordance with subsection 83(2) of the Residential Tenancies Act, 2006 (the 'Act'), and find that it would not be unfa