# Full Extraction Pipeline

This is a notebook where I'm updating the entire pipeline of the data extraction process from start to finish, with a single example. Functions/aspects of the process should be thoroughly tested and vetted before being added to this notebook. This notebook should be a "final" version of the pipeline, and should be able to be run from start to finish with no issues.

# Loading in HTML File(s)

Creates/finds a folder called "raw_htmls" and each file in that folder:
- loads the HTML file
- parses the HTML file
- cleans and extracts the text from the HTML file (gets rid of encoding artifacts, extra lines, etc)
- OPTIONAL -- creates a .txt file with the cleaned text, named according to the file number of the case
- creates a CSV with the raw_file_str and the file_number of the case
- puts this CSV (Pandas df) through the rest of the pipeline
- writes to a CSV the contents of the populated Pandas df once pipeline is complete

In [1]:
import pandas as pd
import os
from bs4 import BeautifulSoup

In [2]:
input_dir = "raw_html_files/"

files_dict = {}
files_dict['raw_file_str'] = []

for file in os.listdir(input_dir):
    try:
        if os.path.isfile(input_dir + file) and not file.startswith('.') and file.endswith('.html'): # will only work for non-system files that are .html files
            # print("Adding ", file, "...")
            with open(input_dir + file) as f:
                html = f.read()
            soup = BeautifulSoup(html, "html.parser")

            # find metadata
            document_meta = soup.find("div", {"id": "documentMeta"}) 
            meta_items = document_meta.find_all("div", {"class": "row py-1"})

            # "Metadata"
            case_ID = ""
            meta_data = []
            for meta_item in meta_items:
                children_text = []
                for x in meta_item.findChildren()[:2]:
                    children_text.append(x.text)
                child_string = '\t'.join(children_text)
                if "file number" in child_string.lower():
                    case_ID = child_string.split("\t")[1].strip()
                    # print(case_ID)
                meta_data.append(child_string)

            # "Content"
            document_body = soup.find("div", {"class": "documentcontent"}).get_text()

            # add to raw_files_dict{} to be put into dataframe later
            files_dict['raw_file_str'].append('Metadata:\n' +          # metadata marker
                                               '\n'.join(meta_data) +   # metadata text
                                               'Content:\n' +           # content marker
                                               document_body)           # content text
            
    except:
        print("Error with:", file)

data_df = pd.DataFrame(files_dict)
data_df

Unnamed: 0,raw_file_str
0,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...
1,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...
2,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...
3,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...
4,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...


In [4]:
# import glob

# # Get a list of all .txt files ending with -22.txt in the current directory
# file_list = glob.glob("2022/*-22.txt")

# all_data = {}
# all_data['raw_file_str'] = []

# # Iterate over each file and read its contents
# for file_path in file_list:
#     with open(file_path, "r") as file:
#         contents = file.read()
#         all_data['raw_file_str'].append(contents)
#         # print(f"Contents of {file_path}:\n{contents}\n")

# len(all_data)
# all_data

1

In [6]:
data_df = pd.DataFrame(all_data)
data_df

Unnamed: 0,raw_file_str
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...
...,...
527,Metadata:\nDate:\t2022-02-07\nFile number:\t\n...
528,Metadata:\nDate:\t2022-01-21\nFile number:\t\n...
529,Metadata:\nDate:\t2022-02-16\nFile number:\t\n...
530,Metadata:\nDate:\t2022-02-09\nFile number:\t\n...


# Step 1: General Cleaning
- General file cleaning

In [7]:
import re

def general_cleaning(raw_file_str: str):
    """
    Performs general cleaning on a raw file string.

    This function removes tabs, non-breaking spaces, leading/trailing whitespace, empty lines, 
    and "\xa0" characters. This function operates line-by-line for the input text and only keeps 
    non-empty lines after stripping.

    Parameters
    ----------
    raw_file_str : str
        The raw file content as a string, where different lines are separated by '\n'.

    Returns
    -------
    list
        A list of cleaned lines. Each element of the list is a cleaned string corresponding to a non-empty 
        line in the input string. Tabs and "\xa0" characters are replaced with spaces, leading/trailing 
        whitespaces are removed.

    Examples
    --------
    >>> general_cleaning("  First line \t \n \xa0 \nSecond line \n   Third line\t")
    ['First line', 'Second line', 'Third line']
    """

    # gets rid of tabs, non-breaking spaces, leading/trailing whitespace, removes empty lines, and "\xa0"
    generally_cleaned_list = [line.replace("\t", " ").replace("\xa0", "").strip() for line in raw_file_str.split('\n') if line.strip() != '']
    return generally_cleaned_list

def remove_whitespace_and_underscores(string):
    """
    Removes consecutive whitespace and more than three consecutive underscores from a given string.
    
    Parameters
    ----------
    string : str
        The input string to be processed.
        
    Returns
    -------
    str
        The processed string with consecutive whitespace and more than three consecutive underscores removed.
    
    Examples
    --------
    >>> remove_whitespace_and_underscores("Hello    world___")
    'Hello world'
    
    >>> remove_whitespace_and_underscores("   This    string_has___many____underscores  ")
    'This string_has_many_underscores'
    """
    # Remove consecutive whitespace
    string = re.sub(r'\s+', ' ', string)

    # Remove more than three consecutive underscores
    string = re.sub(r'_+', '', string)

    return string.strip()

# Step 2: Metadata + Content Separation

This is the Flan-T5 model trained to separate Content and Metadata but tbh the rule-based method worked perfectly and took about 0.00001x the time so I think we should use that instead

In [8]:
# import transformers
# # from transformers import AutoTokenizer
# from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

# model_name = "metadata_extractor_flant5_small"

# # folder where the model files are located -- unzip before running
# model_dir = f"/Users/kmaurinjones/Desktop/School/UBC/UBC_Coursework/capstone/Allard_A_Capstone/models/metadata_extractor/{model_name}"

# tokenizer = AutoTokenizer.from_pretrained(model_dir)
# model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

In [9]:
# def extract_metadata_t5(raw_case_file_text: str, model, tokenizer = tokenizer):
#     """
#     Extracts metadata and content from a raw case file text using a T5-based model.

#     This function performs general cleaning on the raw case file text and then applies a Flant-T5-small model
#     to extract the metadata and content. The metadata is extracted using the "extract metadata boundary:"
#     prefix, and the content is obtained by removing the metadata from the cleaned text.

#     Parameters
#     ----------
#     raw_case_file_text : str
#         The raw case file text to be processed.
#     model : T5Model
#         The T5-based model to use for extraction. By default, it uses the pre-defined model.
#     tokenizer : T5Tokenizer, optional
#         The tokenizer associated with the T5-based model. By default, it uses the pre-defined tokenizer.

#     Returns
#     -------
#     tuple
#         A tuple containing two strings: the extracted metadata and the content.
    
#     Examples
#     --------
#     >>> raw_text = "metadata: Title: Example Case\nContent: This is the content of the case."
#     >>> extract_metadata_t5(raw_text)
#     ('Title: Example Case', 'This is the content of the case.')

#     >>> raw_text = "metadata: Author: John Doe\nContent: Some content."
#     >>> extract_metadata_t5(raw_text)
#     ('Author: John Doe', 'Some content.')
#     """

#     # do general case file cleaning
#     clean_file_list = general_cleaning(raw_case_file_text)
#     clean_file_str = " ".join([line for line in clean_file_list if ("metadata:" or "content:") not in line.lower()])

#     if "Browse myCanLII Save this case Set up citation alert Email this case" in clean_file_str:
#         clean_file_str = clean_file_str.replace("Browse myCanLII Save this case Set up citation alert Email this case", "").strip()

#     # run model on cleaned case file text
#     inputs = ["extract metadata boundary:" + clean_file_str] # PREFIX = "extract metadata boundary:"

#     inputs = tokenizer(inputs, max_length = 256, truncation = True, return_tensors = "pt")
#     output = model.generate(**inputs, num_beams = 8, do_sample = True, min_length = 1, max_length = 128)
#     decoded_output = tokenizer.batch_decode(output, skip_special_tokens = True)[0]

#     for to_delete in ["<", ">"]:
#         decoded_output = decoded_output.replace(to_delete, "")

#     metadata = decoded_output.strip()

#     # this is just for reformatting the first URL in the metadata -- really specific but seemed to be the only pitfall of the model
#     # this fixes the issue completely
#     pattern = r'https://[^,]*,'
#     matches = re.findall(pattern, metadata)
#     metadata = metadata.replace(matches[0], f"<{matches[0][:-1]}>,")

#     # differentially get the content
#     content = clean_file_str.replace(metadata, "").replace("Content:", "").strip()

#     full_file_cleaned = "Metadata: " + metadata + " " + "Content: " + content
    
#     return full_file_cleaned, metadata, content

In [10]:
# for row in data_df.index:
#     full_raw_text = data_df.loc[row, 'raw_file_str']

#     # full_file, case_metadata, case_content = extract_metadata_t5(
#     #     raw_case_file_text = full_raw_text,
#     #     model = model,
#     #     tokenizer = tokenizer)
    
#     full_file, case_metadata, case_content = separate_file_sections(full_raw_text)
    
#     data_df.loc[row, 'full_file'] = full_file
#     data_df.loc[row, 'metadata'] = case_metadata
#     data_df.loc[row, 'content'] = case_content

# data_df

Rule-based method for separating raw case file str into content and metadata, and also returns a cleaned version of the entire case file (metadata + content) in case we want to print that to a text file later on (it includes \n characters that we can later split the str by in order to print it to a human-readable list)

In [11]:
def separate_file_sections(text_with_newlines: str):
    metadata_list = []
    content_list = []

    is_metadata = True
    is_content = False

    cleaned_full_file = general_cleaning(text_with_newlines)

    for line in text_with_newlines.split("\n"):
        if line.strip() == 'Metadata:':
            is_metadata = True
            is_content = False
        elif line.strip() == 'Content:':
            is_metadata = False
            is_content = True
        elif is_metadata:
            metadata_list.append(remove_whitespace_and_underscores(line))
        elif is_content:
            content_list.append(remove_whitespace_and_underscores(line))

    return "\n".join(cleaned_full_file).strip(), " ".join(cleaned_full_file).strip(), " ".join(metadata_list).strip(), " ".join(content_list).strip()

In [12]:
for row in data_df.index:
    full_raw_text = data_df.loc[row, 'raw_file_str']

    # full_file, case_metadata, case_content = extract_metadata_t5(
    #     raw_case_file_text = full_raw_text,
    #     model = model,
    #     tokenizer = tokenizer)
    
    for_txt_file, full_file_str, case_metadata, case_content = separate_file_sections(full_raw_text)
    
    data_df.loc[row, 'cleaned_case_with_newlines'] = for_txt_file
    data_df.loc[row, 'full_file'] = full_file_str
    data_df.loc[row, 'metadata'] = case_metadata
    data_df.loc[row, 'content'] = case_content

data_df

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...
...,...,...,...,...,...
527,Metadata:\nDate:\t2022-02-07\nFile number:\t\n...,Metadata:\nDate: 2022-02-07\nFile number:\nSWL...,Metadata: Date: 2022-02-07 File number: SWL-57...,Date: 2022-02-07 File number: SWL-57644-22 Ci...,Order under Section 78(6) Residential Tenancie...
528,Metadata:\nDate:\t2022-01-21\nFile number:\t\n...,Metadata:\nDate: 2022-01-21\nFile number:\nHOL...,Metadata: Date: 2022-01-21 File number: HOL-12...,Date: 2022-01-21 File number: HOL-12856-22 Ci...,Order under Section 77 Residential Tenancies A...
529,Metadata:\nDate:\t2022-02-16\nFile number:\t\n...,Metadata:\nDate: 2022-02-16\nFile number:\nSOL...,Metadata: Date: 2022-02-16 File number: SOL-26...,Date: 2022-02-16 File number: SOL-26900-22 Ci...,Order under Section 77 Residential Tenancies A...
530,Metadata:\nDate:\t2022-02-09\nFile number:\t\n...,Metadata:\nDate: 2022-02-09\nFile number:\nCEL...,Metadata: Date: 2022-02-09 File number: CEL-04...,Date: 2022-02-09 File number: CEL-04532-22 Ci...,Order under Section 77 Residential Tenancies A...


In [13]:
print(data_df.loc[0, 'cleaned_case_with_newlines'])

Metadata:
Date: 2022-02-24
File number:
SWL-57718-22
Citation: Drier v Hill, 2022 CanLII 128599 (ON LTB), <https://canlii.ca/t/jv57m>, retrieved on 2023-05-16
Content:
Order under Section 77
Residential Tenancies
Act, 2006
File Number: SWL-57718-22
In the
matter of:
26 DUNSMERE DRIVE KITCHENER ON N2E2B4
Between:
Jason Drier
Celia Drier
Landlords
and
Lorna Hill Martin
Hill
Tenants
Jason Drier and Celia Drier (the
'Landlords') applied for an order to terminate the tenancy and evict Lorna Hill and Martin Hill (the 'Tenants') because the Tenants
entered into an agreement to terminate the tenancy.
Determinations:
1.  The Landlords and the Tenants
signed an agreement
to terminate the tenancy as of
March 1, 2022.
It is
ordered that:
1.  The tenancy between
the Landlords and the Tenants
is terminated. The Tenants must move out of the rental unit on or
before March 7, 2022.
2.  If the unit is not vacated on or before
March 7, 2022, then starting
March 8, 2022, the
Landlords may file this order 

# Step 3: File Number + Citation

In [14]:
import re

def get_case_citation(metadata_list):
    """
    Extracts the case citation from a list of metadata lines.

    This function searches through the metadata lines for a line containing "Citation:" or "Référence:"
    and extracts the citation information from that line.

    Parameters
    ----------
    metadata_list : list of str
        A list of metadata lines.

    Returns
    -------
    str or None
        The extracted case citation, or None if no citation is found.

    Examples
    --------
    >>> metadata = ["Title: Example Case", "Citation: ABC123 (LTB)"]
    >>> get_case_citation(metadata)
    'ABC123 (LTB)'

    >>> metadata = ["Title: Another Case", "Référence: XYZ789 (LTB)"]
    >>> get_case_citation(metadata)
    'XYZ789 (LTB)'
    """
    if isinstance(metadata_list, str):
        metadata_list = metadata_list.split("\n")

    for line in metadata_list:
        if "Citation:" in line:
            citation_start = line.find("Citation: ")
            citation_end = line.find("LTB)") + 4
            return line[citation_start:citation_end].replace("Citation: ", "").strip()
        elif "Référence: " in line:
            citation_start = line.find("Référence: ")
            citation_end = line.find("LTB)") + 4
            return line[citation_start:citation_end].replace("Référence: ", "").strip()
    return None

def get_file_number(metadata_list):
    """
    Extracts the file number from a list of metadata lines.

    This function concatenates the metadata lines into a single string and extracts the file number
    from that string. The file number is obtained either after "File number:" or "Numéro de dossier:".

    Parameters
    ----------
    metadata_list : list of str
        A list of metadata lines.

    Returns
    -------
    str or None
        The extracted file number, or None if no file number is found.

    Examples
    --------
    >>> metadata = ["File number: TNL-10001-18", "Citation: ABC123 (LTB)"]
    >>> get_file_number(metadata)
    'TNL-10001-18'

    >>> metadata = ["Numéro de dossier: XYZ789", "Référence: DEF456 (LTB)"]
    >>> get_file_number(metadata)
    'XYZ789'
    """
    if isinstance(metadata_list, list):
        metadata_str = " ".join(metadata_list)
    else:
        metadata_str = metadata_list

    if "Citation: " in metadata_str:
        file_nums = metadata_str[metadata_str.find("File number: ") + len("File number: ") : metadata_str.find("Citation:")].strip()
    elif "Référence: " in metadata_str:
        file_nums = metadata_str[metadata_str.find("Numéro de dossier: ") + len("Numéro de dossier: ") : metadata_str.find("Référence")].strip()

    if len(file_nums) == 0:
        return None

    file_nums = file_nums.replace(";", " ")

    file_num = list(set(file_nums.split()))
    file_num = ";".join(file_num)
    file_num = re.sub(r'[^\w\s]$', '', file_num)

    if ";" in file_num:
        file_num = list(set(file_num.split(";")))
        file_num = [re.sub(r'[\(\)]', '', num) for num in file_num]
        file_num = ";".join(file_num)

    file_num = re.sub(r'[\(\)]', '', file_num)

    return file_num

In [15]:
for row in data_df.index:
    data_df.loc[row, 'citation'] = get_case_citation(data_df.loc[row, 'metadata'])
    data_df.loc[row, 'file_number'] = get_file_number(data_df.loc[row, 'metadata'])

data_df

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22
...,...,...,...,...,...,...,...
527,Metadata:\nDate:\t2022-02-07\nFile number:\t\n...,Metadata:\nDate: 2022-02-07\nFile number:\nSWL...,Metadata: Date: 2022-02-07 File number: SWL-57...,Date: 2022-02-07 File number: SWL-57644-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Marda Management Inc. v Derose, 2022 CanLII 97...",SWL-57644-22
528,Metadata:\nDate:\t2022-01-21\nFile number:\t\n...,Metadata:\nDate: 2022-01-21\nFile number:\nHOL...,Metadata: Date: 2022-01-21 File number: HOL-12...,Date: 2022-01-21 File number: HOL-12856-22 Ci...,Order under Section 77 Residential Tenancies A...,"Farahani v Plourde, 2022 CanLII 78831 (ON LTB)",HOL-12856-22
529,Metadata:\nDate:\t2022-02-16\nFile number:\t\n...,Metadata:\nDate: 2022-02-16\nFile number:\nSOL...,Metadata: Date: 2022-02-16 File number: SOL-26...,Date: 2022-02-16 File number: SOL-26900-22 Ci...,Order under Section 77 Residential Tenancies A...,"Retsinas v Napieraj, 2022 CanLII 108707 (ON LTB)",SOL-26900-22
530,Metadata:\nDate:\t2022-02-09\nFile number:\t\n...,Metadata:\nDate: 2022-02-09\nFile number:\nCEL...,Metadata: Date: 2022-02-09 File number: CEL-04...,Date: 2022-02-09 File number: CEL-04532-22 Ci...,Order under Section 77 Residential Tenancies A...,"Beers v Banks, 2022 CanLII 87905 (ON LTB)",CEL-04532-22


# Step 4: Detect Language
- not necessary for anything in the pipeline, just a fun extra point of data

In [16]:
# !pip install langdetect
from langdetect import detect

def is_mostly_french(text, threshold):
    try:
        detected_language = detect(text)
        if detected_language == 'fr':
            return True
        else:
            return False
    except:
        return False

def is_french(text, threshold):
    try:
        detected_language = detect(text)
        if detected_language == 'fr':
            return True
        language_probabilities = detect_langs(text)
        for lang in language_probabilities:
            if lang.lang == 'fr' and lang.prob > threshold:
                return True
        return False
    except:
        return False

In [17]:
for row in data_df.itertuples():

    # adding to 'language' column
    if is_french(data_df.loc[row.Index, "raw_file_str"], 0.7) == True:
        data_df.at[row.Index, 'language'] = "French"
    else:
        data_df.at[row.Index, 'language'] = "English"

data_df

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number,language
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22,English
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22,English
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22,English
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22,English
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22,English
...,...,...,...,...,...,...,...,...
527,Metadata:\nDate:\t2022-02-07\nFile number:\t\n...,Metadata:\nDate: 2022-02-07\nFile number:\nSWL...,Metadata: Date: 2022-02-07 File number: SWL-57...,Date: 2022-02-07 File number: SWL-57644-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Marda Management Inc. v Derose, 2022 CanLII 97...",SWL-57644-22,English
528,Metadata:\nDate:\t2022-01-21\nFile number:\t\n...,Metadata:\nDate: 2022-01-21\nFile number:\nHOL...,Metadata: Date: 2022-01-21 File number: HOL-12...,Date: 2022-01-21 File number: HOL-12856-22 Ci...,Order under Section 77 Residential Tenancies A...,"Farahani v Plourde, 2022 CanLII 78831 (ON LTB)",HOL-12856-22,English
529,Metadata:\nDate:\t2022-02-16\nFile number:\t\n...,Metadata:\nDate: 2022-02-16\nFile number:\nSOL...,Metadata: Date: 2022-02-16 File number: SOL-26...,Date: 2022-02-16 File number: SOL-26900-22 Ci...,Order under Section 77 Residential Tenancies A...,"Retsinas v Napieraj, 2022 CanLII 108707 (ON LTB)",SOL-26900-22,English
530,Metadata:\nDate:\t2022-02-09\nFile number:\t\n...,Metadata:\nDate: 2022-02-09\nFile number:\nCEL...,Metadata: Date: 2022-02-09 File number: CEL-04...,Date: 2022-02-09 File number: CEL-04532-22 Ci...,Order under Section 77 Residential Tenancies A...,"Beers v Banks, 2022 CanLII 87905 (ON LTB)",CEL-04532-22,English


# Step 5: Year
- also not necessary for anything in the pipeline, just another datapoint for the corpus

In [18]:
year_pattern = r"\b(\d{4})\b"

for row in data_df.itertuples():

    year_match = re.search(year_pattern, data_df.loc[row.Index, "metadata"])
    if year_match:
        year = year_match.group(1)
        data_df.loc[row.Index, "year"] = year
    else:
        data_df.loc[row.Index, "year"] = "year not found"

data_df

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number,language,year
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22,English,2022
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22,English,2022
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22,English,2022
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22,English,2022
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22,English,2022
...,...,...,...,...,...,...,...,...,...
527,Metadata:\nDate:\t2022-02-07\nFile number:\t\n...,Metadata:\nDate: 2022-02-07\nFile number:\nSWL...,Metadata: Date: 2022-02-07 File number: SWL-57...,Date: 2022-02-07 File number: SWL-57644-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Marda Management Inc. v Derose, 2022 CanLII 97...",SWL-57644-22,English,2022
528,Metadata:\nDate:\t2022-01-21\nFile number:\t\n...,Metadata:\nDate: 2022-01-21\nFile number:\nHOL...,Metadata: Date: 2022-01-21 File number: HOL-12...,Date: 2022-01-21 File number: HOL-12856-22 Ci...,Order under Section 77 Residential Tenancies A...,"Farahani v Plourde, 2022 CanLII 78831 (ON LTB)",HOL-12856-22,English,2022
529,Metadata:\nDate:\t2022-02-16\nFile number:\t\n...,Metadata:\nDate: 2022-02-16\nFile number:\nSOL...,Metadata: Date: 2022-02-16 File number: SOL-26...,Date: 2022-02-16 File number: SOL-26900-22 Ci...,Order under Section 77 Residential Tenancies A...,"Retsinas v Napieraj, 2022 CanLII 108707 (ON LTB)",SOL-26900-22,English,2022
530,Metadata:\nDate:\t2022-02-09\nFile number:\t\n...,Metadata:\nDate: 2022-02-09\nFile number:\nCEL...,Metadata: Date: 2022-02-09 File number: CEL-04...,Date: 2022-02-09 File number: CEL-04532-22 Ci...,Order under Section 77 Residential Tenancies A...,"Beers v Banks, 2022 CanLII 87905 (ON LTB)",CEL-04532-22,English,2022


# Step 6: LTB Location
- there are a few different methods and things to try, so I run them in succession to make sure to capture SOMETHING

In [19]:
import re

def find_all_positions(text: str, keyword: str):
    """
    Finds all positions of a keyword in a given text.

    This function searches for a keyword in a given text and returns a list of positions where the keyword is found.

    Parameters
    ----------
    text : str
        The text to search within.
    keyword : str
        The keyword to find in the text.

    Returns
    -------
    list
        A list of integers representing the positions of the keyword in the text.

    Examples
    --------
    >>> find_all_positions("This is an example sentence.", "example")
    [11]
    """
    positions = []
    start = 0
    while True:
        index = text.find(keyword, start)
        if index == -1:
            break
        positions.append(index)
        start = index + 1
    return positions

def get_postal_code(text: str):
    """
    Finds a postal code in the format "L4Z2G5" within the given text.

    Args:
        text (str): The input text to search for a postal code.

    Returns:
        str: The postal code found in the text. Returns an empty string if no postal code is found.

    Examples:
        >>> find_postal_code("This is a sample text with a postal code L4Z2G5.")
        "L4Z2G5"
    """

    pattern = r"\b[A-Za-z]\d[A-Za-z]\d[A-Za-z]\d\b"
    match = re.search(pattern, text)

    if match:
        return match.group()
    else:
        return None

def find_closest_subset(text: str, keywords: list):
    """
    Finds a subset of the given text where a date and any of the given keywords appear with the smallest distance between them,
    but only if the subset appears before the word "determination" in the lowercase text and does not contain the word "member".

    Args:
        text (str): The input text to search for the subset.
        keywords (list): The list of keywords to search for.

    Returns:
        tuple: A tuple containing the subset of the text where the date and keyword appear with the smallest distance between them,
               and the corresponding keyword. Returns an empty string and None if no match is found or if the subset appears after "determination"
               or contains the word "member".

    Examples:
        >>> find_closest_subset("The event will take place on April 23, 2018. The application was heard on April 25, 2018.", ["heard", "event"])
        ("The event will take place on April 23, 2018.", "event")

    """

    pattern = r"\b[A-Z][a-z]+ \d{1,2}, \d{4}\b"
    date_matches = re.findall(pattern, text)
    keyword_positions = [(m.start(), m.end(), keyword) for keyword in keywords for m in re.finditer(keyword, text)]

    if not date_matches or not keyword_positions:
        return "", None

    smallest_distance = float('inf')
    best_subset = ""
    best_keyword = None
    
    for date in date_matches:
        for start, end, keyword in keyword_positions:
            distance = abs(start - text.find(date))
            subset = text[min(start, text.find(date)): max(end, text.find(date))]

            if distance < smallest_distance and text.lower().find(best_subset.lower()) < (text.lower().find("determination") or text.lower().find("it is determinatined that")) and ("member" or "with the request to review") not in subset.lower():
                smallest_distance = distance
                best_subset = subset
                best_keyword = keyword

    if text.lower().find(best_subset.lower()) >= text.lower().find("determination") or "member" in best_subset.lower():
        return "", None

    return best_subset, best_keyword


def get_ltb_location_by_postal_code(case_content_str: str):
    """
    Helps to extract the location information from the given case content string using postal code lookup.

    Args:
        case_content_str (str): The case content string to extract the location from.

    Returns:
        str or None: Subset of text from the passed case string wherein the location appears near the postal code.

    Examples:
        >>> get_ltb_location_by_postal_code("The application was heard at L4Z 2G5.")
        "Mississauga"
    """

    # if there isn't a postal code, return None right away
    if not get_postal_code(case_content_str):
        return None

    pc_idx = case_content_str.find(get_postal_code(case_content_str))
    subset = case_content_str[pc_idx - 30 : pc_idx]

    if "ON" in subset:
        subset = subset.split("ON")[:-1]
    elif "Ontario" in subset:
        subset = subset.split("Ontario")[:-1]

    subset = " ".join(subset)
    
    if "floor" in subset.lower():
        floor_idx = subset.lower().find("floor")
        # print(floor_idx)
        subset = subset[floor_idx + len("floor") :].strip()
    
    return subset

def get_ltb_location(case_content_str: str):
    """
    Extracts the location information from the given case content string.

    Args:
        case_content_str (str): The case content string to extract the location from.

    Returns:
        str or None: The extracted location information if found, otherwise None.

    Examples:
        >>> get_ltb_location("The application was heard in Newmarket.")
        "Newmarket"
    """

    keywords = ["application was heard", "applications were heard", "was heard", "were heard together",
                "was held", "set to be heard",
                # "heard by telephone", "heard by teleconference", "heard via teleconference",
                "heard by", "heard by", "heard via",
                "motion were heard", "motion was heard", "came before the board in",
                "was then heard in", "were then heard in"]

    subset, keyword = find_closest_subset(text = case_content_str, keywords = keywords)

    if subset:
        subset = subset.replace(keyword, "")
        subset = subset.split()
        subset = [tok for tok in subset if tok not in ['in', 'on', 'via', 'together', 'by']]
        subset = " ".join(subset).strip()
        subset = subset.replace("With The Request To Review", "")

    if subset: # sometimes the hearing location is redacted and replaced with [CITY]
        if str(subset) != "[CITY]":
            return subset.title().replace("And Avenue, Unit 2 ", "").strip()

    # otherwise, go by postal code
    subset = get_ltb_location_by_postal_code(case_content_str = case_content_str)
    if subset:
        return subset.title().replace("And Avenue, Unit 2 ", "").strip()
    else:
        return None

In [20]:
for row in data_df.itertuples():

    try:
        location = get_ltb_location(data_df.loc[row.Index, 'content'])#.title() # returns the string in title case

        if location:
            data_df.at[row.Index, 'ltb_location'] = location
        else:
            data_df.at[row.Index, 'ltb_location'] = "LOCATION NOT FOUND"

    except Exception as any_error:
        data_df.at[row.Index, 'ltb_location'] = "LOCATION NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number,language,year,ltb_location
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22,English,2022,6 Dunsmere Drive Kitchener
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22,English,2022,49 Holborn Drive Kitchener
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22,English,2022,"R, 1306 King St E Hamilt"
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22,English,2022,Juniper Crescent Brampt
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22,English,2022,9Th Avenue East Owen Sound


# Step 7: Hearing Date

In [21]:
def find_date(text: str):
    """
    Finds a date in the format "Month Day, Year" within the given text.

    Args:
        text (str): The input text to search for a date.

    Returns:
        str: The date found in the text. Returns an empty string if no date is found.

    Examples:
        >>> find_date("The event will take place on April 23, 2018.")
        "April 23, 2018"
    """

    pattern = r"\b[A-Z][a-z]+ \d{1,2}, \d{4}\b"
    match = re.search(pattern, text)

    if match:
        return match.group()
    else:
        return ""

def get_hearing_date(case_content_str: str):
    """
    Extracts the hearing date from the given case content string.

    Args:
        case_content_str (str): The case content string to extract the hearing date from.

    Returns:
        str or None: The extracted hearing date in the format "Month Day, Year" if found, otherwise None.

    Examples:
        >>> get_hearing_date("The application was heard on April 23, 2018. It is determined that...")
        "April 23, 2018"
    """

    for keyword in ["determinations:", "it is determined"]:
        if keyword in case_content_str.lower():
            kw_idx = case_content_str.find(keyword)
            break
        else:
            kw_idx = -1

    subset = case_content_str[case_content_str.lower().find("application") : kw_idx].strip()
    date = find_date(subset)

    if date:
        return date.strip()
        
    # otherwise return None
    return None

In [22]:
for row in data_df.itertuples():

    try:
        data_df.at[row.Index, 'hearing_date'] = get_hearing_date(data_df.loc[row.Index, 'content']) # is already a str
    except Exception as any_error:
        data_df.at[row.Index, 'hearing_date'] = "HEARING DATE NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number,language,year,ltb_location,hearing_date
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22,English,2022,6 Dunsmere Drive Kitchener,
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22,English,2022,49 Holborn Drive Kitchener,"January 1, 2022"
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22,English,2022,"R, 1306 King St E Hamilt","February 1, 2022"
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22,English,2022,Juniper Crescent Brampt,"January 1, 2022"
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22,English,2022,9Th Avenue East Owen Sound,"January 1, 2022"


# Step 8: Decision Date

In [23]:
import re
from dateutil.parser import parse
import spacy
nlp = spacy.load("en_core_web_sm")

def find_date(text: str):
    """
    Finds a date in the format "Month Day, Year" within the given text.

    Args:
        text (str): The input text to search for a date.

    Returns:
        str: The date found in the text. Returns an empty string if no date is found.

    Examples:
        >>> find_date("The event will take place on April 23, 2018.")
        "April 23, 2018"
    """

    pattern = r"\b[A-Z][a-z]+ \d{1,2}, \d{4}\b"
    match = re.search(pattern, text)

    if match:
        return match.group()
    else:
        return ""

def extract_date(text, nlp = nlp):
    """
    Extracts a date from a string of text using spaCy's entity recognition.

    Args:
        text (str): The text to extract the date from.

    Returns:
        str: The extracted date string, or an empty string if no date is found.

    Examples:
        >>> extract_date("The event will take place on April 23, 2018.")
        "April 23, 2018"
    """

    doc = nlp(text)

    for entity in doc.ents:
        if entity.label_ == "DATE":
            return entity.text

    return ""

def convert_date(date_str):
    """
    Parses a date string in any format and converts it to the format "Month Day, Year".

    Args:
        date_str (str): The date string to parse.

    Returns:
        str: The parsed date string in the format "Month Day, Year", or an empty string if parsing fails.

    Examples:
        >>> convert_date("2022-05-31")
        "May 31, 2022"

        >>> convert_date("05/31/2018")
        "May 31, 2018"
    """

    try:
        parsed_date = parse(date_str)
        formatted_date = parsed_date.strftime("%B %d, %Y")
        return formatted_date
    except ValueError:
        return ""

def get_decision_date(case_content_str: str):
    """
    Extracts the decision date from the given case content string.

    Args:
        case_content_str (str): The case content string to extract the decision date from.

    Returns:
        str or None: The extracted decision date in the format "Month Day, Year" if found, otherwise None.

    Examples:
        >>> get_decision_date("The date order issued on April 23, 2018 states...")
        "April 23, 2018"
    """

    # intentionally searches these in this order. Any amendment would be the most recent date
    for keyword in ['date order amended', 'date issued', 'date order issued']: 
        if keyword in case_content_str.lower():
            di_idx = case_content_str.lower().find(keyword)
            subset = case_content_str[di_idx - 18 : di_idx].strip().split(". ")[-1]
            return subset.strip()
    
    else:
        if "date" in case_content_str.lower()[: 500]:
            subset = case_content_str[: 500]
            date_idx = case_content_str.lower().find('date')
            subset = case_content_str[date_idx + len('date') : date_idx + len('date') + 50].strip()
            subset = extract_date(subset).strip()
            return convert_date(subset).strip()
    
    # otherwise return None
    return None

2023-06-16 11:51:37.741973: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [24]:
for row in data_df.itertuples():

    try:
        data_df.at[row.Index, 'decision_date'] = get_decision_date(data_df.loc[row.Index, 'content']) # is already a str
    except Exception as any_error:
        data_df.at[row.Index, 'decision_date'] = "DECISION DATE NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number,language,year,ltb_location,hearing_date,decision_date
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22,English,2022,6 Dunsmere Drive Kitchener,,"February 24, 2022"
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22,English,2022,49 Holborn Drive Kitchener,"January 1, 2022","February 2, 2022"
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22,English,2022,"R, 1306 King St E Hamilt","February 1, 2022","February 23, 2022"
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22,English,2022,Juniper Crescent Brampt,"January 1, 2022","February 4, 2022"
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22,English,2022,9Th Avenue East Owen Sound,"January 1, 2022","January 19, 2022"


# Step 9: Case URL
- also not necessary for any other part of the pipeline. good for corpus

In [25]:
import re

def get_url_from_citation_string(text: str):
    """
    Returns URL to case file given a list of strings of metadata from a case file.
    String must begin with "Citation: " and URL must be within angle brackets.

    Parameters
    ----------
    text : str
        A string of metadata from a case file.

    Returns
    -------
    str
        A string of the URL to the case file.
    """

    pattern = r"<(.*?)>"
    matches = re.findall(pattern, text)
    return matches[0]

def get_url_from_metadata(case_metadata: list):
    """
    Extract URL to case file from a list of strings of metadata from a case file.

    Parameters
    ----------
    case_metadata : list
        A list of strings of metadata from a case file.

    Returns
    -------
    str
        A string of the URL to the case file.
    """

    if isinstance(case_metadata, str):
        case_metadata = case_metadata.split("\n")

    for line in case_metadata:
        if ("Citation:" or "Référence:") in line:
            return get_url_from_citation_string(line)
        
    return None

In [26]:
for row in data_df.itertuples():

    try:
        data_df.at[row.Index, 'url'] = get_url_from_metadata(data_df.loc[row.Index, 'metadata'])
    except Exception as any_error:
        data_df.at[row.Index, 'url'] = "URL NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number,language,year,ltb_location,hearing_date,decision_date,url
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22,English,2022,6 Dunsmere Drive Kitchener,,"February 24, 2022",https://canlii.ca/t/jv57m
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22,English,2022,49 Holborn Drive Kitchener,"January 1, 2022","February 2, 2022",https://canlii.ca/t/js2rt
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22,English,2022,"R, 1306 King St E Hamilt","February 1, 2022","February 23, 2022",https://canlii.ca/t/jv55d
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22,English,2022,Juniper Crescent Brampt,"January 1, 2022","February 4, 2022",https://canlii.ca/t/js3mt
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22,English,2022,9Th Avenue East Owen Sound,"January 1, 2022","January 19, 2022",https://canlii.ca/t/jr90z


# Step 10: Adjudicating Member

In [27]:
def get_adj_member(case_content_str: str):
    """
    Retrieves the adjudicating member(s) mentioned in the given case content string.

    Args:
        case_content_str (str): The input string containing the case content.

    Returns:
        str: The adjudicating member(s) mentioned in the case content. If no adjudicating member is found, returns "nan".

    Examples:
        >>> get_adjudicating_member("This is the entire case file. There are sentences and other text.")
        "Name of Adjudicating Member"

    Notes:
        The function looks for specific keywords in the `case_content_str` to identify the adjudicating member(s).
        The keywords are evaluated in the following order: "date issued", "date of reasons", and "date order issued".
        If multiple instances of the same keyword are found, the function extracts the adjacent text and processes it to retrieve the member(s).
        If only one instance of the keyword is found, the function extracts the adjacent text and processes it to retrieve the member(s).
        If no adjudicating member is found, the function returns "nan".

    Raises:
        TypeError: If `case_content_str` is not a string.

    """

    keyword_1 = "date issued" # this is the most reliable one
    keyword_2 = "date of reasons" # first fallback
    keyword_3 = "date order issued" # second fallback

    # find which is best for the case (in order of best option to worst option)
    if keyword_1 in case_content_str.lower():
        keyword = keyword_1
        # kw_idx = case_content_str.lower().find(keyword_1)
    
    elif keyword_2 in case_content_str.lower():
        keyword = keyword_2
        # kw_idx = case_content_str.lower().find(keyword_2)

    elif keyword_3 in case_content_str.lower():
        keyword = keyword_3
        # kw_idx = case_content_str.lower().find(keyword_3)

    # if nothing is found, better to return nothing than to return something clearly incorrect
    if not keyword:
        return "nan"
    
    # getting index of whichever keyword was found first
    kw_idxs = find_all_positions(text = case_content_str.lower(), keyword = keyword)
    
    
    ### If there are multiple members found ###

    if len(kw_idxs) > 1:

        adj_membs = []

        for kw_idx in kw_idxs:
                
            subset = case_content_str[kw_idx + len(keyword): kw_idx + 100] # subsetting to an arbitrary distance after the keyword location
            subset = subset.split(", ")[0].strip()

            # removing "member" if applicable
            if "member" in subset.lower():
                memb_idx = subset.lower().find("member")
                subset = subset[: memb_idx].strip()

            # removing "vice chair" if applicable
            if "vice chair" in subset.lower():
                memb_idx = subset.lower().find("vice chair")
                subset = subset[: memb_idx].strip()

            # removing "vice chair" if applicable
            if "vice-chair" in subset.lower():
                memb_idx = subset.lower().find("vice-chair")
                subset = subset[: memb_idx].strip()

            # return subset
            adj_membs.append(subset)

        return ", ".join(list(set([memb for memb in adj_membs if memb != ""]))) # removing empty and duplicate items
    
    ### If there's only one member found ###

    kw_idx = case_content_str.lower().find(keyword)

    subset = case_content_str[kw_idx + len(keyword): kw_idx + 100] # subsetting to an arbitrary distance after the keyword location
    subset = subset.split(", ")[0].strip()

    # removing "member" if applicable
    if "member" in subset.lower():
        memb_idx = subset.lower().find("member")
        subset = subset[: memb_idx].strip()

    # removing "vice chair" if applicable
    if "vice chair" in subset.lower():
        memb_idx = subset.lower().find("vice chair")
        subset = subset[: memb_idx].strip()

    # removing "vice chair" if applicable
    if "vice-chair" in subset.lower():
        memb_idx = subset.lower().find("vice-chair")
        subset = subset[: memb_idx].strip()

    return subset

In [28]:
for row in data_df.itertuples():

    try:
        data_df.at[row.Index, 'adjudicating_member'] = get_adj_member(data_df.loc[row.Index, 'content']).replace("Vice Chair", "").replace("Vice-Chair", "").strip()
    except Exception as any_error:
        data_df.at[row.Index, 'adjudicating_member'] = "MEMBER NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number,language,year,ltb_location,hearing_date,decision_date,url,adjudicating_member
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22,English,2022,6 Dunsmere Drive Kitchener,,"February 24, 2022",https://canlii.ca/t/jv57m,Trish Carson
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22,English,2022,49 Holborn Drive Kitchener,"January 1, 2022","February 2, 2022",https://canlii.ca/t/js2rt,Emile Ramlochan
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22,English,2022,"R, 1306 King St E Hamilt","February 1, 2022","February 23, 2022",https://canlii.ca/t/jv55d,
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22,English,2022,Juniper Crescent Brampt,"January 1, 2022","February 4, 2022",https://canlii.ca/t/js3mt,Ian Speers
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22,English,2022,9Th Avenue East Owen Sound,"January 1, 2022","January 19, 2022",https://canlii.ca/t/jr90z,Vladislav Shustov


# Step 11: Extract Case Outcome Span
- still need to classify case outcome span

In [29]:
import re
import itertools

def find_all_positions(text: str, keyword: str):
    """
    Finds all positions of a keyword in a given text.

    This function searches for a keyword in a given text and returns a list of positions where the keyword is found.

    Parameters
    ----------
    text : str
        The text to search within.
    keyword : str
        The keyword to find in the text.

    Returns
    -------
    list
        A list of integers representing the positions of the keyword in the text.

    Examples
    --------
    >>> find_all_positions("This is an example sentence.", "example")
    [11]
    """
    positions = []
    start = 0
    while True:
        index = text.find(keyword, start)
        if index == -1:
            break
        positions.append(index)
        start = index + 1
    return positions

def get_outcome_span(text: str, return_truncated: bool = True):
    """
    Extracts the outcome span from a given text using different methods.

    This function extracts the outcome span from a given text using multiple methods. It first attempts to find
    the span between occurrences of the phrases "accordance with" and "ordered". If that method fails, it then
    tries to find the span after the phrase "it is ordered". If that also fails, it looks for the span after the
    phrase "find". The function returns the extracted outcome span as a cleaned string.

    Parameters
    ----------
    text : str
        The text from which to extract the outcome span.

    Returns
    -------
    str or None
        The extracted outcome span as a cleaned string, or None if no span is found.

    Examples
    --------
    >>> get_outcome_span(unstructured_case_file)
    "In accordance with the order, it is ordered that the defendant pays a fine."
    """

    ############### FIRST METHOD ################

    for keyword in ['in accordance with', 'grant', 'relief', 'fair']: # these all seem common but none seem to exist in 100% of cases

        if keyword in text:

            # find all occurrences of 'in accordance with' and 'ordered'
            accordance_with_indices = [m.end() for m in re.finditer(keyword, text)]
            ordered_indices = [m.start() for m in re.finditer("ordered", text)]

            # generate all possible pairs of indices
            index_pairs = list(itertools.product(accordance_with_indices, ordered_indices))

            # filter pairs where 'accordance with' index is less than 'ordered' index
            index_pairs = [(i, j) for (i, j) in index_pairs if i < j]
            if index_pairs:
                # find the pair with the shortest distance between indices
                min_distance_pair = min(index_pairs, key = lambda x: x[1] - x[0])
                try:
                    best_subset = text[min_distance_pair[0] - 300 : min_distance_pair[1] + 400].strip()
                except IndexError:
                    best_subset = text[min_distance_pair[0] - 600 : min_distance_pair[1]].strip()

                best_subset = best_subset.split(". ")

                if not best_subset:
                    continue # to next match of all matches of the keyword

                sent_id = [idx for idx, i in enumerate(best_subset) if keyword in i.lower()][0]

                clean_outcome = best_subset[sent_id]

                # return JUST the (presumably) most relevant outcome span (after cleaning it up a bit)
                if return_truncated:
                    clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                    clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                    if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                        clean_outcome = clean_outcome.split(")")[1].strip()
                    return clean_outcome

                # return all case file text until the end of the outcome span
                else:
                    return text[: text.find(clean_outcome) + len(clean_outcome)].strip()

    ################ SECOND METHOD ################

    keyword = "it is ordered"
    if keyword in text.lower():
        matches = find_all_positions(text.lower(), keyword)

        for match in matches:
            try: # match + 400 chars
                clean_outcome = ". ".join(text[match - 400 : match + 400].split(". ")[1:-1]) 
            except IndexError: # match idx until end of string (+ 400 is sometimes out of range)
                clean_outcome = ". ".join(text[match - 600 :].split(". ")[1:-1])

            # return None
            # print("METHOD 2")
            if not clean_outcome:
                continue # to next match of all matches of the keyword

            if return_truncated:
                clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                    clean_outcome = clean_outcome.split(")")[1].strip()
                return clean_outcome

            # return all case file text until the end of the outcome span
            else:
                return text[: text.find(clean_outcome) + len(clean_outcome)].strip()

    ############### THIRD METHOD ################

    keyword = " find " # spaces to prevent "finding" or other derivations from being included -- specifically looking for statements like "I find that..."
    if keyword in text.lower():
        matches = find_all_positions(text.lower(), keyword)
        for match in matches:

            try: # match + 400 chars
                clean_outcome = ". ".join(text[match - 400 : match + 400].split(". ")[1:-1]) 
            except IndexError: # match idx until end of string (+ 400 is sometimes out of range)
                clean_outcome = ". ".join(text[match - 600 :].split(". ")[1:-1])

            if not clean_outcome:
                continue # to next match of all matches of the keyword
            
            if return_truncated:
                clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                    clean_outcome = clean_outcome.split(")")[1].strip()
                return clean_outcome
            else:
                return text[: text.find(clean_outcome) + len(clean_outcome)].strip()

    # if absolutely nothing works, return none and try Longformer or something idk
    return None

In [30]:
for row in data_df.itertuples():

    try:
        pass
        content_str = data_df.at[row.Index, 'content']
        data_df.at[row.Index, 'outcome_span'] = get_outcome_span(content_str, return_truncated = True)
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

print(data_df.shape)
data_df.head()

(532, 15)


Unnamed: 0,raw_file_str,cleaned_case_with_newlines,full_file,metadata,content,citation,file_number,language,year,ltb_location,hearing_date,decision_date,url,adjudicating_member,outcome_span
0,Metadata:\nDate:\t2022-02-24\nFile number:\t\n...,Metadata:\nDate: 2022-02-24\nFile number:\nSWL...,Metadata: Date: 2022-02-24 File number: SWL-57...,Date: 2022-02-24 File number: SWL-57718-22 Ci...,Order under Section 77 Residential Tenancies A...,"Drier v Hill, 2022 CanLII 128599 (ON LTB)",SWL-57718-22,English,2022,6 Dunsmere Drive Kitchener,,"February 24, 2022",https://canlii.ca/t/jv57m,Trish Carson,Determinations: 1. The Landlords and the Tena...
1,Metadata:\nDate:\t2022-02-02\nFile number:\t\n...,Metadata:\nDate: 2022-02-02\nFile number:\nSWL...,Metadata: Date: 2022-02-02 File number: SWL-57...,Date: 2022-02-02 File number: SWL-57618-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Waterloo Region Housing v Underwood, 2022 CanL...",SWL-57618-22,English,2022,49 Holborn Drive Kitchener,"January 1, 2022","February 2, 2022",https://canlii.ca/t/js2rt,Emile Ramlochan,The amount that is still owing from that order...
2,Metadata:\nDate:\t2022-02-23\nFile number:\t\n...,Metadata:\nDate: 2022-02-23\nFile number:\nSOL...,Metadata: Date: 2022-02-23 File number: SOL-26...,Date: 2022-02-23 File number: SOL-26921-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Anastasakis v Iwashita, 2022 CanLII 128519 (ON...",SOL-26921-22,English,2022,"R, 1306 King St E Hamilt","February 1, 2022","February 23, 2022",https://canlii.ca/t/jv55d,,6. The Landlord collected a rent deposit of $1...
3,Metadata:\nDate:\t2022-02-04\nFile number:\t\n...,Metadata:\nDate: 2022-02-04\nFile number:\nCEL...,Metadata: Date: 2022-02-04 File number: CEL-04...,Date: 2022-02-04 File number: CEL-04513-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Virk v Hashey, 2022 CanLII 88013 (ON LTB)",CEL-04513-22,English,2022,Juniper Crescent Brampt,"January 1, 2022","February 4, 2022",https://canlii.ca/t/js3mt,Ian Speers,"Since the date of the order, the Tenant has fa..."
4,Metadata:\nDate:\t2022-01-19\nFile number:\t\n...,Metadata:\nDate: 2022-01-19\nFile number:\nCEL...,Metadata: Date: 2022-01-19 File number: CEL-04...,Date: 2022-01-19 File number: CEL-04413-22 Ci...,Order under Section 78(6) Residential Tenancie...,"Grey Bruce Property Rentals Inc v Thompson, 20...",CEL-04413-22,English,2022,9Th Avenue East Owen Sound,"January 1, 2022","January 19, 2022",https://canlii.ca/t/jr90z,Vladislav Shustov,"Since the date of the order, the Tenant has fa..."


In [32]:
# data_df.to_csv("data_for_web_app_kai.csv", index = False)
data_df['ltb_location']

0      6 Dunsmere Drive Kitchener
1      49 Holborn Drive Kitchener
2        R, 1306 King St E Hamilt
3         Juniper Crescent Brampt
4      9Th Avenue East Owen Sound
                  ...            
527    71 Tecumseh Road W Windsor
528      Rfa Crescent West Kingst
529    1 Buffalo Street Brantford
530    Albot Street Port Mcnicoll
531      487 Leacock Drive Barrie
Name: ltb_location, Length: 532, dtype: object