# Full Extraction Pipeline

This is a notebook where I'm updating the entire pipeline of the data extraction process from start to finish, with a single example. Functions/aspects of the process should be thoroughly tested and vetted before being added to this notebook. This notebook should be a "final" version of the pipeline, and should be able to be run from start to finish with no issues.

## Generating data for testing

In [1]:
import pandas as pd

gold_data = pd.read_csv("data/gold_labels_with_files.csv")

# making new names for the columns in gold_data

new_names = {
    'Timestamp': 'timestamp',
    'Email Address': 'email_address',
    'What is the file number of the case?': 'file_number_gold',
    'What was the date of the hearing? [mm/dd/yyyy]': 'hearing_date',
    'What was the date of the decision? [mm/dd/yyyy]': 'decision_date',
    'Who was the member adjudicating the decision?': 'adjudicating_member',
    'What was the location of the landlord tenant board?': 'ltb_location',
    'Did the decision state the landlord was represented?': 'landlord_represented',
    'Did the decision state the landlord attended the hearing?': 'landlord_attended_hearing',
    'Did the decision state the tenant was represented?': 'tenant_represented',
    'Did the decision state the tenant attended the hearing?': 'tenant_attended_hearing',
    'Did the decision state the landlord was a not-for-profit landlord (e.g. Toronto Community Housing)?': 'landlord_nonprofit',
    'Did the decision state the tenant was collecting a subsidy?': 'tenant_collecting_subsidy',
    'What was the outcome of the case?': 'case_outcome',
    'What was the length of the tenancy, or in other words, how long had the tenants lived at the residence in question? ': 'tenancy_length',
    'What was the monthly rent?': 'monthly_rent',
    'What was the amount of the rental deposit? ': 'rental_deposit',
    'If any rent increases occurred, what was the rent after the increase(s)?': 'rent_after_increase',
    'If any rent increases occurred, when did the rent increase(s) come into effect? ': 'rent_increase_effect_date',
    'What was the total amount of arrears?': 'total_arrears',
    'Over how many months did the arrears accumulate? ': 'arrears_duration',
    'If the tenant made a payment on the arrears after the eviction notice was served and/or prior to the hearing, what was the amount of the payment? ': 'arrears_payment_amount',
    'Did the decision mention a history of arrears by the tenant separate from the arrears in the current claim (more than one period of arrears, recurrently coming in and out of arrears, arrears with previous landlord, etc.)?': 'tenant_arrears_history_mentioned',
    'If the tenant had a history of arrears, did the decision mention a history of the tenant making payments on those arrears (separate from any payments made in response to the present eviction notice/hearing)?': 'tenant_arrears_payment_history_mentioned',
    'How frequently were rent payments made late?': 'rent_payments_late_frequency',
    'Did the member find the tenant had or seemed to have the ability to pay rent, but chose not do so?': 'tenant_ability_to_pay_rent',
    'What were the specific mental, medical, or physical conditions of the tenant, if any? ': 'tenant_conditions',
    'Did the decision state that the tenant had children living with them?': 'tenant_children_present',
    'How many total children did the tenant have living with them? ': 'total_children',
    'How many total children aged 17 or younger did the tenant have living with them?': 'children_17_or_younger',
    'How many total children aged 13 or younger did the tenant have living with them? ': 'children_13_or_younger',
    'How many total children aged 4 or younger did the tenant have living with them?': 'children_4_or_younger',
    'Did the decision state any of the children had mental, medical or physical conditions?': 'children_conditions_mentioned',
    'If yes to the previous question, did the decision state these conditions would make moving particularly burdensome?': 'conditions_making_moving_burdensome',
    'Was the tenant employed at the time of the hearing?': 'tenant_employed',
    'If the tenant was not employed, did the decision state the tenant was receiving any form of government assistance (e.g. OW, childcare benefits, ODSP, OSAP)?': 'tenant_government_assistance',
    'If the tenant was employed, did the decision state any doubts about the stability of employment e.g. lack of guaranteed hours, contract work, etc.?': 'employment_stability_doubts',
    'Did the member find the tenant had sufficient income to pay rent?': 'sufficient_income_to_pay_rent',
    'What was the total income of the tenant’s household? ': 'total_household_income',
    'Did the decision mention the tenant lost their job leading up to or during the period of the hearing?': 'tenant_job_loss_mentioned',
    'Did the decision mention any other extenuating circumstances experienced by the tenant leading up to or during the period of the claim (e.g. hospitalization, death in the family, etc.)?': 'tenant_extenuating_circumstances',
    'Did the tenant propose a payment plan?': 'tenant_proposed_payment_plan',
    'If the tenant did propose a payment plan, did the member accept the proposed payment plan?': 'accepted_proposed_payment_plan',
    'If a payment plan was ordered, what was the length of the payment plan? ': 'payment_plan_length',
    'Did the decision mention the tenant’s difficulty finding alternative housing for any reason e.g.physical limitations, reliance on social assistance, etc.?': 'tenant_difficulty_finding_housing',
    'If yes to the previous question, which of the following were applicable to the tenant?': 'applicable_difficulty_reasons',
    'Did the decision state the tenant was given prior notice for the eviction?': 'tenant_prior_notice_given',
    'If the tenant was given prior notice for the eviction, how much notice was given?': 'prior_notice_duration',
    'Did the decisions state postponement would result in the tenant accruing additional arrears?': 'postponement_additional_arrears',
    'Which other specific applications of the landlord or the tenant were mentioned?': 'mentioned_applications',
    'Did the decision mention the validity of an N4 eviction notice?': 'validity_of_N4_notice_mentioned',
    'Were there detail(s) in the decision not captured by this questionnaire that should be included?': 'additional_details_in_decision',
    'Exec Review': 'executive_review',
    'Review Status': 'review_status'
}

gold_data = gold_data.rename(columns = new_names)
# sorting by file_number -- so that ordering of the new data annotations can match this and be more easily compared
gold_data = gold_data.sort_values(by = ['file_number_gold'], ascending = True).reset_index(drop = True)

silver_df = gold_data.loc[:, ['file_number_gold', 'raw_file_text']] # keeping file numbers just for easier comparability. they will still be generated
silver_df['raw_file_text_no_markers'] = silver_df['raw_file_text'].apply(lambda x: x.replace("Metadata:", "").replace("Content:", "").strip()) # removing markers
silver_df.head()

Unnamed: 0,file_number_gold,raw_file_text,raw_file_text_no_markers
0,SWL-17348-18,Metadata:\nDate:\t2018-07-06\nFile number:\t\n...,Date:\t2018-07-06\nFile number:\t\nSWL-17348-1...
1,TEL-79722-17,Metadata:\nDate:\t2017-05-26\nFile number:\t\n...,Date:\t2017-05-26\nFile number:\t\nTEL-79722-1...
2,TEL-80773-17,Metadata:\nDate:\t2017-07-05\nFile number:\t\n...,Date:\t2017-07-05\nFile number:\t\nTEL-80773-1...
3,TEL-81359-17-AM,Metadata:\nDate:\t2017-07-18\nFile number:\t\n...,Date:\t2017-07-18\nFile number:\t\nTEL-81359-1...
4,TEL-81405-17,Metadata:\nDate:\t2017-08-17\nFile number:\t\n...,Date:\t2017-08-17\nFile number:\t\nTEL-81405-1...


# Step 1: General Cleaning
- General file cleaning

In [2]:
import re

def general_cleaning(raw_file_str: str):
    """
    Performs general cleaning on a raw file string.

    This function removes tabs, non-breaking spaces, leading/trailing whitespace, empty lines, 
    and "\xa0" characters. This function operates line-by-line for the input text and only keeps 
    non-empty lines after stripping.

    Parameters
    ----------
    raw_file_str : str
        The raw file content as a string, where different lines are separated by '\n'.

    Returns
    -------
    list
        A list of cleaned lines. Each element of the list is a cleaned string corresponding to a non-empty 
        line in the input string. Tabs and "\xa0" characters are replaced with spaces, leading/trailing 
        whitespaces are removed.

    Examples
    --------
    >>> general_cleaning("  First line \t \n \xa0 \nSecond line \n   Third line\t")
    ['First line', 'Second line', 'Third line']
    """

    # gets rid of tabs, non-breaking spaces, leading/trailing whitespace, removes empty lines, and "\xa0"
    generally_cleaned_list = [line.replace("\t", " ").replace("\xa0", "").strip() for line in raw_file_str.split('\n') if line.strip() != '']
    return generally_cleaned_list

def remove_whitespace_and_underscores(string):
    """
    Removes consecutive whitespace and more than three consecutive underscores from a given string.
    
    Parameters
    ----------
    string : str
        The input string to be processed.
        
    Returns
    -------
    str
        The processed string with consecutive whitespace and more than three consecutive underscores removed.
    
    Examples
    --------
    >>> remove_whitespace_and_underscores("Hello    world___")
    'Hello world'
    
    >>> remove_whitespace_and_underscores("   This    string_has___many____underscores  ")
    'This string_has_many_underscores'
    """
    # Remove consecutive whitespace
    string = re.sub(r'\s+', ' ', string)

    # Remove more than three consecutive underscores
    string = re.sub(r'_+', '', string)

    return string.strip()

# Step 2: Metadata + Content Separation
- Extract metadata and then separate metadata from content using FlanT5 Model

In [3]:
import transformers
# from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model_name = "metadata_extractor_flant5_small"

# folder where the model files are located -- unzip before running
model_dir = f"/Users/kmaurinjones/Desktop/School/UBC/UBC_Coursework/capstone/Allard_A_Capstone/models/metadata_extractor/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

2023-06-06 13:57:17.744103: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
def extract_metadata_t5(raw_case_file_text: str, model, tokenizer = tokenizer):
    """
    Extracts metadata and content from a raw case file text using a T5-based model.

    This function performs general cleaning on the raw case file text and then applies a Flant-T5-small model
    to extract the metadata and content. The metadata is extracted using the "extract metadata boundary:"
    prefix, and the content is obtained by removing the metadata from the cleaned text.

    Parameters
    ----------
    raw_case_file_text : str
        The raw case file text to be processed.
    model : T5Model
        The T5-based model to use for extraction. By default, it uses the pre-defined model.
    tokenizer : T5Tokenizer, optional
        The tokenizer associated with the T5-based model. By default, it uses the pre-defined tokenizer.

    Returns
    -------
    tuple
        A tuple containing two strings: the extracted metadata and the content.
    
    Examples
    --------
    >>> raw_text = "metadata: Title: Example Case\nContent: This is the content of the case."
    >>> extract_metadata_t5(raw_text)
    ('Title: Example Case', 'This is the content of the case.')

    >>> raw_text = "metadata: Author: John Doe\nContent: Some content."
    >>> extract_metadata_t5(raw_text)
    ('Author: John Doe', 'Some content.')
    """

    # do general case file cleaning
    clean_file_list = general_cleaning(raw_case_file_text)
    clean_file_str = " ".join([line for line in clean_file_list if ("metadata:" or "content:") not in line.lower()])

    if "Browse myCanLII Save this case Set up citation alert Email this case" in clean_file_str:
        clean_file_str = clean_file_str.replace("Browse myCanLII Save this case Set up citation alert Email this case", "").strip()

    # run model on cleaned case file text
    inputs = ["extract metadata boundary:" + clean_file_str] # PREFIX = "extract metadata boundary:"

    inputs = tokenizer(inputs, max_length = 256, truncation = True, return_tensors = "pt")
    output = model.generate(**inputs, num_beams = 8, do_sample = True, min_length = 1, max_length = 128)
    decoded_output = tokenizer.batch_decode(output, skip_special_tokens = True)[0]

    for to_delete in ["<", ">"]:
        decoded_output = decoded_output.replace(to_delete, "")

    metadata = decoded_output.strip()

    # this is just for reformatting the first URL in the metadata -- really specific but seemed to be the only pitfall of the model
    # this fixes the issue completely
    pattern = r'https://[^,]*,'
    matches = re.findall(pattern, metadata)
    metadata = metadata.replace(matches[0], f"<{matches[0][:-1]}>,")

    # differentially get the content
    content = clean_file_str.replace(metadata, "").strip()

    full_file_cleaned = "Metadata: " + metadata + " " + "Content: " + content

    # full case file
    # full_file = ". ".join([line.replace("  ", " ") for line in clean_file_list if ("metadata:" or "content:") not in line.lower()])

    return full_file_cleaned, metadata, content

In [5]:
copied_from_website = """
Date:
2017-05-26
File number:
TEL-79722-17
Citation:
TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB), <https://canlii.ca/t/h539n>, retrieved on 2023-05-30

Browse myCanLII

Save this case

Set up citation alert

Email this case
Order under Section 69

Residential Tenancies Act, 2006

 

File Number: TEL-79722-17

 

 
 

 


T.I.L. (the 'Landlord') applied for an order to terminate the tenancy and evict B. (R.) S. (the 'Tenant') because the Tenant did not pay the rent that the Tenant owes (the ‘L1 application’).

 

The Landlord also applied for an order to terminate the tenancy and evict the Tenant because the Tenant has substantially interfered with the reasonable enjoyment or lawful right, privilege or interest of the Landlord or another tenant (the ‘L2 application’).

 

This application was heard in Toronto on May 19, 2017.

 

Only the Landlord and the Landlord’s witness, T.G., attended the hearing.

 

Determinations:

 

The L1 Application

 

1.      The Tenant has not paid the total rent the Tenant was required to pay for the period from February 1, 2017 to May 31, 2017.  Because of the arrears, the Landlord served a Notice of Termination effective April 17, 2017.

2.      The arrears of rent owing for the period ending May 31, 2017, total $1,960.00.

3.      The Landlord incurred costs as a result of filing the application in the amount of $190.00 and is entitled to reimbursement for those costs.

4.      At the hearing, the Board indicated to the Landlord that the notice of termination fails to identify the rental unit pursuant to the requirements outlined in subsection 43(1)(a) of the Residential Tenancies Act, 2016 (the ‘Act’). As the notice of termination fails to comply with the mandatory requirements of the Act, the notice is invalid.

5.      As the notice was found to be invalid, the Landlord elected to pursue an order for arrears of rent only. An order will issue accordingly.

The L2 Application  

6.      The Landlord says that the Tenant has substantially interfered with another tenant’s reasonable enjoyment of the residential complex and has substantially interfered with a lawful right, privilege or interest of the Landlord.

7.      For the reasons that follow, I am satisfied that the Tenant’s conduct has substantially interfered with both the Landlord and the other tenant living in the residential complex.

8.      The residential complex is a house. The Tenant lives in a bedroom in the basement, another tenant lives in a second room in the basement and the main floor is a 3 bedroom unit that is currently vacant.

9.      The Landlord says that the Tenant constantly exhibits aggressive, argumentative and inappropriate behaviour towards the other tenant, towards the Landlord and towards anyone around him.

10.   In support of his position, the Landlord called the other basement tenant, TG, to provide specific examples. TG provided the Board with multiple audio recordings in which the Tenant can be heard yelling extremely loudly, threatening TG and others around him, swearing, uttering racial slurs, and being belligerent, confrontational and aggressive.

11.   Both the Landlord and TG say they have personally witnessed the Tenant conducting himself in this way on multiple occasions on an almost daily basis both inside the house and outside in the yard and garage.

12.   The Landlord provided the Board with a list of dates and specific examples which TG had sent to him. The list contains 9 specific dates from September 2016 to early March 2017, and descriptions of the Tenant threatening, swearing, yelling, insulting, uttering racial slurs and uttering homophobic slurs towards TG.  Both the Landlord and TG say that this is not an exhaustive list; the Tenant continuously engages in this behaviour and has become impossible to endure.

13.   In addition, the Landlord says that the Tenant’s behaviour has prevented him from renting the main floor of the house. The Landlord says he is concerned that a family might move into the 3 bedroom unit and then be subject to the Tenant’s confrontational, threatening and belligerent behaviour. The Landlord says he is currently suffering a financial loss because he is worried about the negative effect the Tenant’s behaviour might have on any new tenants moving into the house.

14.   Based on the evidence before me, I am satisfied that the Tenant’s conduct is substantially interfering with both the basement tenant’s reasonable enjoyment of his unit as well as the Landlord’s lawful right privilege or interest. As a result, the Landlord’s application must be granted.

15.   I have considered all of the disclosed circumstances in accordance with subsection 83(2) of the Residential Tenancies Act, 2006 (the 'Act'), and find that it would be unfair to grant relief from eviction pursuant to subsection 83(1) of the Act.

 

It is ordered that:

1.      The tenancy between the Landlord and the Tenant is terminated.  The Tenant must move out of the rental unit on or before June 6, 2017.

2.      The Tenant shall pay to the Landlord $1,856.72*, which represents the amount of rent owing and compensation up to May 26, 2017.

3.      The Tenant shall also pay to the Landlord $21.37 per day for compensation for the use of the unit starting May 27, 2017 to the date the Tenant moves out of the unit.

4.      The Tenant shall also pay to the Landlord $190.00 for the cost of filing the application.

5.      If the Tenant does not pay the Landlord the full amount owing* on or before June 6, 2017, the Tenant will start to owe interest.  This will be simple interest calculated from June 7, 2017 at 2.00% annually on the balance outstanding.

6.      If the unit is not vacated on or before June 6, 2017, then starting June 7, 2017, the Landlord may file this order with the Court Enforcement Office (Sheriff) so that the eviction may be enforced.

7.      Upon receipt of this order, the Court Enforcement Office (Sheriff) is directed to give vacant possession of the unit to the Landlord, on or after June 7, 2017.

 

 

May 26, 2017                                                                    _______________________

Date Issued                                                                     Laura Hartslief

                                                                                                                           Member, Landlord and Tenant Board

 

Toronto East-RO

2275 Midland Avenue, Unit 2

Toronto ON M1P3E7

 

If you have any questions about this order, call 416-645-8080 or toll free at 1-888-332-3234.

 

In accordance with section 81 of the Act, the part of this order relating to the eviction expires on December 7, 2017 if the order has not been filed on or before this date with the Court Enforcement Office (Sheriff) that has territorial jurisdiction where the rental unit is located.

 

*           Refer to section A on the attached Summary of Calculations.


Schedule 1

SUMMARY OF CALCULATIONS

 

File Number: TEL-79722-17

 

A.        Amount the Tenant must pay if the tenancy is terminated:

 

Reasons for amount owing

Period

Amount

 

 Arrears: (up to the termination date in the Notice of Termination)

February 1, 2017 to April 17, 2017

$1,023.29

 

Plus compensation: (from the day after the termination date in the Notice to the date of the order)

April 18, 2017 to May 26, 2017

$833.43

 

Amount owing to the Landlord on the order date:(total of previous boxes)

$1,856.72

 

Additional costs the Tenant must pay to the Landlord:

$190.00

 

Plus daily compensation owing for each day of occupation starting May 27, 2017:

$21.37 (per day)

 

Total the Tenant must pay the Landlord if the tenancy is terminated:

$2,046.72, + $21.37 per day starting May 27, 2017


"""

In [6]:
# replace this with importing and iterating over folder of txt files when actually testing the pipeline
# test_row = 0
# test_raw_text = silver_df['raw_file_text_no_markers'][test_row]

data_dict = {'raw_file_text': [copied_from_website]}
data_df = pd.DataFrame(data_dict)
data_df

Unnamed: 0,raw_file_text
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...


In [7]:
full_file, case_metadata, case_content = extract_metadata_t5(raw_case_file_text = copied_from_website, model = model, tokenizer = tokenizer)
print(case_metadata)
print(case_content)
full_file

Date: 2017-05-26 File number: TEL-79722-17 Citation: TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB), <https://canlii.ca/t/h539n>, retrieved on 2023-05-30
Order under Section 69 Residential Tenancies Act, 2006 File Number: TEL-79722-17 T.I.L. (the 'Landlord') applied for an order to terminate the tenancy and evict B. (R.) S. (the 'Tenant') because the Tenant did not pay the rent that the Tenant owes (the ‘L1 application’). The Landlord also applied for an order to terminate the tenancy and evict the Tenant because the Tenant has substantially interfered with the reasonable enjoyment or lawful right, privilege or interest of the Landlord or another tenant (the ‘L2 application’). This application was heard in Toronto on May 19, 2017. Only the Landlord and the Landlord’s witness, T.G., attended the hearing. Determinations: The L1 Application 1.      The Tenant has not paid the total rent the Tenant was required to pay for the period from February 1, 2017 to May 31, 2017.  Because of the arr

"Metadata: Date: 2017-05-26 File number: TEL-79722-17 Citation: TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB), <https://canlii.ca/t/h539n>, retrieved on 2023-05-30 Content: Order under Section 69 Residential Tenancies Act, 2006 File Number: TEL-79722-17 T.I.L. (the 'Landlord') applied for an order to terminate the tenancy and evict B. (R.) S. (the 'Tenant') because the Tenant did not pay the rent that the Tenant owes (the ‘L1 application’). The Landlord also applied for an order to terminate the tenancy and evict the Tenant because the Tenant has substantially interfered with the reasonable enjoyment or lawful right, privilege or interest of the Landlord or another tenant (the ‘L2 application’). This application was heard in Toronto on May 19, 2017. Only the Landlord and the Landlord’s witness, T.G., attended the hearing. Determinations: The L1 Application 1.      The Tenant has not paid the total rent the Tenant was required to pay for the period from February 1, 2017 to May 31, 2017.

In [8]:
# case_metadata.split(". ")
# case_content#.split("")

In [9]:
for row in data_df.index:
    data_df.loc[row, 'full_file'] = full_file
    data_df.loc[row, 'metadata'] = case_metadata
    data_df.loc[row, 'content'] = case_content

data_df

Unnamed: 0,raw_file_text,full_file,metadata,content
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Cit...,Order under Section 69 Residential Tenancies A...


# Step 3: File Number + Citation

In [10]:
import re

def get_case_citation(metadata_list):
    """
    Extracts the case citation from a list of metadata lines.

    This function searches through the metadata lines for a line containing "Citation:" or "Référence:"
    and extracts the citation information from that line.

    Parameters
    ----------
    metadata_list : list of str
        A list of metadata lines.

    Returns
    -------
    str or None
        The extracted case citation, or None if no citation is found.

    Examples
    --------
    >>> metadata = ["Title: Example Case", "Citation: ABC123 (LTB)"]
    >>> get_case_citation(metadata)
    'ABC123 (LTB)'

    >>> metadata = ["Title: Another Case", "Référence: XYZ789 (LTB)"]
    >>> get_case_citation(metadata)
    'XYZ789 (LTB)'
    """
    if isinstance(metadata_list, str):
        metadata_list = metadata_list.split("\n")

    for line in metadata_list:
        if "Citation:" in line:
            citation_start = line.find("Citation: ")
            citation_end = line.find("LTB)") + 4
            return line[citation_start:citation_end].replace("Citation: ", "").strip()
        elif "Référence: " in line:
            citation_start = line.find("Référence: ")
            citation_end = line.find("LTB)") + 4
            return line[citation_start:citation_end].replace("Référence: ", "").strip()
    return None

def get_file_number(metadata_list):
    """
    Extracts the file number from a list of metadata lines.

    This function concatenates the metadata lines into a single string and extracts the file number
    from that string. The file number is obtained either after "File number:" or "Numéro de dossier:".

    Parameters
    ----------
    metadata_list : list of str
        A list of metadata lines.

    Returns
    -------
    str or None
        The extracted file number, or None if no file number is found.

    Examples
    --------
    >>> metadata = ["File number: TNL-10001-18", "Citation: ABC123 (LTB)"]
    >>> get_file_number(metadata)
    'TNL-10001-18'

    >>> metadata = ["Numéro de dossier: XYZ789", "Référence: DEF456 (LTB)"]
    >>> get_file_number(metadata)
    'XYZ789'
    """
    if isinstance(metadata_list, list):
        metadata_str = " ".join(metadata_list)
    else:
        metadata_str = metadata_list

    if "Citation: " in metadata_str:
        file_nums = metadata_str[metadata_str.find("File number: ") + len("File number: ") : metadata_str.find("Citation:")].strip()
    elif "Référence: " in metadata_str:
        file_nums = metadata_str[metadata_str.find("Numéro de dossier: ") + len("Numéro de dossier: ") : metadata_str.find("Référence")].strip()

    if len(file_nums) == 0:
        return None

    file_nums = file_nums.replace(";", " ")

    file_num = list(set(file_nums.split()))
    file_num = ";".join(file_num)
    file_num = re.sub(r'[^\w\s]$', '', file_num)

    if ";" in file_num:
        file_num = list(set(file_num.split(";")))
        file_num = [re.sub(r'[\(\)]', '', num) for num in file_num]
        file_num = ";".join(file_num)

    file_num = re.sub(r'[\(\)]', '', file_num)

    return file_num

In [11]:
for row in data_df.index:
    data_df.loc[row, 'citation'] = get_case_citation(data_df.loc[row, 'metadata'])
    data_df.loc[row, 'file_number'] = get_file_number(data_df.loc[row, 'metadata'])

data_df

Unnamed: 0,raw_file_text,full_file,metadata,content,citation,file_number
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Cit...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17


# Step 4: Detect Language
- not necessary for anything in the pipeline, just a fun extra point of data

In [12]:
# !pip install langdetect
from langdetect import detect

def is_mostly_french(text, threshold):
    try:
        detected_language = detect(text)
        if detected_language == 'fr':
            return True
        else:
            return False
    except:
        return False

def is_french(text, threshold):
    try:
        detected_language = detect(text)
        if detected_language == 'fr':
            return True
        language_probabilities = detect_langs(text)
        for lang in language_probabilities:
            if lang.lang == 'fr' and lang.prob > threshold:
                return True
        return False
    except:
        return False

In [13]:
for row in data_df.itertuples():

    # adding to 'language' column
    if is_french(data_df.loc[row.Index, "raw_file_text"], 0.7) == True:
        data_df.at[row.Index, 'language'] = "French"
    else:
        data_df.at[row.Index, 'language'] = "English"

data_df

Unnamed: 0,raw_file_text,full_file,metadata,content,citation,file_number,language
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Cit...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English


# Step 5: Year
- also not necessary for anything in the pipeline, just another datapoint for the corpus

In [14]:
year_pattern = r"\b(\d{4})\b"

for row in data_df.itertuples():

    year_match = re.search(year_pattern, data_df.loc[row.Index, "metadata"])
    if year_match:
        year = year_match.group(1)
        data_df.loc[row.Index, "year"] = year
    else:
        data_df.loc[row.Index, "year"] = "year not found"

data_df

Unnamed: 0,raw_file_text,full_file,metadata,content,citation,file_number,language,year
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Cit...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017


# Step 6: LTB Location
- there are a few different methods and things to try, so I run them in succession to make sure to capture SOMETHING

In [15]:
def find_all_positions(text, keyword):
    positions = []
    start = 0
    while True:
        index = text.find(keyword, start)
        if index == -1:
            break
        positions.append(index)
        start = index + 1
    return positions

# backup for other methods, using " ON " + postal code as a marker
def find_postal_code_index(text):
    match = re.search(r' ON [A-Z]\d[A-Z] ?\d[A-Z]\d', text)
    if match:
        return match.start()
    else:
        return None

def extract_loc_rule(text_list):

    if isinstance(text_list, list):
        content_str = " ".join(text_list)
    else:
        content_str = text_list
    # print(content_str)

    # "hear" is a good marker for the location sentence ("heard in/on", "hearing", etc)
    if "hear" in content_str:
        hear_inds = find_all_positions(content_str, "hear") # list of indices where "hear" appears in string
        # print(hear_inds)

        for hear_ind in hear_inds:

            hear_substr = content_str[hear_ind - 50 : hear_ind + 50]
            possible_sentences = hear_substr.split(". ")
            # print(possible_sentences)
            
            hear_sent = [sent for i, sent in enumerate(possible_sentences) if "hear" in sent][0]
            # print(hear_sent)

            if len(hear_sent.split(" on ")) != 2:
                # print("TEST")
                # return None
                pass # go to next "hear" location in string
            else:
                
                location_sent, date = hear_sent.split(" on ") # should only split into 2 parts
                # print(location_sent)

                if " in " in location_sent:

                    # print(location_sent)
                    location = location_sent.split(" in ")[1].strip() # location name should be last tokens of string after token containing " in " (city name could be multiple tokens, so need to get all tokens after " in " token, not just last one)

                    # print(location)
                    location = location.split(" ")[-1] # location name should be last tokens of string after token containing "hear" (city name could be multiple tokens, so need to get all tokens after "hear" token, not just last one)
                    # print("YESSS")
                    return location.strip()
                
                else:
                    pass

    # use " ON " + postal code as a marker
    if " ON " in content_str:
        # print("|UESSS")
        # print(find_postal_code_index(content_str))
        postal_ind = find_postal_code_index(content_str)
        content_str_subsection = (content_str[postal_ind - 50 : postal_ind + 50])
        location = content_str_subsection.split(" ON ")[0].split()[-1]
        return location.strip()
    
    # if absolutely nothing works, return None and we'll use a transformer or something more nuanced
    return None


# this is a list that was made from all ltb locations identified in the annotated data
all_locations = [
    'Sudbury',
    # 'Review completed without hearing',
    'Thunder Bay',
    'Ottawa',
    'Woodstock',
    'Orangeville',
    'Toronto',
    'Burlington',
    'Waterloo',
    'Stratford',
    # 'Not stated, hearing in Windsor',
    'London',
    'Lindsay',
    'Hamilton',
    'Peterborough',
    'Windsor',
    'Brantford',
    'Cobourg',
    'Kingston',
    'Belleville',
    'Bracebridge',
    'Barrie',
    'Newmarket',
    'Mississauga',
    'Not stated',
    # 'by telephone',
    'Whitby'
    ]

import spacy

# Load the English language model in spaCy
nlp = spacy.load('en_core_web_sm')

def extract_location_spacy(string_list, model = nlp, other_locations = all_locations):

    if isinstance(string_list, list):
        content_str = " ".join(string_list)
    else:
        content_str = string_list

    string = " ".join(string_list)

    # uses a spacy model + its vocabulary to extract and return the location if possible
    doc = model(string)
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            return ent.text

    # otherwise looks through the list of all locations in the annotated data and returns the first one that appears in the string -- for example, "Hamilton"
    for tok in string.split():
        if tok in other_locations:
            return tok

    # if all else fails, return None -- use a transformer or something later idk
    return None

# all_locations

In [16]:
for row in data_df.itertuples():

    ### rule-based extremely quick -- pretty effective but imperfect
    location = extract_loc_rule(data_df.loc[row.Index, 'content'])#.title() # returns the string in title case

    if not location[0].isalnum(): # something like "[CITY]" -- use spacy method
        location = extract_location_spacy(data_df.loc[row.Index, 'content'])

    if not location[0].isupper():
        # I know this isn't a great rule in general but this seems to be consistent/reliable across all cases.
        # City names are all capitalized. Otherwise it finds "it", "heard", and more as locations
        location = extract_location_spacy(data_df.loc[row.Index, 'content'])

    if location == None: # rule-based returns None
        location = extract_location_spacy(data_df.loc[row.Index, 'content'])

    # use the found location
    data_df.at[row.Index, 'ltb_location'] = location.title() # Title casing for consistency

data_df.head()

Unnamed: 0,raw_file_text,full_file,metadata,content,citation,file_number,language,year,ltb_location
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Cit...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017,Toronto


# Step 7: Hearing Date
- this is in progress -- current system has 74% accuracy and is a rule-based, statistical system
- ML might work better but something needs to be trained for it

# Step 8: Decision Date
- also in progress, but performs better than hearing date extraction

# Step 9: Case URL
- also not necessary for any other part of the pipeline. good for corpus

In [17]:
import re

def get_url_from_citation_string(text: str):
    """
    Returns URL to case file given a list of strings of metadata from a case file.
    String must begin with "Citation: " and URL must be within angle brackets.

    Parameters
    ----------
    text : str
        A string of metadata from a case file.

    Returns
    -------
    str
        A string of the URL to the case file.
    """

    pattern = r"<(.*?)>"
    matches = re.findall(pattern, text)
    return matches[0]

def get_url_from_metadata(case_metadata: list):
    """
    Extract URL to case file from a list of strings of metadata from a case file.

    Parameters
    ----------
    case_metadata : list
        A list of strings of metadata from a case file.

    Returns
    -------
    str
        A string of the URL to the case file.
    """

    if isinstance(case_metadata, str):
        case_metadata = case_metadata.split("\n")

    for line in case_metadata:
        if ("Citation:" or "Référence:") in line:
            return get_url_from_citation_string(line)
        
    return None

In [18]:
for row in data_df.itertuples():

    try:
        data_df.at[row.Index, 'url'] = get_url_from_metadata(data_df.loc[row.Index, 'metadata'])
    except Exception as any_error:
        data_df.at[row.Index, 'url'] = "URL NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_text,full_file,metadata,content,citation,file_number,language,year,ltb_location,url
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Cit...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017,Toronto,https://canlii.ca/t/h539n


# Step 10: Adjudicating Member

In [19]:
import re

def get_adj_member(list_of_strings: list):

    if isinstance(list_of_strings, str):
        list_of_strings = list_of_strings.split("\n")

    text = " ".join(list_of_strings)
    pattern = r"Date Issued(.*?)Member"
    matches = re.findall(pattern, text, re.DOTALL)
    # extracted_text = [match.strip() for match in matches]
    extracted_text = list(set(match.strip() for match in matches))
    if len(extracted_text) > 0:
        return ", ".join(extracted_text) # returns a list of matches and sometimes there's more than one match so we just take the first one -- there are never two
    elif "date issued" in text.lower():
        DI_inds = find_all_positions(text.lower(), "date issued")
        for DI_ind in DI_inds:
            DI_substr = text[DI_ind - 50 : DI_ind + 50].lower()
            if len(DI_substr.split("date issued")) != 2:
                # should only contain 2
                pass

            DI_sent = DI_substr.split("date issued")[1].strip()
            if ", " in DI_sent: # there should be a comma be just in case, it doesn't hurt to have this (and this to try the "hear" method after iterating over all of these if none work)
                DI_sent = DI_sent.split(", ")[0]

            if "member" in DI_sent:
                DI_sent = DI_sent.replace("member", "")

            return DI_sent.title().strip()

In [20]:
for row in data_df.itertuples():

    try:
        data_df.at[row.Index, 'adjudicating_member'] = get_adj_member(data_df.loc[row.Index, 'content']).replace("Vice Chair", "").replace("Vice-Chair", "").strip()
    except Exception as any_error:
        data_df.at[row.Index, 'adjudicating_member'] = "MEMBER NOT FOUND"

data_df.head()

Unnamed: 0,raw_file_text,full_file,metadata,content,citation,file_number,language,year,ltb_location,url,adjudicating_member
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Cit...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017,Toronto,https://canlii.ca/t/h539n,Laura Hartslief


# Step 11: Extract Case Outcome Span
- still need to classify case outcome span

In [21]:
import re
import itertools

def find_all_positions(text: str, keyword: str):
    """
    Finds all positions of a keyword in a given text.

    This function searches for a keyword in a given text and returns a list of positions where the keyword is found.

    Parameters
    ----------
    text : str
        The text to search within.
    keyword : str
        The keyword to find in the text.

    Returns
    -------
    list
        A list of integers representing the positions of the keyword in the text.

    Examples
    --------
    >>> find_all_positions("This is an example sentence.", "example")
    [11]
    """
    positions = []
    start = 0
    while True:
        index = text.find(keyword, start)
        if index == -1:
            break
        positions.append(index)
        start = index + 1
    return positions

def get_outcome_span(text: str, return_truncated: bool = True):
    """
    Extracts the outcome span from a given text using different methods.

    This function extracts the outcome span from a given text using multiple methods. It first attempts to find
    the span between occurrences of the phrases "accordance with" and "ordered". If that method fails, it then
    tries to find the span after the phrase "it is ordered". If that also fails, it looks for the span after the
    phrase "find". The function returns the extracted outcome span as a cleaned string.

    Parameters
    ----------
    text : str
        The text from which to extract the outcome span.

    Returns
    -------
    str or None
        The extracted outcome span as a cleaned string, or None if no span is found.

    Examples
    --------
    >>> get_outcome_span(unstructured_case_file)
    "In accordance with the order, it is ordered that the defendant pays a fine."
    """

    ############### FIRST METHOD ################

    for keyword in ['in accordance with', 'grant', 'relief', 'fair']: # these all seem common but none seem to exist in 100% of cases

        if keyword in text:

            # find all occurrences of 'in accordance with' and 'ordered'
            accordance_with_indices = [m.end() for m in re.finditer(keyword, text)]
            ordered_indices = [m.start() for m in re.finditer("ordered", text)]

            # generate all possible pairs of indices
            index_pairs = list(itertools.product(accordance_with_indices, ordered_indices))

            # filter pairs where 'accordance with' index is less than 'ordered' index
            index_pairs = [(i, j) for (i, j) in index_pairs if i < j]
            if index_pairs:
                # find the pair with the shortest distance between indices
                min_distance_pair = min(index_pairs, key = lambda x: x[1] - x[0])
                try:
                    best_subset = text[min_distance_pair[0] - 300 : min_distance_pair[1] + 400].strip()
                except IndexError:
                    best_subset = text[min_distance_pair[0] - 600 : min_distance_pair[1]].strip()

                best_subset = best_subset.split(". ")

                if not best_subset:
                    continue # to next match of all matches of the keyword

                sent_id = [idx for idx, i in enumerate(best_subset) if keyword in i.lower()][0]

                clean_outcome = best_subset[sent_id]

                # return JUST the (presumably) most relevant outcome span (after cleaning it up a bit)
                if return_truncated:
                    clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                    clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                    if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                        clean_outcome = clean_outcome.split(")")[1].strip()
                    return clean_outcome

                # return all case file text until the end of the outcome span
                else:
                    return text[: text.find(clean_outcome) + len(clean_outcome)]

    ################ SECOND METHOD ################

    keyword = "it is ordered"
    if keyword in text.lower():
        matches = find_all_positions(text.lower(), keyword)

        for match in matches:
            try: # match + 400 chars
                clean_outcome = ". ".join(text[match - 400 : match + 400].split(". ")[1:-1]) 
            except IndexError: # match idx until end of string (+ 400 is sometimes out of range)
                clean_outcome = ". ".join(text[match - 600 :].split(". ")[1:-1])

            # return None
            # print("METHOD 2")
            if not clean_outcome:
                continue # to next match of all matches of the keyword

            if return_truncated:
                clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                    clean_outcome = clean_outcome.split(")")[1].strip()
                return clean_outcome

            # return all case file text until the end of the outcome span
            else:
                return text[: text.find(clean_outcome) + len(clean_outcome)]

    ############### THIRD METHOD ################

    keyword = " find " # spaces to prevent "finding" or other derivations from being included -- specifically looking for statements like "I find that..."
    if keyword in text.lower():
        matches = find_all_positions(text.lower(), keyword)
        for match in matches:

            try: # match + 400 chars
                clean_outcome = ". ".join(text[match - 400 : match + 400].split(". ")[1:-1]) 
            except IndexError: # match idx until end of string (+ 400 is sometimes out of range)
                clean_outcome = ". ".join(text[match - 600 :].split(". ")[1:-1])

            if not clean_outcome:
                continue # to next match of all matches of the keyword
            
            if return_truncated:
                clean_outcome = re.sub(r'\[\d+\]', '', clean_outcome)
                clean_outcome = re.sub(r'^\d+\.\s*', '', clean_outcome).strip() # removes numbers from the start of the string such as "16. " from start of string

                if ")" in clean_outcome[:10] and "(" not in clean_outcome[:10]:
                    clean_outcome = clean_outcome.split(")")[1].strip()
                return clean_outcome
            else:
                return text[: text.find(clean_outcome) + len(clean_outcome)]

    # if absolutely nothing works, return none and try Longformer or something idk
    return None

In [23]:
for row in data_df.itertuples():

    try:
        pass
        content_str = data_df.at[row.Index, 'content']
        data_df.at[row.Index, 'outcome_span'] = get_outcome_span(content_str, return_truncated = True)
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

data_df.head()

Unnamed: 0,raw_file_text,full_file,metadata,content,citation,file_number,language,year,ltb_location,url,adjudicating_member,outcome_span
0,\nDate:\n2017-05-26\nFile number:\nTEL-79722-1...,Metadata: Date: 2017-05-26 File number: TEL-79...,Date: 2017-05-26 File number: TEL-79722-17 Cit...,Order under Section 69 Residential Tenancies A...,"TEL-79722-17 (Re), 2017 CanLII 48856 (ON LTB)",TEL-79722-17,English,2017,Toronto,https://canlii.ca/t/h539n,Laura Hartslief,I have considered all of the disclosed circums...
