# Pre-Processing to Match Annotated Data
- Ultimate goal: Match everything as closely as possible (even if it doesn't always make sense to)

## Imports

In [23]:
import os
import pandas as pd
import time
import numpy as np
from collections import deque

# Gold Data (Annotated by Partner)

## Renaming `gold_data` columns

In [24]:
import pandas as pd
# gold_data = pd.read_csv("data/gold_labels_with_files.csv") # with partner annotations
gold_data = pd.read_csv('data/allard_labels_with_text.csv') # allard annotations

# making new names for the columns in gold_data

new_names = {
    'Timestamp': 'timestamp',
    'Email Address': 'email_address',
    'What is the file number of the case?': 'file_number_gold',
    'What was the date of the hearing? [mm/dd/yyyy]': 'hearing_date',
    'What was the date of the decision? [mm/dd/yyyy]': 'decision_date',
    'Who was the member adjudicating the decision?': 'adjudicating_member',
    'What was the location of the landlord tenant board?': 'ltb_location',
    'Did the decision state the landlord was represented?': 'landlord_represented',
    'Did the decision state the landlord attended the hearing?': 'landlord_attended_hearing',
    'Did the decision state the tenant was represented?': 'tenant_represented',
    'Did the decision state the tenant attended the hearing?': 'tenant_attended_hearing',
    'Did the decision state the landlord was a not-for-profit landlord (e.g. Toronto Community Housing)?': 'landlord_nonprofit',
    'Did the decision state the tenant was collecting a subsidy?': 'tenant_collecting_subsidy',
    'What was the outcome of the case?': 'case_outcome',
    'What was the length of the tenancy, or in other words, how long had the tenants lived at the residence in question? ': 'tenancy_length',
    'What was the monthly rent?': 'monthly_rent',
    'What was the amount of the rental deposit? ': 'rental_deposit',
    'If any rent increases occurred, what was the rent after the increase(s)?': 'rent_after_increase',
    'If any rent increases occurred, when did the rent increase(s) come into effect? ': 'rent_increase_effect_date',
    'What was the total amount of arrears?': 'total_arrears',
    'Over how many months did the arrears accumulate? ': 'arrears_duration',
    'If the tenant made a payment on the arrears after the eviction notice was served and/or prior to the hearing, what was the amount of the payment? ': 'arrears_payment_amount',
    'Did the decision mention a history of arrears by the tenant separate from the arrears in the current claim (more than one period of arrears, recurrently coming in and out of arrears, arrears with previous landlord, etc.)?': 'tenant_arrears_history_mentioned',
    'If the tenant had a history of arrears, did the decision mention a history of the tenant making payments on those arrears (separate from any payments made in response to the present eviction notice/hearing)?': 'tenant_arrears_payment_history_mentioned',
    'How frequently were rent payments made late?': 'rent_payments_late_frequency',
    'Did the member find the tenant had or seemed to have the ability to pay rent, but chose not do so?': 'tenant_ability_to_pay_rent',
    'What were the specific mental, medical, or physical conditions of the tenant, if any? ': 'tenant_conditions',
    'Did the decision state that the tenant had children living with them?': 'tenant_children_present',
    'How many total children did the tenant have living with them? ': 'total_children',
    'How many total children aged 17 or younger did the tenant have living with them?': 'children_17_or_younger',
    'How many total children aged 13 or younger did the tenant have living with them? ': 'children_13_or_younger',
    'How many total children aged 4 or younger did the tenant have living with them?': 'children_4_or_younger',
    'Did the decision state any of the children had mental, medical or physical conditions?': 'children_conditions_mentioned',
    'If yes to the previous question, did the decision state these conditions would make moving particularly burdensome?': 'conditions_making_moving_burdensome',
    'Was the tenant employed at the time of the hearing?': 'tenant_employed',
    'If the tenant was not employed, did the decision state the tenant was receiving any form of government assistance (e.g. OW, childcare benefits, ODSP, OSAP)?': 'tenant_government_assistance',
    'If the tenant was employed, did the decision state any doubts about the stability of employment e.g. lack of guaranteed hours, contract work, etc.?': 'employment_stability_doubts',
    'Did the member find the tenant had sufficient income to pay rent?': 'sufficient_income_to_pay_rent',
    'What was the total income of the tenant’s household? ': 'total_household_income',
    'Did the decision mention the tenant lost their job leading up to or during the period of the hearing?': 'tenant_job_loss_mentioned',
    'Did the decision mention any other extenuating circumstances experienced by the tenant leading up to or during the period of the claim (e.g. hospitalization, death in the family, etc.)?': 'tenant_extenuating_circumstances',
    'Did the tenant propose a payment plan?': 'tenant_proposed_payment_plan',
    'If the tenant did propose a payment plan, did the member accept the proposed payment plan?': 'accepted_proposed_payment_plan',
    'If a payment plan was ordered, what was the length of the payment plan? ': 'payment_plan_length',
    'Did the decision mention the tenant’s difficulty finding alternative housing for any reason e.g.physical limitations, reliance on social assistance, etc.?': 'tenant_difficulty_finding_housing',
    'If yes to the previous question, which of the following were applicable to the tenant?': 'applicable_difficulty_reasons',
    'Did the decision state the tenant was given prior notice for the eviction?': 'tenant_prior_notice_given',
    'If the tenant was given prior notice for the eviction, how much notice was given?': 'prior_notice_duration',
    'Did the decisions state postponement would result in the tenant accruing additional arrears?': 'postponement_additional_arrears',
    'Which other specific applications of the landlord or the tenant were mentioned?': 'mentioned_applications',
    'Did the decision mention the validity of an N4 eviction notice?': 'validity_of_N4_notice_mentioned',
    'Were there detail(s) in the decision not captured by this questionnaire that should be included?': 'additional_details_in_decision',
    'Exec Review': 'executive_review',
    'Review Status': 'review_status'
}

gold_data = gold_data.rename(columns = new_names)
# sorting by file_number -- so that ordering of the new data annotations can match this and be more easily compared
gold_data = gold_data.sort_values(by = ['case number'], ascending = True).reset_index(drop = True)
gold_data.columns

Index(['case number', 'adjudicating_member', 'ltb_location',
       'landlord_represented', 'landlord_attended_hearing',
       'tenant_represented', 'tenant_attended_hearing', 'landlord_nonprofit',
       'tenant_collecting_subsidy', 'case_outcome',
       'What was the length of the tenancy', 'monthly_rent',
       'rental deposit amount', 'was there an rent increases', 'total_arrears',
       'Over how many months did the arrears accumulate?',
       'Does the tenant made a payment on the arrears after the eviction notice',
       'tenant_arrears_history_mentioned', 'tenant_ability_to_pay_rent',
       'What were the specific mental, medical, or physical conditions of the tenant, if any?',
       'tenant_children_present', 'tenant_employed',
       'tenant_government_assistance', 'employment_stability_doubts',
       'sufficient_income_to_pay_rent', 'tenant_job_loss_mentioned',
       'tenant_extenuating_circumstances', 'tenant_difficulty_finding_housing',
       'tenant_prior_notic

In [25]:
from dateutil.parser import parse

def convert_to_datetime(date_str):
    # Parse date using dateutil.parser.parse
    dt = parse(date_str)
    
    # Format date with strftime in the format 'MM/DD/YYYY'
    return dt.strftime('%m/%d/%Y')

gold_data['hearing_date'] = gold_data['hearing_date'].apply(lambda x: convert_to_datetime(x))
gold_data['decision_date'] = gold_data['decision_date'].apply(lambda x: convert_to_datetime(x))
gold_data['hearing_date']

KeyError: 'hearing_date'

# Silver Data
- Only 678 of 702 case files match

In [26]:
# # formatted_cases_path = "/Users/kmaurinjones/Desktop/School/UBC/UBC_Coursework/capstone/Allard_A_Capstone/scraping/45k_formatted_cases/"
# folder_path = "./raw_case_files/"

# # def create_master_dictionary(directory):
# master_dict = {}
# master_dict['raw_file_name'] = []
# master_dict['raw_file_text'] = []

# # Iterate over .txt files in the folder
# for file_name in os.listdir(folder_path):
#     file_path = os.path.join(folder_path, file_name)
    
#     # Check if the file is a .txt file
#     if os.path.isfile(file_path) and file_name.endswith('.txt'):
        
#         # Read the contents of the .txt file
#         with open(file_path, 'r') as file:
#             contents = file.read()
            
#             # Append the contents to the list in the master_dict
#             master_dict['raw_file_name'].append(file_name)
#             master_dict['raw_file_text'].append(contents)
# # master_dict

In [27]:
# for key, value in master_dict.items():
#     print(key, len(value))

# silver_data = pd.DataFrame.from_dict(master_dict)
# print(silver_data.shape)
# silver_data.head()

In [28]:
# matched_files = []
# print(len(silver_data))
# for silver_fn in silver_data.raw_file_name.unique().tolist():
#     if silver_fn[:-4] in gold_data.file_number.unique().tolist():
#         # print(silver_fn)
#         matched_files.append(silver_fn)

# silver_data = silver_data[silver_data.raw_file_name.isin(matched_files)]
# print(len(silver_data))
# silver_data = silver_data.sort_values(by = ['raw_file_name'], ascending = True).reset_index(drop = True)
# silver_data

## Creating `silver_data` df from `gold_data` raw text

In [29]:
silver_data = gold_data.copy()
silver_data = silver_data.drop(columns = [col for col in silver_data.columns if col not in ['raw_file_name', 'raw_file_text']])
silver_data

Unnamed: 0,raw_file_text,raw_file_name
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt
...,...,...
667,Metadata:\nDate:\t2018-12-13\nFile number:\t\n...,TSL-98918-18-RV.txt
668,Metadata:\nDate:\t2018-11-23\nFile number:\t\n...,TSL-99691-18.txt
669,Metadata:\nDate:\t2018-11-29\nFile number:\t\n...,TSL-99824-18.txt
670,Metadata:\nDate:\t2018-12-12\nFile number:\t\n...,TSL-99900-18.txt


In [30]:
gold_data = gold_data.rename(columns = {'case number': 'file_number',
                                        'board_location': 'ltb_location'})
gold_data.columns.tolist()

['file_number',
 'adjudicating_member',
 'ltb_location',
 'landlord_represented',
 'landlord_attended_hearing',
 'tenant_represented',
 'tenant_attended_hearing',
 'landlord_nonprofit',
 'tenant_collecting_subsidy',
 'case_outcome',
 'What was the length of the tenancy',
 'monthly_rent',
 'rental deposit amount',
 'was there an rent increases',
 'total_arrears',
 'Over how many months did the arrears accumulate?',
 'Does the tenant made a payment on the arrears after the eviction notice',
 'tenant_arrears_history_mentioned',
 'tenant_ability_to_pay_rent',
 'What were the specific mental, medical, or physical conditions of the tenant, if any?',
 'tenant_children_present',
 'tenant_employed',
 'tenant_government_assistance',
 'employment_stability_doubts',
 'sufficient_income_to_pay_rent',
 'tenant_job_loss_mentioned',
 'tenant_extenuating_circumstances',
 'tenant_difficulty_finding_housing',
 'tenant_prior_notice_given',
 'postponement_additional_arrears',
 'validity_of_N4_notice_mentio

# ALL CASE FILES

In [31]:
# all_cases_master = pd.read_csv("large_files/44k_cases_pproc_filenums.csv")
# all_cases_master

In [32]:
# all_cases_master['file_numbers_from_file_name'] = all_cases_master.raw_file_name.apply(lambda x: x.split(".txt")[0])
# all_cases_master

# `get_nulls()`
- for checking df after each addition to it

In [33]:
def get_nulls(df, col, return_index = False):
    # returns a list of the indices of null values in a column of a dataframe
    null_rows = silver_data[silver_data[col].isnull()] # df of all rows with null ltb_location
    nulls_inds = null_rows.index.tolist()

    if return_index:
        return nulls_inds
    else:
        return null_rows
    
get_nulls(silver_data, 'raw_file_text', return_index = False)

Unnamed: 0,raw_file_text,raw_file_name


# General Cleaning
- Removing unnecessary lines (blank)
- Removing unnecessary characters (extra whitespaces and underscores)
- Separating metadata and content

In [34]:
import re

def general_cleaning(raw_file_str: str):
    # gets rid of tabs, non-breaking spaces, leading/trailing whitespace, removes empty lines, and "\xa0"
    generally_cleaned_list = [line.replace("\t", " ").replace("\xa0", "").strip() for line in raw_file_str.split('\n') if line.strip() != '']
    return generally_cleaned_list

def remove_whitespace_and_underscores(string):
    # Remove consecutive whitespace
    string = re.sub(r'\s+', ' ', string)

    # Remove more than three consecutive underscores
    string = re.sub(r'_+', '', string)

    return string.strip()

def separate_file_sections(text_list):
    metadata_list = []
    content_list = []

    is_metadata = True
    is_content = False

    for line in text_list:
        if line.strip() == 'Metadata:':
            is_metadata = True
            is_content = False
        elif line.strip() == 'Content:':
            is_metadata = False
            is_content = True
        elif is_metadata:
            metadata_list.append(remove_whitespace_and_underscores(line))
        elif is_content:
            content_list.append(remove_whitespace_and_underscores(line))

    return metadata_list, content_list

def merge_numerical_entries(strings_list):
    """
    Turns something like
        [..., '3.',
        'The tenant took occupancy of the rental unit in or about the beginning of December 2016.', ...]
    into
        [..., '3. The tenant took occupancy of the rental unit in or about the beginning of December 2016.', ...]
    
    """
    for i in range(len(strings_list) - 2, -1, -1):
        if re.fullmatch(r'\d+\.', strings_list[i]):
            strings_list[i] += ' ' + strings_list[i + 1]
            del strings_list[i + 1]
    return strings_list

def move_trailing_numbers(strings_list):
    """
    Turns something like
        [..., 'Credibility of the Parties 4.',
        'The Landlord said about two to three months ago he ...', ...]
    into
        [..., 'Credibility of the Parties',
        '4. The Landlord said about two to three months ago he...', ...]
    
    """
    for i in range(len(strings_list) - 1, -1, -1):
        match = re.search(r'\s+(\d{2}\.)$', strings_list[i])
        if match:
            number = match.group(1)
            strings_list[i] = re.sub(r'\s+\d{1,2}\.$', '', strings_list[i])
            strings_list[i + 1] = number + ' ' + strings_list[i + 1]
    return strings_list

import re

def remove_end_tag_and_restructure(metadata_list: list):

    cleaned_str = " ".join(metadata_list)

    # this doesn't add any meaning to the case details we need to extract, and instead just adds noise to the extraction process + adds extra unnecessary tokens
    if cleaned_str.find("If you have any questions about this order") > (len(cleaned_str) - 500):
        cleaned_str = cleaned_str[: cleaned_str.find("If you have any questions about this order")].strip() # ending tag removed
    
    # otherwise just do everything else
    cleaned_str = cleaned_str.replace(". ", ".\n")
    # cleaned_str = cleaned_str.replace(". ", ".\n") # deprecated by regex approach
    # trimmed_list = [line.strip() for line in re.split(r'(?<!\d)\. ', cleaned_str) if line.strip() != ''] # deprecated by regex approach
    cleaned_str = re.sub(r'(?<!\d)\. ', "\n", cleaned_str)
    trimmed_list = [line.strip() for line in cleaned_str.split('\n') if line.strip() != '']
    trimmed_list = merge_numerical_entries(trimmed_list)
    trimmed_list = move_trailing_numbers(trimmed_list)
    return trimmed_list

file_name = "CEL-74519-18.txt"
# row of this particular case
case_file_ind = silver_data.loc[silver_data['raw_file_name'] == file_name].index.tolist()[0]
test_text = silver_data.loc[206, "raw_file_text"]#.item()

metadata, content = separate_file_sections(general_cleaning(test_text))
remove_end_tag_and_restructure(content)

['Order under Section 69 Residential Tenancies Act, 2006 File Number: TEL-05033-19 M.',
 'M.',
 "(the 'Landlord') applied for an order to terminate the tenancy and evict T.",
 "O'G.",
 "(the 'Tenant') because the Tenant did not pay the rent that the Tenant owes.",
 'This application was heard in Peterborough on October 24, 2019.',
 'The Landlord and the Tenant attended the hearing.',
 'The Landlord was represented by G.',
 'G..',
 'The Tenant spoke to Tenant Duty Counsel prior the hearing.',
 'Determinations: 1.',
 'By way of background, this tenancy began in January 2015; the monthly rent is $1,300.00 and is due on the first of each month.',
 '2. The Tenant has not paid the total rent the Tenant was required to pay for the period from January 1, 2015 to October 31, 2019.',
 'Because of the arrears, the Landlord served a Notice of Termination effective September 22, 2019.',
 '3. Since the application was filed, the Tenant made no payments towards the arrears.',
 '4. The arrears and cos

## Updating CSV with Cleaned File, Metadata, and Case Contents
- Adding column for cleaned full file
- Adding column for metadata
- Adding column for case contents

In [35]:
silver_data

Unnamed: 0,raw_file_text,raw_file_name
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt
...,...,...
667,Metadata:\nDate:\t2018-12-13\nFile number:\t\n...,TSL-98918-18-RV.txt
668,Metadata:\nDate:\t2018-11-23\nFile number:\t\n...,TSL-99691-18.txt
669,Metadata:\nDate:\t2018-11-29\nFile number:\t\n...,TSL-99824-18.txt
670,Metadata:\nDate:\t2018-12-12\nFile number:\t\n...,TSL-99900-18.txt


In [36]:
import time
import numpy as np
from collections import deque

start_time = time.time()

# Initialize a deque to store the latest 100 iteration times
time_deque = deque(maxlen = 500)

cases_contents = []
cases_metadata = []
full_cleaned = []

raw_files = silver_data['raw_file_text'].tolist()
for index, raw_file in enumerate(raw_files):
    iteration_start_time = time.time()
    better_file = general_cleaning(raw_file)
    try:
        metadata_list, content_list = separate_file_sections(better_file)
        full_cleaned.append(better_file)
        # cases_metadata.append(remove_end_tag_and_restructure(metadata_list)) # removing a bit more text if possible
        cases_metadata.append(metadata_list) # removing a bit more text if possible
        cases_contents.append(remove_end_tag_and_restructure(content_list))

        # Save the end time of this iteration and push it into the deque
        iteration_end_time = time.time()
        time_deque.append(iteration_end_time - iteration_start_time)

        # progress tracker
        average_time_per_file = np.mean(time_deque)
        files_left = len(raw_files) - (index + 1)
        estimated_time_left = files_left * average_time_per_file

        print(f"Files processed: {index + 1} of {len(raw_files)}, Estimated time remaining: {time.strftime('%H:%M:%S', time.gmtime(estimated_time_left))}", end='\r')
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", index)

silver_data['full_cleaned'] = full_cleaned
silver_data['metadata'] = cases_metadata
silver_data['content'] = cases_contents
silver_data.head()

Files processed: 672 of 672, Estimated time remaining: 00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...


# Case Citation and File Number
- Extracting case citation from metadata
- Extracting file number from all data (it's not as consistently formatted so the approach has to be more broad)

In [37]:
import re

def get_case_citation(metadata_list):
    for line in metadata_list:
        if "Citation:" in line:
            citation_start = line.find("Citation: ")
            # print(citation_start)
            citation_end = line.find("LTB)") + 4
            # print(citation_end)
            return line[citation_start : citation_end].replace("Citation: ", "").strip()
        elif "Référence: " in line:
            citation_start = line.find("Référence: ")
            # print(citation_start)
            citation_end = line.find("LTB)") + 4
            # print(citation_end)
            return line[citation_start : citation_end].replace("Référence: ", "").strip()
    return None

def get_file_number(metadata_list):
    # metadata_str = "".join(get_case_citation(metadata_list))
    metadata_str = " ".join(metadata_list)

    if "Citation: " in metadata_str:
        file_nums = metadata_str[metadata_str.find("File number: ") + len("File number: ") : metadata_str.find("Citation:")].strip()
    elif "Référence: " in metadata_str:
        file_nums = metadata_str[metadata_str.find("Numéro de dossier: ") + len("Numéro de dossier: ") : metadata_str.find("Référence")].strip()

    if len(file_nums) == 0:
        return None
    
    file_nums = file_nums.replace(";", " ")
    
    # eliminates duplicates or instances of something like 'TNL-10004-18 TNL-10004-18' being one list item
    file_num = list(set(file_nums.split()))
    file_num = ";".join(file_num)
    file_num = re.sub(r'[^\w\s]$', '', file_num)

    # removing duplicate file numbers
    if ";" in file_num:
        file_num = list(set(file_num.split(";"))) 
        file_num = [re.sub(r'[\(\)]', '', num) for num in file_num] # removing any trailing punctuation from each file number
        file_num = ";".join(file_num)

    # removes any parentheses from the file number
    file_num = re.sub(r'[\(\)]', '', file_num)

    return file_num

# test_row = 149
# test_metadata = silver_data.loc[test_row, "metadata"]# + silver_data.loc[test_row, "content"]
# print(get_file_number(test_metadata))
# silver_data.loc[test_row, "metadata"]

In [38]:
import time
import numpy as np
from collections import deque

start_time = time.time()

# Initialize a deque to store the latest 500 iteration times
time_deque = deque(maxlen = 500)

for index, row in enumerate(silver_data.itertuples()):

    # Save the start time of this iteration
    iteration_start_time = time.time()

    # adding to 'case_citation' and 'file_number' columns
    try:
        metadata_list, content_list = separate_file_sections(general_cleaning(silver_data.loc[row.Index, "raw_file_text"]))
        silver_data.at[row.Index, 'case_citation'] = get_case_citation(metadata_list)
        silver_data.at[row.Index, 'file_number'] = get_file_number(metadata_list)
        
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

    # Save the end time of this iteration and push it into the deque
    iteration_end_time = time.time()
    time_deque.append(iteration_end_time - iteration_start_time)

    # progress tracker
    average_time_per_row = np.mean(time_deque)
    rows_left = len(silver_data) - (index + 1)
    estimated_time_left = rows_left * average_time_per_row

    print(f"Files processed: {index + 1} of {len(silver_data)}, Estimated time remaining: {time.strftime('%H:%M:%S', time.gmtime(estimated_time_left))}", end='\r')

silver_data.head()

Files processed: 672 of 672, Estimated time remaining: 00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16


In [39]:
from evaluation import *
# wrongs = evaluate(silver_data, gold_data, "file_number", return_inaccurate = True, metric = "jaro_winkler")
# extracted_successfully = 0
# for row in wrongs.index:
#     if str(wrongs.loc[row, wrongs.columns[1]]) in str(wrongs.loc[row, wrongs.columns[0]]):
#         extracted_successfully += 1
        
# print(f"\nExtracted {extracted_successfully} out of {len(silver_data)} also file numbers correctly. Total accuracy could be {round((extracted_successfully / len(silver_data)), 2) + 0.916}%")
evaluate(silver_data, gold_data, "file_number", return_inaccurate = True, metric = "jaro_winkler")

Unnamed: 0,silver_file_number,gold_file_number
33,CEL-71370-17-RV,CEL-71370-17
48,CET-73038-18;CEL-75759-18;CEL-73640-18,CEL-73640-18
50,CEL-76175-18-SA;CEL-73963-18-RV,CEL-73963-18-RV
51,CEL-73985-18;CET-72638-18,CEL-73985-18
60,CEL-76171-18;CEL-75701-18;CET-75675-18,CEL-76171-18
61,CEL-76435-18-RV,CEL-76435-18
72,CEL-77933-18;CET-77737-18,CEL-77933-18
73,CEL-78470-18.;CEL-78005-18;CEL-78005-18-RV;CEL...,CEL-78005-18-RV
83,CEL-80413-18-IN,CEL-80413-18
92,CET-65284-17;CEL-65840-17;CEL-65426-17,CET-65284-17


In [40]:
from evaluation import *
wrongs = evaluate(silver_data, gold_data, "file_number", return_inaccurate = True, metric = "jaro_winkler")
extracted_successfully = 0
for row in wrongs.index:
    if str(wrongs.loc[row, wrongs.columns[1]]) in str(wrongs.loc[row, wrongs.columns[0]]):
        extracted_successfully += 1
        
print(f"\nExtracted {extracted_successfully} out of {len(silver_data)} also citations correctly. Total accuracy could be {round((extracted_successfully / len(silver_data)), 2) + 0.916}%")


Extracted 45 out of 672 also citations correctly. Total accuracy could be 0.986%


In [43]:
test_row = 67
test_metadata = silver_data.loc[test_row, "metadata"]# + silver_data.loc[test_row, "content"]
print(get_file_number(test_metadata))
silver_data.loc[test_row, "metadata"]

CEL-77387-18


['Date: 2018-07-30',
 'File number:',
 'CEL-77387-18',
 'CEL-77387-18',
 'Citation: CEL-77387-18 (Re), 2018 CanLII 88554 (ON LTB), <https://canlii.ca/t/hv7l3>, retrieved on 2023-05-16 https://canlii.ca/t/hv7l3']

In [44]:
def check_df(df):
    """
    Checks df for null values, number of unique values in each column, and data types
    """

    print(f"Df Size: {df.shape[0]} rows, {df.shape[1]} columns")
    print("-" * 30)

    print("\n" + "Checking for null values...")
    print("-" * 25)
    print(df.isnull().sum())

    print("\n" + "Checking data types...")
    print("-" * 25)
    for col in df.columns:
        col_type = str(type(df.loc[0, col])).replace("<class '", "").replace("'>", "")
        print(f"{col}: {col_type}")

check_df(silver_data)

Df Size: 672 rows, 7 columns
------------------------------

Checking for null values...
-------------------------
raw_file_text    0
raw_file_name    0
full_cleaned     0
metadata         0
content          0
case_citation    0
file_number      0
dtype: int64

Checking data types...
-------------------------
raw_file_text: str
raw_file_name: str
full_cleaned: list
metadata: list
content: list
case_citation: str
file_number: str


In [45]:
## can't eval because it wasn't actually annotated in the gold data
# evaluate(silver_data, gold_data, "case_citation", return_inaccurate = True, metric = "jaro_winkler")

# Language Detection

In [46]:
# !pip install langdetect
from langdetect import detect

def is_mostly_french(text, threshold):
    try:
        detected_language = detect(text)
        if detected_language == 'fr':
            return True
        else:
            return False
    except:
        return False

def is_french(text, threshold):
    try:
        detected_language = detect(text)
        if detected_language == 'fr':
            return True
        language_probabilities = detect_langs(text)
        for lang in language_probabilities:
            if lang.lang == 'fr' and lang.prob > threshold:
                return True
        return False
    except:
        return False

is_french(silver_data.loc[109, "raw_file_text"], 0.7)

False

## Updating CSV with Language

In [47]:
# import time
# import numpy as np
# from collections import deque

start_time = time.time()

# Initialize a deque to store the latest 500 iteration times
time_deque = deque(maxlen = 500)

for index, row in enumerate(silver_data.itertuples()):

    # Save the start time of this iteration
    iteration_start_time = time.time()

    # adding to 'case_citation' and 'file_number' columns
    try:
        # adding to 'language' column
        if is_french(silver_data.loc[row.Index, "raw_file_text"], 0.7) == True:
            silver_data.at[row.Index, 'language'] = "French"
        else:
            silver_data.at[row.Index, 'language'] = "English"

    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

    # Save the end time of this iteration and push it into the deque
    iteration_end_time = time.time()
    time_deque.append(iteration_end_time - iteration_start_time)

    # progress tracker
    average_time_per_row = np.mean(time_deque)
    rows_left = len(silver_data) - (index + 1)
    estimated_time_left = rows_left * average_time_per_row

    print(f"Files processed: {index + 1} of {len(silver_data)}, Estimated time remaining: {time.strftime('%H:%M:%S', time.gmtime(estimated_time_left))}", end='\r')

silver_data.head()

Files processed: 672 of 672, Estimated time remaining: 00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English


# Year
- not meaningful really but nice for the corpus

In [48]:
import re

def get_year_from_file_number(file_number):

    if isinstance(file_number, list):
        file_number = " ".join(file_number)

    if ";" not in file_number:
        # print("UES")
        file_number = [tok for tok in file_number.split("-") if (len(tok) == 2 and tok.isdigit())]
        return "20" + file_number[0]
    else:
        file_numbers = file_number.split(";")
        file_numbers = [tok for tok in file_numbers if (len(tok) == 2 and tok.isdigit())]
        return "20" + file_number[0]
    # elif file_number.isinstance(list):

print(get_year_from_file_number("TEL-81359-17-AM")) # Outputs: "17"
print(get_year_from_file_number("TEL-81405-17")) # Outputs: "17"
print(get_year_from_file_number(silver_data.loc[5, 'file_number'])) # Outputs: "17"
silver_data.loc[5, 'file_number']

2017
2017
2017


'CEL-63559-17'

## Updating CSV with Year

In [49]:
# import time
# import numpy as np
# from collections import deque

start_time = time.time()

# Initialize a deque to store the latest 500 iteration times
time_deque = deque(maxlen = 500)

for index, row in enumerate(silver_data.itertuples()):

    # Save the start time of this iteration
    iteration_start_time = time.time()

    # adding to 'case_citation' and 'file_number' columns
    try:
        silver_data.at[row.Index, 'year'] = get_year_from_file_number(silver_data.at[row.Index, 'file_number'])
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

    # Save the end time of this iteration and push it into the deque
    iteration_end_time = time.time()
    time_deque.append(iteration_end_time - iteration_start_time)

    # progress tracker
    average_time_per_row = np.mean(time_deque)
    rows_left = len(silver_data) - (index + 1)
    estimated_time_left = rows_left * average_time_per_row

    print(f"Files processed: {index + 1} of {len(silver_data)}, Estimated time remaining: {time.strftime('%H:%M:%S', time.gmtime(estimated_time_left))}", end='\r')
    
    # if index == 6:
    #     break

silver_data.head()

Files processed: 672 of 672, Estimated time remaining: 00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English,2016
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English,2016
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English,2016
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English,2016
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English,2016


In [50]:
check_df(silver_data)

Df Size: 672 rows, 9 columns
------------------------------

Checking for null values...
-------------------------
raw_file_text    0
raw_file_name    0
full_cleaned     0
metadata         0
content          0
case_citation    0
file_number      0
language         0
year             0
dtype: int64

Checking data types...
-------------------------
raw_file_text: str
raw_file_name: str
full_cleaned: list
metadata: list
content: list
case_citation: str
file_number: str
language: str
year: str


# LTB Location
- not 100% effective but close to. Fallback on other methods (transformers, etc) if need be

## Method 1 - Rule-Based

In [51]:
def find_all_positions(text, keyword):
    positions = []
    start = 0
    while True:
        index = text.find(keyword, start)
        if index == -1:
            break
        positions.append(index)
        start = index + 1
    return positions

In [52]:
# backup for other methods, using " ON " + postal code as a marker
def find_postal_code_index(text):
    match = re.search(r' ON [A-Z]\d[A-Z] ?\d[A-Z]\d', text)
    if match:
        return match.start()
    else:
        return None

def extract_loc_rule(text_list):

    content_str = " ".join(text_list)
    # print(content_str)

    # "hear" is a good marker for the location sentence ("heard in/on", "hearing", etc)
    if "hear" in content_str:
        hear_inds = find_all_positions(content_str, "hear") # list of indices where "hear" appears in string
        # print(hear_inds)

        for hear_ind in hear_inds:

            hear_substr = content_str[hear_ind - 50 : hear_ind + 50]
            possible_sentences = hear_substr.split(". ")
            # print(possible_sentences)
            
            hear_sent = [sent for i, sent in enumerate(possible_sentences) if "hear" in sent][0]
            # print(hear_sent)

            if len(hear_sent.split(" on ")) != 2:
                # print("TEST")
                # return None
                pass # go to next "hear" location in string
            else:
                
                location_sent, date = hear_sent.split(" on ") # should only split into 2 parts
                # print(location_sent)

                if " in " in location_sent:

                    # print(location_sent)
                    location = location_sent.split(" in ")[1].strip() # location name should be last tokens of string after token containing " in " (city name could be multiple tokens, so need to get all tokens after " in " token, not just last one)

                    # print(location)
                    location = location.split(" ")[-1] # location name should be last tokens of string after token containing "hear" (city name could be multiple tokens, so need to get all tokens after "hear" token, not just last one)
                    # print("YESSS")
                    return location.strip()
                
                else:
                    pass

    # use " ON " + postal code as a marker
    if " ON " in content_str:
        # print("|UESSS")
        # print(find_postal_code_index(content_str))
        postal_ind = find_postal_code_index(content_str)
        content_str_subsection = (content_str[postal_ind - 50 : postal_ind + 50])
        location = content_str_subsection.split(" ON ")[0].split()[-1]
        return location.strip()
    
    # if absolutely nothing works, return None and we'll use a transformer or something more nuanced
    return None

test_num = 10
extract_loc_rule(silver_data.loc[test_num, 'content'])
# " ".join(silver_data.loc[56, 'content'])
# silver_data.loc[test_num, 'content']

'Mississauga'

## Method 2 - SpaCy

In [53]:
gold_data.columns

Index(['file_number', 'adjudicating_member', 'ltb_location',
       'landlord_represented', 'landlord_attended_hearing',
       'tenant_represented', 'tenant_attended_hearing', 'landlord_nonprofit',
       'tenant_collecting_subsidy', 'case_outcome',
       'What was the length of the tenancy', 'monthly_rent',
       'rental deposit amount', 'was there an rent increases', 'total_arrears',
       'Over how many months did the arrears accumulate?',
       'Does the tenant made a payment on the arrears after the eviction notice',
       'tenant_arrears_history_mentioned', 'tenant_ability_to_pay_rent',
       'What were the specific mental, medical, or physical conditions of the tenant, if any?',
       'tenant_children_present', 'tenant_employed',
       'tenant_government_assistance', 'employment_stability_doubts',
       'sufficient_income_to_pay_rent', 'tenant_job_loss_mentioned',
       'tenant_extenuating_circumstances', 'tenant_difficulty_finding_housing',
       'tenant_prior_notic

In [54]:
all_locations = list(set(gold_data['ltb_location'].unique().tolist())) # list of all unique locations in the annotated data
all_locations = list(set([loc.strip() for loc in all_locations]))
all_locations

import spacy

# Load the English language model in spaCy
nlp = spacy.load('en_core_web_sm')

def extract_location_spacy(string_list, model = nlp, other_locations = all_locations):

    string = " ".join(string_list)

    # uses a spacy model + its vocabulary to extract and return the location if possible
    doc = model(string)
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            return ent.text

    # otherwise looks through the list of all locations in the annotated data and returns the first one that appears in the string -- for example, "Hamilton"
    for tok in string.split():
        if tok in other_locations:
            return tok

    # if all else fails, return None -- use a transformer or something later idk
    return None

all_locations

2023-05-30 22:05:09.519931: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


['Belleville',
 'Bracebridge',
 'Toronto',
 'London',
 'Mississauga',
 'by telephone',
 'Whitby',
 'Kingston',
 'Barrie',
 'Stratford',
 'Peterborough',
 'Hamilton',
 'Brantford',
 'Cobourg',
 'Ottawa',
 'Windsor',
 'Sudbury',
 'Review completed without hearing',
 'Waterloo',
 'Not stated',
 'Lindsay',
 'Orangeville',
 'Burlington',
 'Newmarket',
 'Woodstock',
 'Thunder Bay',
 'Not stated, hearing in Windsor']

In [55]:
start_time = time.time()

# Initialize a deque to store the latest 100 iteration times
time_deque = deque(maxlen = 500)

for index, row in enumerate(silver_data.itertuples()):

    # Save the start time of this iteration
    iteration_start_time = time.time()

    try:
        ### rule-based extremely quick -- pretty effective but imperfect
        location = extract_loc_rule(silver_data.loc[row.Index, 'content'])#.title() # returns the string in title case

        if not location[0].isalnum(): # something like "[CITY]" -- use spacy method
            location = extract_location_spacy(silver_data.loc[row.Index, 'content'])

        if not location[0].isupper():
            # I know this isn't a great rule in general but this seems to be consistent/reliable across all cases.
            # City names are all capitalized. Otherwise it finds "it", "heard", and more as locations
            location = extract_location_spacy(silver_data.loc[row.Index, 'content'])

        if location == None: # rule-based returns None
            location = extract_location_spacy(silver_data.loc[row.Index, 'content'])

        # use the found location
        silver_data.at[row.Index, 'ltb_location'] = location.title() # Title casing for consistency

    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

    # Save the end time of this iteration and push it into the deque
    iteration_end_time = time.time()
    time_deque.append(iteration_end_time - iteration_start_time)

    # progress tracker
    average_time_per_row = np.mean(time_deque)
    rows_left = len(silver_data) - (index + 1)
    estimated_time_left = rows_left * average_time_per_row

    print("Files processed: ", index + 1, "of", len(silver_data),
          "Estimated time remaining: ", time.strftime('%H:%M:%S', time.gmtime(estimated_time_left)), end='\r')

silver_data.head()

'NoneType' object is not subscriptable with file at Df row:  459
Files processed:  672 of 672 Estimated time remaining:  00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English,2016,Mississauga
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English,2016,Mississauga
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English,2016,Mississauga
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English,2016,Mississauga
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English,2016,Mississauga


In [56]:
silver_data.ltb_location.value_counts()

Toronto         365
Mississauga      78
Whitby           63
Newmarket        33
Belleville       16
Peterborough     15
Barrie           12
London           11
Ottawa           11
Lindsay          10
Windsor           8
Burlington        5
Hamilton          5
Duty Counsel      5
Waterloo          5
Cobourg           3
Stratford         3
Kingston          2
Sudbury           2
Woodstock         1
Dg                1
Elgin             1
Dd                1
D.C.              1
Nc                1
N.F.              1
Kw                1
Ac                1
Lg                1
Gl                1
Sarnia            1
Goderich          1
Tc                1
Brantford         1
Bay               1
Bracebridge       1
Orangeville       1
J.H.S.            1
Name: ltb_location, dtype: int64

In [57]:
get_nulls(silver_data, 'ltb_location') # this one is problematic for other columns too. The formatting seems to be completely different

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location
459,Metadata:\nDate:\t2018-12-17\nFile number:\t\n...,TNL-05489-18.txt,"[Metadata:, Date: 2018-12-17, File number:, 65...","[Date: 2018-12-17, File number:, 659/18;, TNL-...","[CITATION: Capreit 2 Limited Partnership v., R...",Cit,659/18;TNL-05489-18,English,206,


In [58]:
evaluate(silver_data, gold_data, "ltb_location", return_inaccurate = True, metric = "jaro_winkler").head(60)

Unnamed: 0,silver_ltb_location,gold_ltb_location
5,Barrie,Mississauga
6,Elgin,Mississauga
34,Barrie,Mississauga
81,Toronto,Mississauga
85,Barrie,Mississauga
88,Toronto,Mississauga
106,Ottawa,by telephone
110,Ottawa,by telephone
117,Barrie,Mississauga
123,Hamilton,Toronto


In [59]:
silver_data.loc[454, 'content']

["Order under Section 69 Residential Tenancies Act, 2006 File Number: TNL-04964-18 RV (the 'Landlord') applied for an order to terminate the tenancy and evict GG (the 'Tenant') because the Tenant did not pay the rent that the Tenant owes.",
 'This application was heard in Toronto on June 5, 2018.',
 'The Landlord and the Tenant attended the hearing.',
 'Determinations and Reasons: 1.',
 'The Tenant has not paid the total rent she was required to pay for the period from February 1, 2018 to June 30, 2018.',
 'Because of the arrears, the Landlord served a Notice of Termination effective May 11, 2018.',
 '2. The Tenant testified that she is living in the rental unit with four young grandchildren, aged between seven and fourteen years old.',
 'She was supporting them on short-term disability payments and child tax credits until the payments stopped in April, 2018.',
 'She has been working to replace her short term disability with ODSP payments.',
 'She has not paid the rent since February, 

# Decision Date
- uses entire case file (metadata + content)
- Eg.: "This application was heard in Toronto on October 16, 2019." should return "October 16, 2019" (then date is converted into the same format as the other dates in the df)
- Rule-based may be too simple for this so may need to use ML

In [60]:
from dateutil.parser import parse

def convert_to_datetime(date_str):
    # Parse date using dateutil.parser.parse
    dt = parse(date_str)
    
    # Format date with strftime in the format 'MM/DD/YYYY'
    return dt.strftime('%m/%d/%Y')

In [61]:
import spacy

nlp = spacy.load("en_core_web_sm") # loading this outside of the function saves ~2s per function call

def find_dates_in_list(string_list, model=nlp):
    extracted_dates = []
    for string in string_list:
        doc = nlp(string)
        for entity in doc.ents:
            if entity.label_ == "DATE":
                extracted_dates.append(entity.text)

    pattern = r"(?i)(\b\w+ \d{1,2}, \d{4}\b)"
    valid_dates = re.findall(pattern, ", ".join(extracted_dates))
    return list(set(valid_dates))

string_list = [
    "This application was heard in Toronto on October 16, 2019.",
    "The case was heard on June 12, 2020.",
    "The hearing was conducted on January 5, 2021.",
    "The meeting was adjourned, and a new date was set for November 30, 2022.",
    "No hearing was held in this matter.",
    "The matter was discussed on September 14, 2021, and a decision was made."
]
result = find_dates_in_list(string_list)
print(result)

['November 30, 2022', 'January 5, 2021', 'June 12, 2020', 'September 14, 2021', 'October 16, 2019']


In [62]:
def get_hearing_date(string_list, prox = 20):
    case_tokens = " ".join(string_list).split() # case file merged into one string
    match_found = False # assume there's no match found
    first_prox = prox

    while not match_found:
        possible_strings_with_dates = [] # we'll look for dates in these strings
        # print(prox)
        for tok in case_tokens:
            if "issue" in tok.lower():
                hear_ind = case_tokens.index(tok)
                near_toks = case_tokens[hear_ind - (prox // 2) : hear_ind + prox]
                near_text = " ".join(near_toks)
                possible_strings_with_dates.append(near_text) # finds all sections within a "prox" range of the word "hear" and adds them to a list
        
        # print(possible_strings_with_dates)

        best_matches = list(set(find_dates_in_list(possible_strings_with_dates))) # finds all dates in the possible strings and returns a list of unique dates
        # print(best_matches)
        if len(best_matches) == 1 or prox <= 0:
            match_found = True
            break
        else:
            prox -= (first_prox // 4) # decrement by 25% of the first_prox each time
    
    if len(best_matches) > 0:
        return convert_to_datetime(best_matches[0])
    
    # if none of the above works, look for "Date: " and get that date lol (it's also the decision date -- I checked the annotated data)
    for line in string_list:
        if "Date: " in line:
            return convert_to_datetime(line.split("Date: ")[1]) # convert to the same format as other dates in the df

case_num = 156
full_case_list = silver_data.loc[case_num, 'metadata'] + silver_data.loc[case_num, 'content']
full_case_list[:15]
print(get_hearing_date(silver_data.loc[case_num, 'metadata'] + silver_data.loc[case_num, 'content'], prox = 20))
(silver_data.loc[case_num, 'metadata'] + silver_data.loc[case_num, 'content'])[:15]

04/10/2018


['Date: 2018-04-10',
 'File number:',
 'SWL-11849-17',
 'SWL-11849-17',
 'Citation: SWL-11849-17 (Re), 2018 CanLII 88671 (ON LTB), <https://canlii.ca/t/hv7p8>, retrieved on 2023-05-16 https://canlii.ca/t/hv7p8',
 "Order under Section 69 Residential Tenancies Act, 2006 File Number: SWL-11849-17 JP (the 'Landlord') applied for an order to terminate the tenancy and evict RS (the 'Tenant') because the Landlord intends to demolish the rental unit.",
 'This application was heard in Windsor on March 29, 2018.',
 'MH, the Landlord’s Legal Representative, attended the hearing.',
 'The Tenant attended the hearing and declined the opportunity to speak with Duty Counsel prior to the hearing as he had previously been represented by counsel.',
 'Determinations: 1.',
 'The Landlord requires the rental unit to be vacated in order to demolish it.',
 'I am satisfied that the Landlord has taken all reasonable steps to obtain the necessary permits for this work.',
 "2. As set out in the attached reasons, 

## Updating CSV with Decision Date

In [63]:
# about 13 per second

start_time = time.time()

# Initialize a deque to store the latest 100 iteration times
time_deque = deque(maxlen = 500)

for index, row in enumerate(silver_data.itertuples()):

    # Save the start time of this iteration
    iteration_start_time = time.time()

    try:
        full_case_list = silver_data.loc[row.Index, 'metadata'] + silver_data.loc[row.Index, 'content']
        silver_data.at[row.Index, 'decision_date'] = get_hearing_date(full_case_list, prox = 20)

    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

    # Save the end time of this iteration and push it into the deque
    iteration_end_time = time.time()
    time_deque.append(iteration_end_time - iteration_start_time)

    # progress tracker
    average_time_per_row = np.mean(time_deque)
    rows_left = len(silver_data) - (index + 1)
    estimated_time_left = rows_left * average_time_per_row

    print("Files processed: ", index + 1, "of", len(silver_data),
          "Estimated time remaining: ", time.strftime('%H:%M:%S', time.gmtime(estimated_time_left)), end = '\r')

silver_data.head()

Files processed:  672 of 672 Estimated time remaining:  00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English,2016,Mississauga,01/18/2017
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English,2016,Mississauga,01/09/2017
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English,2016,Mississauga,01/09/2017
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English,2016,Mississauga,01/09/2017
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English,2016,Mississauga,01/10/2017


In [64]:
get_nulls(silver_data, 'decision_date')

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date


In [66]:
evaluate(silver_data, gold_data, "decision_date", return_inaccurate = True, metric = "jaro_winkler")#.head(60)

KeyError: 'decision_date'

In [None]:
wrongs = evaluate(silver_data, gold_data, "decision_date", return_inaccurate = True, metric = "accuracy").index#.head(60)
# gold_data.loc[wrongs, ["hearing_date", "decision_date"]].head(20)
evaluate(silver_data, gold_data, "decision_date", return_inaccurate = True, metric = "accuracy")#.head(60)

In [67]:
import spacy

nlp = spacy.load("en_core_web_sm") # loading this outside of the function saves ~2s per function call

def find_dates_in_string(string, model=nlp):
    extracted_dates = []
    # for string in string_list:
    doc = nlp(string)
    for entity in doc.ents:
        if entity.label_ == "DATE":
            extracted_dates.append(entity.text)
    
    return list(set(extracted_dates))
    # pattern = r"(?i)(\b\w+ \d{1,2}, \d{4}\b)"
    # valid_dates = re.findall(pattern, ", ".join(extracted_dates))
    # return valid_dates

from datetime import datetime

def order_dates(dates_list):
    sorted_dates = sorted(dates_list, key=lambda x: datetime.strptime(x, '%m/%d/%Y'))
    return sorted_dates

In [68]:
# def get_hearing_date()
test_case_list = silver_data.loc[wrongs[0], 'content']
test_case_str = " ".join(test_case_list)
hdate_candidates = [line for line in test_case_str.lower().split(". ") if ("application" in line.lower() and "hear" in line.lower())]

order_dates([convert_to_datetime(date) for date in find_dates_in_list(hdate_candidates)])

KeyError: 0

In [None]:
silver_data.loc[wrongs[0], 'metadata']

# Hearing Date

In [69]:
import re
from datetime import datetime

def clean_and_convert_date(date_str):
    """
    Gets rid of artifacts found within strings containing dates and converts them to a consistent format.

    """
    
    # Use regex to find date in the format 'Month DD, YYYY'
    match = re.search(r'([a-zA-Z]+ \d{1,2}, \d{4})', date_str)
    if match:
        date_str = match.group(1)

        # Parse date with strptime in the format 'Month DD, YYYY'
        dt = datetime.strptime(date_str, '%B %d, %Y')

        # Format date with strftime in the format 'MM/DD/YYYY'
        return dt.strftime('%m/%d/%Y')

    return None

# Test the function
date_str = "September 3, 2019 f"
new_date_str = clean_and_convert_date(date_str)
print(new_date_str)

09/03/2019


In [70]:
def extract_date_rule(text_list):

    content_str = " ".join(text_list)
    # print(content_str)

    if "date issued" in content_str.lower():
        DI_inds = find_all_positions(content_str.lower(), "date issued")
        for DI_ind in DI_inds:
            DI_substr = content_str[DI_ind - 50 : DI_ind + 50].lower()
            # print(DI_ind)
            # print(DI_substr)

            if len(DI_substr.split("date issued")) != 2:
                # should only contain 2
                pass

            # DI_sent = DI_substr.split("date issued")#.strip()
            # print(DI_substr)

            # regex pattern to find any date within the DI_sent substring
            date_pattern = r"(?i)(january|february|march|april|may|june|july|august|september|october|november|december) [0-9]{1,2}, [0-9]{4}"

            match = re.search(date_pattern, DI_substr)
            if match:
                # print(match)
                return match.group(0)
            
    elif "hear" in content_str:

        hear_inds = find_all_positions(content_str, "hear") # list of indices where "hear" appears in string
        # print(hear_inds)

        for hear_ind in hear_inds:

            hear_substr = content_str[hear_ind - 50 : hear_ind + 50]
            possible_sentences = hear_substr.split(". ")
            # print(possible_sentences)
            
            hear_sent = [sent for i, sent in enumerate(possible_sentences) if "hear" in sent][0]
            # print(hear_sent)

            if len(hear_sent.split(" on ")) != 2:
                # return None
                pass # go to next "hear" location in string
            else:
                location_sent, date = hear_sent.split(" on ") # should only split into 2 parts
                # print(location_sent)

                return date.strip()

    # if all else fails, return None
    return None

print(extract_date_rule(silver_data.loc[2, 'content']))
# print(clean_and_convert_date(extract_date_rule(silver_data.loc[56, 'content']))) # converts it into desired format
# silver_data.loc[56, 'content']

january 9, 2017


In [71]:
def extract_date_2(case_metadata: list):
    for line in case_metadata:
        if "Date: " in line:
            break
    
    # return the date in the format they gave us the data in - this might make things more useful for them idk
    return line.replace("Date: ", "")
    # return line.replace("Date: ", "")

print(extract_date_rule(silver_data.loc[0, 'content']))
print(extract_date_2(silver_data.loc[0, 'metadata']))
convert_to_datetime(extract_date_rule(silver_data.loc[0, 'content']))
convert_to_datetime(extract_date_2(silver_data.loc[0, 'metadata']))

january 30, 2017
2017-01-18


'01/18/2017'

## Updating CSV with Hearing Date

In [72]:
import time
import numpy as np
from collections import deque

start_time = time.time()

# Initialize a deque to store the latest 100 iteration times
time_deque = deque(maxlen=100)

for index, row in enumerate(silver_data.itertuples()):

    # Save the start time of this iteration
    iteration_start_time = time.time()

    try:
        # rule-based method #1 -- uses 'content'
        found_date = extract_date_rule(silver_data.loc[row.Index, 'content'])

        if found_date == None: # rule-based method #2 -- uses 'metadata'
            found_date = extract_date_2(silver_data.loc[row.Index, 'metadata'])

        # normalize the format
        # formatted_date = clean_and_convert_date(found_date)
        formatted_date = convert_to_datetime(found_date)

        # if still None, try one more time to convert it
        silver_data.at[row.Index, 'hearing_date'] = formatted_date
            
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

    # Save the end time of this iteration and push it into the deque
    iteration_end_time = time.time()
    time_deque.append(iteration_end_time - iteration_start_time)

    # progress tracker
    average_time_per_row = np.mean(time_deque)
    rows_left = len(silver_data) - (index + 1)
    estimated_time_left = rows_left * average_time_per_row

    print("Files processed: ", index + 1, "of", len(silver_data),
          "Estimated time remaining: ", time.strftime('%H:%M:%S', time.gmtime(estimated_time_left)), end='\r')

silver_data.head()

String does not contain a date: t with file at Df row:  45900:00
Files processed:  672 of 672 Estimated time remaining:  00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date,hearing_date
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English,2016,Mississauga,01/18/2017,01/30/2017
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English,2016,Mississauga,01/09/2017,01/09/2017
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English,2016,Mississauga,01/09/2017,01/09/2017
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English,2016,Mississauga,01/09/2017,01/09/2017
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English,2016,Mississauga,01/10/2017,02/03/2017


In [73]:
check_df(silver_data)

Df Size: 672 rows, 12 columns
------------------------------

Checking for null values...
-------------------------
raw_file_text    0
raw_file_name    0
full_cleaned     0
metadata         0
content          0
case_citation    0
file_number      0
language         0
year             0
ltb_location     1
decision_date    0
hearing_date     1
dtype: int64

Checking data types...
-------------------------
raw_file_text: str
raw_file_name: str
full_cleaned: list
metadata: list
content: list
case_citation: str
file_number: str
language: str
year: str
ltb_location: str
decision_date: str
hearing_date: str


In [74]:
get_nulls(silver_data, 'hearing_date')

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date,hearing_date
459,Metadata:\nDate:\t2018-12-17\nFile number:\t\n...,TNL-05489-18.txt,"[Metadata:, Date: 2018-12-17, File number:, 65...","[Date: 2018-12-17, File number:, 659/18;, TNL-...","[CITATION: Capreit 2 Limited Partnership v., R...",Cit,659/18;TNL-05489-18,English,206,,07/13/2018,


In [75]:
evaluate(silver_data, gold_data, "hearing_date", return_inaccurate = True, metric = "jaro_winkler")#.head(60)

KeyError: 'hearing_date'

In [None]:
print(convert_to_datetime(extract_date_2(silver_data.loc[454, 'metadata'])))
silver_data.loc[454, 'metadata']

# Case CanLII URL
- In the metadata, there's a hyperlink that seems to be a shortened URL to the case
- Eg.:  *'Citation: NOL-10723-12 (Re), 2013 CanLII 5182 (ON LTB), <https://canlii.ca/t/fw1m8>, retrieved on 2023-05-17'*

In [None]:
silver_data.loc[0, 'metadata']

In [76]:
import re

def get_url_from_citation_string(text: str):
    """
    Returns URL to case file given a list of strings of metadata from a case file.
    String must begin with "Citation: " and URL must be within angle brackets.

    Parameters
    ----------
    text : str
        A string of metadata from a case file.

    Returns
    -------
    str
        A string of the URL to the case file.
    """

    pattern = r"<(.*?)>"
    matches = re.findall(pattern, text)
    return matches[0]

def get_url_from_metadata(case_metadata: list):
    """
    Extract URL to case file from a list of strings of metadata from a case file.

    Parameters
    ----------
    case_metadata : list
        A list of strings of metadata from a case file.

    Returns
    -------
    str
        A string of the URL to the case file.
    """

    for line in case_metadata:
        if ("Citation:" or "Référence:") in line:
            return get_url_from_citation_string(line)
        
    return None
        
get_url_from_metadata(silver_data.loc[1, 'metadata'])

'https://canlii.ca/t/gxq6r'

## Updating CSV with URLs

In [77]:
import time
import numpy as np
from collections import deque

start_time = time.time()

# Initialize a deque to store the latest 100 iteration times
time_deque = deque(maxlen = 100)

for index, row in enumerate(silver_data.itertuples()):

    # Save the start time of this iteration
    iteration_start_time = time.time()

    try:
        silver_data.at[row.Index, 'url'] = get_url_from_metadata(silver_data.loc[row.Index, 'metadata'])
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

    # Save the end time of this iteration and push it into the deque
    iteration_end_time = time.time()
    time_deque.append(iteration_end_time - iteration_start_time)

    # progress tracker
    average_time_per_row = np.mean(time_deque)
    rows_left = len(silver_data) - (index + 1)
    estimated_time_left = rows_left * average_time_per_row

    print("Files processed: ", index + 1, "of", len(silver_data),
          "Estimated time remaining: ", time.strftime('%H:%M:%S', time.gmtime(estimated_time_left)), end='\r')

silver_data.head()

Files processed:  672 of 672 Estimated time remaining:  00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date,hearing_date,url
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English,2016,Mississauga,01/18/2017,01/30/2017,https://canlii.ca/t/gxq6n
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6r
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6s
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6t
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English,2016,Mississauga,01/10/2017,02/03/2017,https://canlii.ca/t/h3w7b


In [78]:
check_df(silver_data)

Df Size: 672 rows, 13 columns
------------------------------

Checking for null values...
-------------------------
raw_file_text    0
raw_file_name    0
full_cleaned     0
metadata         0
content          0
case_citation    0
file_number      0
language         0
year             0
ltb_location     1
decision_date    0
hearing_date     1
url              0
dtype: int64

Checking data types...
-------------------------
raw_file_text: str
raw_file_name: str
full_cleaned: list
metadata: list
content: list
case_citation: str
file_number: str
language: str
year: str
ltb_location: str
decision_date: str
hearing_date: str
url: str


In [79]:
silver_data.loc[2, 'metadata']
silver_data.loc[150, 'content'][-15:]

['6. If the unit is not vacated on or before February 28, 2018, then starting March 1, 2018, the Landlords may file this order with the Court Enforcement Office (Sheriff) so that the eviction may be enforced.',
 '7. Upon receipt of this order, the Court Enforcement Office (Sheriff) is directed to give vacant possession of the unit to the Landlords, on or after March 1, 2018.',
 '8. If, on or before February 28, 2018, the Tenant pays the amount of $4,719.00** to the Landlords or to the Board in trust, this order for eviction will be void.',
 'This means that the tenancy would not be terminated and the Tenant could remain in the unit.',
 'If this payment is not made in full and on time, the Landlords may file this order with the Court Enforcement Office (Sheriff) so that the eviction may be enforced.',
 '9. The Tenant may make a motion to the Board under subsection 74(11) of the Act to set aside this order if they pay the amount required under that subsection on or after March 1, 2018 bu

# Adjudicating Member
- Typically in format like:
    ```[...,
    'Date',
    'Issued Gerald',
    'Taylor',
    'Member,'
    ...]```
- This seems to be formatted pretty consistently but not sure if it may present in different ways

In [80]:
silver_data.loc[2, 'metadata']
silver_data.loc[130, 'content'][-15:]

['8. Upon receipt of this order, the Court Enforcement Office (Sheriff) is directed to give vacant possession of the unit to the Landlord, on or after September 8, 2018.',
 '9. If the Tenant wishes to void this order and continue the tenancy, the Tenant must pay to the Landlord or to the Board in trust: i) $7,675.00 if the payment is made on or before August 31, 2018, or ii) $10,175.00 if the payment is made on or before September 7, 2018**.',
 'If the Tenant does not make full payment in accordance with this paragraph and by the appropriate deadline, then the Landlord may file this order with the Court Enforcement Office (Sheriff) so that the eviction may be enforced.',
 '10. The Tenant may make a motion to the Board under subsection 74(11) of the Act to set aside this order if they pay the amount required under that subsection on or after September 8, 2018 but before the Sheriff gives vacant possession to the Landlord.',
 'The Tenant is only entitled to make this motion once during t

In [81]:
import re

def get_adj_member(list_of_strings: list):
    text = " ".join(list_of_strings)
    pattern = r"Date Issued(.*?)Member"
    matches = re.findall(pattern, text, re.DOTALL)
    # extracted_text = [match.strip() for match in matches]
    extracted_text = list(set(match.strip() for match in matches))
    if len(extracted_text) > 0:
        return ", ".join(extracted_text) # returns a list of matches and sometimes there's more than one match so we just take the first one -- there are never two
    elif "date issued" in text.lower():
        DI_inds = find_all_positions(text.lower(), "date issued")
        for DI_ind in DI_inds:
            DI_substr = text[DI_ind - 50 : DI_ind + 50].lower()
            if len(DI_substr.split("date issued")) != 2:
                # should only contain 2
                pass

            DI_sent = DI_substr.split("date issued")[1].strip()
            if ", " in DI_sent: # there should be a comma be just in case, it doesn't hurt to have this (and this to try the "hear" method after iterating over all of these if none work)
                DI_sent = DI_sent.split(", ")[0]

            if "member" in DI_sent:
                DI_sent = DI_sent.replace("member", "")

            return DI_sent.title().strip()

test = silver_data.loc[453, 'content']#[-15:]

get_adj_member(test)

'Nancy Morris'

### Updating CSV with Adjudicating Member

In [82]:
# import time
# import numpy as np
# from collections import deque

start_time = time.time()

# Initialize a deque to store the latest 100 iteration times
time_deque = deque(maxlen = 500)

for index, row in enumerate(silver_data.itertuples()):

    # Save the start time of this iteration
    iteration_start_time = time.time()

    try:
        silver_data.at[row.Index, 'adjudicating_member'] = get_adj_member(silver_data.loc[row.Index, 'content']).replace("Vice Chair", "").replace("Vice-Chair", "").strip()
    except Exception as any_error:
        print(f"{any_error} with file at Df row: ", row.Index)

    # Save the end time of this iteration and push it into the deque
    iteration_end_time = time.time()
    time_deque.append(iteration_end_time - iteration_start_time)

    # progress tracker
    average_time_per_row = np.mean(time_deque)
    rows_left = len(silver_data) - (index + 1)
    estimated_time_left = rows_left * average_time_per_row

    print("Files processed: ", index + 1, "of", len(silver_data),
          "Estimated time remaining: ", time.strftime('%H:%M:%S', time.gmtime(estimated_time_left)), end = '\r')

silver_data.head()

'NoneType' object has no attribute 'replace' with file at Df row:  459
'NoneType' object has no attribute 'replace' with file at Df row:  649
'NoneType' object has no attribute 'replace' with file at Df row:  657
Files processed:  672 of 672 Estimated time remaining:  00:00:00

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date,hearing_date,url,adjudicating_member
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English,2016,Mississauga,01/18/2017,01/30/2017,https://canlii.ca/t/gxq6n,Avril Cardoso
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6r,Tiisetso Russell
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6s,Tiisetso Russell
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6t,Tiisetso Russell
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English,2016,Mississauga,01/10/2017,02/03/2017,https://canlii.ca/t/h3w7b,Karen Wallace


In [83]:
check_df(silver_data)

Df Size: 672 rows, 14 columns
------------------------------

Checking for null values...
-------------------------
raw_file_text          0
raw_file_name          0
full_cleaned           0
metadata               0
content                0
case_citation          0
file_number            0
language               0
year                   0
ltb_location           1
decision_date          0
hearing_date           1
url                    0
adjudicating_member    3
dtype: int64

Checking data types...
-------------------------
raw_file_text: str
raw_file_name: str
full_cleaned: list
metadata: list
content: list
case_citation: str
file_number: str
language: str
year: str
ltb_location: str
decision_date: str
hearing_date: str
url: str
adjudicating_member: str


In [84]:
nulls_adj = get_nulls(silver_data, 'adjudicating_member', return_index = False)
nulls_inds = nulls_adj.index.tolist()
nulls_adj

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date,hearing_date,url,adjudicating_member
459,Metadata:\nDate:\t2018-12-17\nFile number:\t\n...,TNL-05489-18.txt,"[Metadata:, Date: 2018-12-17, File number:, 65...","[Date: 2018-12-17, File number:, 659/18;, TNL-...","[CITATION: Capreit 2 Limited Partnership v., R...",Cit,659/18;TNL-05489-18,English,206,,07/13/2018,,https://canlii.ca/t/hwpvq,
649,Metadata:\nDate:\t2018-11-06\nFile number:\t\n...,TSL-97498-18-RV2-AM.txt,"[Metadata:, Date: 2018-11-06, File number:, TS...","[Date: 2018-11-06, File number:, TSL-97498-18-...",[Amended Order Order under Sections 21.1 and 2...,"TSL-97498-18-RV2-AM (Re), 2018 CanLII 141664 (...",TSL-97498-18-RV2-AM,English,2018,Toronto,12/06/2018,11/01/2018,https://canlii.ca/t/j0fjl,
657,Metadata:\nDate:\t2018-10-09\nFile number:\t\n...,TSL-98384-18-AM.txt,"[Metadata:, Date: 2018-10-09, File number:, TS...","[Date: 2018-10-09, File number:, TSL-98384-18,...",[Amended Order Order under Section 69 Resident...,"TSL-98384-18-AM (Re), 2018 CanLII 120875 (ON LTB)",TSL-98384-18,English,2018,Toronto,10/12/2018,09/24/2018,https://canlii.ca/t/hwmcl,


In [85]:
evaluate(silver_data, gold_data, "adjudicating_member", return_inaccurate = True, metric = "jaro_winkler").head(60)

Unnamed: 0,silver_adjudicating_member,gold_adjudicating_member
0,Avril Cardoso,avril cardoso
1,Tiisetso Russell,tiisetso russell
2,Tiisetso Russell,tiisetso russell
3,Tiisetso Russell,tiisetso russell
4,Karen Wallace,karen wallace
5,Avril Cardoso,avril cardoso
6,Avril Cardoso,avril cardoso
7,Karen Wallace,karen wallace
8,Avril Cardoso,avril cardoso
9,Karen Wallace,karen wallace


In [86]:
silver_data.file_number.value_counts()

TEL-95255-18-AM    2
HOL-03461-18-RV    2
TEL-91060-18-RV    2
TNL-04964-18       1
TNL-04214-18       1
                  ..
TEL-61064-15-RV    1
TEL-75185-16       1
TEL-75630-16       1
TEL-75637-16       1
TSL-99965-18       1
Name: file_number, Length: 669, dtype: int64

In [87]:
silver_data

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date,hearing_date,url,adjudicating_member
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English,2016,Mississauga,01/18/2017,01/30/2017,https://canlii.ca/t/gxq6n,Avril Cardoso
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6r,Tiisetso Russell
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6s,Tiisetso Russell
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6t,Tiisetso Russell
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English,2016,Mississauga,01/10/2017,02/03/2017,https://canlii.ca/t/h3w7b,Karen Wallace
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,Metadata:\nDate:\t2018-12-13\nFile number:\t\n...,TSL-98918-18-RV.txt,"[Metadata:, Date: 2018-12-13, File number:, TS...","[Date: 2018-12-13, File number:, TSL-98918-18-...",[Order under Section 21.2 of the Statutory Pow...,"TSL-98918-18-RV (Re), 2018 CanLII 141679 (ON LTB)",TSL-98918-18-RV,English,2018,Toronto,11/08/2018,12/13/2018,https://canlii.ca/t/j0fjv,Nancy Henderson
668,Metadata:\nDate:\t2018-11-23\nFile number:\t\n...,TSL-99691-18.txt,"[Metadata:, Date: 2018-11-23, File number:, TS...","[Date: 2018-11-23, File number:, TSL-99691-18,...",[Order under Section 69 Residential Tenancies ...,"TSL-99691-18 (Re), 2018 CanLII 141675 (ON LTB)",TSL-99691-18,English,2018,Toronto,11/23/2018,11/23/2018,https://canlii.ca/t/j0fk1,David Lee
669,Metadata:\nDate:\t2018-11-29\nFile number:\t\n...,TSL-99824-18.txt,"[Metadata:, Date: 2018-11-29, File number:, TS...","[Date: 2018-11-29, File number:, TSL-99824-18,...",[Order under Section 69 Residential Tenancies ...,"TSL-99824-18 (Re), 2018 CanLII 141673 (ON LTB)",TSL-99824-18,English,2018,Toronto,11/29/2018,12/11/2018,https://canlii.ca/t/j0fk2,Renée Lang
670,Metadata:\nDate:\t2018-12-12\nFile number:\t\n...,TSL-99900-18.txt,"[Metadata:, Date: 2018-12-12, File number:, TS...","[Date: 2018-12-12, File number:, TSL-99900-18,...",[Order under Section 69 Residential Tenancies ...,"TSL-99900-18 (Re), 2018 CanLII 140403 (ON LTB)",TSL-99900-18,English,2018,Toronto,12/12/2018,12/12/2018,https://canlii.ca/t/hzzb6,David Mungovan


In [88]:
# silver_data.to_csv(f"pproc_{len(silver_data)}_files_1.csv", index = False)

# Rule-Based Case OutCome Extraction

In [89]:
# gold_data.columns # looking for case outcome column name
len(gold_data['case_outcome'].unique()) # only 47 unique outcomes across 679 cases - that means it can probably be summarized effectively
gold_data['case_outcome'].unique()

array(['No relief', 'Relief', 'Conditional Order'], dtype=object)

In [90]:
reclassified_outcomes = {
    'no_relief': [
        'No relief',
        'No relief, but arrears calculated based on the lawful monthly rent ($1400), despite the landlord claiming it to be $1500',
        'Tenants already vacated rental unit. Tenants pay full amount owed.',
        'Ordered payment of arrears, no eviction',
        'Conditional order to preserve tenancy',
        'Conditional relief',
        'Payment of damages (not a rent dispute)',
        'Request to review is denied',
        'No relief; rent abatement due to maintenance repairs',
        'Application dismissed',
        'Relief not to evict. Tenant has paid all arrears and there was a material error on the landlord\'s part.',
        'Notice of termination was found to be invalid so Landlord requested consent of Board to withdraw application of non-payment of rent which was granted.',
        'Application Dismissed',
        'No eviction granted, payment plan merely involved paying rent on time',
        'Tenant ordered to pay adjusted arrears cost',
        'Order to pay rent on time',
        'Tenant must pay arrears with EI lump-sum check or else he will be evicted',
        'landlord was not seeking eviction but seeking to terminate tenancy because of persistent late payment; relief granted (not terminating tenancy) if tenant can pay in time and in full for the next 11 months',
        'No relief, because tenant can pay off the arrears with the help of social assistance',
        'No relief because tenants could pay the balance',
        'No relief because tenant can pay the balance'
    ],
    'no_payment': [
        'Payment plan',
        'Eviction refused; Tenant pays arrears only',
        'Payment and eviction',
        'Eviction order set aside',
        'Pay on time order'
    ],
    'other': [
        'Postponement of eviction',
        'Order under review reversed in part to correct serious error',
        'Extended termination date',
        'Other: Standard eviction order but if arrears are paid, tenant can stay as long as all future payments for 1 year are paid on time (subject to s. 78)',
        'Postponement of eviction but conditioned on the fact that all rent is paid on time for one year period',
        'Probation',
        'Relief from eviction subject to conditions',
        'Conditional order',
        'Conditional Order',
        'Landlord\'s Application for Eviction was dismissed',
        'Abatement of Rent',
        'Voiding of order',
        'Tenant forced out, landlord wants space. All rent was paid.',
        'Tenant shall pay rent on time from February 2020 - January 2021. Additionally, tenant must pay application cost + $20 NSF charges incurred by the Landlord',
        'Full relief',
        'Interim order'
    ]
}

for new_class, old_classes in reclassified_outcomes.items():
    for old_class in old_classes:
        gold_data.loc[gold_data['case_outcome'] == old_class, 'new_case_outcome'] = new_class

gold_data['new_case_outcome'].value_counts()

no_relief    430
other          7
Name: new_case_outcome, dtype: int64

In [91]:
def find_outcome_context(text: str, proximity: int):
    full_text = text
    positions = []
    start = 0
    while True:
        index = text.find("ordered", start)
        if index == -1:
            break
        positions.append(index)
        start = index + 1
    # return positions

    if len(positions) == 0:
        return None
    elif len(positions) == 1:
        surrounding_text = text[positions[0] - (proximity // 4) : positions[0] + proximity]#.lower()
        surrounding_sents = surrounding_text.split(". ")
        # print(surrounding_sents)
        ordered_ind = [i for i, sent in enumerate(surrounding_sents) if "ordered" in sent][0]
        # print(surrounding_sents[ordered_ind:])
        # print(surrounding_sents)
        relevant_text = "".join(full_text[full_text.find(surrounding_sents[ordered_ind]):])
        return_text_end = relevant_text.lower().find("date issued")
        print(return_text_end)

        return " ".join(relevant_text.split(". "))
        return relevant_text
    else: # if there is more than one instance of the keyword, return the one that is closest to "find"
        print("POOOPOOOO")
        # closest = min(positions, key = lambda x: abs(x - text.find("find")))
        # surrounding_text = text[closest - (proximity // 4) : closest + proximity]#.lower()
        # return surrounding_text

row = 5
test_str = " ".join(silver_data.loc[row, "content"])
print(find_outcome_context(test_str, 1000))

POOOPOOOO
None


In [92]:
for row in silver_data.index:
    # row = 5
    # print(gold_data.loc[row, "case_outcome"])
    test_str = "\n".join(silver_data.loc[row, "content"])
    keyword = "consider"

    # print(find_outcome_context(test_str, 500))
    # keyword = "ordered"
    finds_found = find_all_positions(test_str.lower(), keyword)
    proximity = 500
    if len(finds_found) == 0:
        # print(finds_found)
        # print(f"{row}: No 'ordered'")
        if len(find_all_positions(test_str.lower(), "find")) == 0:
            print(f"{row}: No '{keyword}'")
        # for found_position in finds_found:
        #     surrounding_text = test_str[found_position - (proximity // 2) : found_position + proximity].lower()
        #     # if "it is ordered that" in surrounding_text:
        #         # print (surrounding_text.strip())

# silver_data.loc[46, "content"]

219: No 'consider'
255: No 'consider'
340: No 'consider'
344: No 'consider'
388: No 'consider'
389: No 'consider'
422: No 'consider'
448: No 'consider'
467: No 'consider'
489: No 'consider'
608: No 'consider'
662: No 'consider'
664: No 'consider'
665: No 'consider'


In [93]:
silver_data.loc[6, "content"]

["Arrears Worksheet File Number: CEL-63931-17 Time period for Arrears Owing From: December 1, 2016 to December 16, 2016 (From the commencement of arrears to the termination date in the notice, or the end of the rental period if the tenancy is not being terminated.) Part 1 - Calculations of Arrears Owing (A) Rent Period (monthly, weekly, etc.) (B) Rent Charged (C) Lawful Rent (if issue raised) (D) Lower of (B) and (C) (E) Rent Paid (F) Amount Owing (D-E) 01/12/2016 - 16/12/2016 $526.03 $526.03 $500.00 $26.03 **Part Month ** To calculate the Rent for part of a month, use the following formula for columns (B), (C) and (D): Monthly Rent X 12 X # Days 365 (F) Total Rent Owing $26.03 (G1) Arrears Owing $26.03 [From (F)] (G2) Arrears Claimed $1,670.00 (G3) Include whichever is less when Calculation Total Arrears Owing $26.03 (I) Total Amount Owing $26.03 Part II - Calculation of Compensation (Use this part if the tenancy is being terminated) (J) (i) Lump Sum Compensation Start Date December 1

It seems like it's usually around the text "...find that...." and "....it is ordered that...."

- could get all text within x left and right of the text that this is near, and train a model to classify it according to a certain number of class options (just like metadata extraction)

# Testing Metadata Extraction Model
- report to be written in based on data from WANDB
- wandb report link: https://wandb.ai/kmaurinjones/huggingface/reports/FLAN-T5-Metadata-Extractor-Model--Vmlldzo0NDk4MDEx

### General Cleaning functions (used with model)

In [94]:
import re

def general_cleaning(raw_file_str: str):
    # gets rid of tabs, non-breaking spaces, leading/trailing whitespace, removes empty lines, and "\xa0"
    generally_cleaned_str = [line.replace("\t", " ").replace("\xa0", "").strip() for line in raw_file_str.split('\n') if line.strip() != '']
    return generally_cleaned_str

def remove_whitespace_and_underscores(string):
    # Remove consecutive whitespace
    string = re.sub(r'\s+', ' ', string)

    # Remove more than three consecutive underscores
    string = re.sub(r'_+', '', string)

    return string.strip()

In [95]:
import transformers
# from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [96]:
model_name = "metadata_extractor_flant5_small"

# folder where the model files are located -- unzip before running
model_dir = f"/Users/kmaurinjones/Desktop/School/UBC/UBC_Coursework/capstone/Allard_A_Capstone/models/metadata_extractor/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

In [97]:
def extract_metadata_t5(raw_case_file_text: str, model = model, tokenizer = tokenizer):
    # do general case file cleaning
    clean_file_list = general_cleaning(raw_case_file_text)
    clean_file_str = " ".join([line for line in clean_file_list if ("metadata:" or "content:") not in line.lower()])
    print(clean_file_str)
    
    # run model on cleaned case file text
    inputs = ["extract metadata boundary:" + clean_file_str] # PREFIX = "extract metadata boundary:"

    # print("INPUT:", inputs)
    inputs = tokenizer(inputs, max_length = 256, truncation = True, return_tensors = "pt")
    output = model.generate(**inputs, num_beams = 8, do_sample = True, min_length = 1, max_length = 128)
    decoded_output = tokenizer.batch_decode(output, skip_special_tokens = True)[0]
    # print("OUTPUT:", decoded_output)

    return decoded_output
    
    # return metadata and content as lists

In [98]:
row_num = 3
# print(silver_data.loc[row_num, 'raw_file_text'])
[line for line in general_cleaning(silver_data.loc[row_num, 'raw_file_text']) if ("metadata:" or "content:") not in line.lower()]

['Date: 2017-01-20',
 'File number:',
 'CEL-63056-16',
 'CEL-63056-16',
 'Citation: CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB), <https://canlii.ca/t/gxq6t>, retrieved on 2023-05-16 https://canlii.ca/t/gxq6t',
 'Content:',
 'Arrears Worksheet',
 'File Number: CEL-63056-16',
 'Time',
 'period for Arrears Owing From: November 1, 2016 to November 24,',
 '2016',
 '(From',
 'the commencement of arrears to the termination date in the notice, or the end',
 'of the rental period if the tenancy is not being terminated.)',
 'Part',
 '1',
 '- Calculations of Arrears Owing',
 '(A)',
 'Rent Period (monthly, weekly, etc.)',
 '(B)',
 'Rent Charged',
 '(C)',
 'Lawful Rent',
 '(if issue raised)',
 '(D)',
 'Lower of',
 '(B) and (C)',
 '(E)',
 'Rent Paid',
 '(F)',
 'Amount Owing',
 '(D-E)',
 '01/11/2016',
 '- 24/11/2016',
 '$895.54',
 '$895.54',
 '$0.15',
 '$895.39',
 '**Part',
 'Month',
 '**',
 'To calculate the Rent for part of a month, use the',
 'following formula for columns (B), (C) and (D):',
 'M

In [99]:
test_case_file = silver_data.loc[row_num, 'raw_file_text']
print(f"GOAL: {' '.join(silver_data.loc[row_num, 'metadata'])}")
print()
print(f"MODEL: {extract_metadata_t5(raw_case_file_text = test_case_file)}")

GOAL: Date: 2017-01-20 File number: CEL-63056-16 CEL-63056-16 Citation: CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB), <https://canlii.ca/t/gxq6t>, retrieved on 2023-05-16 https://canlii.ca/t/gxq6t

Date: 2017-01-20 File number: CEL-63056-16 CEL-63056-16 Citation: CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB), <https://canlii.ca/t/gxq6t>, retrieved on 2023-05-16 https://canlii.ca/t/gxq6t Content: Arrears Worksheet File Number: CEL-63056-16 Time period for Arrears Owing From: November 1, 2016 to November 24, 2016 (From the commencement of arrears to the termination date in the notice, or the end of the rental period if the tenancy is not being terminated.) Part 1 - Calculations of Arrears Owing (A) Rent Period (monthly, weekly, etc.) (B) Rent Charged (C) Lawful Rent (if issue raised) (D) Lower of (B) and (C) (E) Rent Paid (F) Amount Owing (D-E) 01/11/2016 - 24/11/2016 $895.54 $895.54 $0.15 $895.39 **Part Month ** To calculate the Rent for part of a month, use the following formula for col

## case outcome extraction 2

In [272]:
silver_data = silver_data.drop(columns='outcome_text')
silver_data

Unnamed: 0,raw_file_text,raw_file_name,full_cleaned,metadata,content,case_citation,file_number,language,year,ltb_location,decision_date,hearing_date,url,adjudicating_member,new_case_outcome
0,Metadata:\nDate:\t2017-01-18\nFile number:\t\n...,CEL-62600-16.txt,"[Metadata:, Date: 2017-01-18, File number:, CE...","[Date: 2017-01-18, File number:, CEL-62600-16,...",[Arrears Worksheet File Number: CEL-62600-16 T...,"CEL-62600-16 (Re), 2017 CanLII 9545 (ON LTB)",CEL-62600-16,English,2016,Mississauga,01/18/2017,01/30/2017,https://canlii.ca/t/gxq6n,Avril Cardoso,No relief
1,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-62852-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-62852-16,...",[Arrears Worksheet File Number: CEL-62852-16 T...,"CEL-62852-16 (Re), 2017 CanLII 9535 (ON LTB)",CEL-62852-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6r,Tiisetso Russell,Relief
2,Metadata:\nDate:\t2017-01-09\nFile number:\t\n...,CEL-63024-16.txt,"[Metadata:, Date: 2017-01-09, File number:, CE...","[Date: 2017-01-09, File number:, CEL-63024-16,...",[Arrears Worksheet File Number: CEL-63024-16 T...,"CEL-63024-16 (Re), 2017 CanLII 9543 (ON LTB)",CEL-63024-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6s,Tiisetso Russell,Relief
3,Metadata:\nDate:\t2017-01-20\nFile number:\t\n...,CEL-63056-16.txt,"[Metadata:, Date: 2017-01-20, File number:, CE...","[Date: 2017-01-20, File number:, CEL-63056-16,...",[Arrears Worksheet File Number: CEL-63056-16 T...,"CEL-63056-16 (Re), 2017 CanLII 9537 (ON LTB)",CEL-63056-16,English,2016,Mississauga,01/09/2017,01/09/2017,https://canlii.ca/t/gxq6t,Tiisetso Russell,No relief
4,Metadata:\nDate:\t2017-02-03\nFile number:\t\n...,CEL-63193-16.txt,"[Metadata:, Date: 2017-02-03, File number:, CE...","[Date: 2017-02-03, File number:, CEL-63193-16,...",[Arrears Worksheet File Number: CEL-63193-16 T...,"CEL-63193-16 (Re), 2017 CanLII 30828 (ON LTB)",CEL-63193-16,English,2016,Mississauga,01/10/2017,02/03/2017,https://canlii.ca/t/h3w7b,Karen Wallace,No relief
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,Metadata:\nDate:\t2018-12-13\nFile number:\t\n...,TSL-98918-18-RV.txt,"[Metadata:, Date: 2018-12-13, File number:, TS...","[Date: 2018-12-13, File number:, TSL-98918-18-...",[Order under Section 21.2 of the Statutory Pow...,"TSL-98918-18-RV (Re), 2018 CanLII 141679 (ON LTB)",TSL-98918-18-RV,English,2018,Toronto,11/08/2018,12/13/2018,https://canlii.ca/t/j0fjv,Nancy Henderson,No relief
668,Metadata:\nDate:\t2018-11-23\nFile number:\t\n...,TSL-99691-18.txt,"[Metadata:, Date: 2018-11-23, File number:, TS...","[Date: 2018-11-23, File number:, TSL-99691-18,...",[Order under Section 69 Residential Tenancies ...,"TSL-99691-18 (Re), 2018 CanLII 141675 (ON LTB)",TSL-99691-18,English,2018,Toronto,11/23/2018,11/23/2018,https://canlii.ca/t/j0fk1,David Lee,No relief
669,Metadata:\nDate:\t2018-11-29\nFile number:\t\n...,TSL-99824-18.txt,"[Metadata:, Date: 2018-11-29, File number:, TS...","[Date: 2018-11-29, File number:, TSL-99824-18,...",[Order under Section 69 Residential Tenancies ...,"TSL-99824-18 (Re), 2018 CanLII 141673 (ON LTB)",TSL-99824-18,English,2018,Toronto,11/29/2018,12/11/2018,https://canlii.ca/t/j0fk2,Renée Lang,No relief
670,Metadata:\nDate:\t2018-12-12\nFile number:\t\n...,TSL-99900-18.txt,"[Metadata:, Date: 2018-12-12, File number:, TS...","[Date: 2018-12-12, File number:, TSL-99900-18,...",[Order under Section 69 Residential Tenancies ...,"TSL-99900-18 (Re), 2018 CanLII 140403 (ON LTB)",TSL-99900-18,English,2018,Toronto,12/12/2018,12/12/2018,https://canlii.ca/t/hzzb6,David Mungovan,No relief


In [273]:
# import pandas as pd
# gold_df = pd.read_csv('data/allard_labels_with_text.csv')
# gold_df.head()
# silver_data['new_case_outcome'] = gold_df['case_outcome']
# silver_data.head()
silver_data.to_csv('data/outcome_extraction_testing.csv', index = False)

In [138]:
# row = 0

keyword = "accordance with"
# keyword = "based on"
# keyword = "considered"

found_total = 0

for row in silver_data.index:
    gold_outcome = silver_data.loc[row, 'new_case_outcome']
    # print(gold_outcome)

    case_text = " ".join(silver_data.loc[row, 'content'])
    # if case_text.find(keyword) != -1:
    if keyword in case_text.lower():
        found_total += 1

print(f"{found_total / len(silver_data.index)}; {found_total} / {len(silver_data.index)}")

0.9419642857142857; 633 / 672


In [257]:

start_boundary = "accordance with"
end_boundary = "ordered that"

# start_boundary = "based on"
# start_boundary = "considered"

found_total = 0
# row = 500

for row in silver_data.index:
    gold_outcome = silver_data.loc[row, 'new_case_outcome']

    # print(silver_data.loc[row, 'url'])
    # print(gold_outcome)
    # print()

    case_text = " ".join(silver_data.loc[row, 'content'])
    start_bound_positions = find_all_positions(case_text.lower(), start_boundary)

    proximity = 1000 # number of characters after the start boundary to look for the end boundary

    for pos in start_bound_positions:
        near_text = case_text[pos - 100 : pos + proximity]
        near_text = ". ".join(near_text.split(". ")[1:])
        end_bounds_finds = find_all_positions(near_text, end_boundary)
        # if len(end_bounds_finds) > 1:
        #     print("MORE THAN ONE END BOUNDARY")
        if end_boundary in near_text:
            found_total += 1
            subset2 = near_text[:near_text.find(end_boundary)]
            outcome = ". ".join(subset2.split(". ")[:-1])
            # print("YES END BOUNDARY")
            # print(f"{len(outcome.split(' '))} words")
            # print()
            outcome = re.sub(r'^\d+\.\s*', '', outcome).strip() # removes "16. " from start of string
            silver_data.loc[row, 'outcome_text'] = outcome
            # print(outcome)
            break
        else:
            silver_data.loc[row, 'outcome_text'] = "NEED OTHER METHOD"
            continue
            # break
        # print(near_text)
    
found_total / len(silver_data.index)

0.7247023809523809

In [270]:
start_boundary = "accordance with"
end_boundary = "ordered that"

found_total = 0

max_outcome_len = 0
total_outcome_len = 0

for row in silver_data.index:
    gold_outcome = silver_data.loc[row, 'new_case_outcome']

    case_text = " ".join(silver_data.loc[row, 'content'])
    start_bound_positions = find_all_positions(case_text.lower(), start_boundary)

    proximity = 1000 # number of characters after the start boundary to look for the end boundary

    for pos in start_bound_positions:
        near_text = case_text[pos - 100 : pos + int(proximity)]
        near_text = ". ".join(near_text.split(". ")[1:])

        # Use a while loop to increase the proximity if end_boundary is not found in near_text
        while end_boundary not in near_text and proximity < len(case_text) - pos:
            proximity *= 1.5
            near_text = case_text[pos - 100 : pos + int(proximity)]
            near_text = ". ".join(near_text.split(". ")[1:])

        end_bounds_finds = find_all_positions(near_text, end_boundary)

        if end_boundary in near_text:
            found_total += 1
            subset2 = near_text[:near_text.find(end_boundary)]
            outcome = ". ".join(subset2.split(". ")[:-1])
            outcome = re.sub(r'^\d+\.\s*', '', outcome).strip() # removes "16. " from start of string
            silver_data.loc[row, 'outcome_text'] = outcome
            outcome_len = len(outcome.split(' '))
            if outcome_len > max_outcome_len:
                max_outcome_len = outcome_len
            total_outcome_len += outcome_len
            break
        else:
            silver_data.loc[row, 'outcome_text'] = "NEED OTHER METHOD"

    # Reset the proximity back to the original value for the next row
    proximity = 1000

print(found_total / len(silver_data.index))
print(max_outcome_len)
print(total_outcome_len / len(silver_data.index))

0.9211309523809523
4235
221.10267857142858


In [264]:
silver_data['outcome_text'].tolist()

["The N5 Notice has a termination date of November 17, 2016 and alleges that the conduct of the Tenant substantially interfered with the reasonable enjoyment of the residential complex and the Tenant has wilfully or negligently damaged the rental unit or residential complex. 5. The N5 Notice sets out the dates, times and specific allegations against the Tenant. 6. Because this is a first N5 Notice, the Tenant has an opportunity to void the notice in accordance with section 64(3) of the Act by correcting the identified problems within seven days of the date the notice was served. In this case, the seven day period begins the day following service of the N5 Notice, from October 28th to November 3rd, 2016. The Landlord's Legal Representative confirmed that the first N5 Notice was not voided because the conduct has not stopped since the notice was served and the Tenant has not paid $500.00 required to replace or replace the damaged property. Window screens 7. The Tenant negligently caused 

In [224]:
# len("""
# 29.   I have considered all of the disclosed circumstances in accordance with subsection 83(2) of the Act, and find that it would not be unfair to grant relief from eviction subject to  the conditions set out in this order pursuant to subsection 83(1)(a) and 204(1) of the Act.

# 30.   In granting relief, I am taking into account the submissions of the Landlord and her financial obligations, and balancing that against the Tenant’s disclosed circumstances. Therefore the relief from eviction is subject to conditions set out in this order which, if not met, would allow the Landlord to apply to the Board, without further notice to the Tenant, for an order terminating the tenancy and evicting the Tenant pursuant to section 78 of the Act.

# It is ordered that:

# """)