# Event-based Timeline Generation Tool for Document Analysis
Addressing the critical need within the legal profession, this project aims to develop a solution for efficiently analyzing vast volumes of legal documents. Lawyers often spend weeks aggregating evidence from numerous submissions to construct a chronological timeline of events relevant to a case, and in some cases, there can be thousands or even millions of documents to process. Hence, I propose developing a tool that will enable lawyers to upload case-related documents and get a chronological timeline of events related to the case, associating each event with its corresponding documents so they can easily refer to the source document. With the proposed tool, this time-consuming task will be streamlined into a fast and efficient process, analyzing documents within minutes instead of weeks. I believe this tool can be scaled to serve not only legal professionals but also various other fields that require efficient document analysis.

## Import necessary libraries

In [1]:
import openai
import base64
import fitz
import os
import io
import re
import numpy as np
import spacy
from PIL import Image
from datetime import datetime
from scipy.spatial.distance import cosine

In [2]:
# Set up your OpenAI API key
openai.api_key = 'YOUR-OPENAI-KEY'

## Extract document texts

In [3]:
def extract_texts_from_case_documents(data_directory = '../Data/pdfs', case_folder_name='fdi_moot_case_2024'):
    """
    Extracts text from documents within a specified case folder.

    Args:
        data_directory (str): The directory containing case folders.
        case_folder_name (str): The name of the case folder to extract texts from.

    Returns:
        tuple: A tuple containing a dictionary of document texts and a list of document paths.
            - document_texts (dict): A dictionary where keys are document names and values are corresponding texts.
            - document_paths (list): A list of paths to the extracted documents.
    """
    folder_path = os.path.join(data_directory, case_folder_name)
    documents = os.listdir(folder_path)
    
    # Create full paths for each document
    document_paths = [os.path.join(folder_path, doc) for doc in documents if not doc.startswith('.')]
    
    document_texts = {}
    for path in document_paths:
        key = os.path.basename(path).split('.')[0]
        text = ""
        with fitz.open(path) as pdf_document:
            num_pages = pdf_document.page_count
    
            for page_number in range(num_pages):
                page = pdf_document[page_number]
                text += page.get_text()
        document_texts[key] = text

    return document_texts, document_paths

In [4]:
document_texts, document_paths = extract_texts_from_case_documents()

In [5]:
print(document_texts['fdi_moot_case_2024_part_1'])

4 
INTERNATIONAL CENTRE FOR SETTLEMENT OF INVESTMENT DISPUTES 
In the arbitration proceeding between 
Astracommex Regional Satellite Communication Inc. 
(Claimant) 
and 
The Republic of Celestria 
(Respondent) 
REQUEST FOR ARBITRATION 
9 September 2022 
For the Claimant: 
Ms. Astrid Stellaris 
AstroJuris Arbitration 
3 Saturn St., 48798 Stelaria 
Nebuland 
 
 
 
5 
I. 
INTRODUCTION
1. Astracommex Regional Satellite Communications Inc. hereby submits a request to initiate 
arbitration (the “Request”) in a dispute with the Republic of Celestria (“Celestria”) in 
accordance with Article 36 of the Convention on the Settlement of Investment Disputes 
between States and Nationals of Other States, which entered into force on 14 October 1966 
(the “ICSID Convention”), Rules 1 and 2 of the 2022 ICSID Institution Rules, and Article 10 
5 
of the Agreement on Reciprocal Promotion and Protection of Investments between the 
Kingdom of Nebuland and the Republic of Celestria, which entered into force

## OCR for image-pdfs

In [6]:
def get_ocr_keys(document_texts, document_paths):
    """
    Identifies documents suitable for OCR processing based on their text length.

    Args:
        document_texts (dict): A dictionary containing document names as keys and corresponding text as values.
        document_paths (list): A list of paths to the extracted documents.

    Returns:
        tuple: A tuple containing lists of document keys and OCR document paths.
            - doc_keys_for_ocr (list): A list of document keys suitable for OCR processing.
            - ocr_document_paths (list): A list of paths to the OCR suitable documents.
    """
    doc_keys_for_ocr = []
    for key, doc in document_texts.items():
        # Check if the document text is short (rough threshold)
        if len(doc) <= 30:
            doc_keys_for_ocr.append(key)
            
    ocr_document_paths = [path for path in document_paths
                          if os.path.basename(path).split('.')[0] in doc_keys_for_ocr]
    
    return doc_keys_for_ocr, ocr_document_paths

In [7]:
doc_keys_for_ocr, ocr_document_paths = get_ocr_keys(document_texts, document_paths)

In [8]:
doc_keys_for_ocr

['fdi_moot_case_2024_part_4']

In [9]:
ocr_document_paths

['../Data/pdfs/fdi_moot_case_2024/fdi_moot_case_2024_part_4.pdf']

In [10]:
document_texts[doc_keys_for_ocr[0]]

'C-3\n'

In [11]:
def extract_images_from_pdfs(pdf_paths, images_folder='../Data/images'):
    """
    Extracts images from multiple PDFs and saves them as JPG files in separate folders.

    Args:
        pdf_paths (list): List of paths to the PDF files.
        images_folder (str): Path to the folder where images will be saved. Default is '../Data/images'.

    Returns:
        dict: Dictionary where keys are folder names and values are lists of paths to the saved image files.
    """
    image_dict = {}

    for pdf_path in pdf_paths:
        document_name = os.path.splitext(os.path.basename(pdf_path))[0]
        pdf_folder = os.path.join(images_folder, document_name)
        os.makedirs(pdf_folder, exist_ok=True)

        image_paths = []

        try:
            with fitz.open(pdf_path) as pdf_document:
                for page_number in range(len(pdf_document)):
                    page = pdf_document.load_page(page_number)
                    images = page.get_images(full=True)

                    for img in images:
                        xref = img[0]
                        base_image = pdf_document.extract_image(xref)
                        image_bytes = base_image["image"]
                        image = Image.open(io.BytesIO(image_bytes))
                        image_path = os.path.join(pdf_folder, f'{document_name}_page_{page_number}.jpg')
                        image.save(image_path)
                        image_paths.append(image_path)

        except Exception as e:
            print(f"An error occurred: {e}")

        image_dict[document_name] = image_paths

    return image_dict

In [12]:
image_dict = extract_images_from_pdfs(ocr_document_paths)

In [13]:
image_dict

{'fdi_moot_case_2024_part_4': ['../Data/images/fdi_moot_case_2024_part_4/fdi_moot_case_2024_part_4_page_0.jpg',
  '../Data/images/fdi_moot_case_2024_part_4/fdi_moot_case_2024_part_4_page_1.jpg',
  '../Data/images/fdi_moot_case_2024_part_4/fdi_moot_case_2024_part_4_page_2.jpg',
  '../Data/images/fdi_moot_case_2024_part_4/fdi_moot_case_2024_part_4_page_3.jpg']}

In [14]:
def encode_images(image_dict):
    """
    Encodes images in the given dictionary into base64 format.

    Args:
        image_dict (dict): A dictionary where keys are folder names and values are lists of image paths.

    Returns:
        dict: A dictionary where keys are folder names and values are lists of tuples containing page numbers and encoded images.
    """
    encoded_images = {}

    for folder_name, image_paths in image_dict.items():
        encoded_images[folder_name] = []

        for image_path in image_paths:
            with open(image_path, "rb") as image_file:
                encoded_image = base64.b64encode(image_file.read()).decode('utf-8')
                # Extract the page number from the image path
                page_number = int(image_path.split('_')[-1].split('.')[0])
                encoded_images[folder_name].append((page_number, encoded_image))

        # Sort the list of tuples based on the page number
        encoded_images[folder_name].sort(key=lambda x: x[0])

    return encoded_images

In [15]:
base64_images = encode_images(image_dict)

In [16]:
base64_images.keys()

dict_keys(['fdi_moot_case_2024_part_4'])

In [17]:
def ocr_document_pages_with_gpt(base64_images, pages_per_batch=3):
    """
    Performs OCR (Optical Character Recognition) on document pages using OpenAI's GPT models.

    Args:
        base64_images (dict): A dictionary containing document names as keys and lists of base64-encoded page images as values.
        pages_per_batch (int, optional): The number of pages to process per batch. Defaults to 3.

    Returns:
        dict: A dictionary containing OCR results for each document, where keys are document names and values are the extracted text.
    """
    ocr_results = {}

    for document_name, pages in base64_images.items():
        ocr_results[document_name] = ''

        for i in range(0, len(pages), pages_per_batch):
            batch = pages[i:i+pages_per_batch]
            content = [{"type": "text", "text": """Given the pdf pages in correct order, extract the EXACT information in the same order, that is, do OCR. You must extract every word as it is, without your additional thoughts and comments. DO NOT ADD redundant (double) new lines after each text line, that is, do not seperate text parts into seperate paragraphs when they belong together and should be one paragraph.
                        BAD OUTPUT EXAMPLE:
                        AGREEMENT OF 1 OCTOBER 2016 FOR THE IMPLEMENTATION OF THE RURAL DIGITALIZATION FUND
                        
                        The present Agreement for the Implementation of the Rural Digitalization Fund (hereinafter referred to as the “Agreement”) is made and entered at the location and on the date indicated below between
                        
                        the National Frequency Agency of Celestria, 3 hertz bvd., 98479 Starvalis, Celestria (hereinafter “NFA”); and
                        
                        the Rural Development Agency of Celestria, 3 tree road, 98479 Starvalis, Celestria (hereinafter “RDA”); and
                        
                        Astracomnex Regional Satellite Communications Inc., 115 Neptune St., 48799 Stelaria, Neuland (hereinafter “Astracomnex Regional”)
                        
                        (hereinafter together referred to as “the Parties”)
                        
                        under the following terms and conditions:
                        
                        (A) On 1 January 2016, the Republic of Celestria established the Rural Digitalization Fund (hereinafter “RDF”) with the goal of digitalizing public services.
                        
                        (B) On 1 January 2016, the NFA and RDA announced an invitation for foreign and domestic companies to submit their applications to the RDF for allocation of funds and spectrum.
                        
                        (C) On 15 February 2016, Astracomnex Regional submitted its application to the RDF (hereinafter the “RDF Application”).
                        
                        (D) On 1 August 2016, the NFA and RDA announced Astracomnex Regional as one of three successful applicants.
                        
                        THEREFORE, in consideration of the foregoing, the Parties agree to the following:
                        
                        Article 1 Definitions
                        
                        ORBIT The path, relative to a specified frame of reference, described by the center of mass of a satellite or other object in space subjected primarily to natural forces, mainly the force of gravity (per article 1.184 of the International Telecommunication Union (“ITU”) Radio Regulations).
                        
                        GEOSYNCHRONOUS SATELLITE An earth satellite whose period of revolution is equal to the period of rotation of the Earth about its axis (per article 1.188 of the ITU Radio Regulations).
                        
                        GEOSTATIONARY SATELLITE A geosynchronous satellite whose circular and direct orbit lies in the plane of the Earth’s equator and which thus remains fixed relative to the Earth; by extension, a geosynchronous satellite which remains approximately fixed relative to the Earth (per article 1.189 of the ITU Radio Regulations).
                        
                        FIXED-SATELLITE SERVICE A radiocommunication service between earth stations at given positions, when one or more satellites are used; the given position may be a specified fixed point or any fixed point within specified areas (per article 1.21 of the ITU Radio Regulations).
                        
                        MOBILE SATELLITE-SERVICE A radio communication service between mobile earth stations and one or more space stations, or between space stations used by this service; or between mobile earth stations by means of one or more space stations (per article 1.25 of the ITU Radio Regulations).
                        
                        Article 2 Deployment Requirements
                        
                        1. Astracomnex Regional commits to offer stand-alone broadband service at speeds consistent with the RDF Application, i.e., with at least 25 Mbps downstream and 3 Mbps upstream (25/3 Mbps) at rates reasonably comparable to those available in urban areas to all locations within an awarded area of the ten-year program.
                        
                        2. The initial interim deployment milestones are set as follows. Astracomnex Regional as carrier must complete:
                        
                        i. 10 percent of deployments by the end of year one;
                        
                        ii. 30 percent of deployments by the end of year two;
                        
                        iii. 50 percent of deployments by the end of year three;
                        
                        iv. 70 percent of deployments by the end of year four;
                        
                        v. 100 percent of deployments by the end of year five.
                
                        GOOD OUTPUT EXAMPLE with no redundant spacing inside paragraphs and double new lines (\n\n) between separate paragraphs:
                        AGREEMENT OF 1 OCTOBER 2016 FOR THE IMPLEMENTATION OF THE RURAL DIGITALIZATION FUND
                        The present Agreement for the Implementation of the Rural Digitalization Fund (hereinafter referred to as the “Agreement”) is made and entered at the location and on the date indicated below between the National Frequency Agency of Celestria, 3 hertz bvd., 98479 Starvalis, Celestria (hereinafter “NFA”); and the Rural Development Agency of Celestria, 3 tree road, 98479 Starvalis, Celestria (hereinafter “RDA”); and Astracomnex Regional Satellite Communications Inc., 115 Neptune St., 48799 Stelaria, Neuland (hereinafter “Astracomnex Regional”) (hereinafter together referred to as “the Parties”) under the following terms and conditions:
                        (A) On 1 January 2016, the Republic of Celestria established the Rural Digitalization Fund (hereinafter “RDF”) with the goal of digitalizing public services.
                        (B) On 1 January 2016, the NFA and RDA announced an invitation for foreign and domestic companies to submit their applications to the RDF for allocation of funds and spectrum.
                        (C) On 15 February 2016, Astracomnex Regional submitted its application to the RDF (hereinafter the “RDF Application”).
                        (D) On 1 August 2016, the NFA and RDA announced Astracomnex Regional as one of three successful applicants.
                        
                        THEREFORE, in consideration of the foregoing, the Parties agree to the following:
                        Article 1 Definitions
                        ORBIT The path, relative to a specified frame of reference, described by the center of mass of a satellite or other object in space subjected primarily to natural forces, mainly the force of gravity (per article 1.184 of the International Telecommunication Union (“ITU”) Radio Regulations).
                        GEOSYNCHRONOUS SATELLITE An earth satellite whose period of revolution is equal to the period of rotation of the Earth about its axis (per article 1.188 of the ITU Radio Regulations).
                        GEOSTATIONARY SATELLITE A geosynchronous satellite whose circular and direct orbit lies in the plane of the Earth’s equator and which thus remains fixed relative to the Earth; by extension, a geosynchronous satellite which remains approximately fixed relative to the Earth (per article 1.189 of the ITU Radio Regulations).
                        FIXED-SATELLITE SERVICE A radiocommunication service between earth stations at given positions, when one or more satellites are used; the given position may be a specified fixed point or any fixed point within specified areas (per article 1.21 of the ITU Radio Regulations).
                        MOBILE SATELLITE-SERVICE A radio communication service between mobile earth stations and one or more space stations, or between space stations used by this service; or between mobile earth stations by means of one or more space stations (per article 1.25 of the ITU Radio Regulations).
                        
                        Article 2 Deployment Requirements
                        1. Astracomnex Regional commits to offer stand-alone broadband service at speeds consistent with the RDF Application, i.e., with at least 25 Mbps downstream and 3 Mbps upstream (25/3 Mbps) at rates reasonably comparable to those available in urban areas to all locations within an awarded area of the ten-year program.
                        2. The initial interim deployment milestones are set as follows. Astracomnex Regional as carrier must complete:
                        i. 10 percent of deployments by the end of year one;
                        ii. 30 percent of deployments by the end of year two;
                        iii. 50 percent of deployments by the end of year three;
                        iv. 70 percent of deployments by the end of year four;
                        v. 100 percent of deployments by the end of year five.
                        
                        Finally, DO NOT apply any text modifications (bold font, italics font, fontsize differences, ...) to the text and give extracted clean text word for word as in the image, that is, do ACCURATE OCR."""}]
                
            # Add image data to the messages
            for _, image_data in batch:
                content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},})

            response = openai.chat.completions.create(
                model="gpt-4-turbo",
                messages=[
                    {
                      "role": "user",
                      "content": content,
                    }
                  ]
            )

            # Extract and concatenate the OCR results
            for choice in response.choices:
                ocr_results[document_name] += '\n\n' + response.choices[0].message.content

            # Remove the first \n\n
            ocr_results[document_name] = ocr_results[document_name].lstrip('\n')

    return ocr_results

In [18]:
ocr_results = ocr_document_pages_with_gpt(base64_images)

In [19]:
ocr_results.keys()

dict_keys(['fdi_moot_case_2024_part_4'])

In [20]:
print(ocr_results['fdi_moot_case_2024_part_4'])

AGREEMENT OF 1 OCTOBER 2016 FOR THE IMPLEMENTATION OF THE RURAL DIGITALIZATION FUND
The present Agreement for the Implementation of the Rural Digitalization Fund (hereinafter referred to as the “Agreement”) is made and entered at the location and on the date indicated below between the National Frequency Agency of Celestria, 3 hertz bvd., 98479 Starvalis, Celestria (hereinafter “NFA”); and the Rural Development Agency of Celestria, 3 tree road, 98479 Starvalis, Celestria (hereinafter “RDA”); and Astracomnex Regional Satellite Communications Inc., 115 Neptune St., 48799 Stelaria, Neuland (hereinafter “Astracomnex Regional”) (hereinafter together referred to as “the Parties”) under the following terms and conditions: (A) On 1 January 2016, the Republic of Celestria established the Rural Digitalization Fund (hereinafter “RDF”) with the goal of digitalizing public services. (B) On 1 January 2016, the NFA and RDA announced an invitation for foreign and domestic companies to submit their app

In [21]:
def update_with_ocr_texts(document_texts, ocr_results):
    """
    Updates document texts with OCR (Optical Character Recognition) results.

    Args:
        document_texts (dict): A dictionary containing document names as keys and their corresponding texts as values.
        ocr_results (dict): A dictionary containing OCR results for documents, with document names as keys and OCR-extracted text as values.

    Returns:
        dict: Updated document_texts dictionary with OCR-extracted text replacing original text for documents that require OCR.
    """
    doc_keys_for_ocr, ocr_document_paths = get_ocr_keys(document_texts, document_paths)
    
    for key in document_texts.keys():
        if key in doc_keys_for_ocr:
            # Replace the value corresponding to the key with the value from ocr_results
            document_texts[key] = ocr_results.get(key, document_texts[key])
            
    return document_texts

In [22]:
document_texts = update_with_ocr_texts(document_texts, ocr_results)

In [23]:
document_texts['fdi_moot_case_2024_part_4']

'AGREEMENT OF 1 OCTOBER 2016 FOR THE IMPLEMENTATION OF THE RURAL DIGITALIZATION FUND\nThe present Agreement for the Implementation of the Rural Digitalization Fund (hereinafter referred to as the “Agreement”) is made and entered at the location and on the date indicated below between the National Frequency Agency of Celestria, 3 hertz bvd., 98479 Starvalis, Celestria (hereinafter “NFA”); and the Rural Development Agency of Celestria, 3 tree road, 98479 Starvalis, Celestria (hereinafter “RDA”); and Astracomnex Regional Satellite Communications Inc., 115 Neptune St., 48799 Stelaria, Neuland (hereinafter “Astracomnex Regional”) (hereinafter together referred to as “the Parties”) under the following terms and conditions: (A) On 1 January 2016, the Republic of Celestria established the Rural Digitalization Fund (hereinafter “RDF”) with the goal of digitalizing public services. (B) On 1 January 2016, the NFA and RDA announced an invitation for foreign and domestic companies to submit their a

## Divide documents into chunks/batches

In [24]:
def get_chunks_with_spacy(document_texts, min_chunk_length=1000, batch_max_chunk_length=25000, batch=False, nlp = spacy.load("en_core_web_sm")):
    """
    Splits document texts into chunks using spaCy.

    Args:
        document_texts (dict): A dictionary containing document names as keys and their corresponding cleaned texts as values.
        min_chunk_length (int, optional): The minimum length of each chunk. Defaults to 1000.
        batch_max_chunk_length (int, optional): The maximum length of chunks in a batch. Defaults to 25000.
        batch (bool, optional): If True, chunks will be split into batches based on the batch_max_chunk_length. Defaults to False.
        nlp (spaCy Language, optional): An instance of a spaCy Language model. Defaults to spacy.load("en_core_web_sm").

    Returns:
        dict: A dictionary containing document names as keys and lists of chunks as values.
    """
    document_texts_chunks = {}
    
    for document_name, text in document_texts.items():
        chunks = []
        
        # Process the text with spaCy
        doc = nlp(text)
        
        # Iterate over the paragraphs in the document
        paragraphs = [par.text.strip() for par in doc.sents]
        
        current_chunk = ""
        
        for paragraph in paragraphs:
            # Add the current paragraph to the current chunk
            if len(current_chunk) == 0:
                current_chunk = paragraph
            else:
                current_chunk += " " + paragraph
            
            # Check if the current chunk exceeds the minimum chunk length
            if len(current_chunk) >= min_chunk_length:
                chunks.append(current_chunk)
                current_chunk = ""
        
        # Check if there's remaining text to create a final chunk
        if len(current_chunk) > 0:
            chunks.append(current_chunk)

        if not batch:
            # Append chunks along with their indexes to the list
            document_texts_chunks[document_name] = [(idx, chunk) for idx, chunk in enumerate(chunks)]
        else:
            batches = []
            current_batch = ""
            for i, chunk in enumerate(chunks):
                if len(chunk) + len(current_batch) <= batch_max_chunk_length:
                    current_batch += "\n" + chunk
                    if i == len(chunks) - 1: # it is the last chunk and still haven't reached the threshold
                        current_batch = current_batch.lstrip("\n") # Remove the first \n
                        batches.append(current_batch)
                else:
                    current_batch = current_batch.lstrip("\n") # Remove the first \n
                    batches.append(current_batch)
                    current_batch = chunk
                    
            document_texts_chunks[document_name] = batches

    return document_texts_chunks

## Clean the text with openai

In [25]:
def clean_texts_with_gpt(document_texts):
    """
    Cleans document texts using GPT-3.5 model.

    Args:
        document_texts (dict): A dictionary containing document names as keys and their corresponding texts as values.

    Returns:
        dict: A dictionary containing cleaned document texts.
    """
    document_texts_cleaned = {}
    
    batches = get_chunks_with_spacy(document_texts, batch=True)
    for key, batches_to_clean in batches.items():
        # print(f"{key} - Batches: {len(batches_to_clean)}")
        cleaned_text = ""
        for batch in batches_to_clean:
            response = openai.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant helping to clean document texts."},
                    {"role": "user", "content": f"""Clean the following extracted pdf document's text from redundant information, such as the page numbers, row numbers, and other unnecessary symobols/characters/information that I got as a result of extracting the text from pdf. Return the EXACT same text without any modifications besides cleaning the pdf extractor's extracted redundant irrelevant information. Remember, don't do any contextual or text changes, your job is to just clean and return the text, without leaving out or modifying any information besides what was instructed. \n TEXT to clean: {batch}"""},
                ]
            )
            
            cleaned_text += '\n' + response.choices[0].message.content
            cleaned_text = cleaned_text.lstrip('\n') # Remove the first \n

        document_texts_cleaned[key] = cleaned_text
        
    return document_texts_cleaned

In [26]:
document_texts_cleaned = clean_texts_with_gpt(document_texts)

In [27]:
# Print the cleaned texts
for key, cleaned_text in document_texts_cleaned.items():
    print(f"Cleaned text for key '{key}':")
    print(cleaned_text)
    print("-----------------------------------------")

Cleaned text for key 'fdi_moot_case_2024_part_12':
INTERNATIONAL CENTRE FOR SETTLEMENT OF INVESTMENT DISPUTES 
In the arbitration proceeding between 
Astracommex Regional Satellite Communication Inc. 
(Claimant) 
and 
The Republic of Celestria 
(Respondent) 
RESPONSE TO THE REQUEST FOR ARBITRATION 
11 November 2022 
For the Respondent: 
Dr. Janis Pletnik 
CosmoLex 
5 lunar bvd., 98479 Starvalis 
Celestria 
I. 
INTRODUCTION 
1. The Republic of Celestria (“Celestria”), the Respondent in the present proceeding, hereby 
submits a short response to the Claimant’s Request for Arbitration (the “Response”). 2. In this Response, unless otherwise stated, the Respondent adopts the abbreviations used in the 
Claimant’s Request for Arbitration. 3. Unless otherwise stated, the Respondent disagrees with every statement made by the Claimant 
in the Request for Arbitration. 
II. THE 
TRIBUNAL 
LACKS 
JURISDICTION 
RATIONE 
TEMPORIS 
TO 
ENTERTAIN THE CLAIMANT’S CLAIM 
4. The Respondent considers that t

In [28]:
# Chunk the text
document_texts_cleaned_chunks = get_chunks_with_spacy(document_texts_cleaned)

In [29]:
document_texts_cleaned_chunks

{'fdi_moot_case_2024_part_12': [(0,
   'INTERNATIONAL CENTRE FOR SETTLEMENT OF INVESTMENT DISPUTES \nIn the arbitration proceeding between \nAstracommex Regional Satellite Communication Inc. \n(Claimant) \nand \nThe Republic of Celestria \n(Respondent) \nRESPONSE TO THE REQUEST FOR ARBITRATION \n11 November 2022 \nFor the Respondent: \nDr. Janis Pletnik \nCosmoLex \n5 lunar bvd., 98479 Starvalis \nCelestria \nI. \nINTRODUCTION \n1. The Republic of Celestria (“Celestria”), the Respondent in the present proceeding, hereby \nsubmits a short response to the Claimant’s Request for Arbitration (the “Response”). 2. In this Response, unless otherwise stated, the Respondent adopts the abbreviations used in the \nClaimant’s Request for Arbitration. 3. Unless otherwise stated, the Respondent disagrees with every statement made by the Claimant \nin the Request for Arbitration. \nII. THE \nTRIBUNAL \nLACKS \nJURISDICTION \nRATIONE \nTEMPORIS \nTO \nENTERTAIN THE CLAIMANT’S CLAIM \n4. The Respondent

## Extract dates from chunks, standardize them, order in chronological order

In [30]:
def extract_dates_and_events_with_gpt(possible_date, date_context):
    """
    Extracts dates and events from text using GPT-3.5 model.

    Args:
        possible_date (str): The extracted possible date.
        date_context (str): The context in which the date occurred.

    Returns:
        str: A string containing the extracted event details or "INVALID DATE" if the possible date is not valid.
    """
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant helping to validate dates, standardize dates, extract the short and long-form event descriptions of that date from document texts."},
            {"role": "user", "content": f"""Given the extracted possible date and the context in which the date occurred, do the following steps: 1. Is the extracted date a valid date? There were some invalid numbers extracted such as the address, etc, if the given possible date is indeed not a date, then only return INVALID DATE and don't continue to the next steps. Please note that if you get 'today', 'tomorrow', 'the next day', or other variants that indeed indicate some dates, these are valid dates, and your job is given the context, identify to what exact date these are referring to. When either validating the dates, or identifying the dates from context, return the date in a standardized format, dd/mm/yyyy. Finally, if you are able to identify a date range, that is from some dd/mm/yyyy - dd/mm/yyyy, then return only the START date of the range. 2. Now with the validated, identified, standardized date dd/mm/yyyy, identify what event happened on that date, and return the 4-5 word event title, and one sentence event description/summary.\n\nExtracted possible date: {possible_date}\n\nContext for extracting the event from date (the chunk were date was found is marked with DATE CHUNK): {date_context}\n\n
            EXAMPLE where it is not a date but an address:\n
            Extracted possible date: 48798\n\n
            Context for extracting the event from date (the chunk were date was found is marked with DATE CHUNK):
                                            DATE CHUNK: INTERNATIONAL CENTRE FOR SETTLEMENT OF INVESTMENT DISPUTES 
                                            In the arbitration proceeding between 
                                            Astracommex Regional Satellite Communication Inc. 
                                            (Claimant) 
                                            and 
                                            The Republic of Celestria 
                                            (Respondent) 
                                            REQUEST FOR ARBITRATION 
                                            9 September 2022 
                                            For the Claimant: 
                                            Ms. Astrid Stellaris 
                                            AstroJuris Arbitration 
                                            3 Saturn St., 48798 Stelaria 
                                            Nebuland\n\n
            INVALID DATE
            \n\nEXAMPLE where it is a non-standard date and there is a date range:\n
            Extracted possible date: this week\n\n
            Context for extracting the event from date (the chunk were date was found is marked with DATE CHUNK):
                                            Exhibit C-7: Press Report of 5 January 2021 on Solar Radiation Storm
                                            PLANETARY NEWS
                                            TRENDING
                                            WEATHER
                                            ASTROGEOLOGY
                                            ASTROBIOLOGY
                                            Home > Tag > Trending
                                            Solar Radiation Storm Four Times Earth’s Size Recorded by
                                            Amateur Astronomer in Celestria
                                            By Lucy Franks 5 Jan 2021 14:35 CET
                                            An amateur astronomer and Celestrian educational YouTuber named Rick Carrington captured a
                                            solar storm emerging away from the Sun where “four entire Earths” could fit in.
                                            Solar radiation storms happen when a massive magnetic explosion, typically triggering a coronal
                                            mass ejection and a related solar flare, speeds up charged particles in the sun’s atmosphere to
                                            extremely high speeds.
                                            The key particles in these events are protons, which can reach speeds close
                                            to that of light.
                                            These high-energy protons, when they hit satellites or humans in space, can deeply
                                            penetrate the impacted objects, potentially harming electronic components or biological DNA.
                                            DATE CHUNK: AS100 collision was said to be caused by the solar storm observed this week.
                                            In the past few months, there’s been a rise in solar activity on the sun, the sole star of our solar
                                            system, as part of the ongoing Solar Cycle.
                                            While solar storms are a regular occurrence, their
                                            observation and recording hold great scientific value.
                                            Solar Storm Video
                                            Carrington detected the unusual space storm on a video
                                            posted on Instagram on 1 January 2021. In the caption, the
                                            amateur astronomer stated he was not expecting to capture
                                            the event, which happened after he pointed out his telescope
                                            on 30 December 2020. In the video, it can be noticed the
                                            Sun emitted a curve-shaped plasma out of its surface.
                                            The video has gained thousands of views and even caught
                                            the attention of Celestrian Space Agency.
                                            Rick Carrington is reportedly well recognized for uploading
                                            videos in matters relating to astronomy, astrophotography,
                                            telescopes and cameras.
                                            Note: CET refers to “Celestrian Eastern Time.\n\n
            DATE: 05/01/2021\n
            EVENT TITLE: Solar storm observed\n
            EVENT SUMMARY: An amateur astronomer named Rick Carrington captured a solar storm, which emitted a curve-shaped plasma out of the Sun's surface, gaining thousands of views on Instagram and catching the attention of the Celestrian Space Agency.

            Remember, return in the structure I have, either INVALID DATE, either DATE: EVENT TITLE: EVENT SUMMARY:, that is, DO NOT ADD any other redundant information from your end, give standardized answers ONLY with the given format, even if you are able to identify an event start and end dates give only the start date, nothing beyond the standardized format, as the output will, later on, be processed with regex, not LLM.
            """},
        ]
    )

    return response.choices[0].message.content

In [31]:
def clean_date(date_text):
    """
    Cleans the extracted date to ensure it is in a valid format.

    Args:
        date_text (str): The extracted date text.

    Returns:
        str: The cleaned date text, or an empty string if the date is invalid.
    """
    # Remove any characters other than digits and '/'
    cleaned_date = re.sub(r'[^\d/]', '', date_text)
    
    # Check if the cleaned date matches a valid date format
    try:
        datetime.strptime(cleaned_date, '%d/%m/%Y')
        return cleaned_date
    except ValueError:
        return ""  # Return an empty string if the date is invalid

In [32]:
def sort_dictionary_date_keys(dictionary):
    """
    Sorts a dictionary with date keys chronologically.

    Args:
        dictionary (dict): A dictionary with date keys in the format 'dd/mm/yyyy'.

    Returns:
        dict: A new dictionary with date keys sorted chronologically and their corresponding values.
    """
    valid_dates = []
    for date in dictionary.keys():
        try:
            datetime.strptime(date, '%d/%m/%Y')
            valid_dates.append(date)
        except ValueError:
            continue
            
    # Sort dates chronologically
    sorted_dates = sorted(valid_dates, key=lambda x: datetime.strptime(x, '%d/%m/%Y'))
    
    # Create a new dictionary with the sorted keys and their corresponding values
    sorted_dictionary = {date: dictionary[date] for date in sorted_dates}  

    return sorted_dictionary

In [33]:
def extract_dates_and_events(document_texts_cleaned_chunks, nlp = spacy.load("en_core_web_sm")):
    '''
    Extracts dates and events using spaCy and GPT.

    Args:
        document_texts_cleaned_chunks (dict): A dictionary containing document names as keys and lists of cleaned text chunks as values.

    Returns:
        dict: A dictionary where dates are keys and values are lists of tuples containing event information.

    Note: If the GPT response indicates "INVALID DATE", the corresponding event information will not be added to the output dictionary.
    '''
    dates_and_events = {}
    
    for document_name, chunks in document_texts_cleaned_chunks.items():
        for chunk_index, chunk_text in chunks:
            # Use spaCy to extract potential dates
            doc = nlp(chunk_text)
    
            # Initialize a list to store dates found in the chunk
            chunk_dates = []
    
            # Extract and process recognized dates
            for ent in doc.ents:
                if ent.label_ == 'DATE':
                    chunk_dates.append((ent.text, chunk_index))
    
            if chunk_dates:
                for date in chunk_dates:
                    date_chunk_index = date[1]
                    
                    # Calculate the indices of the previous and next chunks
                    prev_chunk_indices = list(range(max(0, date_chunk_index - 2), date_chunk_index))
                    next_chunk_indices = list(range(date_chunk_index + 1, min(len(chunks), date_chunk_index + 3)))
    
                    context_chunk_indices_for_date_chunk = sorted(set(prev_chunk_indices + [date_chunk_index] + next_chunk_indices))
    
                    context_chunk = ''
                    for idx in context_chunk_indices_for_date_chunk:
                        if idx == date_chunk_index:
                            context_chunk += f'DATE CHUNK: {chunks[idx][1]}\n'
                        else:
                            context_chunk += f'{chunks[idx][1]}\n'
    
                    response = extract_dates_and_events_with_gpt(date[0], context_chunk)
                    print(response)
                    print('----')
    
                    # Process the response and extract event information
                    event_info = {}
                    for line in response.split("\n"):
                        line_lower = line.lower()
                        if "date:" in line_lower:
                            extracted_date = line.split(":", 1)[1].strip()
                            # Clean the extracted date
                            cleaned_date = clean_date(extracted_date)
                            event_info["DATE"] = cleaned_date
                        elif "event title:" in line_lower:
                            event_info["EVENT TITLE"] = line.split(":", 1)[1].strip()
                        elif "event summary:" in line_lower:
                            event_info["EVENT SUMMARY"] = line.split(":", 1)[1].strip()
                    
                    # Add the event information to the dates_and_events dictionary
                    if "DATE" in event_info:
                        extracted_date = event_info["DATE"]
                        event_title = event_info.get("EVENT TITLE", "No event title")
                        event_summary = event_info.get("EVENT SUMMARY", "No event summary")
                        
                        if extracted_date.lower() != "invalid date":
                            if extracted_date not in dates_and_events:
                                dates_and_events[extracted_date] = [(event_title, event_summary, date_chunk_index, document_name)]
                            else:
                                dates_and_events[extracted_date].append((event_title, event_summary, date_chunk_index, document_name))
            
    sorted_dates_and_events = sort_dictionary_date_keys(dates_and_events)   
    
    return sorted_dates_and_events

In [34]:
sorted_dates_and_events = extract_dates_and_events(document_texts_cleaned_chunks)

DATE: 11/11/2022

EVENT TITLE: RESPONSE TO THE REQUEST FOR ARBITRATION

EVENT SUMMARY: The Republic of Celestria submits a response to Astracommex Regional Satellite Communication Inc.'s Request for Arbitration in an arbitration proceeding regarding jurisdiction over the dispute.
----
INVALID DATE
----
DATE: 27/01/1980
EVENT TITLE: Vienna Convention on the Laws of Treaties
EVENT SUMMARY: The Vienna Convention on the Laws of Treaties entered into force on 27 January 1980, defining provisions regarding the binding nature of treaties in relation to acts or situations prior to the treaty's entry into force.
----
DATE: 10/10/2022

EVENT TITLE: Interpretive statement issued

EVENT SUMMARY: The contracting parties of the BIT issued a binding interpretive statement on 10 October 2022, excluding disputes that arose prior to its entry into force.
----
DATE: 15/10/2020

EVENT TITLE: Astracommex Regional's Letter Dispute

EVENT SUMMARY: On 15 October 2020, Astracommex Regional sent a letter contes

In [35]:
sorted_dates_and_events

{'18/03/1965': [('ICSID Convention on the Settlement of Investment Disputes',
   'The Convention on the Settlement of Investment Disputes between States and Nationals of Other States was signed on 18 March 1965, providing a framework for the resolution of investment disputes between investors and States.',
   8,
   'fdi_moot_case_2024_part_34')],
 '14/10/1966': [('ICSID Convention Entered into Force',
   'The Convention on the Settlement of Investment Disputes between States and Nationals of Other States entered into force on 14 October 1966, establishing rules for resolving investment disputes.',
   0,
   'fdi_moot_case_2024_part_1')],
 '27/01/1980': [('Vienna Convention on the Laws of Treaties',
   "The Vienna Convention on the Laws of Treaties entered into force on 27 January 1980, defining provisions regarding the binding nature of treaties in relation to acts or situations prior to the treaty's entry into force.",
   1,
   'fdi_moot_case_2024_part_12')],
 '02/12/2003': [('Free Tra

In [36]:
# Print the sorted events
for date, events_list in sorted_dates_and_events.items():
    print(f"Date: {date}")
    for event in events_list:
        event_title, event_summary, date_chunk_index, document_name = event
        print(f"Event Title: {event_title}")
        print(f"Event Summary: {event_summary}")
        print(f"Date Chunk Index: {date_chunk_index}")
        print(f"Document Name: {document_name}")
        print()

Date: 18/03/1965
Event Title: ICSID Convention on the Settlement of Investment Disputes
Event Summary: The Convention on the Settlement of Investment Disputes between States and Nationals of Other States was signed on 18 March 1965, providing a framework for the resolution of investment disputes between investors and States.
Date Chunk Index: 8
Document Name: fdi_moot_case_2024_part_34

Date: 14/10/1966
Event Title: ICSID Convention Entered into Force
Event Summary: The Convention on the Settlement of Investment Disputes between States and Nationals of Other States entered into force on 14 October 1966, establishing rules for resolving investment disputes.
Date Chunk Index: 0
Document Name: fdi_moot_case_2024_part_1

Date: 27/01/1980
Event Title: Vienna Convention on the Laws of Treaties
Event Summary: The Vienna Convention on the Laws of Treaties entered into force on 27 January 1980, defining provisions regarding the binding nature of treaties in relation to acts or situations prior 

## Post-process the extracted events (handle duplicated events, ...)

In [37]:
def split_events_by_count(sorted_dates_and_events):
    """
    Splits a dictionary of dates and corresponding events into two dictionaries based on the count of events.

    Args:
        sorted_dates_and_events (dict): A dictionary with date keys and corresponding lists of events.

    Returns:
        tuple: A tuple containing two dictionaries:
            - single_event_dates: A dictionary with dates as keys and single events as values.
            - multi_event_dates: A dictionary with dates as keys and lists of multiple events as values.
    """
    single_event_dates = {}
    multi_event_dates = {}
    
    for date, events_list in sorted_dates_and_events.items():
        if len(events_list) == 1:
            single_event_dates[date] = events_list
        else:
            multi_event_dates[date] = events_list
    
    return single_event_dates, multi_event_dates

In [38]:
def compare_texts(text1, text2, engine="text-embedding-3-small"):
    """
    Compares the similarity between two texts using embeddings and cosine similarity.

    Args:
        text1 (str): The first text to compare.
        text2 (str): The second text to compare.
        engine (str, optional): The name of the OpenAI embedding model to use. Defaults to "text-embedding-3-small".

    Returns:
        float: The cosine similarity between the embeddings of the two texts.
    """
    embedding1 = np.array(openai.embeddings.create(input = [text1], model=engine).data[0].embedding)
    embedding2 = np.array(openai.embeddings.create(input = [text2], model=engine).data[0].embedding)
    
    # Calculate cosine similarity
    similarity = 1 - cosine(embedding1, embedding2)
    
    return similarity

In [39]:
def merge_events(events_list, title_threshold, summary_threshold):
    """
    Merges similar events within a list based on title and summary similarities.

    Args:
        events_list (list): A list of tuples containing event titles, summaries, chunk indices, and document names.
        title_threshold (float): Threshold for title similarity.
        summary_threshold (float): Threshold for summary similarity.

    Returns:
        list: A list of merged events.
    """
    merged_events = []
    merged_indices = []
    
    for i, (title_i, summary_i, chunk_index_i, document_name_i) in enumerate(events_list):
        if i in merged_indices:
            continue
        
        merged_indices.append(i)
        merged_indices_temp = [i]
        merged_documents_temp = [(chunk_index_i, document_name_i)]  # Store the (chunk index, document name) tuple of the current event
        
        for j, (title_j, summary_j, chunk_index_j, document_name_j) in enumerate(events_list[i+1:], start=i+1):
            if j in merged_indices:
                continue
            
            title_similarity = compare_texts(title_i, title_j)
            summary_similarity = compare_texts(summary_i, summary_j)
            
            if title_similarity > title_threshold and summary_similarity > summary_threshold:
                merged_indices.append(j)
                merged_indices_temp.append(j)
                merged_documents_temp.append((chunk_index_j, document_name_j))  # Store the (chunk index, document name) tuple of the other event
        
        # Choose any one of the titles and summaries
        merged_title = title_i
        merged_summary = summary_i
        merged_chunk_indices_final = list(set((chunk_index, document_name) for chunk_index, document_name in merged_documents_temp))  # Use a set to remove duplicates

        # If there is only one tuple, expand it
        if len(merged_chunk_indices_final) == 1:
            merged_chunk_index, merged_document_name = merged_chunk_indices_final[0]
            merged_events.append((merged_title, merged_summary, merged_chunk_index, merged_document_name))
        else:
            merged_events.append((merged_title, merged_summary, merged_chunk_indices_final))
    
    return merged_events

In [40]:
def merge_duplicate_events(sorted_dates_and_events, title_threshold=0.75, summary_threshold=0.7):
    """
    Merges duplicate events within a dictionary of sorted dates and events.

    Args:
        sorted_dates_and_events (dict): A dictionary containing sorted dates as keys and lists of events as values.
        title_threshold (float, optional): Threshold for title similarity. Defaults to 0.75.
        summary_threshold (float, optional): Threshold for summary similarity. Defaults to 0.7.

    Returns:
        dict: A dictionary with merged events.
    """
    single_event_dates, multi_event_dates = split_events_by_count(sorted_dates_and_events)
    
    merged_dates_and_events = {}
    
    # Merge events for dates with multiple events
    for date, events_list in multi_event_dates.items():
        merged_events = merge_events(events_list, title_threshold, summary_threshold)
        merged_dates_and_events[date] = merged_events
    
    # Combine single-event dates with merged multi-event dates
    merged_dates_and_events.update(single_event_dates)

    # Sort dates
    merged_dates_and_events = sort_dictionary_date_keys(merged_dates_and_events)
    
    return merged_dates_and_events

In [41]:
merged_dates_and_events = merge_duplicate_events(sorted_dates_and_events)

In [42]:
merged_dates_and_events

{'18/03/1965': [('ICSID Convention on the Settlement of Investment Disputes',
   'The Convention on the Settlement of Investment Disputes between States and Nationals of Other States was signed on 18 March 1965, providing a framework for the resolution of investment disputes between investors and States.',
   8,
   'fdi_moot_case_2024_part_34')],
 '14/10/1966': [('ICSID Convention Entered into Force',
   'The Convention on the Settlement of Investment Disputes between States and Nationals of Other States entered into force on 14 October 1966, establishing rules for resolving investment disputes.',
   0,
   'fdi_moot_case_2024_part_1')],
 '27/01/1980': [('Vienna Convention on the Laws of Treaties',
   "The Vienna Convention on the Laws of Treaties entered into force on 27 January 1980, defining provisions regarding the binding nature of treaties in relation to acts or situations prior to the treaty's entry into force.",
   1,
   'fdi_moot_case_2024_part_12')],
 '02/12/2003': [('Free Tra

In [43]:
def display(merged_dates_and_events):
    """
    Displays merged dates and events.

    Args:
        merged_dates_and_events (dict): A dictionary containing merged events with dates as keys.

    Returns:
        None
    """
    ordered_dates = list(merged_dates_and_events.keys())
    
    for date in ordered_dates:
        for event in merged_dates_and_events[date]:
            if type(event[2]) == int:
                event_title, event_summary, date_chunk_index, document_name = event
                print(f"DATE: {date}")
                print(f"Event Title: {event_title}")
                print(f"Event Summary: {event_summary}")
                print(f"Date Chunk Index: {date_chunk_index}")
                print(f"Document Name: {document_name}")
                print()
            elif type(event[2]) == list:
                event_title, event_summary, chunk_index_and_doc_name_list = event
                print(f"DATE: {date}")
                print(f"Event Title: {event_title}")
                print(f"Event Summary: {event_summary}")
                for (date_chunk_index, document_name) in chunk_index_and_doc_name_list:
                    print(f"Date Chunk Index: {date_chunk_index}")
                    print(f"Document Name: {document_name}")
                print()

In [44]:
display(merged_dates_and_events)

DATE: 18/03/1965
Event Title: ICSID Convention on the Settlement of Investment Disputes
Event Summary: The Convention on the Settlement of Investment Disputes between States and Nationals of Other States was signed on 18 March 1965, providing a framework for the resolution of investment disputes between investors and States.
Date Chunk Index: 8
Document Name: fdi_moot_case_2024_part_34

DATE: 14/10/1966
Event Title: ICSID Convention Entered into Force
Event Summary: The Convention on the Settlement of Investment Disputes between States and Nationals of Other States entered into force on 14 October 1966, establishing rules for resolving investment disputes.
Date Chunk Index: 0
Document Name: fdi_moot_case_2024_part_1

DATE: 27/01/1980
Event Title: Vienna Convention on the Laws of Treaties
Event Summary: The Vienna Convention on the Laws of Treaties entered into force on 27 January 1980, defining provisions regarding the binding nature of treaties in relation to acts or situations prior 