In [1]:
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
import re
from dotenv import load_dotenv
import pandas as pd
from IPython.display import Markdown, display, HTML
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
load_dotenv()
os.chdir(os.path.dirname(os.getcwd()))

In [2]:
summary_df = pd.read_parquet('citation_summary_keywords.parquet')
df = pd.read_parquet('./data/splade.parquet')
summary_df.shape, df.shape

((117, 6), (1987, 24))

In [3]:
df1 = df.drop_duplicates(subset=['context'], keep='first')
df.shape, df1.shape

((1987, 24), (1987, 24))

In [4]:
df1.head(2)

Unnamed: 0,index,id,citation,name,name_abbreviation,decision_date,court_id,court_name,court_slug,judges,attorneys,citations,url,head,body,name_contains_lm,body_contains_lm,year,context,context_citation,context_tokens,openai_embeddings,splade_embeddings,state
0,0,411690,154 Ill. 2d 90,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",Johnson v. Halloran,2000-01-13,8837,Illinois Appellate Court,ill-app-ct,[],"['Wolter, Beeman, Lynch & McIntyre, of Springf...","[{'type': 'official', 'cite': '312 Ill. App. 3...",https://api.case.law/v1/cases/411690/,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",JUSTICE HALL\r\ndelivered the opinion of the c...,False,True,2000,The public defender of Cook County was appoint...,154 Ill. 2d 90,1317,"[-0.0017778094625100493, -0.002360282698646187...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",New Mexico
1,2,411690,111 Ill. 2d 229,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",Johnson v. Halloran,2000-01-13,8837,Illinois Appellate Court,ill-app-ct,[],"['Wolter, Beeman, Lynch & McIntyre, of Springf...","[{'type': 'official', 'cite': '312 Ill. App. 3...",https://api.case.law/v1/cases/411690/,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",JUSTICE HALL\r\ndelivered the opinion of the c...,False,True,2000,The defense in the criminal case was then assi...,111 Ill. 2d 229,1346,"[-0.005424648057669401, -0.0027988876681774855...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",Nevada


In [5]:
test_query = """
Regarding the pollution exclusion clause under the terms of comprehensive general liability (CGL) insurance, \
how have the courts defined the phrase "sudden and accidental', in particular for polluting events? Also, has there been any consideration for intentional vs unintentional polluting events.
"""
Markdown(test_query)


Regarding the pollution exclusion clause under the terms of comprehensive general liability (CGL) insurance, how have the courts defined the phrase "sudden and accidental', in particular for polluting events? Also, has there been any consideration for intentional vs unintentional polluting events.


In [6]:
from src.agent.tools.semantic_search import SemanticSearchConfig
from src.agent.tools.splade_search import SPLADESearchConfig
from src.search.query_expansion import segment_search_query

DATA_PATH = "data/splade.parquet"

SEMANTIC_TEXT_COLUMN = "context"
SPLADE_TEXT_COLUMN = "context"

semantic_config = SemanticSearchConfig(
    data_path=DATA_PATH,
    text_column=SEMANTIC_TEXT_COLUMN,
)
keyword_config = SPLADESearchConfig(
    data_path=DATA_PATH,
    text_column=SEMANTIC_TEXT_COLUMN,
)

queries = segment_search_query(
    test_query,
    n='2 to 3'
)

In [7]:
search_results = queries.execute(
    semantic_config,
    keyword_config,
)

[32m2024-04-27 18:55:44 - INFO - Read in df with shape: (1987, 24)[0m
[32m2024-04-27 18:55:45 - INFO - Read in df with shape: (1987, 24)[0m
[32m2024-04-27 18:55:45 - INFO - Using pre-computed 'context' embeddings from existing column: splade_embeddings[0m
[32m2024-04-27 18:55:45 - INFO - 

Thought: To understand how courts have interpreted the phrase 'sudden and accidental' within the context of
pollution exclusion clauses in CGL insurance policies, it's necessary to explore case law and legal
interpretations. This will involve identifying key court decisions that have addressed this phrase
and analyzing the reasoning behind these interpretations.
Search topic: Court Interpretations of 'Sudden and Accidental'[0m
[32m2024-04-27 18:55:45 - INFO - Running vector (OpenAI) search: How have courts defined the phrase 'sudden and accidental' in the context of pollution exclusion clauses in comprehensive general liability insurance policies?[0m
[32m2024-04-27 18:55:45 - INFO - Runnin

In [8]:
search_results[1].to_pandas['search_type'].value_counts()

search_type
vector    15
splade     5
Name: count, dtype: int64

In [9]:
search_results[1].metrics[:3]

[ChunkMetric(source='SemanticSearch', document_id='1620', score=0.8667, rank=1),
 ChunkMetric(source='SemanticSearch', document_id='585', score=0.8665, rank=2),
 ChunkMetric(source='SemanticSearch', document_id='628', score=0.8644, rank=3)]

In [23]:
import openai
from src.search.rank_gpt import RankGPTRerank, get_default_llm

llm = get_default_llm()

reranker = RankGPTRerank(
    top_n=10,
    llm=llm,
    verbose=True,
)

In [25]:
test_rerank = reranker.rerank_dataframe(
    df=search_results[0].to_pandas,
    query=test_query,
    text_column='context'
)

After Reranking, new rank list for nodes: [5, 13, 7, 8, 10, 11, 14, 15, 16, 17, 18, 0, 1, 2, 3, 4, 6, 9, 12, 19]

In [30]:
search_results[0].sub_question

"How have California courts defined the phrase 'sudden and accidental' in the context of pollution exclusion clauses in comprehensive general liability (CGL) insurance policies?"

In [40]:
Markdown(search_results[0].retrieval_docs[0].page_content[:2500])

However, as discussed in detail below, such a showing, by itself, is not sufficient to prove an “occurrence” because, under the express language of the policy, plaintiffs have to further show that the “exposure to conditions” was unexpected and unintended. Because of our holding in Emerson I that under Missouri law, the standard-form “sudden and accidental” exception to the pollution exclusion precludes coverage for property damage caused by gradual, non-abrupt releases of pollutants, plaintiffs are precluded from coverage of gradual pollution at the third-party sites under the 1984-85 policy. The National Priorities List is the EPA’s list of the most serious sites of known or threatened releases of hazardous substances, pollutants, or contaminants throughout the United States. The Hatfield, Pennsylvania, site was a subject of our prior opinion in Emerson I. We held that plaintiffs had proven the “exposure to conditions” component of the definition of “occurrence” at that site and were not required to identify a specific event or release which caused the exposure. Emerson I, 319 Ill. App. 3d at 253-54, 743 N.E.2d at 655-56. However, as previously noted, having cleared that hurdle, plaintiffs still had to prove that the exposure was unexpected and unintended. The EPA defines a surface impoundment as a topographic depression, excavation, or diked area, primarily formed from earthen materials (lined or unlined) and designed to hold accumulated liquid wastes, wastes containing free liquids, or sludges. The TWA court, likewise, did not decide that question, having found that, irrespective of whether the term “sudden” must be interpreted with a temporal connotation, the releases were not “accidental.” As noted above, plaintiffs limited their motion as to the Hatfield, Pennsylvania, site to the 1983-84 policy. For the purposes of this analysis, we need not draw a distinction between the standard “sudden and accidental” exception to the pollution exclusion and the customized “accidental” exception. The court emphasized that “ ‘[t]here is a vast difference between an intended act, and an intended result.’ ” White, 440 S.W.2d at 507, quoting Murray v. Landenberger, 5 Ohio App. 2d 294, 299, 215 N.E.2d 412, 415-16 (1966). “Although the court discussed the decision of the Eighth Circuit reversing in part the district court’s holding in General Dynamics, it did so solely in reference to whether the term “sudden” in the “sudden and accidental” exception to the pollution 

In [10]:
from src.agent.tools.utils import ResearchReport
import openai
import instructor
from tenacity import Retrying, stop_after_attempt, wait_fixed

from src.utils.gen_utils import count_tokens


def create_context(
    df: pd.DataFrame,
    text_column: str,
    context_token_limit: int = 10000
) -> str:
    """
    Creates a context string from a DataFrame within a specified token limit,
    applying word wrapping to the summary text.

    Args:
        df (pd.DataFrame): The DataFrame containing case data.
        context_token_limit (int): The maximum number of tokens for the context.

    Returns:
        str: A formatted string containing case details within the token limit,
             with word wrapping applied to the summary text.
    """
    import textwrap

    df.reset_index(drop=True, inplace=True)
    returns = []
    count = 1
    total_tokens = 100  # Starting token count to account for initial text.
    # Add the text to the context until the context is too long.
    for _, row in df.iterrows():
        wrapped_summary = textwrap.fill(row[text_column], width=100)
        text = (
            f"[{count}] {row['url']}\n"
            f"Summary: {wrapped_summary}\n"
            f"{'-'*100}\n"
        )
        text_tokens = count_tokens(text)
        if total_tokens + text_tokens > context_token_limit:
            break
        returns.append(text)
        total_tokens += text_tokens
        count += 1
    return "\n\n".join(returns)


def create_formatted_input(
    df: pd.DataFrame,
    query: str,
    text_column: str,
    context_token_limit: int = 25000,
    instructions: str = """
Using only the provided context, offer insights on relevancy of the past case(s) and how the legal researcher might reference them to address the sub-question, and ultimately the main question. 
Use structured markdown formatting and ALWAYS include markdown hyperlink citations when referencing the context, for example `[13](url goes here)`, and cite as many relevant cases as possible.
    """
) -> str:

    context = create_context(df, text_column, context_token_limit)

    try:
        prompt = f"""{context}\n\n{instructions}\n{query}\n\nAnalysis:"""
        return prompt
    except Exception as e:
        print(e)
        return ""
    

def get_final_answer(formatted_input: str, model_name: str) -> ResearchReport:
    client = instructor.patch(openai.OpenAI())
    return client.chat.completions.create(
        model=model_name,
        response_model=ResearchReport,
        max_retries=Retrying(
            stop=stop_after_attempt(5),
            wait=wait_fixed(1),
        ),
        messages=[
            {
                "role": "system",
                "content": "You are helpful legal research assistant helping organize research results for a sub-question related to a higher level question. Using only the provided context, offer insights on relevancy of the past case(s) and how the legal researcher can reference them to address this sub-question. Make sure to use highly structured markdown formatting including markdown links for all references.",
            },
            {
                "role": "user",
                "content": f"{formatted_input}"
            },
        ],
    )


In [11]:
query = f"""
To help answer the main question we have a sub-question with research results we need to analyze.
Here is the main question:\n{test_query}
Here is the sub-question we need to address using the context:\n{search_results[0].sub_question}
Remember to include markdown links wherever possible, e.g., '[[2] Allstate vs. Smith](URL)'.
"""

In [12]:
formatted_input = create_formatted_input(
    df=search_results[0].to_pandas, 
    query=query, 
    text_column='context',
    context_token_limit=20000)

In [13]:
print(formatted_input)

[1] https://api.case.law/v1/cases/5454708/
Summary: However, as discussed in detail below, such a showing, by itself, is not sufficient to prove an
“occurrence” because, under the express language of the policy, plaintiffs have to further show that
the “exposure to conditions” was unexpected and unintended. Because of our holding in Emerson I that
under Missouri law, the standard-form “sudden and accidental” exception to the pollution exclusion
precludes coverage for property damage caused by gradual, non-abrupt releases of pollutants,
plaintiffs are precluded from coverage of gradual pollution at the third-party sites under the
1984-85 policy. The National Priorities List is the EPA’s list of the most serious sites of known or
threatened releases of hazardous substances, pollutants, or contaminants throughout the United
States. The Hatfield, Pennsylvania, site was a subject of our prior opinion in Emerson I. We held
that plaintiffs had proven the “exposure to conditions” component of 

In [14]:
response_model = get_final_answer(formatted_input, model_name="gpt-4-turbo")
Markdown(response_model.research_report)

### Legal Analysis: Definitions of 'Sudden and Accidental' in Pollution Exclusion Clauses

The examination of the phrase 'sudden and accidental' in the context of pollution exclusion clauses involves understanding how various courts have interpreted these terms within comprehensive general liability insurance policies. Here's a breakdown of several relevant cases:

1. **Emerson I** highlighted cases both before and after a 1986 Missouri Supreme Court decision restricted the concept of 'sudden' to its traditional meaning (immediate and unexpected) but allowed for broader interpretations of 'accidental' (unintended and unforeseen events). The cases are highly relevant as they show the phrase's evolution in judicial interpretation. See [[1](https://api.case.law/v1/cases/5454708/)].

2. **N.W. Electric and White cases** delve into interpretations of "accidental." They define an event as 'caused by accident' even if initial acts are intentional but result in unintended damage, reiterating that 'accidental' means lacks intent to harm. This nuanced view aligns with broader interpretations, arguing against a strictly temporal understanding of 'sudden.' See [[2](https://api.case.law/v1/cases/1527705/)].

3. **Emerson I's consideration of policy language** regarding 'sudden and accidental' showcases judicial efforts to seek a balance between insurer liability and policyholder protection. The case reflects deliberation on how gradual processes yet resulting from 'accidental' circumstances doesn't meet the sudden and accidental criteria, emphasizing the temporal element of 'sudden.' See [[1](https://api.case.law/v1/cases/5454708/)].

4. **General Dynamics and IPC interpretations** contrast different jurisdictions considering 'accident'. The cases show distinct court tendencies, with some stressing intent while others look at whether the polluter exercised control, making them a useful comparative analysis tool. Specifically, these decisions underline differing judicial perspectives within various states regarding the interpretation of the clause. See [[2](https://api.case.law/v1/cases/1527705/)].

5. **Case Law in Illinois** emphasizes how 'sudden and accidental' are tied together as well as individually dissected to determine applicability in pollution exclusions. The state-law-based interpretations provided are helpful in understanding regional judicial attitudes towards insurance policy enforcement. See [[1](https://api.case.law/v1/cases/5454708/)].

These cases collectively provide a broader understanding of how different jurisdictions interpret the 'sudden and accidental' clause in pollution exclusions, highlighting shifts in legal interpretation over time, as well as variations across different states.

In [15]:
response_model.citations

['1', '2']

In [16]:
query = f"""
To help answer the main question we have a sub-question with research results we need to analyze.
Here is the main question:\n{test_query}
Here is the sub-question we need to address using the context:\n{search_results[1].sub_question}
Remember to include markdown links wherever possible, e.g., '[[2] Allstate vs. Smith](URL)'.
"""

In [17]:
formatted_input = create_formatted_input(
    df=search_results[1].to_pandas, 
    query=query, 
    text_column='context',
    context_token_limit=20000)

In [18]:
response_model2 = get_final_answer(formatted_input, model_name="gpt-4-turbo")
Markdown(response_model2.research_report)

In addressing how courts have considered intentional versus unintentional polluting events in relation to the pollution exclusion clause in comprehensive general liability insurance policies, various legal cases provide extensive insights.

### Key Cases and Considerations:

1. **Intentional versus Unintentional Discharges**:
   In many cases, the courts have closely examined whether the discharges were expected or intended by the insured to determine the applicability of the pollution exclusion. For instance, in case [1](https://api.case.law/v1/cases/12535237/), the court agreed with previous decisions that the pollution exclusion did not apply when the insured did not expect or intend for pollutants to enter the environment, implying that unintentional discharges can be covered under CGL policies despite pollution exclusions.

2. **Definition of 'Sudden and Accidental'**:
   From cases such as [2](https://api.case.law/v1/cases/5454708/) and [10](https://api.case.law/v1/cases/5454708/), it is evident that the interpretation of 'sudden and accidental' within the pollution exclusion has been debated. For instance, case [10](https://api.case.law/v1/cases/5454708/) discusses how under Missouri law, 'sudden' has been interpreted more with respect to temporal elements and 'accidental' focuses on the unexpected nature of the release, thereby influencing coverage decisions.

3. **Distinguishing Factors between Legal Jurisdictions**:
   Different jurisdictions have varied interpretations about what constitutes incidental pollution events as opposed to expected or intentional ones. This can be seen in [3](https://api.case.law/v1/cases/1527705/) where Missouri’s interpretations are referenced, which may contrast with those of other states showing the complexity and varied nature of legal interpretations across jurisdictions.

### Conclusion:
These cases collectively depict how intentionality plays a crucial role in determining the applicability and reach of pollution exclusions. They also highlight the significance of the specific definitions used within policies and the specific language of the courts' interpretation of 'sudden and accidental'. This is instrumental in resolving the complexities surrounding the application of pollution exclusions in liability insurance policies.

In [7]:
from src.search.query_planning import generate_query_filters

query_plan = generate_query_filters(
    search_df=df1,
    query=test_query,
    filter_fields=[
        'state',
    ],
    verbose=True,
)
len(query_plan)

[32m2024-04-27 12:32:23 - INFO - Schema shown to LLM: 
Name of each field, its type and unique values (up to 20):
* state (string);  Values - ['New Mexico' 'Nevada' 'Virginia' 'Vermont' 'Kansas' 'Maryland' 'Arkansas'
 'Massachusetts' 'West Virginia' 'Pennsylvania' 'Texas' 'Maine'
 'Mississippi' 'Hawaii' 'Iowa' 'Kentucky' 'Ohio' 'New Hampshire'
 'Wisconsin' 'North Dakota'], ... 30 more
        [0m


1

In [8]:
filtered_df = query_plan[0].filter_df(df1)

[32m2024-04-27 12:32:26 - INFO - Input DataFrame has 1,987 rows[0m
[32m2024-04-27 12:32:26 - INFO - Applying filter(s): {'state': 'California'}[0m
[32m2024-04-27 12:32:26 - INFO - After filtering the DataFrame has 41 rows[0m


In [None]:
from src.agent.tools.semantic_search import SemanticSearchConfig, SemanticSearchEngine
from src.agent.tools.splade_search import SPLADESearchConfig, SPLADESparseSearch
from src.agent.tools.hybrid_search import HybridSearchConfig, HybridSearchEngine
from src.search.base import SearchType
from src.search.models import SearchQuery

SEMANTIC_TEXT_COLUMN = "context"
SPLADE_TEXT_COLUMN = "context"
HYBRID_TEXT_COLUMN = "context"

DATA_PATH = "data/splade.parquet"

def retrieve_chunks(search_query: SearchQuery):
    search_type = search_query.search_type
    if search_type == SearchType.SEMANTIC:
        config = SemanticSearchConfig(
            data_path=DATA_PATH,
            text_column=SEMANTIC_TEXT_COLUMN,
        )
        search_engine = SemanticSearchEngine.create(config)
    elif search_type == SearchType.SPLADE:
        config = SPLADESearchConfig(
            data_path=DATA_PATH,
            text_column=SPLADE_TEXT_COLUMN,
        )
        search_engine = SPLADESparseSearch.create(config)
    elif search_type == SearchType.HYBRID:
        config = HybridSearchConfig(
            data_path=DATA_PATH,
            text_column=HYBRID_TEXT_COLUMN,
        )
        search_engine = HybridSearchEngine.create(config)

    return search_engine.query_similar_documents(search_query.query, top_k=search_query.num_hits)

In [None]:
from src.agent.tools.semantic_search import SemanticSearch
from src.agent.tools.splade_search import SPLADESparseSearch
from src.search.query_expansion import segment_search_query

semantic_search = SemanticSearch(
    df=summary_df,
    text_column='context',
    embedding_column='openai_embeddings_summary'
)

splade_search = SPLADESparseSearch(
    df=summary_df,
    text_column='keywords',
    embedding_column='splade_embeddings'
)

queries = segment_search_query(
    test_query,
    n='3 to 5'
)

In [41]:
from src.search.models import dataframe_to_text_nodes

nodes = dataframe_to_text_nodes(
    df=df,
    id_column='index',
    text_column='context',
    metadata_fields=['url', 'head', 'citation', 'name'],
    embedding_column='openai_embeddings',
    has_score=False,
)

In [6]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DataFrameLoader
import chromadb

In [18]:
df.head(1)

Unnamed: 0,index,id,citation,name,name_abbreviation,decision_date,court_id,court_name,court_slug,judges,attorneys,citations,url,head,body,name_contains_lm,body_contains_lm,year,context,context_citation,context_tokens,openai_embeddings,splade_embeddings
0,0,411690,154 Ill. 2d 90,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",Johnson v. Halloran,2000-01-13,8837,Illinois Appellate Court,ill-app-ct,[],"['Wolter, Beeman, Lynch & McIntyre, of Springf...","[{'type': 'official', 'cite': '312 Ill. App. 3...",https://api.case.law/v1/cases/411690/,"RICHARD R. JOHNSON, Plaintiff-Appellant and Cr...",JUSTICE HALL\r\ndelivered the opinion of the c...,False,True,2000,The public defender of Cook County was appoint...,154 Ill. 2d 90,1317,"[-0.0017778094625100493, -0.002360282698646187...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [23]:
loader = DataFrameLoader(df[['context', 'id', 'citation', 'name_abbreviation', 'court_name']], page_content_column="context")
docs = loader.load()
len(docs)

1987

In [7]:
embeddings_model = OpenAIEmbeddings()

# db = Chroma.from_documents(docs, embeddings_model, persist_directory="./data/chroma_db")
# load from disk
db = Chroma(persist_directory="./data/chroma_db", embedding_function=embeddings_model)

In [19]:
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("citation_context")

In [42]:
from typing import List
from llama_index.core.indices.vector_store import VectorIndexRetriever, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index_client import TextNode

from src.parsing.search import find_closest_matches_with_bm25_df, find_fuzzy_matches_in_df


def init_vector_store_index(nodes: List[TextNode]):
    chroma_client = chromadb.EphemeralClient()
    chroma_collection = chroma_client.create_collection("chroma_db")

    embeddings = OpenAIEmbedding(
        model='text-embedding-ada-002',
    )
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(
        vector_store=vector_store,
    )
    index = VectorStoreIndex(
        nodes=nodes,
        embed_model=embeddings,
        storage_context=storage_context,
    )
    return index

index = init_vector_store_index(nodes=nodes)

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

In [45]:
res = retriever.retrieve(test_query)
len(res)

10

In [48]:
index.storage_context.persist()

In [50]:
index.storage_context.vector_store.persist(persist_path='./data/index/chromadb')

In [46]:
res

[NodeWithScore(node=TextNode(id_='1145', embedding=None, metadata={'url': 'https://api.case.law/v1/cases/4268691/', 'head': 'COUNTRY MUTUAL INSURANCE COMPANY, Plaintiff-Appellee, v. STEVE CARR, d/b/a Carr Construction, Defendant-Appellant and Third-Party Plaintiff (Jon Seevers et al., Defendants; Harold Vogelzang, Third-Party Defendant).\r\nFourth District\r\nNo. 4—06—0589\r\nOpinion filed March 19, 2007.\r\nEdward H. Rawles (argued), of Rawles, O’Byrne, Stanko, Kepley & Jefferson, EC., of Champaign, for appellant.\r\nKaren L. Kendall (argued), of Heyl, Royster, Voelker & Allen, of Peoria, and Michael E. Raub, of Heyl, Royster, Voelker & Allen, of Urbana, for appellee.', 'citation': '89 Ill. App. 3d 617', 'name': 'COUNTRY MUTUAL INSURANCE COMPANY, Plaintiff-Appellee, v. STEVE CARR, d/b/a Carr Construction, Defendant-Appellant and Third-Party Plaintiff (Jon Seevers et al., Defendants; Harold Vogelzang, Third-Party Defendant)'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys

In [None]:
def pretty_print(df: pd.DataFrame) -> None:
    """
    Displays a DataFrame as HTML with line breaks for better readability.

    Args:
        df (pd.DataFrame): The DataFrame to display.
    """
    return display(HTML(df.to_html().replace("\\n", "<br>")))

def visualize_retrieved_nodes(nodes: list) -> None:
    """
    Visualizes the retrieved nodes by creating a DataFrame with scores and trimmed texts,
    then displaying it using the pretty_print function.

    Args:
        nodes (list): A list of nodes to visualize, where each node has a score and text.
    """
    result_dicts = []
    for node in nodes:
        # Trim the text to 1000 characters max
        trimmed_text = node.node.get_text()[:1000]
        result_dict = {"Score": node.score, "Text": trimmed_text}
        result_dicts.append(result_dict)

    pretty_print(pd.DataFrame(result_dicts))

In [21]:
import datetime
import logging
from pathlib import Path
from time import perf_counter
import json
from dotenv import load_dotenv

from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.generators import SimpleDatasetGenerator
from continuous_eval.llm_factory import LLMFactory

load_dotenv()


def main():
    logging.basicConfig(level=logging.INFO)

    generator_llm = "gpt-4-0125-preview"
    num_questions = 20
    multi_hop_precentage = 0.1
    max_try_ratio = 2

    print(f"Generating a {num_questions}-questions dataset with {generator_llm}...")

    tic = perf_counter()
    dataset_generator = SimpleDatasetGenerator(
        vector_store_index=db,
        generator_llm=LLMFactory(generator_llm),
    )
    dataset = dataset_generator.generate(
        embedding_vector_size=1536,
        num_questions=num_questions,
        multi_hop_percentage=multi_hop_precentage,
        max_try_ratio=max_try_ratio,
    )
    toc = perf_counter()
    print(f"Finished generating dataset in {tic-toc:.2f}sec.")

    current_datetime = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    output_directory = Path("generated_dataset")
    output_directory.mkdir(parents=True, exist_ok=True)
    fname = (
        output_directory / f"G_{generator_llm}_Q_{num_questions}_MH%_{multi_hop_precentage}_{current_datetime}.jsonl"
    )
    print(f"Saving dataset to {fname}")
    with open(fname, 'w', encoding='utf-8') as file:
        for item in dataset:
            json_string = json.dumps(item)
            file.write(json_string + '\n')
    print(f"Done.")
    
main()

Generating a 20-questions dataset with gpt-4-0125-preview...


Samples:  95%|█████████▌| 19/20 [02:38<00:08,  8.35s/it]

Finished generating dataset in -159.39sec.
Saving dataset to generated_dataset\G_gpt-4-0125-preview_Q_20_MH%_0.1_20240408_163005.jsonl
Done.





In [9]:
import json
from typing import List, Dict

def read_dataset(file_path: str) -> List[Dict]:
    """
    Reads a dataset from a JSONL file and returns it as a list of dictionaries.

    Args:
        file_path (str): The path to the JSONL file containing the dataset.

    Returns:
        List[Dict]: A list of dictionaries, where each dictionary represents an item in the dataset.
    """
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            data.append(json.loads(line))
    return data

# Example usage
dataset_path = "./generated_dataset/G_gpt-4-0125-preview_Q_20_MH%_0.1_20240408_163005.jsonl"
dataset = read_dataset(dataset_path)

In [10]:
dataset

[{'question': "Did the trial court abuse its discretion in denying Guidant's motion for a partial stay?",
  'answer': 'No.',
  'contexts': ['A trial court’s decision to grant or deny a motion to stay will not be overturned on appeal absent an abuse of discretion. Employing the relevant standard of review, we cannot say that the trial court abused its discretion in denying Guidant’s motion for a partial stay.'],
  'metadata': [{'citation': '212 Ill. App. 3d 556',
    'court_name': 'Illinois Appellate Court',
    'id': 3600472,
    'name_abbreviation': 'Allianz Insurance v. Guidant Corp.'}],
  'question_type': 'Single Hop Fact Seeking'},
 {'question': 'Did the court find American had a duty to defend McHugh in the Marciano case?',
  'answer': 'No, the court did not find American had a duty to defend McHugh in the Marciano case.',
  'contexts': ['We disagree. As counsel for McHugh admitted during appellate oral arguments, the Marciano complaint alleges only direct negligence against McHug

In [32]:
from continuous_eval.eval import Module, Pipeline, Dataset, ModuleOutput
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics # Deterministic metrics
from continuous_eval.metrics.generation.text import (
    FleschKincaidReadability, # Deterministic metric
    DebertaAnswerScores, # Semantic metric
    LLMBasedFaithfulness, # LLM-based metric
)
from typing import List, Dict
from continuous_eval.eval.tests import MeanGreaterOrEqualThan

dataset = Dataset(dataset_path="./generated_dataset/G_gpt-4-0125-preview_Q_20_MH%_0.1_20240408_163005.jsonl",
                  manifest_path='./generated_dataset/manifest.yaml')

In [33]:
from langchain_core.documents.base import Document

Documents = List[Dict[str, str]]
DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x])

In [34]:
from continuous_eval.metrics.retrieval import RankedRetrievalMetrics, ExactChunkMatch

retriever = Module(
    name="retriever",
    input=dataset.question,
    output=Documents,
    eval=[
        PrecisionRecallF1(matching_strategy=ExactChunkMatch()).use(
            retrieved_context=DocumentsContent,
            ground_truth_context=dataset.contexts,
        ),
    ],
    tests=[
        MeanGreaterOrEqualThan(
            test_name="Average Precision", metric_name="context_recall", min_value=0.8
        ),
    ],
)

reranker = Module(
    name="reranker",
    input=retriever,
    output=Documents,
    eval=[
        RankedRetrievalMetrics(matching_strategy=ExactChunkMatch()).use(
            retrieved_context=DocumentsContent,
            ground_truth_context=dataset.contexts,
        ),
    ],
    tests=[
        MeanGreaterOrEqualThan(
            test_name="Context Recall", metric_name="average_precision", min_value=0.7
        ),
    ],
)


pipeline = Pipeline([retriever, reranker], dataset=dataset)

In [35]:
print(pipeline.graph_repr())

graph TD;
    Dataset((Dataset));
    retriever --> reranker;
    Dataset -. "question" .-> retriever;



In [36]:
from src.agent.tools.semantic_search import SemanticSearch
from src.embedding_models.models import ColbertReranker
from langchain_openai.chat_models import ChatOpenAI

def Dataframe2documents(df):
    loader = DataFrameLoader(df[['context', 'id', 'citation', 'name_abbreviation', 'court_name']], page_content_column="context")
    docs = loader.load()
    return docs

def documents2Dataframe(documents: List[Document]) -> pd.DataFrame:
    rows = []
    for chunk in documents:
        row = {
            "context": chunk.page_content,
            **chunk.metadata,
        }
        rows = rows + [row]

    df = pd.DataFrame(rows)
    return df

def retrieve(q):
    retriever = SemanticSearch(
        df=df1,
        text_column='context',
        embedding_column='openai_embeddings'
    )
    res_df = retriever.query_similar_documents(q, top_n=5)
    res_docs = Dataframe2documents(res_df)
    return res_docs

def rerank(q, retrieved_docs):
    reranker = ColbertReranker(
        column='context'
    )
    retrieved_docs = documents2Dataframe(retrieved_docs)
    res_df = reranker.rerank(q, retrieved_docs)
    res_docs = Dataframe2documents(res_df)
    return res_docs


def ask(q, retrieved_docs):
    model = ChatOpenAI(model="gpt-4-0125-preview")
    system_prompt = (
        "You are a legal research assistant.\n"
        "Answer the question below based on the context provided."
    )
    user_prompt = f"Question: {q}\n\n"
    user_prompt += "Contexts:\n" + "\n".join(
        [doc.page_content for doc in retrieved_docs]
    )
    try:
        result = model.invoke(system_prompt + user_prompt).content
    except Exception as e:
        print(e)
        result = "Sorry, I cannot answer this question."
    return result

In [37]:
from continuous_eval.eval.manager import eval_manager

eval_manager.set_pipeline(pipeline)
eval_manager.start_run()
while eval_manager.is_running():
    if eval_manager.curr_sample is None:
        break
    q = eval_manager.curr_sample["question"]
    # Run and log Retriever results
    retrieved_docs = retrieve(q)
    eval_manager.log("retriever", [doc.__dict__ for doc in retrieved_docs])
    # Run and log Reranker results
    reranked_docs = rerank(q, retrieved_docs)
    eval_manager.log("reranker", [doc.__dict__ for doc in reranked_docs])
    eval_manager.next_sample()

eval_manager.evaluation.save(Path("results.jsonl"))

In [39]:
from continuous_eval.eval.manager import eval_manager
from pathlib import Path

eval_manager.set_pipeline(pipeline)

# Evaluation
eval_manager.evaluation.load(Path("results.jsonl"))
eval_manager.run_metrics()
eval_manager.metrics.save(Path("metrics_results.json"))

# Tests
agg = eval_manager.metrics.aggregate()
print(agg)
eval_manager.run_tests()
eval_manager.tests.save(Path("test_results.json"))

for module_name, test_results in eval_manager.tests.results.items():
    print(f"{module_name}")
    for test_name in test_results:
        print(f" - {test_name}: {test_results[test_name]}")

{'retriever': {'context_precision': 0.4, 'context_recall': 0.825, 'context_f1': 0.5146825396825397}, 'reranker': {'average_precision': 0.7280555555555555, 'reciprocal_rank': 0.7291666666666666, 'ndcg': 0.7402841628990883}, 'llm': {'LLM_based_faithfulness': 0.75, 'deberta_answer_entailment': 0.29618267125915737, 'deberta_answer_contradiction': 0.022239959875423664, 'flesch_reading_ease': 14.291352026267827, 'flesch_kincaid_grade_level': 18.58895135990985}}
retriever
 - Average Precision: True
reranker
 - Context Recall: True


In [10]:
from src.agent.tools.semantic_search import SemanticSearch
from src.agent.tools.splade_search import SparseEmbeddingsSplade
from src.search.models import dataframe_to_text_nodes, text_nodes_to_dataframe
from src.search.rerank_openai import apply_gpt_relevance_to_df

bm25_df = find_closest_matches_with_bm25_df(
    df=df1,
    text_column='context',
    query=test_query,
    k=20,
)
bm25_nodes = dataframe_to_text_nodes(
    df=bm25_df,
    id_column='index',
    text_column='context',
    metadata_fields=['url', 'head', 'citation', 'name'],
    has_score=True,
)
node_index_df = text_nodes_to_dataframe(
    res,
    text_column='context',
    metadata_fields=['url', 'head', 'citation', 'name'],
    embedding_column='openai_embeddings',
)
semantic_search = SemanticSearch(
    df=df1,
    text_column='context',
    embedding_column='openai_embeddings'
)
splade_search = SparseEmbeddingsSplade(
    df=df1,
    text_column='context',
    embedding_column='splade_embeddings'
)
splade_df = splade_search.query_similar_documents(
    query=test_query,
    top_n=20,
)
splade_nodes = dataframe_to_text_nodes(
    df=splade_df,
    id_column='index',
    text_column='context',
    metadata_fields=['url', 'head', 'citation', 'name'],
    embedding_column='splade_embeddings',
    has_score=True,
)
vec_df = semantic_search.query_similar_documents(
    query=test_query,
    top_n=20,
)
vec_nodes = dataframe_to_text_nodes(
    df=vec_df,
    id_column='index',
    text_column='context',
    metadata_fields=['url', 'head', 'citation', 'name'],
    embedding_column='openai_embeddings',
    has_score=True,
)

len(vec_nodes), len(splade_nodes)

2024-04-08 00:52:50 - INFO - Using pre-computed 'context' embeddings from existing column: splade_embeddings


(20, 20)

In [11]:
visualize_retrieved_nodes(bm25_nodes[:4])

Unnamed: 0,Score,Text
0,45.434171,"However, as discussed in detail below, such a showing, by itself, is not sufficient to prove an “occurrence” because, under the express language of the policy, plaintiffs have to further show that the “exposure to conditions” was unexpected and unintended. Because of our holding in Emerson I that under Missouri law, the standard-form “sudden and accidental” exception to the pollution exclusion precludes coverage for property damage caused by gradual, non-abrupt releases of pollutants, plaintiffs are precluded from coverage of gradual pollution at the third-party sites under the 1984-85 policy. The National Priorities List is the EPA’s list of the most serious sites of known or threatened releases of hazardous substances, pollutants, or contaminants throughout the United States. The Hatfield, Pennsylvania, site was a subject of our prior opinion in Emerson I. We held that plaintiffs had proven the “exposure to conditions” component of the definition of “occurrence” at that site and were"
1,40.773825,"See Willis Corroon Corp. , 203 F.3d at 453. After careful consideration of the arguments and the record before us, we find the trial court correctly found Travelers' conduct breached its duty to defend. ¶ 56 II. Pollution Exclusions ¶ 57 We are also asked to consider whether the pollution exclusions in the policies apply. Travelers contends coverage is barred by the pollution exclusions contained in some of the policies because Rogers's liabilities are the result of decades of intentional discharges of hazardous chemicals into the soil, unlined ponds, and a public sewer. Rogers responds that (1) Travelers should be estopped from raising any exclusions, (2) its use of containment ponds and Sauget's sewer system does not implicate Travelers' pollution exclusion, (3) it did not expect or intend overflows from the sewer system or retention ponds, and (4) its use of sewers and ponds was not illegal. ¶ 58 As set forth in our consideration of the first issue, Travelers is estopped from raisin"
2,36.808561,"II. In granting summary judgment with respect to the owned sites: (a) the trial court’s decision is erroneous as a matter of law because the trial court ignored this court’s prior ruling in Emerson I that, for the purposes of the pollution exclusion, a “release” is the escape of pollutants from, not the intentional deposit of wastes into, places of expected containment; (b) even if the trial court were not bound by this court’s decision in Emerson I, its decisions were inconsistent with the plain meaning of the insurance policy and Missouri law, which also provide that a “release” under the pollution exclusion is a discharge from a place of expected containment; (c) the trial court’s rulings, based on the amended “accidental” exception to the pollution exclusion in the 1983-84 policy, that releases at the owned sites could not be “accidental” because they were gradual, contravened both this court’s decision in Emerson I and the established Missouri law; (d) there were genuine questions"
3,36.651254,"While neither of these Missouri Court of Appeals decisions specifically addresses the word “accidental” as used in an exception to a pollution exclusion, their interpretations of the term “caused by accident” are in line with the decision in General Dynamics I. In N.W. Electric, the plaintiff cooperative, a supplier of electric power, sought coverage for damages to private property which resulted when the plaintiff exceeded an easement granted to it for construction of a power line over the property. The court concluded that while the acts producing the damage might be termed intentional, the resulting harm was “caused by accident” within the meaning of the insurance policy, where there was no evidence that the plaintiff cooperative knew of the acts which caused the damage or directed them to be done, nor was there any evidence of an intent to injure. N.W. Electric, 451 S.W.2d at 361-62. Similarly, the court in White held that the contamination of a well was “caused by accident,” where"


In [19]:
from lancedb.rerankers import ColbertReranker

In [16]:
bm25_df2 = apply_gpt_relevance_to_df(
    query=test_query,
    df=bm25_df,
    passage_col='context',
    score_col_name='score_gpt'
)

2024-04-08 00:57:31 - INFO - Predicted 'is relevant': Yes - Score: 24.581820219874547
2024-04-08 00:57:31 - INFO - Predicted 'is relevant': Yes - Score: 13.04373289722047
2024-04-08 00:57:32 - INFO - Predicted 'is relevant': Yes - Score: 19.774827988658743
2024-04-08 00:57:32 - INFO - Predicted 'is relevant': Yes - Score: 23.192935747195328
2024-04-08 00:57:33 - INFO - Predicted 'is relevant': No - Score: -1.9154196646176787
2024-04-08 00:57:33 - INFO - Predicted 'is relevant': No - Score: -28.50855336424877
2024-04-08 00:57:34 - INFO - Predicted 'is relevant': No - Score: -3.5620396709345004
2024-04-08 00:57:34 - INFO - Predicted 'is relevant': No - Score: -29.20614741505005
2024-04-08 00:57:35 - INFO - Predicted 'is relevant': No - Score: -37.741131390804846
2024-04-08 00:57:35 - INFO - Predicted 'is relevant': No - Score: -2.2806426577332375
2024-04-08 00:57:35 - INFO - Predicted 'is relevant': Yes - Score: 2.099506091192047
2024-04-08 00:57:36 - INFO - Predicted 'is relevant': Yes 

In [22]:
from src.search.rank_gpt import RankGPTRerank
from llama_index.llms.openai import OpenAI

bm25_nodes = dataframe_to_text_nodes(
    df=bm25_df,
    id_column='index',
    text_column='context',
    metadata_fields=['url', 'head', 'citation', 'name'],
    has_score=True,
)

reranker = RankGPTRerank(
    top_n=10,
    llm=OpenAI(),
    verbose=True,
)
bm25_nodes2 = reranker.postprocess_nodes(
    nodes=bm25_nodes,
    query_bundle=test_query,
)

bm25_df3 = text_nodes_to_dataframe(
    text_nodes=bm25_nodes2,
    text_column='context',
    metadata_fields=['url', 'head', 'citation', 'name'],   
)

After Reranking, new rank list for nodes: [0, 2, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [24]:
bm25_df3.head()

Unnamed: 0,context,score,url,head,citation,name
0,"However, as discussed in detail below, such a ...",24.685309,https://api.case.law/v1/cases/5454708/,"EMERSON ELECTRIC COMPANY et al., Plaintiffs-Ap...",268 Ill. App. 3d 598,"EMERSON ELECTRIC COMPANY et al., Plaintiffs-Ap..."
1,II. In granting summary judgment with respect ...,19.774828,https://api.case.law/v1/cases/5454708/,"EMERSON ELECTRIC COMPANY et al., Plaintiffs-Ap...",154 Ill. 2d 90,"EMERSON ELECTRIC COMPANY et al., Plaintiffs-Ap..."
2,"See Willis Corroon Corp. , 203 F.3d at 453. Af...",13.670125,https://api.case.law/v1/cases/12535237/,"ROGERS CARTAGE COMPANY, Pharmacia Corporation,...",154 Ill. 2d 90,"ROGERS CARTAGE COMPANY, Pharmacia Corporation,..."
3,While neither of these Missouri Court of Appea...,23.544951,https://api.case.law/v1/cases/1527705/,"EMERSON ELECTRIC COMPANY et al., Plaintiffs-Ap...",155 Ill. 2d 402,"EMERSON ELECTRIC COMPANY et al., Plaintiffs-Ap..."
4,(Emphasis omitted.) United States Fidelity & G...,-1.91542,https://api.case.law/v1/cases/980702/,"ATLANTIC MUTUAL INSURANCE COMPANY et al., Plai...",89 Ill. App. 3d 617,"ATLANTIC MUTUAL INSURANCE COMPANY et al., Plai..."


In [27]:
from src.search.llm_filter import filter_chunks

bm25_nodes3 = filter_chunks(
    query=test_query,
    chunks_to_filter=bm25_nodes2,
)

len(bm25_nodes3)

Running LLM usefulness eval in parallel (following logging may be out of order)


6

In [29]:
visualize_retrieved_nodes(bm25_nodes3[:5])

Unnamed: 0,Score,Text
0,19.774828,"II. In granting summary judgment with respect to the owned sites: (a) the trial court’s decision is erroneous as a matter of law because the trial court ignored this court’s prior ruling in Emerson I that, for the purposes of the pollution exclusion, a “release” is the escape of pollutants from, not the intentional deposit of wastes into, places of expected containment; (b) even if the trial court were not bound by this court’s decision in Emerson I, its decisions were inconsistent with the plain meaning of the insurance policy and Missouri law, which also provide that a “release” under the pollution exclusion is a discharge from a place of expected containment; (c) the trial court’s rulings, based on the amended “accidental” exception to the pollution exclusion in the 1983-84 policy, that releases at the owned sites could not be “accidental” because they were gradual, contravened both this court’s decision in Emerson I and the established Missouri law; (d) there were genuine questions"
1,13.670125,"See Willis Corroon Corp. , 203 F.3d at 453. After careful consideration of the arguments and the record before us, we find the trial court correctly found Travelers' conduct breached its duty to defend. ¶ 56 II. Pollution Exclusions ¶ 57 We are also asked to consider whether the pollution exclusions in the policies apply. Travelers contends coverage is barred by the pollution exclusions contained in some of the policies because Rogers's liabilities are the result of decades of intentional discharges of hazardous chemicals into the soil, unlined ponds, and a public sewer. Rogers responds that (1) Travelers should be estopped from raising any exclusions, (2) its use of containment ponds and Sauget's sewer system does not implicate Travelers' pollution exclusion, (3) it did not expect or intend overflows from the sewer system or retention ponds, and (4) its use of sewers and ponds was not illegal. ¶ 58 As set forth in our consideration of the first issue, Travelers is estopped from raisin"
2,23.544951,"While neither of these Missouri Court of Appeals decisions specifically addresses the word “accidental” as used in an exception to a pollution exclusion, their interpretations of the term “caused by accident” are in line with the decision in General Dynamics I. In N.W. Electric, the plaintiff cooperative, a supplier of electric power, sought coverage for damages to private property which resulted when the plaintiff exceeded an easement granted to it for construction of a power line over the property. The court concluded that while the acts producing the damage might be termed intentional, the resulting harm was “caused by accident” within the meaning of the insurance policy, where there was no evidence that the plaintiff cooperative knew of the acts which caused the damage or directed them to be done, nor was there any evidence of an intent to injure. N.W. Electric, 451 S.W.2d at 361-62. Similarly, the court in White held that the contamination of a well was “caused by accident,” where"
3,-1.91542,"(Emphasis omitted.) United States Fidelity & Guaranty Co., 144 Ill. 2d at 73. Furthermore, if the insurer relies on an exclusionary provision, it must be “clear and free from doubt” that the policy’s exclusion prevents coverage. Bituminous Casualty Corp. v. Fulkerson, 212 Ill. App. 3d 556, 564, 571 N.E.2d 256 (1991). Additionally, we must liberally construe the underlying complaint and the insurance policy in favor of the insured. United States Fidelity & Guaranty Co., 144 Ill. 2d at 74 . In accordance with the above propositions of law, we must analyze the underlying complaint in light of the applicable policy provisions to determine whether the claim is within or potentially within coverage. Both Atlantic Mutual and Centennial rely on portions of their CGL policies that provide coverage only for bodily injury caused by an “occurrence” as defined in their respective policies. Centennial’s CGL policy provides the following: “This company will pay on behalf of the insured all sums which"
4,-3.56204,"The trees were cleared from the wrong lots. The tree-cutler’s commercial general liability (CGL) insurer, Pekin Insurance Co., refused to defend Miller against lawsuits brought by the owners of the property trees were removed from. This court must decide whether clearing trees off the wrong lots constitutes an “occurrence” under the CGL policy and whether certain exclusions in the policy bar coverage. The trial court found Pekin has a duty to defend. We agree. FACTS In the underlying lawsuit, plaintiffs Chicago Title & Trust Co. as trustee under trust No. 53885, William Givens, Marilyn Givens, John Marek, and Harriet Slayton, filed suit against Miller, d/b/a Miller Tree Service, and Bineet Sarang, d/b/a Sarang Corporation (Sarang), for trespass and violations of the Wrongful Tree Cutting Act (740 ILCS 185/2 (West 2000)). The plaintiffs later added additional counts of negligent trespass. They alleged Sarang hired Miller to remove trees from lots 13, 14, and 15 of a subdivision in Hanov"


In [13]:
retrieved_df3 = apply_gpt_relevance_to_df(
    query=test_query,
    df=splade_df,
    passage_col='context',
    score_col_name='score'
)

2024-04-08 00:54:25 - INFO - Predicted 'is relevant': Yes - Score: 24.581820219874547
2024-04-08 00:54:25 - INFO - Predicted 'is relevant': Yes - Score: 20.62935967994126
2024-04-08 00:54:25 - INFO - Predicted 'is relevant': Yes - Score: 23.54495143265143
2024-04-08 00:54:26 - INFO - Predicted 'is relevant': No - Score: -152.2382138316638
2024-04-08 00:54:26 - INFO - Predicted 'is relevant': Yes - Score: 13.04373289722047
2024-04-08 00:54:27 - INFO - Predicted 'is relevant': No - Score: -97.49343406094958
2024-04-08 00:54:28 - INFO - Predicted 'is relevant': No - Score: -2.022808950994338
2024-04-08 00:54:29 - INFO - Predicted 'is relevant': No - Score: -2.442676012149675
2024-04-08 00:54:29 - INFO - Predicted 'is relevant': No - Score: -511.249874552062
2024-04-08 00:54:29 - INFO - Predicted 'is relevant': No - Score: -1318.759110977008
2024-04-08 00:54:30 - INFO - Predicted 'is relevant': No - Score: -813.8807690717495
2024-04-08 00:54:30 - INFO - Predicted 'is relevant': No - Score:

In [14]:
retrieved_df3 = apply_gpt_relevance_to_df(
    query=test_query,
    df=vec_df,
    passage_col='context',
    score_col_name='score'
)

2024-04-08 00:54:42 - INFO - Predicted 'is relevant': Yes - Score: 24.581820219874547
2024-04-08 00:54:42 - INFO - Predicted 'is relevant': No - Score: -1.6020784981936724
2024-04-08 00:54:43 - INFO - Predicted 'is relevant': No - Score: -1.9436635722052453
2024-04-08 00:54:43 - INFO - Predicted 'is relevant': No - Score: -44.50275294029689
2024-04-08 00:54:43 - INFO - Predicted 'is relevant': No - Score: -149.36246498850468
2024-04-08 00:54:44 - INFO - Predicted 'is relevant': Yes - Score: 23.54495143265143
2024-04-08 00:54:44 - INFO - Predicted 'is relevant': No - Score: -307.63982907038877
2024-04-08 00:54:45 - INFO - Predicted 'is relevant': No - Score: -1.9154196646176787
2024-04-08 00:54:46 - INFO - Predicted 'is relevant': No - Score: -243.29580171197526
2024-04-08 00:54:46 - INFO - Predicted 'is relevant': No - Score: -9.788421315579486
2024-04-08 00:54:46 - INFO - Predicted 'is relevant': No - Score: -3.4858878709786447
2024-04-08 00:54:47 - INFO - Predicted 'is relevant': Yes

In [5]:
from src.agent.tools.semantic_search import SemanticSearch
from src.agent.tools.splade_search import SPLADESparseSearch
from src.search.query_expansion import segment_search_query

semantic_search = SemanticSearch(
    df=summary_df,
    embedding_column='openai_embeddings_summary'
)

splade_search = SPLADESparseSearch(
    df=summary_df,
    text_column='keywords',
    embedding_column='splade_embeddings'
)

queries = segment_search_query(
    test_query,
    n='3 to 5'
)

[32m2024-04-06 19:49:44 - INFO - Using pre-computed 'keywords' embeddings from existing column: splade_embeddings[0m


In [6]:
queries.searches

[SubQuestion(chain_of_thought="Understanding the basic framework of the pollution exclusion clause in CGL policies is essential to grasp how the 'sudden and accidental' phrase is interpreted. This will provide a foundation for further analysis.", sub_question_topic='Overview of Pollution Exclusion Clause in CGL Insurance', sub_question_query='What is the pollution exclusion clause in comprehensive general liability insurance?', sub_question_keywords=['pollution exclusion clause', 'CGL insurance', 'comprehensive general liability']),
 SubQuestion(chain_of_thought="Clarifying the legal interpretation of 'sudden and accidental' within the context of pollution exclusion is crucial to understanding how courts and insurers apply this phrase to pollution events.", sub_question_topic="Legal Interpretation of 'Sudden and Accidental'", sub_question_query="How is the phrase 'sudden and accidental' legally interpreted in the context of pollution exclusion clauses in CGL insurance?", sub_question_k

In [7]:
vector_results, keyword_results = queries.execute(semantic_search, splade_search)
len(vector_results), len(keyword_results)

[32m2024-04-06 19:49:58 - INFO - 

Thought: Understanding the basic framework of the pollution exclusion clause in CGL policies is essential to
grasp how the 'sudden and accidental' phrase is interpreted. This will provide a foundation for
further analysis.
Search topic: Overview of Pollution Exclusion Clause in CGL Insurance[0m
[32m2024-04-06 19:49:58 - INFO - Running vector (OpenAI) search: What is the pollution exclusion clause in comprehensive general liability insurance?[0m
[32m2024-04-06 19:49:58 - INFO - Running keyword (SPLADE) search: pollution exclusion clause, CGL insurance, comprehensive general liability[0m
[32m2024-04-06 19:49:58 - INFO - Returning 10 records from vector search and 10 from keywords[0m
[32m2024-04-06 19:49:58 - INFO - ---------------------------------------------------------------------------[0m
[32m2024-04-06 19:49:58 - INFO - 

Thought: Clarifying the legal interpretation of 'sudden and accidental' within the context of pollution
exclusion is 

(4, 4)

In [8]:
vector_results[0].head(2)

Unnamed: 0,citation,summary,keywords,recency,openai_embeddings_summary,splade_embeddings,search_type,score
19,578 N.E.2d 926,The citation 578 N.E.2d 926 concerns primarily...,"insurance policy interpretation, duty to defen...",,"[0.0031179494690150023, -0.00488113472238183, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.851879
89,357 Ill. App. 3d 955,The case cautions against broadly interpreting...,"'arising out of' language, insurance policy ex...",,"[-0.015558742918074131, -0.011712239123880863,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.825766


In [9]:
vector_results = [df.reset_index(drop=False) for df in vector_results]
keyword_results = [df.reset_index(drop=False) for df in keyword_results]

In [10]:
pd.concat(vector_results).shape, pd.concat(keyword_results).shape

((40, 9), (40, 9))

In [11]:
from src.search.doc_joiner import DocJoinerDF

df_joiner = DocJoinerDF(
    join_mode="reciprocal_rank_fusion", 
    top_k=10,
)

In [12]:
vector_res = df_joiner.run(vector_results, text_column='summary')
keyword_res = df_joiner.run(keyword_results, text_column='keywords')
vector_res.shape, keyword_res.shape

((10, 9), (10, 9))

In [13]:
vector_res

Unnamed: 0,index,citation,summary,keywords,recency,openai_embeddings_summary,splade_embeddings,search_type,score
0,19,578 N.E.2d 926,The citation 578 N.E.2d 926 concerns primarily...,"insurance policy interpretation, duty to defen...",,"[0.0031179494690150023, -0.00488113472238183, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.995968
7,53,643 N.E.2d 1226,"643 N.E.2d 1226, United States Gypsum Co. v. A...","insurance law, continuous trigger approach, pr...",References to 643 N.E.2d 1226 have evolved to ...,"[-0.02118617109954357, -0.003369903890416026, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.939225
1,89,357 Ill. App. 3d 955,The case cautions against broadly interpreting...,"'arising out of' language, insurance policy ex...",,"[-0.015558742918074131, -0.011712239123880863,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.93246
10,33,268 Ill. App. 3d 598,The case law from United States Gypsum Co. v. ...,"insurance, claim, coverage, policy, occurrence...",,"[-0.008939363993704319, -0.011452711187303066,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.741935
11,40,89 Ill. App. 3d 617,The case Aetna Casualty & Surety Co. v. Freyer...,"insurance, accident, definition, unforeseen oc...",The definition of 'accident' established in 89...,"[-0.003400814952328801, -0.015401927754282951,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.705141
5,71,199 Ill. 2d 281,"General Casualty Insurance Co. v. Lacey, 199 I...","insurance policy, summary judgment, duty to de...",This case has continually influenced the inter...,"[-0.0014027409488335252, 0.010859929956495762,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.689733
2,91,363 Ill. App. 3d 335,Liberty Mutual Fire Insurance Co. v. St. Paul ...,"insurance policy, interpretation, obligations,...",,"[-0.006612452678382397, -0.00979774072766304, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.480345
4,77,329 Ill. App. 3d 228,The case National Union Fire Insurance Co. of ...,"insurance policy, duty to defend, allegations,...",The case provides insights into the interpreta...,"[0.003789374837651849, -0.008720937184989452, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.462227
15,62,197 Ill. 2d 278,Travelers Insurance Co. v. Eljer Manufacturing...,"insurance policy, initial permission rule, cov...",,"[-0.015268008224666119, -0.011478173546493053,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.462121
3,4,158 Ill. 2d 116,The seminal case of National Union Fire Insura...,"insurance policy, coverage, liability, endorse...",The case continues to be referenced in numerou...,"[-0.012328366748988628, -0.013215634971857071,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",vector,0.456138


In [14]:
keyword_res

Unnamed: 0,index,citation,summary,keywords,recency,openai_embeddings_summary,splade_embeddings,score,search_type
0,92,223 Ill. 2d 352,Valley Forge Insurance Co. v. Swiderski Electr...,"insurance, duty to defend, allegations, insura...",The 2006 decision continues to influence how i...,"[-0.014363071881234646, -0.010653404518961906,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.0,splade
1,9,687 N.E.2d 72,"The case law 687 N.E.2d 72, typically referenc...","insurance policy interpretation, ambiguity res...",,"[-0.001692606951110065, 0.0007399397436529398,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.979967,splade
2,58,105 Ill. 2d 486,"The citation 105 Ill. 2d 486, pertaining to th...","insurance policy, indemnification, self-insure...",,"[-0.014985797926783562, -0.008534330874681473,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.972158,splade
3,19,578 N.E.2d 926,The citation 578 N.E.2d 926 concerns primarily...,"insurance policy interpretation, duty to defen...",,"[0.0031179494690150023, -0.00488113472238183, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.942456,splade
8,63,757 N.E.2d 481,"The citation 757 N.E.2d 481, primarily referen...","insurance policy, intent of parties, insurer d...",References to the principles established in Tr...,"[-0.006457023788243532, 0.004370039328932762, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.904548,splade
4,77,329 Ill. App. 3d 228,The case National Union Fire Insurance Co. of ...,"insurance policy, duty to defend, allegations,...",The case provides insights into the interpreta...,"[0.003789374837651849, -0.008720937184989452, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.703957,splade
10,91,363 Ill. App. 3d 335,Liberty Mutual Fire Insurance Co. v. St. Paul ...,"insurance policy, interpretation, obligations,...",,"[-0.006612452678382397, -0.00979774072766304, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.696843,splade
11,18,144 Ill. 2d 64,United States Fidelity & Guaranty Co. v. Wilki...,"insurance, duty to defend, policy coverage, am...","Notably, the principles established in United ...","[-0.020754359662532806, -0.013158555142581463,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.67634,splade
6,4,158 Ill. 2d 116,The seminal case of National Union Fire Insura...,"insurance policy, coverage, liability, endorse...",The case continues to be referenced in numerou...,"[-0.012328366748988628, -0.013215634971857071,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.673081,splade
7,89,357 Ill. App. 3d 955,The case cautions against broadly interpreting...,"'arising out of' language, insurance policy ex...",,"[-0.015558742918074131, -0.011712239123880863,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.672794,splade


In [15]:
from src.embedding_models.models import ColbertReranker
from src.search.threadpool import run_functions_tuples_in_parallel

In [16]:
USEFUL_PAT = "Yes useful"
NONUSEFUL_PAT = "Not useful"

CHUNK_FILTER_PROMPT = f"""
Determine if the reference section is USEFUL for answering the user query.
It is good enough for the section to be related or similar to the query, \
it should be relevant information that is USEFUL for comparing to the query.
If the section contains ANY useful information, that is good enough, \
it does not need to fully answer the user query, but it \
should at least address a component to be USEFUL.

Reference Section:
```
{{chunk_text}}
```

User Query:
```
{{user_query}}
```

Respond with EXACTLY AND ONLY: "{USEFUL_PAT}" or "{NONUSEFUL_PAT}"
""".strip()

In [17]:
from langchain.schema.messages import AIMessage
from langchain.schema.messages import BaseMessage
from langchain.schema.messages import HumanMessage
from langchain.schema.messages import SystemMessage

def dict_based_prompt_to_langchain_prompt(
    messages: list[dict[str, str]]
) -> list[BaseMessage]:
    prompt: list[BaseMessage] = []
    for message in messages:
        role = message.get("role")
        content = message.get("content")
        if not role:
            raise ValueError(f"Message missing `role`: {message}")
        if not content:
            raise ValueError(f"Message missing `content`: {message}")
        elif role == "user":
            prompt.append(HumanMessage(content=content))
        elif role == "system":
            prompt.append(SystemMessage(content=content))
        elif role == "assistant":
            prompt.append(AIMessage(content=content))
        else:
            raise ValueError(f"Unknown role: {role}")
    return prompt

In [18]:
from typing import Callable
from langchain_openai import ChatOpenAI

def llm_eval_chunk(query: str, chunk_content: str) -> bool:
    def _get_usefulness_messages() -> list[dict[str, str]]:
        messages = [
            {
                "role": "user",
                "content": CHUNK_FILTER_PROMPT.format(
                    chunk_text=chunk_content, user_query=query
                ),
            },
        ]

        return messages

    def _extract_usefulness(model_output: str) -> bool:
        """Default 'useful' if the LLM doesn't match pattern exactly.
        This is because it's better to trust the (re)ranking if LLM fails"""
        if model_output.content.strip().strip('"').lower() == NONUSEFUL_PAT.lower():
            return False
        return True

    llm = ChatOpenAI(model='gpt-3.5-turbo')

    messages = _get_usefulness_messages()
    filled_llm_prompt = dict_based_prompt_to_langchain_prompt(messages)
    model_output = llm.invoke(filled_llm_prompt)

    return _extract_usefulness(model_output)


def llm_batch_eval_chunks(
    query: str, chunk_contents: list[str], use_threads: bool = True
) -> list[bool]:
    if use_threads:
        functions_with_args: list[tuple[Callable, tuple]] = [
            (llm_eval_chunk, (query, chunk_content)) for chunk_content in chunk_contents
        ]

        print(
            "Running LLM usefulness eval in parallel (following logging may be out of order)"
        )
        parallel_results = run_functions_tuples_in_parallel(
            functions_with_args, allow_failures=True
        )

        # In case of failure/timeout, don't throw out the chunk
        return [True if item is None else item for item in parallel_results]

    else:
        return [
            llm_eval_chunk(query, chunk_content) for chunk_content in chunk_contents
        ]

In [19]:
from llama_index_client import TextNode
from pydantic import BaseModel
from src.search.models import dataframe_to_text_nodes

def filter_chunks(
    query: str,
    chunks_to_filter: list[BaseModel],
    max_llm_filter_chunks: int = 20,
) -> list[BaseModel]:
    """Filters chunks based on whether the LLM thought they were relevant to the query.

    Args:
        query (str): The query to filter chunks against.
        chunks_to_filter (list[BaseModel]): A list of BaseModel objects to filter.
        max_llm_filter_chunks (int, optional): The maximum number of chunks to consider. Defaults to 20.

    Returns:
        list[BaseModel]: A list of BaseModel objects that were marked as relevant.
    """
    chunks_to_filter = chunks_to_filter[: max_llm_filter_chunks]
    llm_chunk_selection = llm_batch_eval_chunks(
        query=query,
        chunk_contents=[chunk.text for chunk in chunks_to_filter],
    )
    return [
        chunk
        for ind, chunk in enumerate(chunks_to_filter)
        if llm_chunk_selection[ind]
    ]

In [20]:
from src.types import Document
from src.utils.pydantic_utils import dataframe_to_documents, dataframe_to_pydantic_objects

In [21]:
assert 1+1==3

AssertionError: 

In [22]:
docs = dataframe_to_documents(vector_results[0], content='summary', embedding='openai_embeddings_summary')

In [23]:
print(docs[0])

DynamicDocument:
content: 'The citation 578 N.E.2d 926 concerns primarily the interpretation of
insurance policies, focusing on the broad duty of insurers to defend their
insured, the consideration of ambiguous policy terms in favor of the
insured, and the principles for determining coverage in complex scenarios
such as pollu'..., 
metadata: {
    "source": null
}, score: 0.8519, embedding: vector of size 1536


In [None]:
nodes = dataframe_to_documents(
    vector_res,
    content='summary',
    metadata=['citation']
)

filtered_chunks = filter_chunks(test_query, nodes)
print(f"\nReturned {len(filtered_chunks)} nodes")
print(f"{test_query}\n")
for obj in filtered_chunks:
    print(obj)
    print("-" * 50)

Running LLM usefulness eval in parallel (following logging may be out of order)

Returned 8 nodes

Regarding the pollution exclusion clause under the terms of comprehensive general liability (CGL) insurance, how is phrase "sudden and accidental' defined and applied given a claim involving gradual but unintentional polluting events.


DynamicDocument:
content: 'The citation 578 N.E.2d 926 concerns primarily the interpretation of
insurance policies, focusing on the broad duty of insurers to defend their
insured, the consideration of ambiguous policy terms in favor of the
insured, and the principles for determining coverage in complex scenarios
such as pollu'..., 
metadata: {
    "source": null,
    "citation": "578 N.E.2d 926"
}, score: 0.9766
--------------------------------------------------
DynamicDocument:
content: 'The case law from United States Gypsum Co. v. Admiral Insurance Co., 268
Ill. App. 3d 598, clarifies significant concepts in insurance law across
multiple topics, such as

In [None]:
nodes_kw = dataframe_to_documents(
    df=keyword_res, 
    content='keywords',
    metadata=['citation']
)

filtered_chunks_kw = filter_chunks(test_query, nodes_kw)
print(f"\nReturned {len(filtered_chunks_kw)} nodes")
print(f"{test_query}\n")
for obj in filtered_chunks_kw:
    print(obj)
    print("-" * 50)

Running LLM usefulness eval in parallel (following logging may be out of order)

Returned 7 nodes

Regarding the pollution exclusion clause under the terms of comprehensive general liability (CGL) insurance, how is phrase "sudden and accidental' defined and applied given a claim involving gradual but unintentional polluting events.


DynamicDocument:
content: 'insurance policy interpretation, ambiguity resolution, insurer's duty to
defend, duty to indemnify, de novo review, intentions of parties,
exclusionary provisions, family member ambiguity, entitlement exclusion,
environmental pollution'..., 
metadata: {
    "source": null,
    "citation": "687 N.E.2d 72"
}, score: 0.9802
--------------------------------------------------
DynamicDocument:
content: 'insurance policy interpretation, duty to defend, policy coverage, ambiguous
policy terms, pollution exclusion, property damage, insurer's obligations,
coverage disputes'..., 
metadata: {
    "source": null,
    "citation": "578 N.E.2d 9

In [None]:
distinct_citations = list(set(node.metadata.citation for node in filtered_chunks + filtered_chunks_kw))

In [None]:
len(distinct_citations)

10

In [None]:
filtered_chunks.extend(filtered_chunks_kw)

In [None]:
from src.utils.pydantic_utils import flatten_pydantic_instance

test_flat_model = flatten_pydantic_instance(filtered_chunks[0])

In [None]:
search_res_df = pd.DataFrame([test_flat_model])

In [None]:
results = []

for i in range(len(filtered_chunks)):
    flat_model = flatten_pydantic_instance(filtered_chunks[i])
    df = pd.DataFrame([flat_model])
    results.extend([df])


In [None]:
context_df = pd.concat(results)

In [None]:
Markdown(context_df.head(1)['content'].tolist()[0])

The citation 578 N.E.2d 926 concerns primarily the interpretation of insurance policies, focusing on the broad duty of insurers to defend their insured, the consideration of ambiguous policy terms in favor of the insured, and the principles for determining coverage in complex scenarios such as pollution or property damage. It establishes that insurers have a broad duty to defend their insured if the allegations in the underlying complaint potentially fall within the policy’s coverage. It also sets precedent for interpreting ambiguous policy language in favor of the insured and outlines the conditions under which pollution exclusions apply.

In [None]:
context_df.tail()

Unnamed: 0,id,content,metadata__source,metadata__citation,score_,embedding,index,keywords,recency,openai_embeddings_summary,splade_embeddings,search_type,score,summary
0,cc931aa3da3830196b56c6e88639764b28abdd541c481c...,"insurance, claim, coverage, policy, occurrence...",,268 Ill. App. 3d 598,0.71531,,33,,,"[-0.008939363993704319, -0.011452711187303066,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",splade,0.71531,The case law from United States Gypsum Co. v. ...
0,ee22aac895c83ec83d9900e8e2171f32c46e27815417e0...,"insurance law, continuous trigger approach, pr...",,643 N.E.2d 1226,0.705325,,53,,References to 643 N.E.2d 1226 have evolved to ...,"[-0.02118617109954357, -0.003369903890416026, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",splade,0.705325,"643 N.E.2d 1226, United States Gypsum Co. v. A..."
0,caff281ae5c222364389d859aed731584a418163158e45...,"'arising out of' language, insurance policy ex...",,357 Ill. App. 3d 955,0.686908,,89,,,"[-0.015558742918074131, -0.011712239123880863,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",splade,0.686908,The case cautions against broadly interpreting...
0,64100020552c4150874373e53e10e40484aea600c027eb...,"insurance policy, indemnification, self-insure...",,105 Ill. 2d 486,0.659979,,58,,,"[-0.014985797926783562, -0.008534330874681473,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",splade,0.659979,"The citation 105 Ill. 2d 486, pertaining to th..."
0,96a86e2c5b26d21ea30feddea76386987af5752132ad29...,"insurance, accident, definition, unforeseen oc...",,89 Ill. App. 3d 617,0.480345,,40,,The definition of 'accident' established in 89...,"[-0.003400814952328801, -0.015401927754282951,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",splade,0.480345,The case Aetna Casualty & Surety Co. v. Freyer...


In [None]:
from src.utils.gen_utils import count_tokens


def create_context(
    df: pd.DataFrame,
    context_token_limit: int = 25000
) -> str:
    """
    Creates a context string from a DataFrame within a specified token limit,
    applying word wrapping to the summary text.

    Args:
        df (pd.DataFrame): The DataFrame containing case data.
        context_token_limit (int): The maximum number of tokens for the context.

    Returns:
        str: A formatted string containing case details within the token limit,
             with word wrapping applied to the summary text.
    """
    import textwrap

    df.reset_index(drop=True, inplace=True)
    returns = []
    count = 1
    total_tokens = 100  # Starting token count to account for initial text.
    # Add the text to the context until the context is too long.
    for _, row in df.iterrows():
        wrapped_summary = textwrap.fill(row['content'], width=80)
        text = (
            f"[{count}] {row['metadata__citation']}\n"
            f"Summary: {wrapped_summary}\n"
            "-----------------------------------------\n"
        )
        text_tokens = count_tokens(text)
        if total_tokens + text_tokens > context_token_limit:
            break
        returns.append(text)
        total_tokens += text_tokens
        count += 1
    return "\n\n".join(returns)

In [None]:
context = create_context(context_df)

In [None]:
print(context)

[1] 578 N.E.2d 926
Summary: The citation 578 N.E.2d 926 concerns primarily the interpretation of insurance
policies, focusing on the broad duty of insurers to defend their insured, the
consideration of ambiguous policy terms in favor of the insured, and the
principles for determining coverage in complex scenarios such as pollution or
property damage. It establishes that insurers have a broad duty to defend their
insured if the allegations in the underlying complaint potentially fall within
the policy’s coverage. It also sets precedent for interpreting ambiguous policy
language in favor of the insured and outlines the conditions under which
pollution exclusions apply.
-----------------------------------------


[2] 268 Ill. App. 3d 598
Summary: The case law from United States Gypsum Co. v. Admiral Insurance Co., 268 Ill.
App. 3d 598, clarifies significant concepts in insurance law across multiple
topics, such as the determination of claim coverage based on third party's claim
contents, 

In [None]:
from src.agent.tools.utils import extract_citation_numbers_in_brackets

def create_formatted_input(
    df: pd.DataFrame,
    query: str,
    context_token_limit: int = 25000,
    instructions: str = """Instructions: Working step-by-step using only the provided search results that are relevant to a particular component of the user query, write a detailed analysis focusing on how the prior case(s) can inform a component of the uer query.\n\nNew Query:""",
) -> str:

    context = create_context(df, context_token_limit)

    try:
        prompt = f"""{context}\n\n{instructions}\n{query}\n\nAnalysis:"""
        return prompt
    except Exception as e:
        print(e)
        return ""

In [None]:
from src.agent.tools.utils import ResearchReport
import openai
import instructor
from tenacity import Retrying, stop_after_attempt, wait_fixed


def get_final_answer(formatted_input: str, model_name: str) -> ResearchReport:
    client = instructor.patch(openai.OpenAI())
    return client.chat.completions.create(
        model=model_name,
        response_model=ResearchReport,
        max_retries=Retrying(
            stop=stop_after_attempt(5),
            wait=wait_fixed(1),
        ),
        messages=[
            {
                "role": "system",
                "content": "You are helpful legal research assistant. Working step-by-step, first breakdown the user query into logical sub-questions, then analyze each component with respect to the case law search results. Using only the provided context, offer insights on applicability of the past case(s) and how the legal researcher can reference them to address the components of the broader question, and ultimately the full question. Make sure to use highly structured markdown formatting.",
            },
            {
                "role": "user",
                "content": f"Search Results:\n\n{formatted_input}"
            },
        ],
    )
    

formatted_input = create_formatted_input(search_res_df, test_query, context_token_limit=25000)

response_model = get_final_answer(formatted_input, model_name="gpt-4-turbo-preview")


In [None]:
formatted_input = create_formatted_input(search_res_df, test_query, context_token_limit=25000)

response_model = get_final_answer(formatted_input, model_name="gpt-4-turbo-preview")

In [None]:
Markdown(response_model.research_report)

The query focuses on how the phrase 'sudden and accidental' within the context of a pollution exclusion clause under comprehensive general liability (CGL) insurance policies is defined and applied, especially in cases involving gradual but unintentional polluting events.

Considering the summary of the citation 578 N.E.2d 926, it directly addresses relevant aspects that can inform the analysis of the new query:

1. **Interpretation of Insurance Policies**: The case provides a precedent for a broad interpretation in favor of the insured when policy terms are ambiguous. This principle can be crucial in analyzing how 'sudden and accidental' is defined, especially if the policy language does not explicitly clarify whether gradual pollution events are covered.

2. **Determination of Coverage in Complex Scenarios**: The case discusses how insurers determine coverage in intricate situations, including pollution or property damage. This is particularly relevant for evaluating how the 'sudden and accidental' phrase is applied in claims involving gradual but unintentional polluting events, which are complex by nature.

3. **Pollution Exclusions**: The case outlines the conditions under which pollution exclusions apply. This directly touches upon how 'sudden and accidental' might be interpreted in the context of pollution exclusions, providing a foundation for understanding whether and how such clauses might exclude or include coverage for gradual pollution incidents.

In summary, the principles established in 578 N.E.2d 926 suggest that in cases involving ambiguous policy terms or complex scenarios like gradual pollution, the interpretation is likely to lean in favor of the insured. Therefore, if the phrase 'sudden and accidental' within the pollution exclusion clause is ambiguous or can be interpreted in more than one way in the context of gradual but unintentional polluting events, the precedent suggests that courts might favor an interpretation that includes coverage for such events under CGL insurance policies. This precedent can be remendously valuable for legal researchers or practitioners arguing for the inclusion of gradual pollution incidents in insurance coverage, despite the presence of a pollution exclusion clause.

In [None]:
Markdown(response_model.research_report)

Regarding the interpretation and application of the 'sudden and accidental' phrase within the context of pollution exclusion clauses under Comprehensive General Liability (CGL) insurance policies, especially in cases involving gradual but unintentional pollution events, a detailed analysis of applicable case law is essential. This analysis will explore how this phrase has been defined and applied in past rulings to guide the current enquiry.

**Case Law Analysis:**

- **United States Gypsum Co. v. Admiral Insurance Co., [1] & [3]:** These entries note the application of the 'continuous trigger approach' for progressive and inseparable property damage and the interpretation of insurance coverage through cause analysis. While not directly referencing 'sudden and accidental,' these cases are significant for understanding how courts may approach policy interpretations involving gradual pollution events. The principles of cause analysis and continuous trigger might be indirectly relevant to dissecting the 'sudden and accidental' clause in pollution exclusions.

- **Travelers Insurance Co. v. Eljer Manufacturing, Inc., [6] & [17]:** These references address the construction of insurance policy provisions and interpretation duties of insurers. They establish guidelines that might be applicable when debating the definition and application of 'sudden and accidental' in pollution exclusions, especially since the summary points to discussions around 'occurrence' and 'property damage' in insurance contracts.

- **Zurich Insurance Co. v. Raymark Industries, Inc., [18]:** This case directly deals with the interpretation of policy terms, especially concerning duties to defend and indemnify and trigger of coverage for specific claims like asbestos exposure. While not exclusively concerning pollution, the principles set forth regarding insurance policy interpretation could be instrumental in understanding how 'sudden and accidental' might be construed in the context of pollution.

**Discussion:**

The analyzed cases suggest that the interpretation of 'sudden and accidental' within pollution exclusions likely depends on broader legal principles of insurance policy interpretation, including the continuous trigger for coverage and cause analysis. Since gradual pollution events fall outside the traditional 'sudden' framework, the application of these clauses would require a nuanced approach, examining the specific language of the policy and considering precedents that interpret similar terms.

**Conclusion:**

Given the indirect relevance of the discussed cases, further research into directly related case law or advisory opinions concerning 'sudden and accidental' in the context of CGL policies and pollution exclusions might be necessary. However, these cases provide a foundational understanding of how courts approach complex policy interpretations and might influence arguments concerning the applicability of pollution exclusions to gradual environmental damage.

In [None]:
from src.search.query_expansion import generate_subquestions

questions = generate_subquestions(test_query, n='any number of')
questions.questions

['Should I just pay the rent and wait for my refund?',
 'If the post office shows that my original money order was cashed, am I out that money?',
 'What can I do about a landlord who is slow to make repairs?',
 'Is it legal for my landlord to enter my apartment without permission?',
 'Can a landlord raise rent in response to making repairs?',
 'How do I handle disruptive neighbors?',
 'What should I do if I suspect my landlord of stealing from me?',
 'What are my rights as a tenant in Missouri?',
 'Is it legal to use pliers to turn on water in lieu of a broken knob?',
 "What actions can I take if I've been treated unfairly by my landlord?"]

In [None]:
questions._raw_response.usage

CompletionUsage(completion_tokens=147, prompt_tokens=1130, total_tokens=1277)

In [None]:
from src.search.query_filter import generate_query_plan, auto_filter_fts_search

In [None]:
query_plan = generate_query_plan(
    input_df=df,
    query=test_query,
    filter_fields=[
        'state',
    ]
)
filtered_df = query_plan.filter_df(df=df)

[32m2024-03-19 21:42:45 - INFO - Schema shown to LLM: 
Name of each field, its type and unique values (up to 20):
* state (string);  Values - ['NM' 'IN' 'WY' 'NH' 'MP' 'PA' 'MH' 'ID' 'AR' 'MA' 'KS' 'AS' 'ND' 'PR'
 'DE' 'FL' 'LA' 'OR' 'VT' 'PW'], ... 39 more
        [0m


[32m2024-03-19 21:42:55 - INFO - Input DataFrame has 5,000 rows[0m
[32m2024-03-19 21:42:55 - INFO - Applying filter(s): state LIKE '%OR%'[0m
[32m2024-03-19 21:43:03 - INFO - Filtered DataFrame has 86 rows[0m


In [None]:
filtered_df.head(2)

Unnamed: 0,index,created_utc,full_link,id,body,title,text_label,flair_label,embeddings,token_count,llm_title,state,kmeans_label,topic_title,splade_embeddings
0,2029,1578267399,https://www.reddit.com/r/legaladvice/comments/...,ekl2ef,For context I live in the Philippines. I wont ...,My professor refuses to show us ALL of our tes...,school,9,"[-0.00954271624451715, 0.007157037183387862, 0...",953,"""Unrevealed Grades and Lack of Transparency: S...",OR,9,Legal Consequences of False Accusations,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,3320,1591126549,https://www.reddit.com/r/legaladvice/comments/...,gve3nq,Edit: I live in Washington state.\n\nSo I live...,My landlord has been harassing me about my pet...,housing,7,"[-0.0034782202413053045, 0.00616729225832095, ...",759,"""Legal dispute over pet snake: Landlord threat...",OR,3,Rental Property and Landlord Matters,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [None]:
print(query_plan.original_query)
print(query_plan.rephrased_query)

Do I have any legal recourse here? I know Oregon is an 'at will' state, but it sounds like there are at LEAST two instances that offer grounds for wrongful termination (just based on my limited knowledge of the ADA, dept of labor, BOLI, etc.). 
legal recourse for wrongful termination in 'at will' employment including issues related to mistreatment, health code violations, improper handling of company money, and potential discrimination due to medical conditions


In [None]:
test_res = auto_filter_fts_search(
    df=df,
    query='marijuana',
    top_k=20,
    text_column="body",
    embeddings_column="embeddings",
    filter_fields=[
        'state',
    ])

[32m2024-03-19 21:44:19 - INFO - Schema shown to LLM: 
Name of each field, its type and unique values (up to 20):
* state (string);  Values - ['NM' 'IN' 'WY' 'NH' 'MP' 'PA' 'MH' 'ID' 'AR' 'MA' 'KS' 'AS' 'ND' 'PR'
 'DE' 'FL' 'LA' 'OR' 'VT' 'PW'], ... 39 more
        [0m


[32m2024-03-19 21:44:22 - INFO - No filters were identified for query: marijuana[0m
[32m2024-03-19 21:44:22 - INFO - Revised query: marijuana[0m
[32m2024-03-19 21:44:23 - INFO - Full Text Search (FTS) search yielded a DataFrame with 20 rows[0m


In [None]:
Markdown(test_res['body'].tolist()[0])

I'm a New York State medical marijuana patient. I also work in healthcare. I applied to a new job at a new hospital, and they are discriminating against me for being a medical marijuana patient. I was offered the job and accepted, but when I went to get my pre-employment physical conducted, I gave them my medical marijuana card and informed them that I am a patient. They are now refusing to hire me. Is this legal? I already contacted the division of human rights at the labor department and they said I may or may not have a case.