In [19]:
import pandas as pd

Exoneration documents—documents acquired or vindicated in hte legal process to formally vindicate individuals erroneously convicted of crimes—serve as rich, informative resources in the field of wrongful conviction research. These documents are particularly revealing about the law enforcement personnel involved in such cases. However, these documents are both voluminous, with thousands of pages of text per case, and unstructured.

We seek to make these collections searchable and useful for lawyers, advocates, and community members to better investigate patterns of police misconduct and corruption. In order to do so, we rely on a multi-stage process:

1. Metadata Compilation: We started by compiling a comprehensive CSV index. This structured approach forms the foundation of our file management system, enabling file retrieval and basic deduplication. The metadata we organize in this step includes:

    - file path and name
    - file type
    - sha1 content hash: we truncate this to create unique file identifiers
    - file size and number of pages
    - case ID: when we scanned the documents, we organized them into directories organized by case ID, here we pluck and validate the directory name to add a structured case id field to our metadata index.

2. Page classification: The documents in the collection are varied, representing all documents produced or acquired in the course of an exoneration case, with case timelines going back decades. After some internal review and discussions with the Innocence Project New Orleans (IPNO) case management team, we narrowed our focus to three types of documents:

    - police reports: include mentions of officers involved in the arrest that led to the wrongful conviction, or related arrests.
    - transcripts: court transcripts, recorded by clerks of the court, where officers appear as witnesses and describe their role in the conviction (including making the arrest, transporting evidence, interrogating the accused).
    - testimonies: witness testimony, which will include testimony from officers involved in the conviction

    [*Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval*](https://arxiv.org/abs/1502.07058) describes an effective approach for retrieving specific types of documents from disorganized collections: fine-tuning a pretrained convolutional neural network to label thumbnail images of document pages. In order to use this technique, we needed training data and a pretrained model. To quickly assemble a training data set for our page classifier, we started by noticing that in many cases the file name indicated the document type. These documents were scanned by many people at different times, so we could not rely on this heuristic for comprehensive categorization of documents, but there was more than enough there to jumpstart our training process. We collected our initial training data by probing filenames for specific search terms. Once we had a trained classifier, we were able to measure generalization performance on documents that couldn't be classified via filename, and we were also better able to target additional training data, for example by reviewing pages where the classifier had low confidence about its prediction. Once we had training data, we used [FastAI](https://docs.fast.ai/) to fine-tune the `ResNet34` architecture, pretrained on [ImageNet](https://www.image-net.org/), to identify reports, transcripts, and testimonies based on page thumbnails.

3. Information Extraction: Currently, we're engaged in extracting structured information from the documents we've identified, and that work is the focus of the current post. Our goal is to extract structured information related to each police officer or prosecutor mentioned in the documents, such as their names, ranks, and roles in the wrongful conviction.

4. Deduplication: The previous step leaves us with many distinct mentions, but some individuals are mentioned many times, within the same case or across cases. Here we rely on HRDAG's [extensive experience with database deduplication](https://hrdag.org/tech-notes/adaptive-blocking-writeup-1.html) to create a unique index of officers and prosecutors involved in wrongful convictions, and a record and the role or roles they had in the wrongful conviction.

5. Cross-referencing: In the final stage, we'll cross-reference the officer names and roles we've extracted with the Louisiana Law Enforcement Accountability Database [(LLEAD.co)](https://llead.co/). This stage will assist us in identifying other individuals linked with the implicated officers, such as their partners, those co-accused in misconduct complaints, or those co-involved in use-of-force incidents. The list of officers associated with previous wrongful conviction cases can then be cross-referenced with the IPNO's internal data on potential wrongful convictions with the aim of uncovering new instances of wrongful convictions.

A primary task in our process is extracting officer information from documents – relevant information includes the officer's name and the role the officer played in the wrongful conviction. We initially considered a regex-based solution but soon realized that the complexity and variability of the data rendered regex less than ideal. While regex is efficient for pattern matching, it struggles with language variations and nuances. For instance, a text string from a court transcript reading, "Sergeant Ruiz was mentioned as being involved in the joint investigation with Detective Martin Venezia regarding the Seafood City burglary and the murder of Kathy Ulfers," would pose a problem for regex because it fails to capture semantic context, making it unable to infer that Sergeant Ruiz acted as a lead detective in Kathy Ulfers' murder case. 

In [20]:
### add results from baseline model here?

An alternative approach is to prompt a generative language model with the document text along with a query describing our required output. One challenge with this approach is that the documents we're processing may be hundreds of pages long, whereas generative models have a limit to the length of the prompt you supply. We needed a way to pull out of each document just the chunks of text where the relevant officer information appears, to provide a more helpful prompt.

We split up the problem into two steps, identifying the relevant chunks of text content, and then extracting structured officer information from those chunks. We use [Langchain](https://docs.langchain.com/docs/), a natural language processing library, to manage this pipeline, and use OpenAI's language model, GPT-3 as the language model powering the pipeline. 

For the first step, identifying the relevant chunks of text within the larger document, we used the approach outlined in [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496). This method splits our information retrieval task into two steps:

1. First we chunk the document text into overlapping chunks, and calculate embeddings for each chunk.
1. We then feed our query asking for names and roles of mentioned officers to an instruction-following generative language model, instructing it to compose a "hypothetical" document in response to the query.
3. We embed the hypothetical document using the same embedding system as we use to encode the text chunks from the document.
3. We use [Faiss](https://faiss.ai/) to do a similarity search, comparing our hypothetical document embeddings to find chunks of text content that resemble our hypothetical document.

Here is the method we use to generate hypothetical document embeddings. These embeddings encapsulate the contextual information in our documents.

In [21]:
def generate_hyde():
    llm = OpenAI()
    prompt_template = """\
    You're an AI assistant specializing in criminal justice research. 
    Your main focus is on identifying the names and providing detailed context of mention for each law enforcement personnel.
    ...
    """
    prompt = PromptTemplate(input_variables=["question"], template=prompt_template)
    llm_chain = LLMChain(llm=llm, prompt=prompt)
    base_embeddings = OpenAIEmbeddings()
    embeddings = HypotheticalDocumentEmbedder(llm_chain=llm_chain, base_embeddings=base_embeddings)
    return embeddings

Building upon the concept of a hypothetical document embedder, the process_single_document function stands as the initial step in handling raw text. This function employs Langchain's RecursiveCharacterTextSplitter to dissect documents into digestible chunks of 500 tokens, all the while maintaining an overlap of 250 tokens to ensure contextual continuity. As our primary focus lies in accurately capturing true positives, the [F-beta score](https://en.wikipedia.org/wiki/F-score#F%CE%B2_score) (with β=2) was utilized during the testing phase to weigh recall twice as heavily as precision. The model underwent rigorous testing with varying chunk sizes, including 2000, 1000, and 500, with corresponding overlaps of 1000, 500, and 250 respectively. The optimal configuration revealed itself to be a chunk size of 500 with an overlap of 250. Following segmentation, the text is transformed into a high-dimensional space using the precomputed embeddings generated by our hypothetical document embedder. The FAISS.from_documents function facilitates this transformation, building an indexed document database for similarity search.

In [22]:
def process_single_document(file_path, embeddings):
    logger.info(f"Processing document: {file_path}")

    loader = JSONLoader(file_path)
    text = loader.load()
    logger.info(f"Text loaded from document: {file_path}")

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=1500)
    docs = text_splitter.split_documents(text)

    db = FAISS.from_documents(docs, embeddings)
    return db

In the following sections, we define the core function get_response_from_query(db, query). This function serves as the backbone of our information extraction process, taking in a document database and a query, and returning the system's response to the query. 

The process begins by setting up the relevant parameters. We use a prompt template to guide the query and a role template to define the roles we're interested in. We set the temperature parameter to 0 to maximize the determinism of our responses. The k parameter is set to 20, a decision guided by the F-beta score results from our testing phase, instructing the system to select and concatenate the top 20 relevant text chunks from the document corpus. These documents are then sorted by similarity score to maximize the model's performance. As suggested in the paper [Lost in the Middle: How Language Models Use Long Contexts](https://arxiv.org/abs/2307.03172), current language models perform best when the relevant data is located at the beginning of their context window.

The relevant chunks of text are then passed to the LLMChain class of the LangChain module as part of the 'run' method. In addition to relevant chunks, the 'run' method also receives the PromptTemplate, RoleTemplate, and the original query.

The LLMChain processes these inputs and generates a structured response to the initial query.

In [None]:
PROMPT_TEMPLATE_MODEL = PromptTemplate(
    input_variables=["roles" ,"question", "docs"],
    template="""
    As an AI assistant, my role is to meticulously analyze court transcripts, traditional officer roles, and extract information about law enforcement personnel.

    Query: {question}

    Transcripts: {docs}

    Roles: {roles}

    The response will contain:

    1) The name of a officer, detective, deputy, lieutenant, 
       sergeant, captain, officer, coroner, investigator, criminalist, patrolman, or technician - 
       if an individual's name is not associated with one of these titles they do not work in law enforcement.
       Please prefix the name with "Officer Name: ". 
       For example, "Officer Name: John Smith".

    2) If available, provide an in-depth description of the context of their mention. 
       If the context induces ambiguity regarding the individual's employment in law enforcement, 
       remove the individual.
       Please prefix this information with "Officer Context: ". 

    3) Review the context to discern the role of the officer.
       Please prefix this information with "Officer Role: "
       For example, the column "Officer Role: Lead Detective" will be filled with a value of 1 for officer's who were the lead detective.
""",
)

ROLE_TEMPLATE = """
US-IPNO-Exonerations: Model Evaluation Guide 
Roles:
Lead Detective
•	Coordinates with other detectives and law enforcement officers on the case.
•	Liaises with the prosecutor's office, contributing to legal strategy and court proceedings.
•	May be involved in obtaining and executing search warrants.
•	Could be called to testify in court about the investigation.
"""

def get_response_from_query(db, query):
    # Set up the parameters
    prompt = PROMPT_TEMPLATE_MODEL
    roles = ROLE_TEMPLATE
    temperature = 0
    k = 20

    # Perform the similarity search
    doc_list = db.similarity_search_with_score(query, k=k)

    docs = sorted(doc_list, key=lambda x: x[1], reverse=True)

    third = len(docs) // 3

    highest_third = docs[:third]
    middle_third = docs[third:2*third]
    lowest_third = docs[2*third:]

    highest_third = sorted(highest_third, key=lambda x: x[1], reverse=True)
    middle_third = sorted(middle_third, key=lambda x: x[1], reverse=True)
    lowest_third = sorted(lowest_third, key=lambda x: x[1], reverse=True)

    docs = highest_third + lowest_third + middle_third

    docs_page_content = " ".join([d[0].page_content for d in docs])

    # Create an instance of the OpenAI model
    llm = ChatOpenAI(model_name="gpt-4")

    # Create an instance of the LLMChain
    chain = LLMChain(llm=llm, prompt=prompt)

    # Run the LLMChain and print the response
    response = chain.run(roles=roles, question=query, docs=docs_page_content, temperature=temperature)
    print(response)

    # Return the response and the documents
    return response

For additional context, see the following inputs and outputs:

**Query**

"Identify individuals, by name, with the specific titles of officers, sergeants, lieutenants, captains, detectives, homicide officers, and crime lab personnel in the transcript. Specifically, provide the context of their mention related to key events in the case, if available."

**Relevant Document**

(1 of 20 documents identified by the Faiss similarity search as relevant)
 
 Martin Venezia, New Orleans police sergeant. A 16 .01 Sergeant Venezia, where are you assigned now? : - A Second Police District. 13 . And in October, September of 1979 and in Q 19 September and October of 1980, where were you assigned? :1 Homicide division. A. And how long have you been on the police department right now? Thirteen and a half years. A Officer Venezia, when did you or did you ever take over the investigation of ... Cathy Ulfers' murder? A", metadata={'source': '../../data/convictions/transcripts/iterative\\(C) Det. Martin Venezia Testimony - Trial One.docx'

**Response from the Model** 

Officer Name: Sergeant Martin Venezia

Officer Context: Sergeant Martin Venezia, formerly assigned to the Homicide Division, took over the invesitgation of Cather Ulfers murder.

Officer Role: Lead Detective 

## Evaluations, issues, improvements

In our effort to optimize the model's capability to extract officer names from documents, we evaluated it on various parameters. 

**Preprocessing Parameters**:

- Chunk Size: This refers to the number of consecutive words or units of text processed at once.
- Chunk Overlap: Defines how many words consecutive chunks share. For instance, with an overlap of 250, the succeeding chunk commences 250 words into the preceding chunk, promoting continuity in the analysis.

**Model-specific Parameters**:

- Hypothetical Document Embeddings (HYDE): Investigated their effect on the model's overall performance.
- 'k' Value: Denotes the number of text chunks input to the model.
- Temperature Parameter: Influences the randomness of the model.

For evaluating our model's performance, we utilized the F-beta score as our primary metric. Unlike the F1 score, which gives equal weight to precision (correctness) and recall (completeness), the F-beta score allows for differential weighting. We designed our score to weigh recall twice as much as precision, reflecting the importance of accurately spotting relevant information, even if it means occasionally flagging some irrelevant content.

Based on our evaluations, our model performed best with:
- A chunk size of 500
- A chunk overlap of 250
- Incorporating HYDE embeddings
- A 'k' value of 20

For police reports, the F-beta score reached 0.864909, while for transcripts, the F-beta score peaked at 0.813397.

Although larger chunk sizes, such as 1000 and 2000, might offer advantages for certain applications, they resulted in lower F-beta scores during our tests. Similarly, greater overlaps like 500 and 1000 reduced our performance, even with the potential for more context. The consistent advantage of incorporating HYDE embeddings was evident, underscoring their value to our model.

Another key observation was regarding the temperature parameter, which influences the model's level of randomness. With a temperature set to 1, we generally saw higher F-beta scores, especially for identifying officer names in police reports. As we move to the next phase — extracting detailed context about the officers role within the document — the precise handling of this parameter will be crucial because a high temperature can potentially skew results or generate "hallucinated" content.

In [26]:
def read_summary():
    summary = pd.read_excel("data/overall-summary-with-F1-Fbeta.xlsx")
    summary = summary.sort_values("F_beta", ascending=False)
    return summary

read_summary()

Unnamed: 0,chunk_size,chunk_overlap,temperature,k,hyde,filetype,FN,FP,TP,n_files,precision,recall,F1,F_beta
4,500,250,1,20,1,police-report,20,2,105,5,0.981308,0.84,0.905172,0.864909
2,2000,1000,1,5,0,police-report,12,32,71,5,0.68932,0.855422,0.763441,0.816092
0,500,250,1,20,1,transcript,3,27,34,4,0.557377,0.918919,0.693878,0.813397
1,500,250,0,20,1,police-report,6,56,60,5,0.517241,0.909091,0.659341,0.789474
8,2000,1000,0,5,0,police-report,15,13,54,5,0.80597,0.782609,0.794118,0.787172
3,2000,1000,1,5,1,transcript,3,11,17,3,0.607143,0.85,0.708333,0.787037
6,1000,500,0,10,1,transcript,15,31,57,6,0.647727,0.791667,0.7125,0.757979
10,2000,1000,0,5,1,transcript,22,18,60,7,0.769231,0.731707,0.75,0.738916
7,2000,1000,0,5,1,police-report,13,37,49,5,0.569767,0.790323,0.662162,0.733533
12,1000,500,1,10,1,police-report,34,10,78,5,0.886364,0.696429,0.78,0.727612
