Guidelines:

•	introduce the context, what are the exonerations documents, why do we want to index officers
•	very briefly explain where we are in the process (e.g. we’ve done page classification, we have these transcripts, etc)

•	introduce the extraction method – do we want to address alternative approaches? 
    Maybe we discuss a regex-based solution just as a way of introducing our method and the strengths/weaknesses
•	describe the main problem: responses to multiple prompts, no automatic way to choose the best response
•	describe our solution with the summarizer

Exoneration documents—records that formally vindicate individuals erroneously convicted of crimes—serve as rich, informative resources in the field of wrongful conviction research. These documents are particularly revealing about the law enforcement personnel involved in such cases. However, these documents are both voluminous, with thousands of pages of text per case, and unstructured, printed and collected in legal file storage boxes.

We seek to make these collections searchable and useful for lawyers, advocates, and community members to better investigate patterns of police misconduct and corruption. In order to do so, we rely on a multi-stage process:

1. Metadata Compilation: We started by compiling a comprehensive CSV index. This structured approach forms the foundation of our file management system, enabling efficient file retrieval. The metadata we organize in this step includes:

    - file path and name
    - file type
    - sha1 content hash: we truncate this to create unique file identifiers
    - file size and number of pages
    - case ID: when we scanned the documents, we organized them into directories organized by case ID, here we pluck and validate the directory name to add a structured case id field to our metadata index.

2. Page classification: The documents in the collection are varied, representing all documents produced or acquired in the course of an exoneration case, with case timelines going back decades. After some internal review and discussions with Innocence Project lawyers, we narrowed our focus to three types of documents:

    - police reports: include mentions of officers involved in the arrest that led to the wrongful conviction, or related arrests.
    - transcripts: court transcripts, recorded by clerks of the court
    - testimonies: witness testimony, 

    [*Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval*](https://arxiv.org/abs/1502.07058) describes an effective approach for retrieving specific types of documents from disorganized collections: fine-tuning a pretrained convolutional neural network to label thumbnail images of document pages. In order to use this technique, we needed training data and a pretrained model.

3. To quickly assemble a training data set, we started by noticing that in many cases the file name indicated the document type. These documents were scanned by many people at different times, so we could not rely on this heuristic for comprehensive categorization of documents, but there was more than enough there to jumpstart our training process. We collected our initial training data by probing filenames for specific search terms. Once we had a trained classifier, we were able to measure generalization performance on documents that couldn't be classified via filename, and we were alsoi better able to target additional training data, for example by reviewing pages where the classifier had low confidence about its prediction.

4. We then used [FastAI](https://docs.fast.ai/) to fine-tune the `ResNet34` architecture, pretrained on [ImageNet](https://www.image-net.org/), to identify reports, transcripts, and testimonies based on page thumbnails.

5. Information Extraction: Currently, we're engaged in extracting structured information from the documents we've identified, and that work is the focus of the current post. Our goal is to extract structured information related to each police officer or prosecutor mentioned in the documents, such as their names, ranks, and roles ("arresting officer", "handled evidence", etc).

6. Deduplication: The previous step leaves us with many distinct mentions, but some individuals are mentioned many times, within the same case or across cases. Here we rely on HRDAG's [extensive experience with database deduplication](https://hrdag.org/tech-notes/adaptive-blocking-writeup-1.html) to create a unique index of officers and prosecutors involved in wrongful convictions, and a record and the role or roles they had in the wrongful conviction.

7. Cross-referencing: In the final stage, we'll cross-reference the officer names and roles we've extracted with the Louisiana Law Enforcement Accountability Database ([LLEAD.co](https://llead.co/). This step will help us identify additional individuals associated with implicated officers (for example those who are co-accused on misconduct complaints, or who are involved in the same use-of-force incidents), and allow us to request public records, allowing us to review arrests by these officers.

A primary task in our process is extracting officer information from documents. We initially considered a regex-based solution but soon realized that the complexity and variability of the data rendered regex less than ideal. While regex is efficient for pattern matching, it struggles with language variations and nuances.

For example, let's look at the task of cleaning officer names. Here's an illustration of how we employ regex in our script:

In [None]:
def clean_name(officer_name):
    return re.sub(
        r"((Detective)|(Officer)|(Deputy)|(Captain)|([CcPpLl])|(Sergeant)|(Lieutenant)|(Techn))\\.?\\s+",
        "",
        officer_name,
    )

This function aims to remove titles from officer names. It uses a regular expression to match common titles (like "Detective", "Officer", "Deputy", etc.) and remove them. However, this approach can falter if the titles do not exactly match the regex pattern -- for example due to the use of nonstandard abbreviations, typos, or OCR errors.

To resolve this issue, we turned to [Langchain](https://docs.langchain.com/docs/), a natural language processing library, and OpenAI's language model, GPT-3. Using GPT-3 allows us to handle a wider variety of data and extract more nuanced information.

Specifically, we used the approach outlined in [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496). This method splits our information retrieval task into two steps:

1. We start by feeding a query to an instruction-following generative language model, instructing it to compose a "hypothetical" document in response to the query.
2. We embed the hypothetical document using the same embedding system as we use to encode the document collection.
3. We use [Faiss](https://faiss.ai/) to do a similarity search, comparing our hypothetical document embeddings to find text content that looks similar to our hypothetical document.

Here is the method we use to generate hypothetical document embeddings. These embeddings encapsulate the contextual information in our documents.

In [None]:
def generate_hyde():
    llm = OpenAI()
    prompt_template = """\
    You're an AI assistant specializing in criminal justice research. 
    Your main focus is on identifying the names and providing detailed context of mention for each law enforcement personnel.
    ...
    """
    prompt = PromptTemplate(input_variables=["question"], template=prompt_template)
    llm_chain = LLMChain(llm=llm, prompt=prompt)
    base_embeddings = OpenAIEmbeddings()
    embeddings = HypotheticalDocumentEmbedder(llm_chain=llm_chain, base_embeddings=base_embeddings)
    return embeddings

This function initializes an instance of the OpenAI language model, defines a prompt template for the language model, and creates a hypothetical document embedder instance with the language model and the prompt. The hypothetical document embedder is used to generate embeddings for the documents we process.

Our analysis of wrongful conviction cases employs a suite of computational tools: the RecursiveCharacterTextSplitter, FAISS (Facebook AI Similarity Search), a structured PromptTemplate, and the gpt-3.5-turbo-0613 model.

The RecursiveCharacterTextSplitter initiates the text processing stage. It segments extensive textual data from case files into manageable chunks. The configured chunk size of 3000 characters and overlap of 1500 characters ensure preservation of contextual information across divisions.

Following the text segmentation, we utilize FAISS for document similarity search. This library, developed by Meta's AI Research group, interprets each chunk as a vector, enabling efficient identification of chunks most similar to a given query, thus facilitating efficient analysis of large corpus.

Subsequent to the document similarity search, our structured PromptTemplate comes into play, formulating precise and consistent queries. These queries are processed by the gpt-3.5-turbo-0613 model This model generates responses by extracting pertinent information from our document chunks.

Each tool serves a distinct yet interconnected role in the information extraction process. The RecursiveCharacterTextSplitter prepares our data, FAISS optimizes the search within this data, and the integration of our PromptTemplate with gpt-3.5-turbo-0613 enables effective data interpretation. Collectively, they form a comprehensive pipeline that addresses the challenges associated with large-scale data analysis in wrongful conviction cases.

In [None]:
def process_single_document(file_path, embeddings):
    logger.info(f"Processing document: {file_path}")

    loader = Docx2txtLoader(file_path)
    text = loader.load()
    logger.info(f"Text loaded from document: {file_path}")

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=1500)
    docs = text_splitter.split_documents(text)

    db = FAISS.from_documents(docs, embeddings)

    return db

def get_response_from_query(db, query, k=3):
    logger.info("Performing query...")
    docs = db.similarity_search(query, k=k)
    docs_page_content = " ".join([d.page_content for d in docs])

    llm = ChatOpenAI(model_name="gpt-3.5-turbo-0613")

    prompt = PromptTemplate(
        input_variables=["question", "docs"],
        template="""
        As an AI assistant, my role is to meticulously analyze court transcripts and extract information about law enforcement personnel.
        Note: titles "Technician" and "Tech" might be used interchangeably.
        Note: titles may be abbreviated to the following Sgt., Cpl, Cpt, Det., Ofc., Lt., P.O. and P/O

        Query: {question}

        Transcripts: {docs}

        The response will contain:

        1) The name of a police officer, detective, deputy, lieutenant, 
           sergeant, captain, officer, coroner, investigator, criminalist, or technician.
           If the name does not have one of these titles as a prefix, the person does not work in law enforcement.  
           If you are not at least fifty percent sure that the identified individual works in law enforcement, 
           add "Based on the name, I'm not sure that this individual works in law enforcement".
           Please prefix the name with "Officer Name: ". 
           For example, "Officer Name: John Smith".

        2) If available, provide an in-depth description of the context of their mention. 
           If the context induces ambiguity regarding the individual's employment in law enforcement, 
           add "Based on the context, I'm not sure that this individual works in law enforcement".
           Please prefix this information with "Officer Context: ". 

        Continue this pattern for each officer and each context, until all law enforcement personnel are identified.  

        Guidelines for the AI assistant:
        
        - Derive responses from factual information found within the police reports.
        - If the context of an identified person's mention is not clear in the report, provide their name and note that the context is not specified.
        - If there is insufficient information to answer the query, simply respond with "Insufficient information to answer the query".
    """,
    )

    chain = LLMChain(llm=llm, prompt=prompt)
    response = chain.run(question=query, docs=docs_page_content, temperature=0)

    formatted_response = ""
    officers = response.split("Officer Name:")
    for i, officer in enumerate(officers):
        if officer.strip() != "":
            formatted_response += f"Officer Name {i}:{officer.replace('Officer Context:', 'Officer Context ' + str(i) + ':')}\n\n"

    officer_data = extract_officer_data(formatted_response)
    return officer_data, docs


queries = [
    "Identify individuals with the specific titles of police officers, sergeants, lieutenants, captains, detectives, homicide officers, and crime lab personnel in the transcript. Provide the context of their mention, if available.",
    "List individuals specifically titled as police officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel mentioned in the transcript and provide the context of their mention.",
    "Locate individuals specifically referred to as police officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel in the transcript and explain their context of mention.",
    "Highlight individuals explicitly titled as police officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel in the transcript and describe their context of mention.",
    "Outline individuals directly identified as police officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel in the transcript and specify their context of mention.",
    "Pinpoint individuals directly labeled as police officers, sergeants, lieutenants, captains, detectives, homicide units, and crime lab personnel in the transcript and provide their context of mention.",
]

Despite the strengths of AI, a major challenge remains: determining the best response from the AI model. Given a prompt, the model can yield multiple responses, and figuring out which response is the most accurate or relevant is not straightforward.

Let's consider a situation where we have multiple queries, and for each query, an officer may be identified more than once. This repetition is not a limitation but an inherent characteristic of our approach because it allows us to capture every possible mention of an officer. Hence, we end up with a rich, albeit redundant, dataset, where the same officer could be mentioned multiple times across different queries.

To handle this redundancy and to extract the most valuable information, we use a summarizer model. The summarizer model takes as input the multiple responses from the AI model. It then condenses them and extracts the most crucial information, providing a summary that amalgamates the information from all responses.

This summary doesn't just reduce the length of the text; it also synthesizes the multiple mentions of each officer, helping us understand the different contexts in which they were mentioned. As a result, we get a consolidated, comprehensive view of each officer's involvement in the cases we're examining.

In [None]:
def summarize_context(context):
    model = Summarizer()
    result = model(context, min_length=60)
    summary = "".join(result)
    return summary

Through this method, we can handle the multi-dimensional nature of our data—multiple queries, multiple documents, and multiple mentions of each officer—and distill it into a coherent and concise summary. This summary forms the basis of our index, ensuring that we capture a comprehensive picture of each officer's role in the exoneration cases.