## 1. Setup and Libraries

In [1]:
!pip install transformers scikit-learn



In [2]:
import os
import json
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import T5Tokenizer, T5ForConditionalGeneration

2024-12-07 08:34:10.049851: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 2. Load Metadata and Preprocess

In [None]:
metadata_file = "metadata_new.json"

def load_metadata(file_path):
    """Load metadata from a JSON file."""
    with open(file_path, "r") as f:
        return json.load(f)

metadata = load_metadata(metadata_file)
print(f"Loaded {len(metadata)} metadata entries.")

## Code review

Loaded 3777 metadata entries.


In [4]:
# Extract texts for TF-IDF
texts = [item['text'] for item in metadata]
print(f"Sample text: {texts[0][:200]}")

Sample text: DAWSON, District Judge.
Petitioner, by his guardian, ad litem, sets forth that he is unlawfully restrained of his liberty by Lieutenant Commander J. S. Newell, naval officer in charge at this station,


## 3. Generate TF-IDF Embeddings

In [5]:
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')

In [6]:
tfidf_matrix = vectorizer.fit_transform(texts)
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

TF-IDF matrix shape: (3777, 9787)


## 4. Query with TF-IDF

In [7]:
def tfidf_query(query, top_k=5):
    """
    Perform a query using TF-IDF and return top-k results.
    
    Args:
        query (str): The query text.
        top_k (int): Number of top results to return.

    Returns:
        list: Top-k results with metadata and scores.
    """
    # Transform the query to match the TF-IDF matrix
    query_vector = vectorizer.transform([query])

    # Calculate cosine similarity
    scores = cosine_similarity(query_vector, tfidf_matrix).flatten()

    # Sort scores in descending order and get top-k indices
    top_indices = scores.argsort()[-top_k:][::-1]
    
    # Gather top-k results
    results = []
    for idx in top_indices:
        results.append({
            "file": metadata[idx]["file"],
            "text_snippet": metadata[idx]["text"][:200],  # First 200 characters
            "score": scores[idx]
        })
    
    return results


## 5. Summarization with T5

In [8]:
t5_model_name = "t5-small" 
tokenizer = T5Tokenizer.from_pretrained(t5_model_name)
model = T5ForConditionalGeneration.from_pretrained(t5_model_name)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
def generate_summary(query, results):
    """
    Generate a summary for the query based on top results using T5.

    Args:
        query (str): The user query.
        results (list): Top results from the query function.

    Returns:
        str: Generated summary.
    """
    # Combine text from top results
    context = " ".join([res["text_snippet"] for res in results])

    if not context.strip():
        return "No relevant document content found for summarization."

    # Prepare input for T5 summarization
    input_text = f"question: {query} context: {context}"
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs, max_length=150, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary


## 6. Interactive Query System

In [10]:
def interactive_query():
    """
    Allow the user to interact using numbered options (1-5) for uniformity.
    """
    while True:
        print("\n--- TF-IDF Query System ---")
        print("Select a query type:")
        print("1. Search by Name")
        print("2. Search by File Name")
        print("3. Search by Decision Date")
        print("4. Search by Custom Legal Query")
        print("5. Exit")
        
        choice = input("Enter choice (1-5): ").strip()
        query = ""

        if choice == "1":
            query = input("Enter case name: ").strip()
        elif choice == "2":
            query = input("Enter file name: ").strip()
        elif choice == "3":
            query = input("Enter decision date (YYYY-MM-DD): ").strip()
        elif choice == "4":
            query = input("Enter your query: ").strip()
        elif choice == "5":
            print("Exiting the query system. Goodbye!")
            break
        else:
            print("Invalid choice. Please try again.")
            continue

        # Perform the query
        results = tfidf_query(query)
        
        # Display the results
        print("\nTop Results:")
        for res in results:
            print(f"File: {res['file']}, Score: {res['score']:.4f}")
            print(f"Text Snippet: {res['text_snippet']}\n")

        # Generate a summary for the query
        print("\nGenerating summary for the query...")
        summary = generate_summary(query, results)
        print("\nGenerated Summary:")
        print(summary)


In [11]:
interactive_query()


--- TF-IDF Query System ---
Select a query type:
1. Search by Name
2. Search by File Name
3. Search by Decision Date
4. Search by Custom Legal Query
5. Exit


Enter choice (1-5):  2
Enter file name:  1892-03-08



Top Results:
File: 0165-01.json, Score: 0.2378
Text Snippet: ference to such findings I am able to find in Hill’s Annotated Codes is found in section 396, p. 412, vol. 1, compilation of 1892, which reads as follows:
“The provisions of title 1 of chapter 2, of t

File: 0536-01.json, Score: 0.2283
Text Snippet: el for the plaintiff cites Nemitz v. Conrad (a case from the Supreme Court of Oregon, decided March 29, 1892) 29 Pac. 548, in support of his contention that by filing the bond for discharge in this ca

File: 0070-01.json, Score: 0.2245
Text Snippet: tutes of the United States and acts of Congress and the proclamation of the President thereunder.
The hearing of said' cáse was set for June 20, 1892, and the usual monition was issued and published a

File: 0070-01.json, Score: 0.2191
Text Snippet: TRUITT, District Judge.
The libel of information in this case was filed by Chas. S. Johnson, United States District Attorney, on the 29th day of April, 1892, in said court, against said st

Enter choice (1-5):  exit


Invalid choice. Please try again.

--- TF-IDF Query System ---
Select a query type:
1. Search by Name
2. Search by File Name
3. Search by Decision Date
4. Search by Custom Legal Query
5. Exit


Enter choice (1-5):  5


Exiting the query system. Goodbye!
