Perform All the imports

In [2]:
import sys
import os
from dotenv import load_dotenv
from openai import OpenAI
import anthropic
import json
import cohere
import gradio as gr
from keybert import KeyBERT
from transformers import AutoModel, AutoTokenizer
from langchain.schema.document import Document
from langchain.embeddings.base import Embeddings
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA
from langchain.llms import Ollama


Load the DataSet

In [3]:
from datasets import load_dataset

ds = load_dataset("coastalcph/lex_glue", "case_hold")

About The Dataset:-

Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018), the subsequent more difficult SuperGLUE (Wang et al., 2019), other previous multi-task NLP benchmarks (Conneau and Kiela, 2018; McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a benchmark dataset to evaluate the performance of NLP methods in legal tasks. LexGLUE is based on seven existing legal NLP datasets, selected using criteria largely from SuperGLUE.

As in GLUE and SuperGLUE (Wang et al., 2019b,a), one of our goals is to push towards generic (or ‘foundation’) models that can cope with multiple NLP tasks, in our case legal NLP tasks possibly with limited task-specific fine-tuning. Another goal is to provide a convenient and informative entry point for NLP researchers and practitioners wishing to explore or develop methods for legalNLP. Having these goals in mind, the datasets we include in LexGLUE and the tasks they address have been simplified in several ways to make it easier for newcomers and generic models to address all tasks.

LexGLUE benchmark is accompanied by experimental infrastructure that relies on Hugging Face Transformers library and resides at: https://github.com/coastalcph/lex-glue.

In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['context', 'endings', 'label'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['context', 'endings', 'label'],
        num_rows: 3600
    })
    validation: Dataset({
        features: ['context', 'endings', 'label'],
        num_rows: 3900
    })
})

Load the environment variables

Merge test,train,split to provide all the information to the rag

In [10]:
ds1=[]
for row in ds['train']:
    ds1.append(row)
for row in ds['validation']:
    ds1.append(row)
for row in ds['test']:
    ds1.append(row)

In [5]:
load_dotenv(override=True)
google_api_key = os.getenv('GEMINI_API_KEY')
claude_api_key = os.getenv('ANTHROPIC_API_KEY')

In [6]:
geminimodel=OpenAI(api_key=google_api_key,
                  base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
client = anthropic.Anthropic(
    api_key=claude_api_key
)

Analysing the dataset


In [7]:
print(ds['train'][0]['context'])
print(ds['train'][0]['endings'])
print(ds['train'][0]['label'])

Drapeau’s cohorts, the cohort would be a “victim” of making the bomb. Further, firebombs are inherently dangerous. There is no peaceful purpose for making a bomb. Felony offenses that involve explosives qualify as “violent crimes” for purposes of enhancing the sentences of career offenders. See 18 U.S.C. § 924(e)(2)(B)(ii) (defining a “violent felony” as: “any crime punishable by imprisonment for a term exceeding one year ... that ... involves use of explosives”). Courts have found possession of a'bomb to be a crime of violence based on the lack of a nonviolent purpose for a bomb and the fact that, by its very nature, there is a substantial risk that the bomb would be used against the person or property of another. See United States v. Newman, 125 F.3d 863 (10th Cir.1997) (unpublished) (<HOLDING>); United States v. Dodge, 846 F.Supp. 181,
['holding that possession of a pipe bomb is a crime of violence for purposes of 18 usc  3142f1', 'holding that bank robbery by force and violence or 

Do not reccomend this way as the metadata only contains label which are not sufficient

In [None]:
docs = []
sum=0
for row in ds1:
    full_text = row['context'] + " " + " ".join(row['endings'])
    docs.append(Document(page_content=full_text.strip(), metadata={"class":row['label']}))

Trying a keybert model in order to create tags for the page_content.But the keybert often loses context and registers non-violence as violence due to the use of stopwords. This would lead miss-classification and vectorization of data.


In [10]:

legal_bert_model = SentenceTransformer("nlpaueb/legal-bert-base-uncased")


kw_model = KeyBERT(model=legal_bert_model)

No sentence-transformers model found with name nlpaueb/legal-bert-base-uncased. Creating a new one with mean pooling.


In [None]:
def extract_legal_tags(text, top_n=5, ngram_range=(1, 3)):
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=ngram_range,
        stop_words='english',
        top_n=top_n,
        use_mmr=True,           # optional: promotes diversity
        diversity=0.5           # optional: 0.0 = similar, 1.0 = diverse
    )
    return [kw for kw, _ in keywords]

Using LLM like claude/gemini to create tags for the data

In [1]:
system_prompt3="""You are a legal assistant. Extract 3 to 5 concise legal issue tags from the following case text. Pay close attention to negations and context. Return only a comma-separated list of tags.Give only tags strictly.
"""

In [8]:
def generate_legal_tags(text):
    user_query="Extract 3-5 concise legal tags from case text.Only give tags"+text
    response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=1000,
    temperature=0.7,
    system=system_prompt3,
    messages=[
        {"role": "user", "content": user_query}
    ]
    )

    plain_text_response = ''.join(block.text for block in response.content)

    return plain_text_response

In [9]:
def create_tags(text):
    tags = generate_legal_tags(text)
    tags=tags.split(',')
    tags = [tag.strip() for tag in tags if tag.strip()]  # Clean up whitespace
    return tags
# Output example: ['Eighth Amendment', 'life sentence', 'non-violent drug offense', 'Harmelin v. Michigan']

In [None]:
docs = []
for row in ds1:
    full_text = row['context'] + " " + " ".join(row['endings'])
    tags= create_tags(full_text)
    docs.append(Document(page_content=full_text.strip(), metadata={"class":row['label'],"tags": tags}))

Store the new dataset with tags locally to use later

In [None]:
with open('my_list5.txt', 'w') as f:
    for doc in docs:
        json.dump({'metadata': doc.metadata, 'page_content': doc.page_content}, f)
        f.write('\n')