<h2>☁️ Colab-Based Development</h2>

<p>
  This project is built and executed using <strong>Google Colab</strong> to leverage its
  <em>high processing power</em> and access to <strong>free GPU/TPU resources</strong>.
</p>

<p>
  Google Colab provides an ideal environment for running computationally intensive components of this project, including:
</p>

<ul>
  <li>🔠 Embedding large volumes of legal text</li>
  <li>📊 Performing semantic search over vector databases</li>
  <li>🧠 Running inference with the <code>GPT-Neo 1.3B</code> language model</li>
</ul>

<p>
  By using Colab, the project achieves efficient execution without requiring local high-end hardware or paid cloud services.
</p>

In [None]:
pip install langchain



In [None]:
pip install -U langchain-community



## Importing Libraries

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import ctransformers

## function to load file

In [None]:
def courpus_loader(path):
  loader=DirectoryLoader(path,glob='*.pdf', loader_cls=PyPDFLoader)
  book=loader.load()
  return book

In [None]:
pip install pypdf



In [None]:
courpus=courpus_loader("dataset\raw\OJ-L-2016-119-FULL-EN-TXT.pdf")

## function to make chunks of the  courpus

In [None]:
def chunk_maker(c_size,c_overlap):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=c_size,chunk_overlap=c_overlap)
  final_chunks=text_splitter.split_documents(courpus)
  return final_chunks

In [None]:
chunks=chunk_maker(c_size=400,c_overlap=40)

In [None]:
len(chunks)

1868

## function to download embedding model

In [None]:
def download_embedings(model_name):
  embeddings=HuggingFaceEmbeddings(model_name=model_name)
  return embeddings

In [None]:
model_embeddings=download_embedings(model_name="sentence-transformers/all-MiniLM-L6-V2")

  embeddings=HuggingFaceEmbeddings(model_name=model_name)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
model_embeddings

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-MiniLM-L6-V2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

## making embeddings of chunks

In [None]:
hf_embeddings = model_embeddings.embed_documents([chunk.page_content for chunk in chunks])

In [None]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS


### alternative method to build vector database 

In [None]:
vector_store = FAISS(
 embedding_function=embeddings,
 index=index,
 docstore=InMemoryDocstore(),
 index_to_docstore_id={},
 )
vector_store.add_documents(documents=final_chunks)

saving vector database

In [None]:
vector_store.save_local(folder_path="./Memory",index_name="CWC_index")

In [None]:
pip install faiss-cpu



In [None]:
from langchain.vectorstores import FAISS

## making vector database

In [None]:
vectorstore = FAISS.from_documents(chunks, model_embeddings)

### making prompt using user query

In [None]:
query = "What is the significance of the Official Journal of the European Union, specifically issue L 119?"
 # Perform similarity search
retrieved_docs = vectorstore.similarity_search(query, k=2)

In [None]:
page1 = retrieved_docs[1].metadata.get("page", "unknown")
retrieved_context = retrieved_docs[0].page_content + f"\n\n(Page {page1})"
 # Creating the prompt
augmented_prompt=f"""
Given the context below answer the question.
Question: {query}
Context : {retrieved_context}
Remember to answer only based on the context provided and not from any ot
If the question cannot be answered based on the provided context, say I d
s"""

In [None]:
from langchain.prompts import PromptTemplate
prompt = PromptTemplate(
    input_variables=["retrieved_context","query"],
    template=augmented_prompt
)
c_type_k={"prompt":prompt}

In [None]:
pip install transformers torch accelerate

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

### downloading model and tokenizer from hugging face

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
from langchain.llms import HuggingFacePipeline
# Load LLaMA 2 7B model from Hugging Face
model_name = "EleutherAI/gpt-neo-1.3B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, return_full_text=True)

llm = HuggingFacePipeline(pipeline=pipe)

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=pipe)


This line sets up a retrieval-based QA system where:

A query is sent by the user

The system retrieves 2 most relevant text chunks from the vectorstore

The language model uses those chunks to generate an informed answer

The original source documents are returned along with the answer

In [None]:
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever(search_kwargs={'k':2}), return_source_documents=True)


### displaying user query and model response

In [26]:
result=qa_chain({"query":query})
print(result['result'])

  result=qa_chain({"query":query})
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

4.5.2016 L 119/121 Official Jour nal of the European Union EN

4.5.2016 L 119/141 Official Jour nal of the European Union EN

Question: What is the significance of the Official Journal of the European Union, specifically issue L 119?
Helpful Answer:
The L 119 was published in April 2016.
According to the Official Journal of the European Union, issue L 119 is "published within the scope of all the European Union policies as the journal of the European Commission (http://ec.europa.eu/transparency/journals/ec_journal/en/).
The Journal L 119 is also published by the European Commission and its member states."
The Journal L 119, entitled Official Journal of the European Commission, was published on 15.April 2016.

From Wikipedia:

Official Journal of the European Union (OJEU), the official newspaper of the European Union, carries