## Overview of Assignment 4

This assignment focuses on exploring and implementing advanced concepts and techniques in information retrieval. The primary objectives are to build Retrieval Augumentation Generation, and learn about Language Models

## Enter your details below

## Name

## Banner ID

## GitHub Link of your Assingment 4

## Q1 : Setting up the libraries and the environment

In [4]:
import sys
print(sys.executable)

c:\Users\AVuser\AppData\Local\Programs\Python\Python312\python.exe


In [7]:
!{sys.executable} -m pip install -q --upgrade pip

!{sys.executable} -m pip install -q langchain faiss-cpu sentence-transformers transformers datasets tiktoken openai chromadb langchain-community



In [12]:
import langchain
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFaceHub
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate

from transformers import AutoTokenizer, AutoModel, T5EncoderModel
from sentence_transformers import SentenceTransformer
import faiss
from datasets import load_dataset

import os
import numpy as np
import torch

## Q2:  Data Preprocessing and Model Selection

In [13]:
ROOT_DIR = "pandas_code"
MODEL_NAME = "Salesforce/codet5-small"
MAX_TOKENS = 512
BATCH_SIZE = 8
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [14]:
# 1. Load all .py code files
def load_code_files(root_dir):
    code_texts = []
    for subdir, _, files in os.walk(root_dir):
        for file in files:
            if file.endswith('.py'):
                filepath = os.path.join(subdir, file)
                with open(filepath, 'r', encoding='utf-8') as f:
                    code_texts.append(f.read())
    return code_texts

code_chunks = load_code_files(ROOT_DIR)
print(f"Loaded {len(code_chunks)} Python files from '{ROOT_DIR}'.")

Loaded 1506 Python files from 'pandas_code'.


In [15]:
# 2. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = T5EncoderModel.from_pretrained(MODEL_NAME).to(DEVICE)
model.eval()

T5EncoderModel(
  (shared): Embedding(32100, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32100, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Dropout(p=0.1, i

In [16]:
# 3. Token chunking
def chunk_tokens(tokens, max_len=MAX_TOKENS):
    return [tokens[i:i+max_len] for i in range(0, len(tokens), max_len)]

def tokens_to_text(tokens):
    return tokenizer.convert_tokens_to_string(tokens)

In [None]:
# 4. Prepare code chunks
chunked_texts = []
for code in code_chunks:
    token_chunks = chunk_tokens(tokenizer.tokenize(code))
    chunked_texts.extend(tokens_to_text(chunk) for chunk in token_chunks)

print(f"Prepared {len(chunked_texts)} text chunks for embedding.")

Token indices sequence length is longer than the specified maximum sequence length for this model (515 > 512). Running this sequence through the model will result in indexing errors


Prepared 13683 text chunks for embedding.


: 

In [None]:
# 5. Generate embeddings
def get_embeddings(texts, tokenizer, model, device=DEVICE, batch_size=BATCH_SIZE):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        encoded = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**encoded)
            hidden_states = outputs.last_hidden_state  # [B, T, H]
            mask = encoded['attention_mask'].unsqueeze(-1).expand(hidden_states.size())
            summed = torch.sum(hidden_states * mask, dim=1)
            counts = torch.clamp(mask.sum(1), min=1e-9)
            mean_pooled = summed / counts
            embeddings.append(mean_pooled.cpu().numpy())
    return np.vstack(embeddings)

print("Computing embeddings...")
embeddings = get_embeddings(chunked_texts, tokenizer, model)
print(f"Generated embeddings of shape: {embeddings.shape}")

Computing embeddings...


In [None]:
# 6. Save embeddings using FAISS
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

faiss.write_index(index, "code_index.faiss")
print("Saved FAISS index to 'code_index.faiss'")

## Q3: Implementing RAG using LangChain for different queries

## Q4 : Modify and evaluate the different components of RAG

## Q5: Selecting and implementing a pretrained model for a new task