<a href="https://colab.research.google.com/github/mou-pi-ya/Celebal-Technology-.py/blob/main/Assignment8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install pandas transformers faiss-cpu numpy torch sentence-transformers



In [2]:
import pandas as pd

# Load the dataset
dataset_path = "Training Dataset.csv"  # Update with actual path
df = pd.read_csv(dataset_path)

# Convert each row to a text representation for embedding
def row_to_text(row):
    return (f"Loan ID: {row['Loan_ID']}, Gender: {row['Gender']}, Married: {row['Married']}, "
            f"Dependents: {row['Dependents']}, Education: {row['Education']}, "
            f"Self Employed: {row['Self_Employed']}, Applicant Income: {row['ApplicantIncome']}, "
            f"Coapplicant Income: {row['CoapplicantIncome']}, Loan Amount: {row['LoanAmount']}, "
            f"Loan Term: {row['Loan_Amount_Term']}, Credit History: {row['Credit_History']}, "
            f"Property Area: {row['Property_Area']}, Loan Status: {row['Loan_Status']}")

# Create text representations
df['text'] = df.apply(row_to_text, axis=1)

# Optional: Add dataset metadata as a separate document
metadata_text = (
    "The Loan Approval Prediction dataset contains information about loan applications. "
    "It includes columns such as Loan_ID, Gender, Married, Dependents, Education, "
    "Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, "
    "Credit_History, Property_Area, and Loan_Status (Y/N). The dataset is used to predict "
    "whether a loan application will be approved based on these features."
)

In [3]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load a lightweight embedding model
embedder = SentenceTransformer('distilbert-base-nli-mean-tokens')

# Generate embeddings for the dataset rows and metadata
texts = df['text'].tolist() + [metadata_text]
embeddings = embedder.encode(texts, show_progress_bar=True)

# Create a FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings.astype(np.float32))

# Save text references for retrieval
text_references = texts

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/20 [00:00<?, ?it/s]

In [4]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the generative model and tokenizer
model_name = "google/flan-t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [5]:
def answer_query(query, top_k=3):
    # Embed the query
    query_embedding = embedder.encode([query])[0]

    # Search the FAISS index
    distances, indices = index.search(np.array([query_embedding]).astype(np.float32), top_k)

    # Retrieve relevant texts
    retrieved_texts = [text_references[i] for i in indices[0]]
    context = "\n".join(retrieved_texts)

    # Create prompt for the generative model
    prompt = f"Answer the following question based on the context:\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"

    # Generate response
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=150, num_beams=5, early_stopping=True)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer

In [6]:
# Example queries
queries = [
    "What is the Loan Approval Prediction dataset about?",
    "What are the columns in the dataset?",
    "Can you give me an example of a loan application from the dataset?",
]

for query in queries:
    print(f"Query: {query}")
    print(f"Answer: {answer_query(query)}\n")

Query: What is the Loan Approval Prediction dataset about?
Answer: Loan applications

Query: What are the columns in the dataset?
Answer: Loan_ID, Gender, Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, Loan_Amount_Term, Credit_History, Property_Area, and Loan_Status

Query: Can you give me an example of a loan application from the dataset?
Answer: Loan Approval Prediction

