<a href="https://colab.research.google.com/github/polugariteja/531-Training-3-2/blob/main/Mini_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install wikipedia faiss-cpu sentence-transformers transformers torch numpy langchain_community
import os
os.environ["WANDB_DISABLED"] = "true"


from langchain_community.document_loaders import WikipediaLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS


from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch





**Load Wikipedia**

In [2]:
loader = WikipediaLoader(
    query="Artificial Intelligence",
    load_max_docs=2
)

documents = loader.load()
print(f"Loaded {len(documents)} Wikipedia documents")


#Chunking function

def chunk_text(text, size=400, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc.page_content))

print(f"Total chunks: {len(chunks)}")


Loaded 2 Wikipedia documents
Total chunks: 24


**Create Embeddings**

In [3]:
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
chunk_embeddings = embedder.encode(chunks, convert_to_numpy=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**FAISS Vector Store**

In [4]:
import faiss
dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(chunk_embeddings)

**Load Pretrained Seq2Seq Generator**

In [5]:
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

**RAG Inference**

In [6]:
def rag_answer(question, top_k=3):
    query_vec = embedder.encode([question])
    distances, indices = index.search(query_vec, top_k)

    retrieved_chunks = [chunks[i] for i in indices[0]]
    context = "\n".join(retrieved_chunks)

    prompt = f"""
Answer the question using the context below.

Context:
{context}

Question:
{question}
"""

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=150)

    return tokenizer.decode(outputs[0], skip_special_tokens=True), retrieved_chunks

**Ask Question**

In [7]:
question = "What is artificial intelligence?"
answer, evidence = rag_answer(question)

print("\nQuestion:", question)
print("\nAnswer:", answer)

print("\nRetrieved Evidence:")
for i, e in enumerate(evidence, 1):
    print(f"\nSource {i}:")
    print(e[:300])


Question: What is artificial intelligence?

Answer: AI

Retrieved Evidence:

Source 1:
Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and so

Source 2:
Artificial general intelligence (AGI) is a hypothetical type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks.
Beyond AGI, artificial superintelligence (ASI) would outperform the best human abilities across every domain by a wide margin. 

Source 3:
adly to any programs that neither experience consciousness nor have a mind in the same sense as humans.
Related concepts include artificial superintelligence and transformative AI. An artificial superintelligence (ASI) is a hypothetical type of AGI that is much more generally intelligent