<a href="https://colab.research.google.com/github/hema-255/webGPT/blob/main/webGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install streamlit pyngrok sentence-transformers faiss-cpu -q

In [None]:
# Install required libraries
!pip install requests beautifulsoup4 scrapy pinecone-client -q

In [None]:
!pip install language_tool_python -q

In [None]:
import requests
from bs4 import BeautifulSoup

# Function to scrape content from a given URL
def scrape_website(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # Extract all text content
        paragraphs = soup.find_all('p')
        content = [para.get_text(strip=True) for para in paragraphs]
        return content
    else:
        print(f"Failed to fetch {url}: {response.status_code}")
        return []

# Example usage
urls = [
    "https://byjus.com/biology/nutrition-in-plants/",
    "https://byjus.com/biology/nutrition-modes-living-organisms/",
    "https://byjus.com/biology/nutrition-animals/",
    "https://byjus.com/biology/photosynthesis/"
]

# Scrape content from all URLs
website_data = {url: scrape_website(url) for url in urls}

In [None]:
# Function to split text into chunks
def chunk_text(content, max_length=500):
    chunks = []
    current_chunk = []
    current_length = 0

    for paragraph in content:
        if current_length + len(paragraph) > max_length:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.append(paragraph)
        current_length += len(paragraph)

    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

# Chunk the scraped content
chunked_data = {url: chunk_text(content) for url, content in website_data.items()}

In [None]:
from sentence_transformers import SentenceTransformer

# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for the chunks
embeddings = {}
for url, chunks in chunked_data.items():
    embeddings[url] = embedding_model.encode(chunks, convert_to_tensor=True)

print("Embeddings generated successfully!")

Embeddings generated successfully!


In [None]:
import faiss
import numpy as np

# Initialize FAISS index
dimension = embeddings[urls[0]][0].shape[0]  # Embedding dimension
index = faiss.IndexFlatL2(dimension)  # L2 distance (Euclidean)

# Add embeddings to the FAISS index
chunk_metadata = []  # To track metadata for each chunk
for url, embed_vectors in embeddings.items():
    # Move embeddings to CPU and convert to NumPy
    embed_vectors_np = np.array([vec.cpu().numpy() for vec in embed_vectors])
    index.add(embed_vectors_np)  # Add to FAISS index
    chunk_metadata.extend([(url, i) for i in range(len(embed_vectors_np))])

print(f"FAISS index contains {index.ntotal} vectors.")

FAISS index contains 56 vectors.


In [None]:
# Function to handle user queries
def query_rag_system(query, top_k=3):
    # Convert query into embedding
    query_vector = embedding_model.encode([query], convert_to_tensor=True).cpu().numpy()

    # Perform similarity search
    distances, indices = index.search(query_vector, top_k)

    # Retrieve relevant chunks
    results = []
    for idx in indices[0]:
        url, chunk_id = chunk_metadata[idx]
        results.append((url, chunked_data[url][chunk_id]))
    return results

In [None]:
# Example query
user_query = "What is Chrolophyll?"
retrieved_chunks = query_rag_system(user_query)
for url, chunk in retrieved_chunks:
    print(f"From {url}:\n{chunk}\n")

From https://byjus.com/biology/photosynthesis/:
Chlorophyll is a green pigment found in the chloroplasts of theplant celland in the mesosomes of cyanobacteria. This green colour pigment plays a vital role in the process of photosynthesis by permitting plants to absorb energy from sunlight. Chlorophyll is a mixture of chlorophyll-aand chlorophyll-b.Besides green plants, other organisms that perform photosynthesis contain various other forms of chlorophyll such as chlorophyll-c1,  chlorophyll-c2,  chlorophyll-dand chlorophyll-f.

From https://byjus.com/biology/nutrition-in-plants/:
Chlorophyll is a green pigment present in leaves which helps the leaves capture energy from sunlight to prepare their food. This production of food which takes place in the presence of sunlight is known as photosynthesis. Hence, the sun serves as the primary source for all living organisms During photosynthesis, water and carbon dioxide are used in the presence of sunlight to produce carbohydrates and oxygen. 

**UI Starts here**

In [None]:
import pickle

# Save FAISS index, metadata, and chunked data
with open("faiss_index.pkl", "wb") as f:
    pickle.dump((index, chunk_metadata, chunked_data), f)

print("FAISS index, metadata, chunked data saved successfully.")

FAISS index, metadata, chunked data saved successfully.


In [None]:
%%writefile app.py

import streamlit as st
import pickle
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load the FAISS index, metadata, and chunked data
with open("faiss_index.pkl", "rb") as f:
    index, metadata, chunked_data = pickle.load(f)

# Load the embedding model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Function to retrieve relevant chunks
def retrieve_chunks(query, top_k=3):
    query_vector = model.encode([query])
    distances, indices = index.search(np.array(query_vector), top_k)
    results = [(metadata[i][0], chunked_data[metadata[i][0]][metadata[i][1]]) for i in indices[0]]
    return results

# Streamlit UI
st.title("WebGPT")
st.markdown("Ask me anything about the ingested content!")

# Input from user
user_query = st.text_input("Enter your question:", "")

def beautify_answer(response):
    justified_text = f'<div style="text-align: justify; font-size: 16px; line-height: 1.6;">{response}</div>'
    return justified_text


if user_query:
    st.markdown("### Retrieved Context:")
    retrieved_chunks = retrieve_chunks(user_query)
    for i, (url, chunk) in enumerate(retrieved_chunks):
        st.write(f"**Source {i+1}:** {url}")
        st.markdown(beautify_answer(chunk), unsafe_allow_html=True)
        st.write("\n")

    # Generate response (basic concatenation for now)
    response = " ".join([str(chunk) for url, chunk in retrieved_chunks])  # Ensure only chunks are concatenated
    # Fallback if no valid chunks
    if not response.strip():
      response = "Sorry, I couldn't find relevant information to answer your query."
    st.markdown("### Answer:")
    st.markdown(beautify_answer(response), unsafe_allow_html=True)

Overwriting app.py


In [None]:
!ngrok authtoken 2qQez0La2vpg4sWLmlwlZCATVLQ_2S2WdARFXoJYQj5hGHYU8

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
public_url = ngrok.connect(8501, "http")  # Specify port and protocol
print(f"Streamlit app is running at {public_url}")

Streamlit app is running at NgrokTunnel: "https://ef64-34-34-74-9.ngrok-free.app" -> "http://localhost:8501"


In [None]:
os.system("streamlit run app.py &")

0

!lsof -i:8501

!kill -9 PID