<a href="https://colab.research.google.com/github/mahessh/MLAI-community-labs/blob/main/PersonalizedRecSys/AI_Incubator_for_Local_Machine_install.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Install ollama Windows or Mac or Linux
2. Terminal C:\....> ollama
3. Serve (starts ollama)

within Ollama
----
4. ollama pull nomic-embed-text (for embedding vector db for RAGS)
5. ollama pull llama3.2:3b (LLM)
----
6. pip install chromadb (Vector Database)
7. pip install streamlit (frontend)
8. pip install sentence-transformers
9. pip install pymupdf (for chunking pdfs)
10. pip install langchain-community (wf utilities)
11. pip install pytube (for chunking videos)
12. pip install pypdf

----initial setup -----

RAGS Program Code -> below in Terminal

streamlit run yourscript.py

In [2]:
pip install chromadb streamlit sentence-transformers pypdf pytube pymupdf langchain-community



This starter code Chunks the data from the pdf or video and allows the user to query the chunked out data in the Vector DB. The code is tested on a local Windows Machine with Windows11 x64PC, 32GB RAM.

In [3]:
import os
import streamlit as st
import chromadb
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from pytube import YouTube
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load Embedding Model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file."""
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    return "\n".join([page.page_content for page in pages])

def extract_text_from_video(video_url):
    """Extract video transcript (Requires YouTube subtitles)."""
    yt = YouTube(video_url)
    caption = yt.captions.get_by_language_code('en')
    if caption:
        return caption.generate_srt_captions()
    return "No transcript available."

def process_and_store_text(texts, collection_name="rag_store"):
    """Splits text and stores embeddings in ChromaDB."""
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    documents = text_splitter.create_documents(texts)

    chroma_db = Chroma.from_documents(documents, embedding_model, persist_directory="./chroma_db")
    chroma_db.persist()
    return chroma_db

def query_rag(query, chroma_db):
    """Retrieve and display relevant documents."""
    results = chroma_db.similarity_search(query, k=3)
    return results

# Streamlit UI
st.title("RAG: PDF & Video Query System")

uploaded_files = st.file_uploader("Upload PDFs", type=["pdf"], accept_multiple_files=True)
video_url = st.text_input("Enter YouTube Video URL")

documents = []
if uploaded_files:
    for uploaded_file in uploaded_files:
        with open(os.path.join("./", uploaded_file.name), "wb") as f:
            f.write(uploaded_file.getbuffer())
        documents.append(extract_text_from_pdf(uploaded_file.name))

if video_url:
    documents.append(extract_text_from_video(video_url))

if documents:
    db = process_and_store_text(documents)
    st.success("Data processed and stored!")

    query = st.text_input("Enter query")
    if st.button("Search"):
        results = query_rag(query, db)
        for idx, res in enumerate(results):
            st.write(f"### Result {idx+1}")
            st.write(res.page_content)


  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
2025-02-23 20:40:42.054 
  command:

    streamlit run /usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py [ARGUMENTS]
2025-02-23 20:40:42.066 Session state does not function when running a script without `streamlit run`


In this following section the Code works with LLMs and the Vector Database . This starter code Chunks the data from the pdf or video and allows the user to query the chunked out data in the Vector DB using an LLM LLama3 . The code is tested on a local Windows Machine with Windows11 x64PC, 32GB RAM.

In [4]:
!pip install ollama



In [5]:
import os
import streamlit as st
import chromadb
import ollama
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from pytube import YouTube
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load Embedding Model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Ensure ChromaDB persistence directory exists
CHROMA_DB_DIR = "./chroma_db"
os.makedirs(CHROMA_DB_DIR, exist_ok=True)

def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file."""
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    return "\n".join([page.page_content for page in pages])

def extract_text_from_video(video_url):
    """Extract video transcript (Requires YouTube subtitles)."""
    try:
        yt = YouTube(video_url)
        caption = yt.captions.get_by_language_code('en')
        if caption:
            return caption.generate_srt_captions()
        return "No transcript available."
    except Exception as e:
        return f"Error fetching transcript: {str(e)}"

def process_and_store_text(texts, collection_name="rag_store"):
    """Splits text and stores embeddings in ChromaDB."""
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    documents = text_splitter.create_documents(texts)

    chroma_db = Chroma.from_documents(documents, embedding_model, persist_directory=CHROMA_DB_DIR)
    chroma_db.persist()
    return chroma_db

def query_rag(query, chroma_db):
    """Retrieve and display relevant documents."""
    results = chroma_db.similarity_search(query, k=3)
    return results

def generate_response(query, context):
    """Generate a response using Llama3 via Ollama."""
    prompt = f"""
    Context:
    {context}

    Question: {query}

    Answer:
    """
    response = ollama.generate(model='llama3', prompt=prompt)
    return response['response']

# Streamlit UI
st.title("RAG: PDF & Video Query System with Llama3")

uploaded_files = st.file_uploader("Upload PDFs", type=["pdf"], accept_multiple_files=True)
video_url = st.text_input("Enter YouTube Video URL")

documents = []
if uploaded_files:
    for uploaded_file in uploaded_files:
        file_path = os.path.join("./", uploaded_file.name)
        with open(file_path, "wb") as f:
            f.write(uploaded_file.getbuffer())
        documents.append(extract_text_from_pdf(file_path))

if video_url:
    documents.append(extract_text_from_video(video_url))

if documents:
    db = process_and_store_text(documents)
    st.success("Data processed and stored!")

    query = st.text_input("Enter query")
    if st.button("Search"):
        results = query_rag(query, db)
        context = "\n".join([res.page_content for res in results])
        response = generate_response(query, context)

        st.write("### Response from Llama3:")
        st.write(response)


