# GenAI with Open-Source Models: Complete Tutorial
## Building RAG Systems with Local Models & PDF Processing

**📚 What you'll learn:**
- How to use **completely open-source** models (no API keys needed!)
- Process **real PDF documents** with practical use cases
- Build **in-memory vector stores** for fast prototyping
- Create **educational explanations** for each step

** Use Case:** Build a **Research Paper Assistant** that helps students understand academic papers

## Step 1: Environment Setup (No API Keys!)

**What we're doing:** Installing only open-source packages that work completely offline

In [None]:
# Install open-source packages only
!pip install -q sentence-transformers chromadb pypdf langchain langchain-community faiss-cpu transformers torch numpy pandas

# Verify installations
import subprocess
result = subprocess.run(['pip', 'list'], capture_output=True, text=True)
print("Installed packages:")
for line in result.stdout.split('\n'):
    if any(pkg in line for pkg in ['sentence-transformers', 'chromadb', 'langchain', 'faiss']):
        print(f"  {line.strip()}")

In [None]:
# Import all required libraries
import os
import json
import time
import numpy as np
from typing import List, Dict, Any
from datetime import datetime

# PDF processing
from pypdf import PdfReader

# Vector stores and embeddings (all open-source)
import chromadb
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Document handling
from langchain.schema import Document

# Progress tracking
from tqdm.notebook import tqdm

print("All libraries imported successfully!")

## Step 2: Understanding Our Use Case

**Scenario:** You're a student who needs to understand research papers quickly. We'll build a system that:
1. Reads PDF research papers
2. Creates a searchable knowledge base
3. Answers questions about the paper content
4. Explains complex concepts in simple terms

In [None]:
# Create sample research paper content (simulating a real PDF)
sample_research_content = """
Title: Deep Learning for Natural Language Processing: A Comprehensive Survey

Abstract: This paper presents a comprehensive survey of deep learning techniques applied to natural language processing tasks. We examine the evolution from traditional statistical methods to modern neural architectures, focusing on transformer-based models and their applications.

1. Introduction
Natural Language Processing (NLP) has undergone a revolutionary transformation with the advent of deep learning. Traditional rule-based systems have given way to neural networks that learn patterns from vast amounts of text data.

2. Background: Traditional NLP Methods
Before deep learning, NLP relied heavily on:
- Bag-of-words models
- N-gram language models
- Hidden Markov Models (HMMs)
- Conditional Random Fields (CRFs)

3. Neural Network Foundations
Deep learning in NLP builds upon several key neural architectures:
- Recurrent Neural Networks (RNNs) for sequential data
- Long Short-Term Memory (LSTM) networks for long dependencies
- Convolutional Neural Networks (CNNs) for local patterns

4. Transformer Architecture
The transformer model, introduced in "Attention is All You Need," revolutionized NLP through:
- Self-attention mechanisms
- Parallel processing capabilities
- Scalability to large datasets

5. Large Language Models
Modern LLMs like BERT, GPT, and T5 demonstrate:
- Few-shot learning capabilities
- Transfer learning effectiveness
- Emergent behaviors at scale

6. Applications and Future Directions
Current applications include machine translation, question answering, and text generation. Future research focuses on efficiency, interpretability, and reducing computational requirements.
"""

# Save as a sample PDF file
with open('sample_research_paper.txt', 'w') as f:
    f.write(sample_research_content)

print("Sample research paper created!")
print("File: sample_research_paper.txt")
print(f"Size: {len(sample_research_content)} characters")

## Step 3: PDF Processing Explained

**What happens here:**
- We read the PDF file
- Split it into manageable chunks (like paragraphs)
- Add metadata so we know where each piece came from

## Step 4: Open-Source Embeddings Explained

**What are embeddings?**
- Think of them as "smart fingerprints" for text
- Similar texts have similar fingerprints
- We use the **all-MiniLM-L6-v2** model (completely free and offline)

## Step 5: In-Memory Vector Store (ChromaDB)

**What is ChromaDB?**
- A lightweight, in-memory database for vectors
- Perfect for prototyping and learning
- No setup required - works immediately!

## Step 6: Open-Source Language Model

**What we're using:**
- **Flan-T5** - Google's open-source model
- **Completely free** - runs on your machine
- **Good for educational purposes** - explains concepts clearly

## Step 7: Complete RAG Pipeline

**Putting it all together:**
- **R**etrieval: Find relevant chunks from PDF
- **A**ugmentation: Add context to the question
- **G**eneration: Create answer using open-source model

## Step 8: Interactive Learning Session

**Let's test our system with educational questions!**

## Final Summary & Next Steps

In [None]:
print("🎉 Congratulations! You've built a complete GenAI system!")
print("\nWhat you learned:")
print("How to process PDF documents into searchable chunks")
print("Using open-source embedding models (no API keys!)")
print("Building in-memory vector databases with ChromaDB")
print("Creating Q&A systems with open-source language models")
print("Adding educational features for better learning")

print("\n Next steps to explore:")
print("1. Try with your own PDF research papers")
print("2. Experiment with different embedding models")
print("3. Add conversation memory for follow-up questions")
print("4. Create a web interface using Streamlit")
print("5. Try larger open-source models like Llama-2")

# Save conversation history for review
with open('learning_session.json', 'w') as f:
    json.dump(assistant.conversation_history, f, indent=2)

print("\nConversation history saved to 'learning_session.json'")