# LLM as Teaching Assistants

## Agenda
- Introduction to LLMs as Teaching Assistants
  - Responses in No Time
  - Cost Reduction
  - On-Demand Responses
- Understanding System Design of LLM as Teaching Assistants
- Discussing Key Concepts and Techniques
- Exploring Use Cases and Architecture

---

## What are LLMs?

**Definition:** Large Language Models (LLMs) are advanced neural networks designed to process and understand human language. They are trained on vast datasets and are capable of generating contextually relevant human-like text.

**Why train LLMs?**
- To generate human-like responses.
- To produce coherent text based on the context provided.
  
### Characteristics
- **Deep Neural Networks:** These models consist of numerous layers (hence 'deep') that enable them to learn complex patterns from large datasets.
- **Parameters:** The number of parameters ranges from millions to billions. More parameters generally lead to better model performance, but they also require more computing power.

### Comparison with Small Language Models (SLMs)
- **Size and Scope:** LLMs can generate larger, more coherent text than SLMs, which might be limited in terms of complexity and length.
- **Use Cases:** LLMs are preferable for complex language tasks like translation, summarization, and question answering due to their sophisticated understanding.

---

## Basic Use Cases of LLMs
1. **Question Answering:** Providing instant responses to user queries.
2. **Translations:** Translating texts from one language to another accurately.
3. **Summarization:** Creating concise summaries of lengthy documents.
4. **Sentiment Analysis:** Analyzing texts for their emotional tone.

---

## What is the SwiGLU Activation Function?

**Definition:** SwiGLU is a non-linear activation function that combines the advantages of both the Swish and GLU (Gated Linear Unit) activation functions. It helps the model learn complex patterns and dependencies more effectively than simpler activation functions like ReLU.

### Formula:
$\text{SwiGLU}(x) = \text{Sigmoid}(x) \cdot \text{ReLU}(x)$


---

## What are Rotary Embeddings?

**Definition:** Rotary embeddings are a mechanism in neural networks to encode the positional information of words in a sequence, allowing the model to understand the order of words better than traditional absolute positional embeddings.

### Importance
- **Context Awareness:** By using rotary embeddings, the model can better associate words with their respective contexts, which is crucial for understanding sentences correctly.
- **Robustness:** This method helps the model maintain meaningful relationships between words in complex sentences.

---

## Why LLAMA is Preferred over GPT?

**LLAMA (Large Language Model Meta AI):**
- **Parameter Efficiency:** LLAMA models range from 7 billion to 65 billion parameters, making them efficient while maintaining performance.
- **Performance:** Studies show LLAMA can outperform larger models like GPT-3 in certain tasks, providing a higher quality responses with fewer resources.
- **Training Data:** LLAMA uses diverse datasets for training which enhances its understanding of various topics.

---

## Advantages of Using LLMs as Teaching Assistants
- **Cost Efficiency:** Reduces the need for hiring multiple human teaching assistants.
- **Instantaneous Responses:** Users receive answers without waiting time, which enhances learning.
- **Scalability:** Can handle multiple users simultaneously without degradation of service quality.

---

## Rough Sketch of LLM as Teaching Assistant Architecture

```mermaid
flowchart TD
    User -->|Sends Query| A[Teaching Assistant System]
    A -->|Forwards Query| B[Chat GPT]
    B -->|Generates Response| A
    A -->|Responds to User| User
```

### Functional Issues of This Architecture
1. **Quality of Responses:** If the user query is misunderstood, the model may provide irrelevant answers.
2. **Data Storage Costs:** Using external APIs may incur costs based on token usage.

---

## What is Retrieval Augmented Fine Tuning (RAFT)?

**Definition:** RAFT is an advanced approach that combines the benefits of retrieval-augmented generation (RAG) with fine-tuning techniques, allowing models to leverage external knowledge efficiently during the response generation process.

## Differences Between RAG and RAFT
- **RAG:** Focuses on retrieving relevant pieces of information from an external database to formulate responses.
- **RAFT:** Enhances RAG by incorporating fine-tuning on specific datasets post-retrieval, leading to more accurate contextually appropriate outputs.

---

## What is FAISS?

**Definition:** FAISS (Facebook AI Similarity Search) is a library developed for efficient similarity search and clustering of dense vectors. It provides tools for high-performance nearest neighbor search.

### Applications
- Used in LLMs for similarity search in RAG architectures. When a document is queried, FAISS helps identify closely related documents in terms of embeddings.

## PostgreSQL Vector Database?

PostgreSQL can be enhanced with extensions to support vector operations, allowing for efficient management and querying of embedding data.

### Pgvector
- Pgvector is a PostgreSQL extension for vector similarity search. It enables storage of vectors and efficient querying based on similarity.
  
---

## Detailed Architecture Analysis

- **Video Upload:** The admin uploads videos to the server.
- **Transcription:** Videos are transcribed using services like Amazon Transcribe.
- **Storing in Vector DB:** Transcriptions are processed and stored in a vector DB (e.g., FAISS).
- **Query Handling:** User queries are directed to the vector DB to retrieve relevant information.
- **Chat GPT Integration:** Contextual prompts are passed to Chat GPT to generate answers based on retrieved data.

```mermaid
flowchart TD
    A[Video Upload] --> B[Transcription]
    B --> C[Store in Vector DB]
    C --> D[User Query]
    D --> E[Retrieve Documents]
    E --> F[Send to Chat GPT]
    F --> G[Response to User]
```

---

## What is Amazon Transcribe?

**Definition:** Amazon Transcribe is a fully managed automatic speech recognition (ASR) service that converts speech into text. This service is particularly useful for processing audio files into readable transcripts.

### Differences Between PyPDF Loader and PyMuPDF Loader

- **PyPDF Loader:** Specifically designed for reading PDF files and extracting text.
- **PyMuPDF Loader:** Offers broader functionalities, including working with text, images, and annotations in PDFs while being more flexible.

---

## AI Interviewer App Architecture

**Architecture Overview:**
1. **User Speech to Text Conversion:** Using services like Whisper.
2. **Evaluation Logic:** Chat GPT evaluates responses based on predefined criteria.
3. **Text to Speech Feedback:** The application converts feedback text back into speech for user interaction.

```mermaid
flowchart TD
    A[User Speech] --> B[Convert to Text]
    B --> C[Chat GPT Evaluation]
    C --> D[Text to Speech]
    D --> E[User Feedback]
```

---



# LLM as Teaching Assistants - Extended Notes with Multimodal RAG

## Multimodal Retrieval-Augmented Generation (RAG)

### Introduction to Multimodal RAG
**Definition:** An advanced AI system that combines multiple data types (text, images, audio, video) to enhance language model capabilities by retrieving and processing information from multiple modalities simultaneously.

**Why Needed?**
- Real-world scenarios often involve both visual and textual information (e.g., medical reports with X-rays and doctor's notes)
- Text and images provide complementary information (diagrams clarifying textual concepts)
- Standard text-only RAG misses visual context, leading to incomplete responses

### Three Major Implementation Methods

#### Option 1: Multimodal Embeddings + Multimodal LLM
1. **Embedding:** Use models like CLIP to encode images/text into shared vector space
2. **Retrieval:** Perform similarity search across both modalities
3. **Synthesis:** Pass raw images + text to multimodal LLM (e.g., GPT-4V)

```mermaid
flowchart TD
    A[Text] --> B[Multimodal Embedding Model]
    C[Image] --> B
    B --> D[Vector Database]
    E[Query] --> F[Similarity Search]
    D --> F
    F --> G[Multimodal LLM]
    G --> H[Answer]
```

#### Option 2: Image Summarization + Text Retrieval
1. **Summarization:** Use multimodal LLM to generate text descriptions of images
2. **Embedding:** Embed these summaries with standard text embeddings
3. **Synthesis:** Pass retrieved text to standard LLM

#### Option 3: Hybrid Approach (Used in Project)
1. **Summarization:** Generate text summaries from images
2. **Embedding:** Store summaries with references to original images
3. **Synthesis:** Pass both raw images and text to multimodal LLM

### Code Implementation Breakdown

#### 1. PDF Processing with PyMuPDF
```python
import fitz  # PyMuPDF
import os

def extract_images_from_pdf(pdf_path, output_folder):
    pdf_document = fitz.open(pdf_path)
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        image_list = page.get_images(full=True)
        
        for img_index, img in enumerate(image_list):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image_filename = f"page{page_number+1}_img{img_index+1}.{image_ext}"
            image_filepath = os.path.join(output_folder, image_filename)
            
            with open(image_filepath, "wb") as image_file:
                image_file.write(image_bytes)
```

Key Components:
- `get_images()`: Extracts all images from a PDF page
- `extract_image()`: Gets binary image data using cross-reference (xref)
- File naming convention preserves page/image relationships

#### 2. Image Encoding and Description
```python
import base64
from openai import OpenAI

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def describe_image(base64_image):
    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract ALL text from this image"},
                    {
                        "type": "image_url",
                        "image_url": f"data:image/png;base64,{base64_image}"
                    }
                ]
            }
        ],
        max_tokens=300
    )
    return response.choices[0].message.content
```

Key Points:
- Base64 encoding converts binary images to text-safe format
- GPT-4V processes both the image and textual prompt
- Response includes extracted text and image interpretation

#### 3. Combined Text+Image Processing
```python
def extract_images_and_text_from_pdf(pdf_path, output_folder):
    pdf_document = fitz.open(pdf_path)
    combined_text = ""
    
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        text = page.get_text()
        combined_text += f"\n\nPage {page_number + 1}:\n{text}"
        
        image_list = page.get_images(full=True)
        for img_index, img in enumerate(image_list):
            # Image extraction code...
            base64_image = encode_image(image_filepath)
            image_description = describe_image(base64_image)
            combined_text += f"\n\n[Image: {image_filename}]\n{image_description}"
    
    return combined_text
```

Workflow:
1. Extracts raw text from each page
2. Processes each image through GPT-4V
3. Combines original text with image descriptions
4. Preserves references between text and images

#### 4. Vector Database Setup
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=60,
    separators=["\n\n", "\n"]
)

splits = text_splitter.split_documents(loaders.load())
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(splits, embeddings)
```

Key Configurations:
- Chunk size optimized for context preservation
- FAISS enables efficient similarity search
- OpenAI embeddings create semantic representations

#### 5. Retrieval-Augmented QA System
```python
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

template = """Use context to answer the question. Include image references when relevant:
{context}
Question: {question}
Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-3.5-turbo"),
    retriever=db.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

result = qa_chain({"query": "Explain payment trends from the image"})
```

### Practical Applications

1. **Education**
   - Lecture notes with diagrams (like the payment trends example)
   - Automated generation of alt-text for accessibility

2. **E-commerce**
   - Product manuals with both textual instructions and diagrams
   - Visual search combined with textual queries

3. **Medical**
   - Radiology reports combining scan images with doctor's notes
   - Prescription understanding with pill images

### Challenges and Solutions

| Challenge | Solution |
|-----------|----------|
| Large PDF processing | Chunking strategies + parallel processing |
| Image quality issues | Preprocessing with OpenCV/Pillow |
| Cost of multimodal LLMs | Hybrid approach (Option 3) |
| Reference preservation | Structured metadata in vector DB |

### Advanced Concepts

**CLIP Embeddings**
- Contrastive Language-Image Pretraining
- Enables cross-modal similarity search
- Can replace GPT-4V for retrieval when synthesis isn't needed

**Pgvector Alternative**
```sql
CREATE TABLE document_embeddings (
    id SERIAL PRIMARY KEY,
    content TEXT,
    image_path VARCHAR(255),
    embedding VECTOR(1536)
);

-- Find similar documents
SELECT content, image_path
FROM document_embeddings
ORDER BY embedding <=> '[query_embedding]'
LIMIT 5;
```

### Future Directions
1. **Audio Integration** - Adding speech-to-text and audio analysis
2. **Video Processing** - Frame extraction + temporal understanding
3. **3D Model Support** - For engineering/CAD applications

---

## Integration with Teaching Assistant System

```mermaid
flowchart TD
    A[Student Query] --> B{Contains Image?}
    B -->|Yes| C[Multimodal Processing]
    B -->|No| D[Standard RAG]
    C --> E[Image Extraction]
    E --> F[Description Generation]
    F --> G[Combined Embedding]
    D --> H[Text Embedding]
    G & H --> I[Vector DB Search]
    I --> J[Response Generation]
    J --> K[Student]
```

**Benefits for Education:**
- Can explain textbook diagrams automatically
- Processes handwritten notes when shared as images
- Understands mathematical notation in scans
```

## Conclusion

In this session, we explored the system design of LLMs as teaching assistants, delving into functionalities, architectures, and applications. Understanding how to utilize LLMs efficiently, especially in educational contexts, opens up possibilities for enhanced learning experiences.

### Next Steps
- Review the LLM architecture and key concepts discussed.
- Implement a sample application based on provided architecture and code snippets.
- Explore additional literature on RAG and RAFT for a deeper understanding.