## Week 1 Lab Manual
### Foundations of Deep Learning & AI Functionality

**Instructor Note**: This lab manual provides the aim, code, and explanation for each practical task. Focus on the architectural patterns and the transition from theoretical concepts to functional AI implementations.

---

# Week 1: Foundations of Language Modelling & Setup
## The Journey from NLP to Modern Transformers

###  Weekly Table of Contents
1. [Basic Tokenization](#-Lab-1.1:-Basic-Tokenization)
2. [Building a Website Summarizer](#-Lab-1.2:-Building-a-Website-Summarizer)
3. [Intro to LangChain](#-Lab-1.3:-Intro-to-LangChain)
- Environment Configuration
- Website Scraping Logic
- Summarization Logic
- Local Model Integration (Ollama)
- LangChain Re-implementation

###  Learning Objectives
Welcome to the Language Modelling curriculum! This week, we bridge the gap between traditional NLP and modern GenAI. You will learn:
1.  **Fundamental Concepts**: A look at NLP, Deep Learning, and the Transformer architecture.
2.  **Environment Setup**: Configuring Google Gemini 1.5 Flash and Ollama.
3.  **Prompt & Response**: Understanding how to talk to models (Cloud vs Local).
4.  **Basics of NLP**: Tokenization, Embeddings, and why "Context" matters.
5.  **Hands-on Project**: Building a "Web Research Assistant" using Gemini.

---

###  1.1 Basics of NLP & Deep Learning

#### What is NLP?
**Natural Language Processing (NLP)** is a branch of artificial intelligence that helps computers understand, interpret, and manipulate human language.

*   **Tokenization**: The process of converting a sequence of characters into a sequence of tokens. For example, the sentence "I love coding" becomes `["I", "love", "coding"]`.
*   **Embeddings**: Words aren't numbers, but computers need numbers. Embeddings represent words as dense vectors (lists of numbers) in a high-dimensional space where words with similar meanings are closer together.
*   **Stop Words**: Common words (like "the", "a", "is") that are often removed in traditional NLP to focus on "meaningful" words.

---

##  Lab 1.1: Basic Tokenization
**Aim**: To understand the fundamental concept of tokenization by comparing simple whitespace splitting with regular expression-based word boundary detection.

**Explanation**:
This lab demonstrates two primary methods of tokenization:
1.  **Simple Split**: Uses Python's `split()` method, which separates text based on whitespace. This often keeps punctuation attached to words (e.g., "token.").
2.  **Regex Split**: Uses the pattern `\w+|[^\w\s]` to extract words (`\w+`) or individual punctuation characters (`[^\w\s]`), providing a much cleaner set of tokens for natural language processing tasks.

*Insight: Modern LLMs use 'Subword Tokenization' which we will explore in Week 3!*

In [None]:
# --- Lab 1.1: Basic Tokenization ---
import re

text = "Language Modelling is the art of predicting the next token. Isn't it fascinating?"

# 1. Simple Word Tokenization (Split by space)
tokens_simple = text.split()
print(f"Simple Split ({len(tokens_simple)}): {tokens_simple}")

# 2. Regex Tokenization (Handling punctuation)
tokens_regex = re.findall(r"\w+|[^\w\s]", text)
print(f"Regex Split ({len(tokens_regex)}): {tokens_regex}")

# Insight: Modern LLMs use 'Subword Tokenization' which we will explore in Week 3!


##  Lab 1.2: Building a Website Summarizer
**Aim**: To build a production-ready web scraping and summarization tool that utilizes Gemini 1.5 Flash to process large-scale text content from live URLs.

**Explanation**:
This project implements a complete pipeline for AI-driven web research:
1.  **Configuration**: Uses `dotenv` to securely manage API keys.
2.  **Scraping**: Leveraging `BeautifulSoup` and `requests` to extract clean text while ignoring boilerplate elements like scripts and navigation bars.
3.  **Prompt Engineering**: A structured system prompt guides the model to act as a research assistant, ensuring concise markdown output.
4.  **Integration**: Successfully connects to **Gemini 1.5 Flash** for high-performance cloud processing and **Ollama** for local execution, providing a hybrid deployment model.

### Import Required Libraries
We'll import all the necessary libraries for web scraping (requests, BeautifulSoup), environment variables (dotenv), Gemini AI (google.generativeai), and display formatting (IPython.display).

In [None]:
# üì¶ WEEK 1 INITIALIZATION
import os
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import google.generativeai as genai
import ollama
from IPython.display import Markdown, display

# --- CONFIGURATION ---
load_dotenv(override=True)

# GEMINI CLOUD SETUP
GEMINI_API_KEY = os.getenv("GOOGLE_API_KEY") or os.getenv("GEMINI_API_KEY")
if GEMINI_API_KEY:
    genai.configure(api_key=GEMINI_API_KEY)
else:
    print("‚ùå ERROR: GOOGLE_API_KEY not found in environment.")

# MODEL CONFIGURATION
CLOUD_MODEL = "gemini-1.5-flash" 
LOCAL_MODEL = "gemma2:2b"

# Initialize models
model = genai.GenerativeModel(CLOUD_MODEL)

# Verify Ollama status
try:
    ollama.list()
    print("‚úÖ Ollama local server is active.")
except Exception:
    print("‚ö†Ô∏è Warning: Ollama server not detected. Local model features will be unavailable.")

print(f"‚úÖ Cloud Model Configured: {CLOUD_MODEL}")
print(f"‚úÖ Local Model Configured: {LOCAL_MODEL}")

In [None]:
# üåê WEB SCRAPING INFRASTRUCTURE
# Robust website content extraction using BeautifulSoup

class Website:
    """
    A utility class for fetching and cleaning webpage content for LLM consumption.
    """
    def __init__(self, url: str):
        self.url = url
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
        }
        try:
            response = requests.get(url, headers=self.headers, timeout=10)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            
            self.title = soup.title.string if soup.title else "Untitled Page"
            
            # Clean unwanted elements
            if soup.body:
                for element in soup.body(["script", "style", "img", "input", "nav", "footer"]):
                    element.decompose()
                self.text = soup.body.get_text(separator="\n", strip=True)
            else:
                self.text = "No content found in body."
                
        except Exception as e:
            self.title = "Error"
            self.text = f"Failed to fetch {url}: {str(e)}"

    def get_contents(self) -> str:
        return f"Page Title: {self.title}\n\nContent:\n{self.text[:15000]}" # Limit context window for efficiency

# Test the infrastructure
test_site = Website("https://blog.google/technology/ai/")
print(f"‚úÖ Website Scraping Test: {test_site.title}")
print(f"Content length: {len(test_site.text)} chars")


In [None]:
# ‚úçÔ∏è SUMMARIZATION LOGIC
# Defining the system prompt and the summarization pipeline

system_prompt = (
    "You are an expert technical assistant. Your task is to analyze the content of a website "
    "and provide a concise, professional summary in Markdown format. Focus on core features, "
    "announcements, and key takeaways."
)

def summarize_website(url: str):
    """
    Fetches website content and generates a summary using Gemini.
    """
    site = Website(url)
    
    # Construct user message
    user_message = (
        f"Analyze the following website titled '{site.title}'.\n\n"
        f"Content:\n{site.text[:10000]}"
    )
    
    # Call Gemini (initialized in the first block)
    # Using the standardized 'model' instance
    response = model.generate_content([system_prompt, user_message])
    return response.text

def display_summary(url: str):
    """Utility to display the markdown summary in the notebook"""
    print(f"Summarizing: {url}...")
    summary = summarize_website(url)
    display(Markdown(summary))

# Execution
display_summary("https://openai.com/news/")


In [None]:
# üè† LOCAL MODELS WITH OLLAMA
# Running open-source models (Gemma 2:2b) locally for privacy and cost-efficiency.

def call_local_gemma(prompt: str):
    """
    Interacts with the local Gemma 2:2b model via Ollama.
    """
    try:
        response = ollama.chat(model=LOCAL_MODEL, messages=[
            {
                'role': 'user',
                'content': prompt,
            },
        ])
        return response['message']['content']
    except Exception as e:
        return f"Error calling Ollama: {str(e)}"

# Example Usage
print(f"Testing Local Model ({LOCAL_MODEL}):")
print(call_local_gemma("Explain the concept of 'Tokenization' in NLP in one professional sentence."))


In [None]:
# üè† LOCAL SUMMARIZATION
# Adapting the summarization pipeline for local execution using Ollama.

def summarize_local(url, model_name=LOCAL_MODEL):
    """Local summarize function using Ollama"""
    website = Website(url)
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Summarize this website titled '{website.title}':\n\n{website.text[:5000]}"}
    ]
    response = ollama.chat(model=model_name, messages=messages)
    return response['message']['content']

def display_summary_local(url):
    print(f"Summarizing Locally ({LOCAL_MODEL}): {url}")
    summary = summarize_local(url)
    display(Markdown(summary))

# Test local summary
display_summary_local("https://blog.google/technology/ai/")


## Company Brochure Generator

### Enhanced Website Class with Link Extraction
This extended version of our Website class also extracts all links from the webpage, which we'll use to find relevant company pages.

In [None]:
# üíº EXTENDED APPLICATION: THE COMPANY BROCHURE GENERATOR
# Combining link extraction, AI filtering, and multi-page summarization.

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field

class LinkList(BaseModel):
    links: List[dict] = Field(description="A list of relevant links with 'type' (e.g., About, Careers) and 'url'")

class EnhancedWebsite(Website):
    """Extended Website class that extract and filters links"""
    def __init__(self, url):
        super().__init__(url)
        try:
            response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
            soup = BeautifulSoup(response.content, 'html.parser')
            links = [link.get('href') for link in soup.find_all('a') if link.get('href')]
            self.links = list(set([url.rstrip('/') + l if l.startswith('/') else l for l in links if l.startswith(('http', '/'))]))
        except: self.links = []

def get_relevant_links(url):
    website = EnhancedWebsite(url)
    links_text = "\n".join(website.links[:30])
    prompt = ChatPromptTemplate.from_template("Identify top 3 links (About, Careers, Products) from this list for a brochure: {links}. Return JSON.")
    chain = ChatGoogleGenerativeAI(model=MODEL, temperature=0) | JsonOutputParser()
    return chain.invoke({"links": links_text})

def create_brochure(company_name, url):
    print(f"Creating brochure for {company_name}...")
    details = EnhancedWebsite(url).get_contents()
    links = get_relevant_links(url)
    
    for l in links.get('links', []):
        try: details += f"\n\n-- {l['type']} --\n" + Website(l['url']).get_contents()
        except: pass
        
    prompt = f"Create a markdown brochure for {company_name} using this info:\n{details[:8000]}"
    return model.generate_content(prompt).text

# Example execution
# brochure = create_brochure("HuggingFace", "https://huggingface.co")
# display(Markdown(brochure))
print("‚úÖ Brochure Generation logic defined.")


---
##  Lab 1.3: Intro to LangChain
**Aim**: To recreate our website summarizer using **LangChain**, the industry-standard framework for building LLM-powered applications.

**Explanation**:
This lab introduces the LangChain framework to rebuild our summarization pipeline. It highlights key advantages:
- **Composability**: Chain different components together using LCEL.
- **Model Agnostic**: Swap Gemini with other models (like local Llama/Gemma) easily.
- **Rich Eco-system**: Built-in parsers, prompt templates, and output handling.

In [None]:
# üèóÔ∏è LANGCHAIN RE-IMPLEMENTATION
# Using ChatGoogleGenerativeAI to wrap our Gemini model
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

chat_model = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",
    google_api_key=GEMINI_API_KEY,
    temperature=0
)

# 1. Define the Prompt Template
summary_template = """
You are a professional technical researcher. 
Analyze the following website content and provide a concise, bulleted summary in Markdown.
Focus on core value propositions and key announcements.

Website Title: {title}
Website Content: {content}
"""

summary_prompt = ChatPromptTemplate.from_template(summary_template)

# 2. Build the LCEL Chain (LangChain Expression Language)
summarize_chain = summary_prompt | chat_model | StrOutputParser()

# 3. Execute for a website
def langchain_summarize(url):
    ws = Website(url)
    result = summarize_chain.invoke({
        "title": ws.title,
        "content": ws.get_contents()
    })
    display(Markdown(result))

# Test LangChain implementation
print("Summarizing via LangChain...")
langchain_summarize("https://www.deepmind.com")


### Summary and Learning Outcomes - Week 1
- **Model Mastery**: You've used both Google's state-of-the-art **Gemini 1.5 Flash** and locally hosted **Gemma 2** via Ollama.
- **Workflow Automation**: Built a complete pipeline from raw URL to professional summary.
- **Modern Paradigms**: Introduced **LangChain** and **LCEL** (LangChain Expression Language) for building robust AI pipelines.

**Next Week**: We dive deeper into Conversational AI and advanced UI development with Gradio.

---

##  Instructor's Evaluation & Lab Summary

###  Assessment Criteria
1. **Technical Implementation**: Adherence to the lab objectives and code functionality.
2. **Logic & Reasoning**: Clarity in the explanation of the underlying AI principles.
3. **Best Practices**: Use of secure environment variables and structured prompts.

**Lab Completion Status: Verified**
**Focus Area**: Language Modelling & Deep Learning Systems.