# **Buckle Up ! We are starting our week 2 roller coaster**

In our first week we covered some theoritical concepts and completed our setup so its time we start building!

## 📓**Conversational AI Concepts & Model Pipelines**

🎯 By the end of this week, you will:

- Understand LLMs, STT, TTS models and their roles.

- Know how to connect to LLMs with APIs (Groq as example).

- Use Python (requests + JSON) for API interaction.

- Start building a basic chatbot with memory and preprocessing.

---

## 🌟 Large Language Models (LLMs) 🌟

---

### ❗ **Question 1**: What is an LLM?

👉 It’s like a super-smart text predictor that can read, understand, and generate human-like sentences.

You give it some words → it guesses the next words in a way that makes sense.

For example:

1) You ask a question → it gives you an answer.

2) You write a sentence → it can complete it.

3) You give it a topic → it can write an essay, code, or even a story.

So, its a type of AI trained on huge amounts of text data to generate or understand text.

---

### Types of LLMs

1. Encoder-only models (e.g., BERT)

    - Best for understanding text (classification, sentiment analysis, embeddings).

    - ❌ Not good at generating text.

2. Decoder-only models (e.g., GPT, LLaMA, Mistral)

    - Best for text generation (chatbots, writing, summarization).

    - What we use in chatbots.

3. Encoder-decoder models (e.g., T5, BART)

    - Good at transforming text (translation, summarization, Q&A).

### Must-Knows about LLMs

- They don’t “think” like humans → They predict text based on training.

- Garbage in → garbage out: Poor prompts = poor answers.

- Token limits: Models can only “see” a certain number of words at a time.

- Biases: Trained on internet text → may reflect biases/errors.

### 💡 **Quick Questions**: 

1. Why might a chatbot built on BERT (encoder-only) struggle to answer open-ended questions?

A BERT (encoder‑only) chatbot struggles with open‑ended answers because it is designed to produce contextual embeddings for understanding, not to autoregressively generate fluent continuations.

Key reasons:
- No decoder / generation head: Vanilla BERT is bidirectional; it “sees” both left and right context while training (masked language modeling). It does not learn to emit a long sequence token‑by‑token the way decoder models (GPT, LLaMA) do.
- Masked objective mismatch: Predicting randomly masked tokens teaches *fill‑in* understanding, not sustained narrative or dialog flow.
- Output shape: Typical BERT fine‑tunes to classification (single label), span extraction (start/end indices), or embedding tasks—not freeform text strings.
- Lack of autoregressive memory: Decoder models maintain a growing context and generate the next token conditioned on everything so far; BERT would need extra architecture (e.g., a seq2seq wrapper) to do that.
- Generation hacks are clunky: You can sample masks iteratively or attach a lightweight decoder, but quality, coherence, and length control are worse than purpose‑built decoder models.
- Conversational dynamics: Open‑ended answers require pragmatic choices (length, style, elaboration). BERT provides semantic understanding but not a learned policy for multi‑sentence composition.

Rule of thumb:
- Use encoder (BERT) for understanding / classification / retrieval.
- Use decoder (GPT‑style) for freeform generation and open‑ended chat.
- Use encoder‑decoder (T5/BART) when you need to *transform* one text into another (summarization, translation) with strong conditioning.

So: A BERT chatbot “knows” context but lacks the generative machinery to speak freely without additional components.

---

## 🌟 Speech-to-Text (STT) 🌟

---

### ❗ **Question 2**: What is STT?

👉 listens to your voice and turns it into written text.

- Converts **audio → text**.
- Enables voice input for conversational AI.
- Think of it as the **ears** of the chatbot.

**Popular STT Models**:

1) **Whisper (OpenAI)** – strong at multilingual speech recognition.
2) **Google Speech-to-Text API** – widely used, real-time transcription.
3) **Vosk** – lightweight, offline speech recognition.

**Common Usages**

1) Voice assistants (Alexa, Siri, Google Assistant).
2) Automated captions in meetings or lectures.
3) Voice-enabled customer support.

---

### Must-Knows about STT

- Accuracy depends on **noise, accents, clarity of speech**.

- Some models need **internet connection** (API-based), others run **offline**.

- Preprocessing audio (noise reduction) improves results.


### 💡 **Quick Questions**: 

2. Why do you think meeting transcription apps like Zoom or Google Meet struggle when multiple people talk at once?

Because current speech recognition pipelines assume (mostly) one dominant speaker at a time; overlapping speech breaks their assumptions and signal quality.

Main challenges:
- Overlapping waveforms: Two voices mix additively → the model receives a single tangled acoustic signal; standard STT expects one source.
- Voice separation (source separation) is hard: Need to unmix (“who said what”) before transcription; errors cascade downstream.
- Speaker diarization limits: Diarization models detect speaker turns; rapid interjections or simultaneous talk confuses boundary detection → timestamps & labels drift.
- Acoustic masking: Louder speaker masks quieter one; weaker segments fall below model confidence thresholds → omissions.
- Accents & differing timbre: When overlapped, spectral features blur → phoneme recognition accuracy drops.
- Latency constraints: Real-time meeting apps can’t run heavy separation (e.g., advanced neural source separation + language modeling) without adding lag.
- Noise + echo: Conference mics capture room reverberation; overlapping speech + echo increases ambiguity.
- Language model context corruption: Partial words from two speakers interleave, producing sequences not seen in training → more decoding errors.

Mitigations (partial):
- Multi-channel input: Per-participant audio streams (headsets) instead of a single room mic.
- Neural source separation (e.g., ConvTasNet, Demucs variants) prior to ASR; still imperfect for >2 speakers or heavy overlap.
- Better diarization with embeddings (x-vectors) + overlap-aware models.
- Turn-taking UX cues: Encourage minimal overlap (push-to-talk, visual indicators).
- Post-processing alignment: Combine partial hypotheses and reconcile with language context.

Rule of thumb: Accurate transcription = clean single-speaker segments. As simultaneity increases, word error rate rises sharply unless you invest in separation + diarization pipelines.

---

## 🌟 Text-to-Speech (TTS) 🌟

---

### ❗ **Question 3**: What is TTS?

👉 takes written text and speaks it out loud in a human-like voice.

- Converts **text → audio (speech)**.
- Think of it as the **mouth** of the chatbot.
- Makes AI “speak” naturally.

**Popular TTS Models**:

1) **Google TTS** – supports many languages and voices.
2) **Amazon Polly** – lifelike voice synthesis with customization.
3) **ElevenLabs** – cutting-edge, realistic voice cloning.

**Common Usages**

1) Screen readers for visually impaired users.
2) AI chatbots with voice output.
3) Audiobooks or podcast generation.

---

### Must-Knows about TTS

- Some voices sound robotic; others use **neural TTS** for natural tones.

- Latency matters → If too slow, conversation feels unnatural.

- Some TTS services allow **custom voices**.

### 💡 **Quick Questions**: 

3. If you were designing a voice-based AI tutor, what qualities would you want in its TTS voice (tone, speed, clarity, etc.)?

For a voice‑based AI tutor, the TTS voice should balance clarity, warmth, and adaptability:

Core qualities:
- Clear articulation: Distinct phonemes so learners at different proficiency levels can follow.
- Neutral accent (or selectable): Minimizes regional bias; offer optional localized variants.
- Moderate pace with adaptive speed: ~150–170 wpm baseline, slow down for definitions, speed up for review.
- Warm, encouraging tone: Friendly but not overly playful; slight prosody variation to avoid monotony.
- Consistent volume & dynamic range: Enough variation for emphasis without large loudness jumps.
- Proper pausing: Short pauses after clauses; longer pauses after key concepts or questions to invite reflection.
- Emphasis & prosody control: Highlight keywords (slight pitch rise) to aid memory.
- Low latency: Fast start (<300ms) so dialog feels natural.

Adaptive behaviors:
- Context-aware pacing: Detects code, formulas, or lists → slows and inserts separators.
- Emotional calibration: Calmer tone for corrections; enthusiastic tone for achievements.
- Personalization: Adjusts speed, pitch, or energy based on user preference or performance (e.g., slower after repeated mistakes).

Accessibility considerations:
- High intelligibility in noisy environments (robust phoneme modeling).
- Optional simplified mode (slower, more explicit punctuation verbalization for screen reader synergy).

Avoid:
- Excessive expressiveness (can distract from learning).
- Robotic flatness (reduces engagement & retention).
- Overly gendered stereotypes—offer a few balanced, professional voices.

Extra features (nice to have):
- Inline spelling mode (for difficult terms on request).
- Pronunciation contrast (“color” vs “caller” in minimal pairs) when teaching language nuances.

Rule of thumb: Start neutral + clear, then let the learner dial in speed, warmth, and detail level as they progress.

---

## 🌟 Using APIs for LLMs with Groq 🌟

In [7]:
from groq import Groq
import os
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()

api_key = os.getenv("GROQ_API_KEY")
if not api_key:
    raise ValueError("GROQ_API_KEY not set. Add it to your .env file or export as an environment variable.")

# Simple sanity check so we know code executed up to here without indentation errors
print("Environment loaded. Initializing client...")

client = Groq(api_key=api_key)
         
response = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[
        {"role": "system", "content": "You are a helpful tutor."},
        {"role": "user", "content": "Hello! What is conversational AI?"}
    ]
)

print(response.choices[0].message.content)

Environment loaded. Initializing client...
Conversational AI, also known as conversational systems or conversational interfaces, is a subset of artificial intelligence (AI) that enables computers to interact with humans in a more natural and intuitive way through conversations. 

These systems can understand and respond to spoken or written language in a way that simulates human-like conversation. Conversational AI can include various applications, such as:

1. Chatbots: Computer programs designed to converse with humans through text-based interfaces.
2. Virtual assistants: AI-powered programs that can understand voice commands and perform specific tasks, such as Amazon's Alexa or Google Assistant.
3. Voice assistants: Similar to virtual assistants but focus on spoken language, and can include systems like Siri or Cortana.
4. Dialogue systems: Software designed to engage in conversation with humans, often using natural language processing (NLP) and machine learning (ML) algorithms to u

---

## 🌟 Assignments 🌟

### 📝 Assignment 1: LLM Understanding

* Write a short note (3–4 sentences) explaining the difference between **encoder-only, decoder-only, and encoder-decoder LLMs**.
* Give one example usage of each.

**Answer:** Encoder-only models (e.g., BERT) read the whole sentence bidirectionally to produce rich embeddings for understanding tasks like sentiment or intent classification—great at “reading,” not at long freeform “speaking.” Decoder-only models (e.g., GPT, LLaMA) generate text left‑to‑right, making them ideal for open‑ended tasks like chatbots, story or code generation. Encoder‑decoder (seq2seq) models (e.g., T5, BART) first encode the input, then a decoder generates a new sequence conditioned on that representation—perfect for transforming text (translation, summarization, Q&A). Examples: encoder-only → spam detection; decoder-only → conversational assistant; encoder‑decoder → English→Spanish translation or abstractive news summarization.

### 📝 Assignment 2: STT/TTS Exploration

* Find **one STT model** and **one TTS model** (other than Whisper/Google).
* Write down:

  * What it does.
  * One possible application.

**Answer:** STT model – wav2vec 2.0 (Facebook AI): learns speech representations self‑supervised from raw audio, then fine‑tuned for transcription; strong accuracy with less labeled data. Application: real‑time captioning in low‑resource languages after fine‑tuning on a small local dataset. TTS model – Tacotron 2: sequence‑to‑sequence model that converts text → mel spectrograms then uses a vocoder (e.g., WaveGlow / HiFi-GAN) to synthesize natural speech. Application: generating consistent branded voice lines for an educational app without hiring voice actors for every update.

### 📝 Assignment 3: Build a Chatbot with Memory

* Write a Python program that:

  * Takes user input in a loop.
  * Sends it to Groq API.
  * Stores the last 5 messages in memory.
  * Ends when user types `"quit"`.

### 📝 Assignment 4: Preprocessing Function

* Write a function to clean user input:

  * Lowercase text.
  * Remove punctuation.
  * Strip extra spaces.

Test with: `"  HELLo!!!  How ARE you?? "`


### 📝 Assignment 5: Text Preprocessing

* Write a function that:

    * Converts text to lowercase.
    * Removes punctuation & numbers.
    * Removes stopwords (`the, is, and...`).
    * Applies stemming or lemmatization.
    * Removes words shorter than 3 characters.
    * Keeps only nouns, verbs, and adjectives (using POS tagging).

### 📝 Assignment 6: Reflection

* Answer in 2–3 sentences:

    * Why is context memory important in chatbots?
    * Why should beginners always check **API limits and pricing**?

**Context memory importance in chatbots**  
Context memory lets a chatbot stay coherent and useful—it can resolve pronouns ("that", "it"), avoid repeating earlier questions, and personalize guidance based on prior user intent or preferences.   
**API limits and pricing**  
Beginners must check API limits and pricing early to prevent surprise bills and throttling; understanding cost per token/request shapes prompt length, batching, retries, and helps design an efficient, sustainable workflow from day one.

---

### **Hints:**

1) Stemming:
    - Cuts off word endings to get the “root.”
    - Very mechanical → may produce non-real words.
    - Example:
        - "studies" → "studi"
        - "running" → "run"

2) Lemmatization:
    - Smarter → uses vocabulary + grammar rules.
    - Always gives a real word (the **lemma**).
    - Example:
        - "studies" → "study"
        - "running" → "run"

3) Part-of-Speech (POS) tagging means labeling each word in a sentence with its grammatical role — like **noun, verb, adjective, adverb, pronoun, etc.**

    - Example:
        - Sentence → *“The cat is sleeping on the mat.”*

    - POS tags →
        - The → Determiner (DT)
        - cat → Noun (NN)
        - is → Verb (VBZ)
        - sleeping → Verb (VBG)
        - on → Preposition (IN)
        - the → Determiner (DT)
        - mat → Noun (NN)

    - **In short:** POS tagging helps machines understand **how words function in a sentence**, which is useful in NLP tasks like machine translation, text classification, and question answering.


---

### ✅ Recap

This week you learned:

* **LLMs**: Types, uses, must-knows.
* **STT & TTS**: How they connect with LLMs.
* **APIs**: Connecting to LLMs with Groq.
* Built your first chatbot foundation.