## **Conversational AI Concepts & Model Pipelines**

Agenda of this week 2 is:

- Understand LLMs, STT, TTS models and their roles.

- Know how to connect to LLMs with APIs (Groq as example).

- Use Python (requests + JSON) for API interaction.

- Start building a basic chatbot with memory and preprocessing.

---

## Large Language Models (LLMs)

---

### **Question 1**: What is an LLM?

👉 It’s like a super-smart text predictor that can read, understand, and generate human-like sentences.

You give it some words → it guesses the next words in a way that makes sense.

For example:

1) You ask a question → it gives you an answer.

2) You write a sentence → it can complete it.

3) You give it a topic → it can write an essay, code, or even a story.

So, its a type of AI trained on huge amounts of text data to generate or understand text.

---

### Types of LLMs

1. Encoder-only models (e.g., BERT)

    - Best for understanding text (classification, sentiment analysis, embeddings).

    - Not good at generating text.

2. Decoder-only models (e.g., GPT, LLaMA, Mistral)

    - Best for text generation (chatbots, writing, summarization).

    - What we use in chatbots.

3. Encoder-decoder models (e.g., T5, BART)

    - Good at transforming text (translation, summarization, Q&A).

### Must-Knows about LLMs

- They don’t “think” like humans → They predict text based on training.

- Garbage in → garbage out: Poor prompts = poor answers.

- Token limits: Models can only “see” a certain number of words at a time.

- Biases: Trained on internet text → may reflect biases/errors.

### **Quick Question**: 

1. Why might a chatbot built on BERT (encoder-only) struggle to answer open-ended questions?

- BERT stands for Bidirectional Encoder Representations from Transformers. It is an encoder-only transformer model. BERT works by understanding the context of the input text and then producing an output. Basically, every word in a sentence looks at the words before and after it to understand its meaning. In this way, we get a context-aware representation of the sentence.

- In short, BERT is not used to generate content but rather for tasks like text classification, named entity recognition, and extractive question answering. For example, if I want to know the sentiment of the sentence "I love ice cream!", BERT will read the entire sentence, understand the meaning of each word in relation to the others, and predict the overall sentiment.

---

## Speech-to-Text (STT) 

---

### **Question 2**: What is STT?

👉 listens to your voice and turns it into written text.

- Converts **audio → text**.
- Enables voice input for conversational AI.
- Think of it as the **ears** of the chatbot.

**Popular STT Models**:

1) **Whisper (OpenAI)** – strong at multilingual speech recognition.
2) **Google Speech-to-Text API** – widely used, real-time transcription.
3) **Vosk** – lightweight, offline speech recognition.

**Common Usages**

1) Voice assistants (Alexa, Siri, Google Assistant).
2) Automated captions in meetings or lectures.
3) Voice-enabled customer support.

---

### Must-Knows about STT

- Accuracy depends on **noise, accents, clarity of speech**.

- Some models need **internet connection** (API-based), others run **offline**.

- Preprocessing audio (noise reduction) improves results.


### **Quick Questions**: 

2. Why do you think meeting transcription apps like Zoom or Google Meet struggle when multiple people talk at once?

- Speech-to-text models used in apps like Zoom or Google Meet struggle in this case because automatic speech recognition (ASR) systems are typically designed to process one speaker’s voice at a time. When multiple people speak simultaneously, their voices overlap, creating what’s known as the "cocktail party problem". This makes it harder for the model to separate and identify each individual’s speech, leading to reduced accuracy in transcription.

---

## Text-to-Speech (TTS) 

---

### **Question 3**: What is TTS?

👉 takes written text and speaks it out loud in a human-like voice.

- Converts **text → audio (speech)**.
- Think of it as the **mouth** of the chatbot.
- Makes AI “speak” naturally.

**Popular TTS Models**:

1) **Google TTS** – supports many languages and voices.
2) **Amazon Polly** – lifelike voice synthesis with customization.
3) **ElevenLabs** – cutting-edge, realistic voice cloning.

**Common Usages**

1) Screen readers for visually impaired users.
2) AI chatbots with voice output.
3) Audiobooks or podcast generation.

---

### Must-Knows about TTS

- Some voices sound robotic; others use **neural TTS** for natural tones.

- Latency matters → If too slow, conversation feels unnatural.

- Some TTS services allow **custom voices**.

### **Quick Questions**: 

3. If you were designing a voice-based AI tutor, what qualities would you want in its TTS voice (tone, speed, clarity, etc.)?

- If I get the opportunity to design a voice-based AI tutor, I will focus mainly on accent and tone because tone helps classify whether it’s a human or a machine. That’s why I will definitely work on accent. Secondly, speed and clarity also matter.

---

## Using APIs for LLMs with Groq

In [13]:
# !pip install groq
# pip install groq python-dotenv


In [14]:
from groq import Groq
from dotenv import load_dotenv
import os
load_dotenv()
GROQ_API_KEY=os.getenv("GROQ_API_KEY")
client = Groq(api_key=GROQ_API_KEY)

response = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Hello! Tell me about Imran Khan?"}],
    max_tokens=200,
    temperature=0.7,
)

print(response.choices[0].message.content)


Imran Khan is a Pakistani politician, former cricketer, and philanthropist who served as the 22nd Prime Minister of Pakistan from August 2018 to April 2022. Here are some key points about his life and career:

**Early Life and Education**

Imran Ahmed Khan Niazi was born on October 5, 1952, in Lahore, Punjab, Pakistan. His father, Ikramullah Khan Niazi, was a civil engineer, and his mother, Shaukat Khanum, was a housewife. Imran Khan studied at Aitchison College in Lahore and later graduated from Keble College, Oxford, with a degree in Philosophy, Politics, and Economics.

**Cricketer and Philanthropist**

Imran Khan is widely regarded as one of the greatest cricketers of all time. He played for the Pakistani national team from 1971 to 1992 and led the team to victory in the 1992 Cricket World


---

## Assignments

### Assignment 1: LLM Understanding

* Write a short note (3–4 sentences) explaining the difference between **encoder-only, decoder-only, and encoder-decoder LLMs**.
* Give one example usage of each.


**Encoder-only:**
Encoder-only models use encoder part of the transformer architecture. They are designed to understand the input, not to generate new text. They are great for the tasks that need the model to make sense of a sentence or document, like text classification or extracting information.
**Use Cases:**
- Sentiment Analysis
- Text Classisfication
- Named Entity Recognition
- Extractive Question Answering

**Decoder-Only Models:**
These models use only the decoder part of the transformer architechture. They are use for generating text or predicting the next word in a sequence. These models are great for tasks like writing stories or auto-completing sentences. 
**Use Cases:**
- Text Generation
- Language Modeling

**Encoder-Decoder Models:**
These models use both encoder and decoder part of the transformer. The encoder first undderstands the input, and the decoder then generates a related output. This is useful for tasks where you need to transform the input into something else, like translating a sentence into another language or summarizing text. 
**Use Cases:**
- Question Answering
- Text Summarization
- Machine Translation
 

### Assignment 2: STT/TTS Exploration

* Find **one STT model** and **one TTS model** (other than Whisper/Google).
* Write down:

  * What it does.
  * One possible application.

**STT Model:**


Granite Speech 3.3 is an STT text model developed by IBM, targeting enterprises. With 8 billion parameters, it’s the largest open-source STT model. It is designed around popular languages used by businesses. Granite Speech 3.3 is optimized for English, French, German, and Spanish, and also excels at English-to-Japanese and English-to-Mandarin with built-in configurable speech translations. This makes Granite Speech 3.3 ideal for applications that require multi-lingual and translation support. 

**For example**, Granite Speech 3.3 can be easily configure to transcribe an English instructional video into Mandarin characters.

**TTS Model:**

Higgs Audio V2 is a massive TTS model developed by BosonAI. It’s currently the top trending text-to-speech model on Hugging Face. It’s an open sourced model that was built on top of Llama 3.2 3B, pre-trained on over 10 million hours of audio data. This model provides industry-leading expressive audio generation and multilingual voice cloning. 

**For example**, Higgs Audio V2 wins audience scores on emulating emotion and question-asking.


### Assignment 3: Build a Chatbot with Memory

* Write a Python program that:

  * Takes user input in a loop.
  * Sends it to Groq API.
  * Stores the last 5 messages in memory.
  * Ends when user types `"quit"`.

In [20]:
from groq import Groq
import os
from dotenv import load_dotenv
load_dotenv()
GROQ_API_KEY=os.getenv("GROQ_API_KEY")
client=Groq(api_key=GROQ_API_KEY) 
messages=[]
while True:
    user_input = input("Type your message: ")
    print(f"You:{user_input}",end="\n")
    if(user_input.lower()=="quit"):
        break
    messages.append({"role":"user","content":user_input})
    messages=messages[-10:]
    response=client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=messages,
        max_tokens=200,
        temperature=0.7
    )
    reply=response.choices[0].message.content
    messages.append({"role":"assistant","content":reply})
    print(f"Bot: {reply}")





You:Hello, kesy ho?
Bot: Namaste. Main theek hoon, dhanyavad. Aap kaise ho?
You:Main bhi theek hun bhai
Bot: Achha hai! Kya karna hai? Chahte hain koi baat karne mein madad karna?
You:kuch bhi nahi bas, testing kr rha hun 
Bot: Testing karna bahut zaroori hai. Main thoda sa test kar raha hoon bhi, aapka conversation ka response dekhna. Agar koi problem to kuch nahi hai, aapko kuch bhi pata nahi chal raha hai?
You:quit


### Assignment 4: Preprocessing Function

* Write a function to clean user input:

  * Lowercase text.
  * Remove punctuation.
  * Strip extra spaces.

Test with: `"  HELLo!!!  How ARE you?? "`


In [16]:
def clean_text(text):
    text=text.lower()
    text=text.strip()
    text=''.join(char for char in text if char.isalnum() or char.isspace())
    text=' '.join(text.split())
    return text
clean_text("  HELLo!!!  How ARE you?? ")


'hello how are you'

### Assignment 5: Text Preprocessing

* Write a function that:

    * Converts text to lowercase.
    * Removes punctuation & numbers.
    * Removes stopwords (`the, is, and...`).
    * Applies stemming or lemmatization.
    * Removes words shorter than 3 characters.
    * Keeps only nouns, verbs, and adjectives (using POS tagging).

In [17]:
# pip install nltk spacy


In [19]:
import re
import nltk
nltk.data.path.append('/root/nltk_data')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')

def preprocess_text(text):
    text=text.lower()
    text=re.sub(r'[^a-zA-Z\s]','',text)
    text_tokens=word_tokenize(text)
    stop_words=set(stopwords.words('english'))
    text=[word for word in text_tokens if word not in stop_words]
    lemmatizer=WordNetLemmatizer()
    text=[lemmatizer.lemmatize(word) for word in text]
    text=[word for word in text if len(word)>=3]
    tagged_words=pos_tag(text)
    allowed_tags = ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ', 'JJR', 'JJS']
    filtered_words=[word for word, tag in tagged_words if tag in allowed_tags]
    return filtered_words


print(preprocess_text("  HELLo!!!  How ARE you?? I am running fast and coding happily. "))

['hello', 'running', 'coding']


### Assignment 6: Reflection

* Answer in 2–3 sentences:

    * Why is context memory important in chatbots?
    * Why should beginners always check **API limits and pricing**?



**Context memory in chatbots** is important because it enables coherent, multi-turn conversations by remembering past interactions. This leads to more personalized, efficient, and satisfying user experiences. Without context memory, chatbots act as **stateless** systems, treating each prompt as a new conversation.

**Rate limiting** is essential for APIs because all APIs operate on finite resources. It improves service availability for as many users as possible by preventing excessive resource usage.

**Why API Pricing is Important:**

1. **Cost Control** – Prevents unexpected high bills by knowing usage charges in advance.
2. **Efficient Design** – Encourages minimizing API calls/tokens to save money.
3. **Scaling Decisions** – Helps plan affordable solutions as usage grows.
4. **Client Billing** – Allows accurate cost estimates for clients or projects.
5. **Provider Comparison** – Enables choosing the best value API for performance vs cost.

---


### **Hints:**

1) Stemming:
    - Cuts off word endings to get the “root.”
    - Very mechanical → may produce non-real words.
    - Example:
        - "studies" → "studi"
        - "running" → "run"

2) Lemmatization:
    - Smarter → uses vocabulary + grammar rules.
    - Always gives a real word (the **lemma**).
    - Example:
        - "studies" → "study"
        - "running" → "run"

3) Part-of-Speech (POS) tagging means labeling each word in a sentence with its grammatical role — like **noun, verb, adjective, adverb, pronoun, etc.**

    - Example:
        - Sentence → *“The cat is sleeping on the mat.”*

    - POS tags →
        - The → Determiner (DT)
        - cat → Noun (NN)
        - is → Verb (VBZ)
        - sleeping → Verb (VBG)
        - on → Preposition (IN)
        - the → Determiner (DT)
        - mat → Noun (NN)

    - **In short:** POS tagging helps machines understand **how words function in a sentence**, which is useful in NLP tasks like machine translation, text classification, and question answering.


---

### Recap

This week I learned:

* **LLMs**: Types, uses, must-knows.
* **STT & TTS**: How they connect with LLMs.
* **APIs**: Connecting to LLMs with Groq.
* Built your first chatbot foundation.