# Unit 1 Hands-on: Generative AI & NLP Fundamentals

Welcome to your interactive guide to **Generative AI**. This notebook is designed to be a step-by-step tutorial, explaining not just *how* to code, but *why* we use these tools.


## 1. Introduction & Setup

In this section, we will set up our environment. But first, let's understand the tools we are using.


### What is Hugging Face?

Hugging Face (https://huggingface.co/) is often called the "GitHub of AI". It is a massive repository where researchers and companies share their trained models, datasets, and demos.

Instead of training a model from scratch (which costs millions of dollars), we can download models like GPT-2, BERT, or RoBERTa directly from Hugging Face and use them.


### What is the `transformers` library?

The `transformers` library is the bridge between the models on Hugging Face and your code. It provides APIs to easily download, load, and run state-of-the-art pretrained models.

It supports framework interoperability, meaning you can often move between PyTorch, TensorFlow, and JAX.


### What is `pipeline()`?

The `pipeline()` function is the most powerful high-level tool in the library. It abstracts away the complex math and processing into three simple steps:

1.  **Preprocessing**: Converts your raw text into numbers (Tokens & IDs) that the model can understand.
2.  **Model Inference**: The model processes the numbers and outputs predictions (logits).
3.  **Post-processing**: The raw predictions are converted back into human-readable text (labels, answers, summaries).

With just one line, `pipeline('task-name')` handles all of this for you.


### Import Pipeline
Let's import this powerful function.


In [1]:
from transformers import pipeline, set_seed, GPT2Tokenizer




### Import Utilities
We also need `nltk` for some traditional NLP tasks and `os` for file handling.


In [2]:
import os
import nltk


### Loading the Course Material
We will define the path to our course text file (`unit 1.txt`).


In [4]:
file_path = "unit 1.txt"


Now we read the file. This text will be the 'Knowledge Base' for our tasks later.


In [5]:
try:
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()
    print("File loaded successfully!")
except FileNotFoundError:
    print(f"Error: '{file_path}' not found.")


File loaded successfully!


Let's look at the first 500 characters to make sure we have the right data.


In [6]:
print("--- Data Preview ---")
print(text[:500] + "...")


--- Data Preview ---
Generative AI and Its Applications: A Foundational Briefing

Executive Summary

This document provides a comprehensive overview of Generative AI, synthesizing foundational concepts, technological underpinnings, and practical applications as outlined in the course materials from PES University. Generative AI represents a transformative subset of Artificial Intelligence focused on creating novel content, a capability primarily driven by the advent of Large Language Models (LLMs). The evolution of ...


## 2. Generative AI: Dumb vs. Smart Models

Generative AI creates new content (text, images, audio). But the quality depends heavily on the model's size and training.

We will compare two models:
1.  **`distilgpt2`**: A 'distilled' version. It is smaller, faster, and requires less memory, but it might be less coherent (a "Dumb" model for this comparison).
2.  **`gpt2`**: The standard version (The "Smart" model, though still small by modern standards).

**How to access a model?**
1.  Go to Hugging Face Models page.
2.  Search for a task (e.g., 'Text Generation').
3.  Pick a model (e.g., `gpt2`).
4.  Copy the model name.


### Step 1: Set a Seed

A **seed value** is used to make random results **reproducible**. When we set a seed, the random number generator starts from the same point each time, which means it will produce the **same sequence of random values**.

Try running the code multiple times using the **same seed value** and observe the output.

Now, change the seed value and run the code again. This time, the output **will change** because a different seed creates a different sequence of random numbers.


In [7]:
set_seed(42)


### Step 2: Define a Prompt
Both models will complete this sentence.


In [8]:
prompt = "Generative AI is a revolutionary technology that"


### Step 3: Fast Model (`distilgpt2`)
Let's see how the smaller model performs.


In [10]:
# Initialize the pipeline with the specific model
fast_generator = pipeline('text-generation', model='distilgpt2')

# Generate text
output_fast = fast_generator(prompt, max_length=50, num_return_sequences=1)
print(output_fast[0]['generated_text'])


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that has been around since the dawn of AI. It is an AI-related project, aimed at making AI a real-time, interactive and real-time experience.

































































































































































































































### Step 4: Standard Model (`gpt2`)
Now let's try the standard model.


In [11]:
smart_generator = pipeline('text-generation', model='gpt2')

output_smart = smart_generator(prompt, max_length=50, num_return_sequences=1)
print(output_smart[0]['generated_text'])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that harnesses the power of natural selection to create new artificial intelligence. It is capable of creating great ideas from scratch, and creates new tools that change the world for better or for worse. It is so transformative that it is considered as the next Darwin.

The AI is an important part of the evolution of the human race. The AI is the first step towards a future where the human race will be able to adapt to the new reality of the world. The AI will make the world better for all of humanity, and will make the human race stronger.

The AI is an important part of the evolution of the human race. The AI is the first step towards a future where the human race will be able to adapt to the new reality of the world. The AI will make the world better for all of humanity, and will make the human race stronger. The AI is the first step towards a future where the human race will be able to adapt to the new reality of the world. The AI is a 

**Analysis**: Compare the two outputs. Does the standard model stay more on topic? Does the fast model drift into nonsense?


# STEP 5: BERT MODEL

In [38]:
fast_generator = pipeline('text-generation', model='bert-base-uncased')

output_smart = fast_generator(prompt, max_length=50, num_return_sequences=1)
print(output_smart[0]['generated_text'])

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that................................................................................................................................................................................................................................................................


# STEP 6: RoBERTa MODEL

In [16]:
# Initialize the pipeline with the specific model
fast_generator = pipeline('text-generation', model='roberta-base')

# Generate text
output_fast = fast_generator(prompt, max_length=50, num_return_sequences=1)
print(output_fast[0]['generated_text'])


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that


# STEP 7: BART MODEL

In [17]:
# Initialize the pipeline with the specific model
fast_generator = pipeline('text-generation', model='facebook/bart-base')

# Generate text
output_fast = fast_generator(prompt, max_length=50, num_return_sequences=1)
print(output_fast[0]['generated_text'])


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that Brotherhood Aluminum Aluminum decrease Utilities Utilitiesウス sophistic Utilities Utilities UtilitiesIncreases Jer Jer Lavrov coaching Utilities Utilitiesivedived unbiased memo sophistic'd Pwr coaching memotheless block Jer Utilities Utilities Facebook Facebook sophisticthelesstheless Utilities Jer Utilities coaching Charity Yellow healthier Utilities Utilities Jer Yellow depicts Utilities Utilities squirrel vengeance Utilities Utilities"},western Jerittensogie healthier nearby Utilities Utilities Yellowittens 1963ittens Utilitieshunter memo Utilities healthier Cambridge Philosophy Cambridgetheless Philosophyived OthersPurchase Utilities healthier healthier Utilities 1929 UtilitiesittensGameplay Cambridgeacle priceless StayGameplay Utilities memo Utilities Utilities decrease healthier healthier healthier unbiased priceless healthier healthier Cambridge strainedittensPurchase attractive priceless healthierPurchase Utilities Basin Utilities

## 3. NLP Fundamentals: Under the Hood

Before any "Magic" happens, the text must be processed. The pipeline does this automatically, but let's break it down manually to understand the steps.


### 3.1 Tokenization
**Why?** Models cannot read English strings. They only understand numbers.
**What?** Tokenization breaks text into pieces (Tokens) and assigns each piece a unique ID.


In [18]:
# 1. Initialize the Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")


Let's take a sample sentence.


In [19]:
sample_sentence = "Transformers revolutionized NLP."


Now we split it into tokens.


In [20]:
tokens = tokenizer.tokenize(sample_sentence)
print(f"Tokens: {tokens}")


Tokens: ['Transform', 'ers', 'Ġrevolution', 'ized', 'ĠN', 'LP', '.']


And finally, convert tokens to IDs.


In [21]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")


Token IDs: [41762, 364, 5854, 1143, 399, 19930, 13]


### 3.2 POS Tagging (Part-of-Speech)
**Why?** To understand grammar. Is 'book' a noun (the object) or a verb (to book a flight)?
**What?** We label each word as Noun (NN), Verb (VB), Adjective (JJ), etc.


In [22]:
# Download necessary NLTK data
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt', quiet=True)


True

Let's tag our sentence.


### 4.2 Question Answering

This task is **Extractive**. We provide a `context` (our text) and a `question`. The model highlights the answer within the text.


In [39]:
qa_pipeline = pipeline("question-answering", model="bert-base-uncased")


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Let's ask about the risks mentioned in our text.


In [41]:
text = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."

questions = [
    "What is the fundamental innovation of the Transformer?",
    "What are the risks of using Generative AI?"
]

for q in questions:
    res = qa_pipeline(question=q, context=text)
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")



Q: What is the fundamental innovation of the Transformer?
A: hallucinations, bias, and deepfakes

Q: What are the risks of using Generative AI?
A: hallucinations, bias, and deepfakes


In [45]:
qa_pipeline = pipeline("question-answering", model="roberta-base")


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


In [46]:
text = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."

questions = [
    "What is the fundamental innovation of the Transformer?",
    "What are the risks of using Generative AI?"
]

for q in questions:
    res = qa_pipeline(question=q, context=text)
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")


Q: What is the fundamental innovation of the Transformer?
A: poses significant risks such as hallucinations, bias, and deepfakes

Q: What are the risks of using Generative AI?
A: poses significant risks such as hallucinations, bias, and deepfakes


In [51]:
qa_pipeline = pipeline("question-answering", model="facebook/bart-base")

Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


In [52]:
TEXT = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."

questions = [
    "What is the fundamental innovation of the Transformer?",
    "What are the risks of using Generative AI?"
]

for q in questions:
    res = qa_pipeline(question=q, context=text)
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")


Q: What is the fundamental innovation of the Transformer?
A: deepfakes

Q: What are the risks of using Generative AI?
A: deepfakes


### 4.3 Masked Language Modeling (The 'Fill-in-the-Blank' Game)

This is the core training objective of BERT. We hide a token (`[MASK]`) and ask the model to predict it based on context.


In [29]:
mask_filler = pipeline("fill-mask", model="bert-base-uncased")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Let's see what the model thinks Generative AI creates.


In [30]:
masked_sentence = "The goal of Generative AI is to create new [MASK]."
preds = mask_filler(masked_sentence)

for p in preds:
    print(f"{p['token_str']}: {p['score']:.2f}")


applications: 0.06
ideas: 0.05
problems: 0.05
systems: 0.04
information: 0.03


In [54]:
mask_filler = pipeline("fill-mask", model="roberta-base")

Device set to use cpu


In [57]:
masked_sentence = "The goal of Generative AI is to create new <mask>."
preds = mask_filler(masked_sentence)

for p in preds:
    print(f"{p['token_str']}: {p['score']:.2f}")

 AI: 0.07
 agents: 0.06
 intelligence: 0.05
 applications: 0.04
 insights: 0.04


In [58]:
mask_filler = pipeline("fill-mask", model="facebook/bart-base")

Device set to use cpu


In [59]:
masked_sentence = "The goal of Generative AI is to create new <mask>."
preds = mask_filler(masked_sentence)

for p in preds:
    print(f"{p['token_str']}: {p['score']:.2f}")

 ways: 0.16
 AI: 0.10
 and: 0.05
 models: 0.04
,: 0.03
