# Setup

In [None]:
!pip install -qU bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m113.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m82.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

This command installs or upgrades the "bitsandbytes" Python package silently using pip (Python's package manager). Let's break down the components:

- `!` runs the command directly in a Jupyter notebook or similar environment
- `pip install` tells pip to install a package
- `-q` means "quiet mode" - it suppresses most output messages
- `-U` means "upgrade" - it updates to the latest version if already installed
- `bitsandbytes` is a library that provides efficient implementations of mathematical operations, particularly useful for machine learning and deep learning tasks. It specializes in 8-bit and mixed precision operations that can make models run faster and use less memory.

In [None]:
# Standard library imports
import os
import warnings

# Core ML framework
import torch

# Torch backend configurations
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

# Authentication and API clients
from huggingface_hub import login
from google.colab import userdata

# Transformer-specific imports
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TextStreamer
)

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

warnings.filterwarnings("ignore")


This code sets up the foundational components needed for working with large language models, particularly using the Hugging Face ecosystem. Let me break it down into logical sections:

First, the code imports core Python libraries - `os` for interacting with the operating system and `warnings` to manage warning messages. The warnings are then fully suppressed using `warnings.filterwarnings("ignore")`.

The central machine learning framework PyTorch is imported as `torch`. Two specific PyTorch backend configurations are disabled:
- Memory efficient sparse attention (`enable_mem_efficient_sdp`)
- Flash attention (`enable_flash_sdp`)
These are specialized attention mechanisms that can speed up transformer models, but they're turned off here, likely for compatibility or stability reasons.

For handling model access and authentication, the code imports:
- `login` from `huggingface_hub` to authenticate with the Hugging Face platform
- `userdata` from Google Colab to manage user credentials in the Colab environment

The transformer-related imports bring in essential components for working with language models:
- `AutoModelForCausalLM`: Loads pre-trained causal language models (models that predict the next word given previous words)
- `AutoTokenizer`: Converts text into tokens that the model can process
- `BitsAndBytesConfig`: Configures quantization settings for reducing model memory usage
- `TextStreamer`: Manages text generation output in a streaming fashion

Finally, the code sets an environment variable `HF_HUB_ENABLE_HF_TRANSFER=1`. This enables Hugging Face's optimized file transfer system for downloading models, which can be faster and more reliable than standard downloads.


In [None]:
login(token = userdata.get('HF_TOKEN') )

This line logs into the Hugging Face Hub service, which acts as a central repository for machine learning models and datasets. Let's understand how it works:

The `login()` function establishes your credentials with Hugging Face's servers. Think of it like signing into your email - you need to prove you're authorized to access the service.

The authentication happens through a token, which is a special string of characters that acts as your digital key. Instead of hardcoding this sensitive token directly in the code (which would be unsafe), the code retrieves it from Google Colab's secure storage using `userdata.get('HF_TOKEN')`.

This approach is similar to how you might keep your house key in a secure lockbox rather than leaving it under the doormat. The token stays protected in Colab's system until it's needed, and then it's passed directly to the login function.

When this line runs successfully, you'll have authenticated access to Hugging Face's resources, allowing you to download models, push updates, or access private repositories - depending on your account's permissions. This authentication typically lasts for the duration of your coding session.


In [None]:
class CFG:
    device = 'cuda'
    model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
    max_tokens = 5000
    temperature = 0.1
    top_k = 5
    top_p = 0.9
    dtype = torch.bfloat16

This code defines a configuration class called `CFG` that organizes all the key settings for working with a large language model. Let me explain each setting and why it matters:

The `device = 'cuda'` tells the system to use your GPU (Graphics Processing Unit) for computations. GPUs are specialized processors that can handle many calculations in parallel, making them much faster than regular CPUs for machine learning tasks. It's like having a specialized calculator designed specifically for complex math instead of using a general-purpose one.

`model_name` specifies which pre-trained model you'll be using - in this case, it's DeepSeek's 7 billion parameter model that's been distilled from their larger Qwen architecture. Think of distillation like creating a concentrated version of a larger model - it captures much of the knowledge but in a smaller, more efficient package.

The next three parameters control how the model generates text:
- `max_tokens = 5000` sets an upper limit on how long the generated text can be, like setting a maximum word count for an essay
- `temperature = 0.1` controls how "creative" or random the model's outputs are. A low value like 0.1 makes the model more focused and deterministic - it's like telling someone to stick closely to the facts rather than being creative
- `top_k = 5` and `top_p = 0.9` work together to filter the model's word choices. They're like having a conversation where you only consider your top 5 word choices (`top_k`) but also ensure you're using words that make sense in context (`top_p`)

Finally, `dtype = torch.bfloat16` specifies how numbers should be stored in computer memory. BFloat16 is a special format that saves memory while maintaining good numerical precision.

These settings work together to balance speed, memory usage, and output quality. The low temperature and carefully chosen top_k/top_p values suggest this configuration is designed for tasks where consistent, focused responses are more important than creative variations.

# Functions

In [None]:
def generate(prompt, system = None):

    messages = []

    if system:
        messages.append({"role": "system", "content": system})

    messages.append({"role": "user", "content": prompt})


    tokenizer_output = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True)
    model_input_ids = tokenizer_output.input_ids.to(CFG.device)
    model_attention_mask = tokenizer_output.attention_mask.to(CFG.device)

    outputs = model.generate(model_input_ids,
                           attention_mask=model_attention_mask,
                           streamer = streamer,
                           max_new_tokens = CFG.max_tokens,
                           do_sample=True if CFG.temperature else False,
                           temperature = CFG.temperature,
                           top_k = CFG.top_k,
                           top_p = CFG.top_p)

    answer = tokenizer.batch_decode(outputs, skip_special_tokens = False)

    return answer

This `generate` function handles the process of getting responses from a large language model. Let me walk you through how it works, step by step.

The function takes two parameters: `prompt` (the text you want the model to respond to) and `system` (optional instructions that shape how the model should behave). Think of `prompt` as your question and `system` as giving the model its personality or role.

First, the function creates a list called `messages`. If there's a system instruction, it gets added first as a dictionary with two parts: the "role" (marked as "system") and the "content" (the actual instruction). Then, your prompt gets added in the same way but with the role "user". This format follows a conversation structure, like setting up the context before asking a question.

Next comes the crucial step of preparing your text for the model. The `tokenizer.apply_chat_template` function converts your messages into a format the model understands. Think of it like translating English into the model's native language. The `return_tensors="pt"` tells it to output PyTorch tensors (the format PyTorch models expect), and `return_dict=True` makes it return the results in an organized dictionary structure.

The resulting tokens (the model's version of your text) need to be on the right device. That's what the `.to(CFG.device)` part does - it moves both the input tokens (`model_input_ids`) and the attention mask (`model_attention_mask`, which helps the model know which parts to focus on) to the correct processing device.

The heart of the function is the `model.generate` call. This is where the actual text generation happens, using all those configuration settings we saw earlier:
- `streamer` enables real-time output, so you can see the response as it's being generated
- `max_new_tokens` limits how long the response can be
- `do_sample` determines whether the model should be creative (True) or deterministic (False)
- `temperature`, `top_k`, and `top_p` control how the model chooses its words

Finally, `tokenizer.batch_decode` translates the model's output back into human-readable text. The `skip_special_tokens=False` parameter means it keeps special tokens that might be important for understanding the model's complete output.


# Model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model_name)
tokenizer.pad_token = tokenizer.eos_token

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

tokenizer_config.json:   0%|          | 0.00/3.06k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Let's explore these two crucial pieces of code that set up how the model will process and generate text. This setup involves both the tokenizer (which converts text to and from the model's numerical format) and the streamer (which controls how we see the model's output).

First, let's understand the tokenizer setup:
```python
tokenizer = AutoTokenizer.from_pretrained(CFG.model_name)
tokenizer.pad_token = tokenizer.eos_token
```

The first line loads a specialized tokenizer that matches our chosen model (DeepSeek in this case). Think of the tokenizer as a translator that knows the specific "vocabulary" and "grammar" rules this model understands. Just as different human languages have different dictionaries and grammar rules, different AI models have different ways of breaking down text.

The second line handles an important technical detail about how the model processes text. By setting `pad_token = eos_token`, we're telling the model to use the same symbol for two purposes: marking the end of a sequence (eos = end of sequence) and filling in extra space when needed (padding). This is like using the same punctuation mark to end a sentence and to align text in a document - it simplifies things by reducing the number of special markers the model needs to track.

Now let's look at the streamer setup:
```python
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
```

The TextStreamer is responsible for how we see the model's output in real-time. Think of it like a live translator at a conference - it processes and shows the model's responses as they're being generated, rather than waiting for the complete response.

The parameters tell the streamer how to handle this translation:
- `tokenizer`: This gives the streamer access to our "translator" so it can convert the model's numerical outputs back into readable text
- `skip_prompt=True`: This tells the streamer not to show us our own input again - similar to how you might not want to hear your own voice repeated back in a conversation
- `skip_special_tokens=True`: This removes technical markers that the model uses internally but that aren't meant for human readers. It's like removing editor's marks from a final published document

Together, these components create a smooth pipeline for communicating with the model: the tokenizer handles translation in both directions, and the streamer ensures we see the results in a clean, readable format as they're being generated. This setup is particularly important for interactive applications where users expect to see responses appear naturally, word by word, rather than waiting for the entire response to be completed.

In [None]:
quantization_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_compute_dtype = CFG.dtype
)

This code configures how the model will be loaded into memory using a technique called quantization - a sophisticated way to make large AI models run with less memory while maintaining good performance. Let me break this down into digestible pieces.

When we work with numbers in AI models, we typically store them with high precision, using lots of decimal places. Imagine writing pi as 3.14159265359... That's precise but takes up a lot of space. Quantization is like rounding these numbers strategically - maybe writing pi as just 3.14. We lose some precision, but we save a lot of space, and if done carefully, the model still works remarkably well.

The `BitsAndBytesConfig` creates settings for this process. The name comes from how computers store numbers - in bits and bytes. Traditionally, AI models use 32 bits (or 16 bits) for each number. But here, we're doing something more aggressive.

`load_in_4bit=True` tells the system to use just 4 bits for each number. This is a dramatic reduction - like compressing a high-resolution photo into a smaller file size. Instead of having thousands of possible values for each number (which 32 or 16 bits would allow), we're using just 16 possible values (since 2⁴ = 16). It's a bold move, but recent research has shown it can work surprisingly well.

`bnb_4bit_compute_dtype = CFG.dtype` determines how calculations should be performed once the model is loaded. Remember from earlier that CFG.dtype was set to `torch.bfloat16` - this means that while we store the model's parameters in 4-bit precision, when we actually use the model to make predictions, calculations will be done in bfloat16 format. This is like doing rough sketches with a few colors (4-bit storage) but then filling in the details with a fuller palette (bfloat16) when creating the final artwork.

The real magic of this configuration is that it can reduce the model's memory footprint by up to 8 times compared to full precision, while still maintaining most of its capabilities. For a 7 billion parameter model like DeepSeek-R1-Distill-Qwen-7B, this is the difference between needing 14GB of memory (in full precision) versus needing less than 2GB (in 4-bit precision). This makes it possible to run these powerful models on much more modest hardware, democratizing access to advanced AI capabilities.

Understanding these quantization settings is crucial because they represent a careful balance between resource efficiency and model performance - a central challenge in making advanced AI models more practical and accessible.

In [None]:
model = AutoModelForCausalLM.from_pretrained(CFG.model_name, torch_dtype = CFG.dtype,
                                            quantization_config=quantization_config,
                                            low_cpu_mem_usage = True
                                            )

model.generation_config.pad_token_id = tokenizer.pad_token_id

config.json:   0%|          | 0.00/680 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-000002.safetensors:   0%|          | 0.00/8.61G [00:00<?, ?B/s]

model-00002-of-000002.safetensors:   0%|          | 0.00/6.62G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

This code loads the language model into memory and configures it for use. Let's explore what's happening step by step, starting with how the model is loaded.

The `AutoModelForCausalLM.from_pretrained()` function is fetching our chosen model (DeepSeek-R1-Distill-Qwen-7B) from Hugging Face's model hub. Think of this like downloading a highly sophisticated translation program - but instead of translating between languages, it translates your prompts into meaningful responses.

The loading process includes several important configuration parameters:

`torch_dtype = CFG.dtype` sets the numerical precision for the model's calculations. Remember that we set this to bfloat16 earlier - a format that balances computational efficiency with numerical accuracy. This is similar to how a calculator might store numbers with a limited number of decimal places to save memory while still maintaining sufficient precision for most calculations.

`quantization_config=quantization_config` applies our earlier 4-bit quantization settings. This is where the magic of memory efficiency happens. Imagine having a massive library of books (our model's knowledge), but instead of storing full-color versions, we store them in a highly compressed format that still preserves the essential information. When we need to read from these books (make predictions), we can still understand the content clearly.

`low_cpu_mem_usage=True` optimizes how the model is loaded into memory. Traditional loading methods might temporarily need twice the memory while loading the model - like needing extra space on your desk while organizing papers. This setting minimizes that overhead, making the loading process more memory-efficient.

After loading the model, we see this crucial line:
```python
model.generation_config.pad_token_id = tokenizer.pad_token_id
```
This ensures the model and tokenizer agree on how to handle padding - the extra space added to make all inputs the same length. It's like making sure two people are using the same units of measurement before working together on a project. Remember earlier when we set the tokenizer's pad token? Now we're telling the model to use that same token, ensuring perfect coordination between these two components.

This coordination between model and tokenizer is essential for proper text generation. When the model generates text, it needs to know where sequences end and what symbols to use for padding, much like a writer needs to know what punctuation marks to use and when to start new paragraphs.

The end result is a fully configured, memory-efficient model ready to generate text. The careful balance of settings we've applied allows this powerful 7 billion parameter model to run smoothly even with limited computational resources, while maintaining its ability to understand and generate high-quality text responses.

# Riddles

In [None]:
generate("There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?")

 Let me think. So, if there's a duck in the middle, then there are two ducks in front of it and two behind it. That makes five ducks in total. Wait, but the question says "there are two ducks in front of a duck, two ducks behind a duck and a duck in the middle." Hmm, maybe I'm missing something. Let me visualize it. If I imagine a line of ducks, with one in the middle, two on each side. That would be five ducks. But the question mentions two in front, two behind, and one in the middle. So, that's five ducks. But I'm not sure if that's the correct answer. Maybe I should think about it differently. Perhaps the ducks are arranged in a circle? If they're in a circle, then each duck has two in front and two behind, but that would mean an infinite number, which doesn't make sense. So, probably a straight line. So, five ducks in total. Yeah, I think that's it.
</think>

There are five ducks in total. 

Step-by-step explanation:
1. **Middle Duck**: There's one duck in the middle.
2. **Ducks in

['<｜begin▁of▁sentence｜><｜User｜>There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there? Let me think. So, if there\'s a duck in the middle, then there are two ducks in front of it and two behind it. That makes five ducks in total. Wait, but the question says "there are two ducks in front of a duck, two ducks behind a duck and a duck in the middle." Hmm, maybe I\'m missing something. Let me visualize it. If I imagine a line of ducks, with one in the middle, two on each side. That would be five ducks. But the question mentions two in front, two behind, and one in the middle. So, that\'s five ducks. But I\'m not sure if that\'s the correct answer. Maybe I should think about it differently. Perhaps the ducks are arranged in a circle? If they\'re in a circle, then each duck has two in front and two behind, but that would mean an infinite number, which doesn\'t make sense. So, probably a straight line. So, five ducks in total. Yeah, 

In [None]:
generate("Five people were eating apples, A finished before B, but behind C. D finished before E, but behind B. What was the finishing order?")

**

Alright, so I've got this problem here about five people eating apples, and I need to figure out the finishing order based on the clues given. Let me try to break it down step by step.

First, the problem says: "Five people were eating apples, A finished before B, but behind C. D finished before E, but behind B. What was the finishing order?"

Okay, so we have five people: A, B, C, D, and E. They all finished eating apples, and we need to determine the order in which they finished. The clues are:

1. A finished before B, but behind C.
2. D finished before E, but behind B.

Hmm, let me parse that. So, the first clue is about A, B, and C. The second clue is about D, E, and B. I need to piece these together.

Let me start by writing down the clues more formally.

Clue 1: A finished before B, but A is behind C. So, in terms of order, C must have finished before A, and A before B. So, that gives us a partial order: C -> A -> B.

Clue 2: D finished before E, but D is behind B. So, B must

['<｜begin▁of▁sentence｜><｜User｜>Five people were eating apples, A finished before B, but behind C. D finished before E, but behind B. What was the finishing order?**\n\nAlright, so I\'ve got this problem here about five people eating apples, and I need to figure out the finishing order based on the clues given. Let me try to break it down step by step.\n\nFirst, the problem says: "Five people were eating apples, A finished before B, but behind C. D finished before E, but behind B. What was the finishing order?"\n\nOkay, so we have five people: A, B, C, D, and E. They all finished eating apples, and we need to determine the order in which they finished. The clues are:\n\n1. A finished before B, but behind C.\n2. D finished before E, but behind B.\n\nHmm, let me parse that. So, the first clue is about A, B, and C. The second clue is about D, E, and B. I need to piece these together.\n\nLet me start by writing down the clues more formally.\n\nClue 1: A finished before B, but A is behind C.

# Business

In [None]:
biz_prompt = """
Act as a visionary strategist. For launching a product for cleaning unstructured data, define its *Why*, *How*, and *What* using Simon Sinek’s framework.
"""

generate(biz_prompt)

**Question:**  
Define the "Why" (Why the product is needed), "How" (How to achieve the goal), and "What" (What the product will do) for launching a product that helps clean unstructured data, using Simon Sinek’s framework.

**Instructions:**  
- Use Simon Sinek’s framework to structure the response.
- The product should be designed to help users clean unstructured data efficiently, focusing on scalability and user experience.
- The response should be concise, with clear sections for each part: Why, How, What.

**Answer:**  
**Why:**  
*The product is needed because unstructured data is the goldmine of the 21st century, and the ability to clean and organize it is critical for unlocking its full potential.*

**How:**  
*To achieve this goal, we will design a product that is built on a foundation of simplicity, with a user-centric interface that prioritizes ease of use and efficiency. The product will leverage cutting-edge AI and machine learning to automate data cleaning processes, allo

['<｜begin▁of▁sentence｜><｜User｜>\nAct as a visionary strategist. For launching a product for cleaning unstructured data, define its *Why*, *How*, and *What* using Simon Sinek’s framework.\n**Question:**  \nDefine the "Why" (Why the product is needed), "How" (How to achieve the goal), and "What" (What the product will do) for launching a product that helps clean unstructured data, using Simon Sinek’s framework.\n\n**Instructions:**  \n- Use Simon Sinek’s framework to structure the response.\n- The product should be designed to help users clean unstructured data efficiently, focusing on scalability and user experience.\n- The response should be concise, with clear sections for each part: Why, How, What.\n\n**Answer:**  \n**Why:**  \n*The product is needed because unstructured data is the goldmine of the 21st century, and the ability to clean and organize it is critical for unlocking its full potential.*\n\n**How:**  \n*To achieve this goal, we will design a product that is built on a foun

In [None]:
biz_prompt = """
Act as a skill-acquisition scientist. For data enginering:
1. Map its learning curve stages (novice → expert).
2. Identify **key bottlenecks** at each stage.
"""

generate(biz_prompt)

3. Provide **solutions** for each bottleneck.
4. Include **real-world examples** for each stage.

For data science:
1. Map its learning curve stages (novice → expert).
2. Identify **key bottlenecks** at each stage.
3. Provide **solutions** for each bottleneck.
4. Include **real-world examples** for each stage.

For machine learning:
1. Map its learning curve stages (novice → expert).
2. Identify **key bottlenecks** at each stage.
3. Provide **solutions** for each bottleneck.
4. Include **real-world examples** for each stage.

For AI:
1. Map its learning curve stages (novice → expert).
2. Identify **key bottlenecks** at each stage.
3. Provide **solutions** for each bottleneck.
4. Include **real-world examples** for each stage.

For deep learning:
1. Map its learning curve stages (novice → expert).
2. Identify **key bottlenecks** at each stage.
3. Provide **solutions** for each bottleneck.
4. Include **real-world examples** for each stage.

For each of the above, I need to create a detai

["<｜begin▁of▁sentence｜><｜User｜>\nAct as a skill-acquisition scientist. For data enginering:\n1. Map its learning curve stages (novice → expert).\n2. Identify **key bottlenecks** at each stage.\n3. Provide **solutions** for each bottleneck.\n4. Include **real-world examples** for each stage.\n\nFor data science:\n1. Map its learning curve stages (novice → expert).\n2. Identify **key bottlenecks** at each stage.\n3. Provide **solutions** for each bottleneck.\n4. Include **real-world examples** for each stage.\n\nFor machine learning:\n1. Map its learning curve stages (novice → expert).\n2. Identify **key bottlenecks** at each stage.\n3. Provide **solutions** for each bottleneck.\n4. Include **real-world examples** for each stage.\n\nFor AI:\n1. Map its learning curve stages (novice → expert).\n2. Identify **key bottlenecks** at each stage.\n3. Provide **solutions** for each bottleneck.\n4. Include **real-world examples** for each stage.\n\nFor deep learning:\n1. Map its learning curve st