# Setup

In [None]:
!pip install -qU bitsandbytes

This command installs or upgrades the "bitsandbytes" Python package silently. The `-q` flag makes the installation quiet by suppressing most output messages, while `-U` ensures you get the latest version even if an older one exists.

Bitsandbytes is a library that helps run large AI models more efficiently by using 8-bit quantization - a technique that reduces how much memory the models need. It's particularly useful when working with large language models or other deep learning projects where memory usage is a concern.

The exclamation mark at the start tells Jupyter notebooks or Google Colab to run this as a shell command rather than Python code. Without it, you'd need to use Python's package management tools directly.

The installation is handled by pip, which is Python's standard package installer. The equal relationship between pip and Python is similar to how npm works with JavaScript or cargo with Rust - it's the main way to add external code to your project.

In [None]:
# Standard library imports
import os
import warnings

# Core ML framework
import torch

# Torch backend configurations
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

# Authentication and API clients
from huggingface_hub import login
from google.colab import userdata

# Transformer-specific imports
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TextStreamer
)

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

warnings.filterwarnings("ignore")


This code sets up the infrastructure needed to work with large language models, particularly focusing on efficient memory usage and proper configuration. Let me break it down into logical sections:

At the core, the code imports essential Python modules. The os module gives access to operating system functions like environment variables, while warnings lets us control which warning messages appear during execution.

The code then configures PyTorch, the deep learning framework. It specifically disables two memory optimization features: memory-efficient scaled dot product attention and flash attention. These features can sometimes conflict with certain model configurations, so turning them off ensures compatibility.

For working with models, the code brings in several key components from the transformers library. AutoModelForCausalLM handles text generation models - think of it as the engine that powers text completion. AutoTokenizer converts text into a format the model can understand, similar to how a translator converts between languages. BitsAndBytesConfig manages memory-efficient model loading, while TextStreamer helps display model output smoothly.

The code sets up authentication too. It imports login from huggingface_hub to access models from Hugging Face's repository, and userdata from Google Colab to handle credentials. This combination allows secure access to hosted models.

An important environment variable is set: HF_HUB_ENABLE_HF_TRANSFER = "1". This enables Hugging Face's optimized file transfer protocol, making model downloads more reliable and efficient.

Finally, the code suppresses warnings using warnings.filterwarnings("ignore"). While this makes the output cleaner, it's worth noting that in development environments, seeing these warnings can help catch potential issues early.

The structure follows a common pattern in machine learning code: first setting up basic Python tools, then configuring the deep learning framework, and finally importing specialized components for model handling. Each import and configuration serves a specific purpose in creating a stable environment for working with large language models.

In [None]:
login(token = userdata.get('HF_TOKEN') )

This line of code handles authentication with the Hugging Face model hub. Let's break down what's happening and why it matters.

The login function is attempting to authenticate with Hugging Face using a token stored in Google Colab's secure storage system. Think of this like using a special key card to enter a secure building - the token proves you have permission to access Hugging Face's resources.

The userdata.get('HF_TOKEN') part is reaching into Colab's secure storage to retrieve your Hugging Face access token. This is similar to how a password manager securely stores your credentials. Storing the token this way, rather than writing it directly in the code, is a security best practice - it keeps sensitive credentials out of your code where they could be accidentally shared or exposed.

The entire login() function call establishes a secure connection that will let you download models, push changes, or interact with private repositories on Hugging Face. Without this authentication step, you'd only have access to public models, or in some cases, no access at all.

A key security feature here is that the token is being retrieved dynamically rather than hardcoded. This follows the principle of keeping sensitive credentials separate from code, similar to how websites store authentication tokens in secure HTTP-only cookies rather than in JavaScript code.

If this login fails, it might mean either the token isn't properly stored in Colab's userdata, or the token itself might be invalid or expired. In that case, you'd need to generate a new token from your Hugging Face account settings and store it in Colab's secrets management system.

In [None]:
class CFG:
    device = 'cuda'
    model = "mistralai/Mistral-7B-Instruct-v0.1"
    max_tokens = 5000
    temperature = 0.1
    top_k = 5
    top_p = 0.9
    dtype = torch.bfloat16

This code defines a configuration class that controls how a language model will run. Think of it like setting up the control panel before operating complex machinery - each setting affects how the model will behave.

Let's examine each setting and understand its purpose:

The device setting tells the model to run on a CUDA-enabled GPU rather than CPU. This is crucial for performance - GPUs can process the complex matrix operations in language models much faster than CPUs, similar to how a specialized wood-cutting machine works better than a general-purpose knife.

The model selection specifies Phi-3-mini-4k-instruct from Microsoft. The "4k" in the name suggests it can handle contexts up to 4,000 tokens, while "mini" indicates it's a smaller, more efficient version. The "instruct" suffix means it's specifically tuned for following instructions, like a well-trained assistant.

The max_tokens setting caps the length of generated text at 5,000 tokens. Think of this like setting a word limit on an essay - it prevents the model from rambling indefinitely.

Temperature, top_k, and top_p work together to control the randomness and creativity of the model's outputs. With temperature at 0.1, the model will be quite deterministic, almost always choosing the most likely next word. This is like turning down the "creativity dial" to get more focused, predictable responses. The top_k value of 5 means it only considers the 5 most likely next words, while top_p of 0.9 ensures it captures 90% of the probability mass - these settings help balance between creativity and coherence.

The dtype setting uses bfloat16, a special number format that saves memory while maintaining good numerical precision. This is like using efficient shorthand instead of writing everything in longhand - it helps the model fit in GPU memory while still performing well. The torch.bfloat16 format is particularly good for neural networks because it preserves more precision in larger numbers compared to regular 16-bit formats.

Together, these settings create a configuration aimed at efficient, controlled text generation with a focus on instruction-following rather than creative exploration. The low temperature and limited top_k suggest this setup is designed for tasks requiring precise, consistent outputs.

# Funkcje

In [None]:
def generate(prompt, system = None):

    messages = []

    if system:
        messages.append({"role": "system", "content": system})

    messages.append({"role": "user", "content": prompt})


    tokenizer_output = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True)
    model_input_ids = tokenizer_output.input_ids.to(CFG.device)
    model_attention_mask = tokenizer_output.attention_mask.to(CFG.device)

    outputs = model.generate(model_input_ids,
                           attention_mask=model_attention_mask,
                           streamer = streamer,
                           max_new_tokens = CFG.max_tokens,
                           do_sample=True if CFG.temperature else False,
                           temperature = CFG.temperature,
                           top_k = CFG.top_k,
                           top_p = CFG.top_p)

    answer = tokenizer.batch_decode(outputs, skip_special_tokens = False)

    return answer

This code defines a function that generates text responses using a language model, similar to how a translator processes and generates text in different languages. Let's walk through how it works step by step.

The function accepts two parameters: 'prompt' (the main text input) and 'system' (optional instructions that shape how the model should behave). Think of the prompt as the actual question you're asking, while the system parameter is like giving background instructions to a human assistant before they start their task.

The function first creates a list called 'messages' to store the conversation structure. If there's a system message provided, it adds that first - like giving context before asking a question. Then it always adds the user's prompt. Each message is formatted as a dictionary with two keys: 'role' (who's speaking) and 'content' (what they're saying).

Next comes the crucial preparation step where the text gets converted into a format the model can understand. The tokenizer.apply_chat_template method takes our messages and transforms them into numerical representations (tokens) that the model can process. This is similar to how a music sheet translates musical ideas into a standardized notation that musicians can read.

The tokenized input gets moved to the specified device (GPU in this case) using .to(CFG.device). This is like moving your work materials to your workbench before starting a project. Both the input tokens (model_input_ids) and the attention mask (which tells the model which parts of the input to focus on) are transferred.

The generation phase is where the model actually creates its response. The model.generate method takes several parameters that control how the text is generated:

- streamer allows for real-time output display
- max_new_tokens limits response length
- do_sample determines if the model should use randomness in its choices (only active if temperature > 0)
- temperature, top_k, and top_p control how creative versus deterministic the output should be

Finally, the generated tokens are converted back into human-readable text using tokenizer.batch_decode. The skip_special_tokens parameter determines whether to include or remove special markers the model uses internally.

Understanding this function is key because it encapsulates the entire process of how we interact with language models: taking human input, converting it to a form the model understands, generating a response under specific constraints, and converting that response back to human-readable text. It's like having a complete translation pipeline that handles both the conversion of languages and the generation of appropriate responses.

In [None]:
def resolve_mistakes(text: str, options = ['all'], one_shot = True):
    """
    A function that corrects mistakes in text using a language model.
    The function specializes in different types of linguistic errors.
    """

    system = "You are an expert specializing in improving texts in English."

    prompt_start = "Check the following English text and correct it. Pay special attention to:\n\n"

    # Possible options to include as the second argument: "typo", "spelling", "punctuation", "inflection", "syntax", "lexical", "stylistic"
    prompt_typo = "* Typos (omission, repetition, or insertion of incorrect characters)\n\n"
    prompt_spelling = "* Spelling mistakes (incorrect letter usage or combinations, incorrect capitalization, incorrect use of periods in abbreviations, incorrect joint writing)\n\n"
    prompt_punctuation = "* Punctuation errors (unnecessary punctuation marks, missing punctuation marks, incorrect handling of multiple punctuation marks, use of incorrect punctuation marks)\n\n"
    prompt_inflection = "* Inflection errors (related to choosing incorrect word forms, incorrect declension, or incorrect lack of declension)\n\n"
    prompt_syntax = "* Syntax errors (using incorrect forms or constructions where required by governing words, word order errors, incorrect use of participle equivalents)\n\n"
    prompt_lexical = "* Lexical errors (incorrect word usage in given constructions, confusing similar-sounding words, unnecessary borrowings or overuse of foreign words, word-formation errors and errors resulting from word-formation associations, distortions of phraseological units)\n\n"
    prompt_stylistic = "* Stylistic errors (mismatched linguistic forms for correspondence character and function, unintentionally making text ambiguous)\n\n"

    prompt_one_shot = "Here's an example: \n\nText to correct: 'pete sat hunced at the tble and was silent .'\n\nCorrectly fixed text: 'Pete sat hunched at the table and was silent.'"
    prompt_end = "Remember that intervention in the content should only be related to error correction and should not interfere with the original meaning. Return only the corrected text without additional comments. Below is the text to correct:\n\n"

    if options[0] == 'all':
        options = ['typo', 'spelling', 'punctuation', 'inflection', 'syntax', 'lexical', 'stylistic']

    # Constructing the complete prompt by combining different error type prompts based on selected options
    prompt_complete = prompt_start + (prompt_typo if 'typo' in options else '') + (prompt_spelling if 'spelling' in options else '') + (prompt_punctuation if 'punctuation' in options else '') + (prompt_inflection if 'inflection' in options else '') + (prompt_syntax if 'syntax' in options else '') + (prompt_lexical if 'lexical' in options else '') + (prompt_stylistic if 'stylistic' in options else '') + (prompt_one_shot if one_shot else '') + prompt_end + text

    messages = []
    messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": str(prompt_complete)})

    # Processing the text through the language model
    tokenizer_output = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True)
    model_input_ids = tokenizer_output.input_ids.to(CFG.device)
    model_attention_mask = tokenizer_output.attention_mask.to(CFG.device)

    outputs = model.generate(model_input_ids,
                           attention_mask=model_attention_mask,
                           streamer = streamer,
                           max_new_tokens=5000,
                           do_sample=False,
                           temperature = CFG.temperature, top_k = CFG.top_k, top_p = CFG.top_p)

    answer = tokenizer.batch_decode(outputs, skip_special_tokens = False)

    return answer

This function creates a sophisticated text correction system that can identify and fix various types of language errors. Let's explore how it works through each major component.

The function starts by accepting three parameters: the text to correct, which error types to look for (defaulting to 'all'), and whether to include an example correction (one_shot). This flexibility allows users to target specific types of errors or conduct a comprehensive review.

The heart of the function lies in its detailed prompt construction system. It builds specialized instructions for the language model by combining different components:

First, it establishes the model's role through a system message that positions it as an English language expert. This creates a context where the model approaches the text with specialized linguistic knowledge.

The prompt construction then uses a modular approach, breaking down potential errors into seven distinct categories, each targeting specific linguistic challenges:

1. Typos focus on mechanical errors like missing or extra characters
2. Spelling addresses systematic issues with how words are written
3. Punctuation handles the intricate rules of marks and pauses
4. Inflection deals with word forms and declensions
5. Syntax manages sentence structure and word relationships
6. Lexical errors cover word choice and usage
7. Stylistic issues address tone and clarity

The one_shot parameter adds an educational element - when enabled, it provides an example correction demonstrating proper error fixing. This example, "pete sat hunced at the tble and was silent," shows multiple error types and their corrections, helping guide the model's approach to the task.

The function then assembles these components based on which error types were requested. If the user specifies 'all', it includes every error category. Otherwise, it selectively includes only the requested error types, creating a focused correction lens.

The prompt ends with clear instructions about preserving the original meaning while fixing errors. This constraint ensures that corrections improve the text's form without altering its substance.

The final section handles the technical process of getting the model to generate corrections:
- It packages the constructed prompt into a message format
- Converts the text into tokens the model can process
- Moves these tokens to the GPU for efficient processing
- Generates the corrected text using the model's parameters
- Decodes the model's output back into human-readable text

The generation parameters are set for precision rather than creativity (do_sample=False), which is appropriate for a correction task where accuracy is paramount.

This architecture creates a powerful tool for improving text quality, combining linguistic expertise with systematic error detection and correction. Its modular design allows for both comprehensive editing and targeted improvements to specific aspects of language use.

# Model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model)
tokenizer.pad_token = tokenizer.eos_token

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

Let me walk you through this code, which sets up the critical components needed for processing and generating text with our language model. Let's break down each part to understand what's happening and why it matters.

The first line loads a tokenizer matched to our specific model (microsoft/Phi-3-mini-4k-instruct in this case). A tokenizer is like a translator that converts human-readable text into numbers that the model can understand. Think of it as breaking down words into smaller pieces, similar to how we might break down "sunflower" into "sun" and "flower" to understand its meaning better.

The next line addresses an important technical detail: setting the pad_token equal to the eos_token (end of sequence token). This is necessary because language models need a consistent way to handle texts of different lengths. Think of it like standardizing the end of sentences in different languages - some might use periods, others might use different marks, but we need one consistent way to mark the end.

The final piece creates a TextStreamer object, which controls how the model's output is displayed. The skip_prompt=True parameter tells the streamer not to show the input text again in the output - similar to how when having a conversation, you don't repeat the question before giving your answer. The skip_special_tokens=True parameter removes internal markers that the model uses but aren't meant for human readers, like cleaning up stage directions before presenting a final theatrical script.

Together, these components create a complete pipeline for text processing:
1. The tokenizer converts human text into model-readable numbers
2. The padding system ensures all inputs are properly formatted
3. The streamer manages how the model's responses are presented back to humans

This setup is fundamental to how the model understands and generates text, similar to how having proper tools and workspace organization is essential before starting any complex project. Each component plays a vital role in ensuring smooth communication between humans and the language model.

Understanding this infrastructure helps us appreciate why the model can handle such sophisticated tasks as style transformation and error correction - it has the proper tools to break down, understand, and reconstruct text in meaningful ways.

In [None]:
quantization_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_compute_dtype = CFG.dtype
)

This code sets up a crucial configuration for running large language models efficiently with limited computational resources. Let me break down what's happening and why it matters.

The BitsAndBytesConfig creates settings for quantization - a technique that reduces the precision of numbers used in the model to save memory while maintaining performance. Think of quantization like converting high-resolution photos to a lower resolution that still looks good but takes up less space.

The configuration has two key settings:

First, load_in_4bit=True tells the system to use 4-bit quantization. To understand why this is significant, let's consider how computers typically store numbers. Standard neural networks often use 32 or 16 bits per number, which provides high precision but requires lots of memory. By using just 4 bits, we can store numbers using only 1/8 or 1/4 of the original memory. This dramatic reduction is like compressing a large file so it can fit on a smaller drive.

Second, bnb_4bit_compute_dtype = CFG.dtype (which we saw earlier was set to torch.bfloat16) specifies the precision to use during actual computations. This creates an interesting hybrid approach: while the model's weights are stored in 4-bit format to save memory, calculations are done in bfloat16 format to maintain accuracy. It's similar to how a photographer might store photos in a compressed format but temporarily decompress them for editing to ensure high-quality results.

The bfloat16 format itself is worth understanding. Unlike standard 16-bit floating point numbers, bfloat16 maintains the same number of exponent bits as 32-bit numbers while reducing precision bits. This makes it particularly good for neural networks because it can represent a wide range of values (from very small to very large) while using less memory.

This configuration represents a careful balance between three competing needs:
1. Reducing memory usage to run large models on consumer hardware
2. Maintaining enough numerical precision for the model to work effectively
3. Keeping computation speed reasonable

Without such optimizations, many modern language models would be too large to run on typical hardware. This configuration makes it possible to use sophisticated models on more modest hardware while still getting good results - democratizing access to advanced AI technology.

The effectiveness of this approach shows how theoretical insights about neural networks (like their resilience to reduced precision) can be translated into practical engineering solutions that make advanced technology more accessible.

In [None]:
model = AutoModelForCausalLM.from_pretrained(CFG.model,
                                            torch_dtype = CFG.dtype,
                                            quantization_config=quantization_config,
                                            low_cpu_mem_usage = True
                                            )

model.generation_config.pad_token_id = tokenizer.pad_token_id

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

This code loads and configures a large language model, setting it up to run efficiently while maintaining good performance. Let me break down each part to help build a clear understanding of what's happening.

The main function, AutoModelForCausalLM.from_pretrained(), loads our chosen model. The "ForCausalLM" part of the name tells us this is specifically designed for text generation, where each word is predicted based on all the previous words - similar to how we naturally complete sentences based on their beginnings.

Let's examine each configuration parameter:

torch_dtype = CFG.dtype sets the numerical format to bfloat16, which we discussed earlier. This choice reflects a deep understanding of how neural networks process numbers. While regular computers often need high precision for calculations like banking, neural networks can work remarkably well with less precise numbers. Think of it like how human brains can recognize a friend's face whether it's in bright sunlight or dim evening light - we don't need perfect precision to perform complex tasks.

quantization_config=quantization_config applies our earlier 4-bit quantization settings. This is where the model performs its memory diet, converting its internal weights from high-precision numbers to more compact 4-bit representations. Imagine converting a high-resolution photo to a smaller size that still captures all important details - we're doing the same thing with the model's knowledge.

low_cpu_mem_usage=True tells the system to be extra careful with memory during the loading process. This is particularly important when working with large models. It's similar to how a moving company might carefully plan how to get large furniture through a narrow doorway - we're making sure the model loads efficiently even with limited resources.

The final line, setting pad_token_id, ensures consistency between our tokenizer and model. This might seem like a small detail, but it's crucial for proper text processing. Think of it like making sure two people working together use the same signal to indicate they've finished speaking - without this coordination, communication breaks down.

This configuration represents years of research and engineering advances in making large language models practical. Just a few years ago, running models of this size would have required expensive specialized hardware. Now, through techniques like quantization and careful memory management, we can run sophisticated models on more modest hardware while maintaining most of their capabilities.

Understanding these settings helps us appreciate the careful balance between model performance, memory usage, and computational efficiency. It's like conducting an orchestra - each setting must be tuned just right to create the desired result. The choices made here reflect a deep understanding of both the theoretical foundations of neural networks and the practical constraints of running them on real hardware.

In [None]:
email = """
Dear Mr Thomson
I am writing to gave you an update about the progress of our project. During last two weeks, our team have made significant improvements in terms of the developments of the new software platform.
I would like too schedule a meeting next week to discuss the results what we achieved. The meeting will took approximately 2 hours, and we need discussing the following topics:

Implementation of new features which was requested by clients
Budget adjustments for the next quater
Time line for deploying the system

Please let me known what time suits you better. I am available between monday and thursday next week, preferable in the morning hours.
Looking forward to hear from you soon.
Best Regards
Sarah Miller
Senior Project Manger
Digital Solutions Department
"""

In [None]:
email_corrected = resolve_mistakes(email, options = ['punctuation'], one_shot = True)

Dear Mr. Thomson,

I am writing to provide you with an update on the progress of our project. Over the past two weeks, our team has made significant improvements in terms of the development of the new software platform.

I would like to schedule a meeting next week to discuss the results we have achieved. The meeting will take approximately two hours, and we need to cover the following topics:

* Implementation of new features requested by clients
* Budget adjustments for the next quarter
* Time line for deploying the system

Please let me know what time works best for you. I am available between Monday and Thursday next week, preferably in the morning hours.

Looking forward to hearing from you soon.

Best Regards,
Sarah Miller
Senior Project Manager
Digital Solutions Department
