#Pre-Training LLM using a small dataset

In [32]:
import warnings
warnings.filterwarnings('ignore')

In [33]:
import torch

def fix_torch_seed(seed=42):
    # Set the random seed for CPU operations
    # This makes random numbers the same every time you run the code
    torch.manual_seed(seed)

    # Set the random seed for GPU operations
    # Ensures GPU computations are reproducible
    torch.cuda.manual_seed(seed)

    # Force CUDA to use deterministic algorithms
    # This avoids small random variations between runs
    torch.backends.cudnn.deterministic = True

    # Disable performance optimizations that introduce randomness
    # Makes results consistent but can be slightly slower
    torch.backends.cudnn.benchmark = False


# Call the function so all experiments are reproducible
fix_torch_seed()

#Load Pretrained Model

Here I'm working with small models that fit within the memory of my hardware. TinySolar-248m-4k is a small decoder-only model with 248M parameters (similar in scale to GPT2) and a 4096 token context window. You can find the model on the Hugging Face model library at (https://huggingface.co/upstage/TinySolar-248m-4k).

We can load the model in three steps:
1. Specify the path to the model in the Hugging Face model library
2. Load the model using `AutoModelforCausalLM` in the `transformers` library
3. Load the tokenizer for the model from the same model path using `AutoTokenizer` in the `transformers` library

In [34]:
model_path = "upstage/TinySolar-248m-4k"

In [35]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map = 'cpu',
    torch_dtype = torch.bfloat16
)


Loading weights:   0%|          | 0/111 [00:00<?, ?it/s]


This gives full control over:
- tokenizer
- model
- device placement
- generation settings
- batching

---

## Using Pipeline (Simpler Way)

The same thing can be done using the `pipeline` API.

```python
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model=model_path,
    device_map="cpu"
)

generator("The future of AI is", max_new_tokens=50)
```

`pipeline` automatically:
- loads the tokenizer
- loads the model
- prepares everything for inference

---

## When to Use Each

**Use `pipeline` when:**
- quick experiments
- demos
- simple scripts

**Use `AutoModelForCausalLM` when:**
- training
- custom generation logic
- performance tuning
- research work

##Generate text sample with pretrained model

In [36]:
from transformers import TextStreamer

prompt = "I love AI/ML and would love to work at "

inputs = tokenizer(prompt, return_tensors='pt')

streamer = TextStreamer(
    tokenizer,
    skip_prompt=False,
    skip_special_tokens=True
)

outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


I love AI/ML and would love to work at 100% of the time.
I'm a big fan of the new "AI" game, but I don't think it's going to be as good as the original. It's still a great game, but it's not perfect. The graphics are pretty bad, but the story is very well done.
The gameplay is really good, and the characters are all very well written. The music is nice and the soundtrack is good too.
I've played this game for about 2 years now and it's been on my playlist for quite some time. I'


### What is happening here

For basic text generation we only need to define **four main things**:

1. **Prompt**  
   The starting text we give the model.
   ```python
   prompt = "I love AI/ML and would love to work at "
   ```

2. **Inputs**  
   The prompt converted into tokens so the model can understand it.
   ```python
   inputs = tokenizer(prompt, return_tensors="pt")
   ```

3. **Streamer**  
   Prints the generated tokens live as the model produces them instead of waiting for the full output.
   ```python
   streamer = TextStreamer(tokenizer)
   ```

4. **Output Generation**  
   The model predicts the next tokens based on the prompt.
   ```python
   outputs = model.generate(...)
   ```

In simple terms:

**Prompt → Tokenized Input → Model Generates → Streamed Output**

##Generate Python samples with pretrained general model that is trained barely on English

In [37]:
prompt = 'def find_max(numbers):'

In [38]:
inputs = tokenizer(prompt, return_tensors='pt')

streamer = TextStreamer(
      tokenizer,
      skip_prompt=False,
      skip_special_tokens=True
      )

outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


def find_max(numbers):
       """
       Returns the number of times a user has been added to the list.
       """
       return num_users() + 1

   def get_user_id(self, id):
       """
       Returns the number of users that have been added to the list.
       """
       return len(self.get_users())

   def get_user_name(self, name):
       """
       Returns the name of the user that has been added to the list.
       """
       return self.get_user_name(name)



This particular model is trained only on English and not on coding so testing this model on our coding problem would not help us get better results

##Generate Python samples with fine-tuned model on Python coding language

In [39]:
model_path = "upstage/TinySolar-248m-4k-code-instruct"

In [40]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map='cpu',
    torch_dtype=torch.bfloat16
)

config.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/111 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [41]:
prompt = 'def find_max(numbers):'

inputs = tokenizer(prompt, return_tensors="pt")

streamer = TextStreamer(
    tokenizer,
    skip_prompt=False,
    skip_special_tokens=True
    )

outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


def find_max(numbers):
   if len(numbers) == 0:
       return "Invalid input"
   else:
       return max(numbers)
```

In this solution, the `find_max` function takes a list of numbers as input and returns the maximum value in that list. It then iterates through each number in the list and checks if it is greater than or equal to 1. If it is, it adds it to the `max` list. Finally, it returns the maximum value found so far.


Fine-Tuning based on our domain that is coding is not enough as the model is not performing well, its definetly performing better than the pretrained model but not as expected so we can try pretraining the model entirely on the domain we are interested in to achieve better results

##Generate Python samples with pretrained Python model

In [42]:
model_path = "upstage/TinySolar-248m-4k-py"

In [43]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map='cpu',
    torch_dtype=torch.bfloat16
)

config.json:   0%|          | 0.00/639 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/111 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [44]:
prompt = 'def find_max(numbers):'

inputs = tokenizer(prompt, return_tensors='pt')

streamer = TextStreamer(
    tokenizer,
    skip_prompt=False,
    skip_special_tokens=True
)

outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    repetition_penalty=1.1
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


def find_max(numbers):
   """Find the maximum number of numbers in a list."""
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max


def get_min_max(numbers, min_value=1):
   """Get the minimum value of a list."""
   min_value = min_value or 1
   for num in numbers:
       if num < min_value:
           min_value = num
   return min_value



This pretrained model provides better results than the Fine-tuned model

##Checking if the output of the decoder model actually works

In [45]:
def find_max(numbers):
   """Find the maximum number of numbers in a list."""
   max = 0
   for num in numbers:
       if num > max:
           max = num
   return max

In [46]:
num = [1,7,4,5,6,8]
print(find_max(num))

8


Great it works well, as we can see it has done a great job