# Lecture 2: Data Preparation

In this lesson you'll carry out some of the data cleaning steps required to prepare data for pretraining. In the video, Sung mentioned an Upstage tool called **Dataverse** which can help you with data cleaning. You can checkout the features of Dataverse at [this link](https://github.com/UpstageAI/dataverse).

In [1]:
import warnings
warnings.filterwarnings("ignore")

## 1. Sourcing datasets for pretraining

In this section, you'll see two ways to source data for training:
1. Download an existing dataset from Hugging Face
2. Create a dataset of python scripts sourced from Github

In both cases the result will be a Hugging Face `Dataset` object, part of the `Datasets` library. You can read more about the properties of Datasets and how to work with them on the [Hugging Face website](https://huggingface.co/docs/datasets/en/index).

### Download data from Hugging face

The dataset you download here is a subset of a much larger dataset called **Red Pajama**. The full, 1 trillion token dataset is available on Hugging Face at [this link](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).

In [2]:
import datasets
pretraining_dataset = datasets.load_dataset(
    "upstage/Pretraining_Dataset",
    split="train"
)

Downloading data:   0%|          | 0.00/150M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/60000 [00:00<?, ? examples/s]

In [3]:
print(pretraining_dataset)

Dataset({

    features: ['text', 'meta'],

    num_rows: 60000

})


Only work with the `text` column:

In [4]:
pretraining_dataset = pretraining_dataset.select_columns(
    ["text"]
)

Print a sample:

In [5]:
print(pretraining_dataset[0]["text"][:500])

In 1793 Zaman Shah, a grandson of Ahmad Shah Durrani, won a brief war of succession to become ruler of Afghanistan. The support of Painda Khan, chief of the Baraksai branch of the Durrani tribe, was decisive in his victory. In the next fifty year., the brothers of Zaman shah and the sons of Painda Khan were to dominate the affairs of Afghanistan. The Durrani tribe was very large with several branches and numerous clans. 1 Abmad Shah and his successors belonged to the Sadozai clan, but other clan


### Compare pretraining and fine-tuning datasets
In the next cell, you'll download a fine-tuning dataset to contrast with the pretraining dataset you loaded above. You can read more about the Alpaca model and instruction tuning dataset [here](https://crfm.stanford.edu/2023/03/13/alpaca.html). 

In [6]:
instruction_dataset = datasets.load_dataset(
    "c-s-ale/alpaca-gpt4-data",
    split="train"
)
print(instruction_dataset)

Downloading readme:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/43.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Dataset({

    features: ['instruction', 'input', 'output'],

    num_rows: 52002

})


In [7]:
i=0
print("Instruction: " + instruction_dataset[i]["instruction"] 
      + "\nInput: " + instruction_dataset[i]["input"] 
      + "\nOutput: " + instruction_dataset[i]["output"])

Instruction: Give three tips for staying healthy.

Input: 

Output: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.



2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.



3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.


Notice how in contrast to the pretraining data, which is just raw text, fine-tuning datasets are structured into question-answer pairs or instruction-response sets that can include additional input context if required. 

Moving forward, you'll only work with the unstructured pretraining dataset.

### Scrape python code from Github
Here, you'll download a selection of python scripts from Github and then prepare them as a Hugging Face `Dataset` object to use in training. 

The same pattern here will work for preparing any text scraped from the web.

In [8]:
# Import some required package
import os
import requests

#Path to directory to store python scripts
code_dir = "/kaggle/working/"

In [9]:
urls = [
    "https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py",
    "https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py",
    "https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py",
    "https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py",
    "https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py",
    "https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py",
    "https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py",
    "https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/django/contrib/messages/__init__.py",
    "https://raw.githubusercontent.com/PaliC/pytorch/master/test/fx/test_subgraph_rewriter.py"
]

Retrieve the python scripts:

In [10]:
for url in urls:
    print(f"Working on url: {url}")
    response = requests.get(url)
    file_name = os.path.basename(url)
    file_path = os.path.join(code_dir, file_name)
    
    with open(file_path, "wb") as file:
        file.write(response.content)

Working on url: https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py

Working on url: https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py

Working on url: https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py

Working on url: https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py

Working on url: https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py

Working on url: https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py

Working on url: https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py

Working on url: https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packag

In [11]:
files = os.listdir(code_dir)
for file in files:
    print(file)

numpy_mlp.py

__notebook__.ipynb

double_linear_search_recursion.py

__init__.py

values.py

module_util.py

visualize.py

version.py

test_subgraph_rewriter.py

distribute_coordinator_context.py


Concatenate scripts into a list:

In [12]:
import os

code_dataset = []
for file in os.listdir(code_dir):
    file_path = os.path.join(code_dir, file)
    if os.path.isfile(file_path):  # Sadece dosyalar üzerinde işlem yap
        with open(file_path, 'r') as f:
            code_dataset.append({'text': f.read()})


Convert list to Hugging Face `Dataset` object:

In [13]:
code_dataset = datasets.Dataset.from_list(code_dataset)
print(code_dataset)

Dataset({

    features: ['text'],

    num_rows: 10

})


Combine the python code dataset with the pretraining dataset you downloaded above:

In [14]:
dataset = datasets.concatenate_datasets(
    [pretraining_dataset, code_dataset]
)

print(dataset)

Dataset({

    features: ['text'],

    num_rows: 60010

})


## 2. Data cleaning

In the cells below, you'll carry out the following cleaning steps:
1. Filter out samples that are too short
2. Remove repetitions within a single text example
3. Remove duplicated documents
4. Quality filter to remove non-English texts 

In [15]:
dataset.num_rows

60010

### Remove examples that are too short

In [16]:
import heapq

def paragraph_length_filter(x):
    """Returns False if a page has too few lines or lines are too short."""
    lines = x['text'].split('\n')
    if (
        len(lines) <3
        or min(heapq.nlargest(3, [len(line) for line in lines])) < 3
    ):
        return False
    return True

In [17]:
dataset = dataset.filter(
    paragraph_length_filter,
    load_from_cache_file=False
)

Filter:   0%|          | 0/60010 [00:00<?, ? examples/s]

In [18]:
dataset.num_rows

52358

### Remove repeated text within training examples

Here you'll remove text repetitions within each example. 

In [19]:
def find_duplicates(paragraphs):
    """
    Use this function to find the number of repetitions 
    in the paragraphs.
    """
    
    unique_x = set()
    duplicate_chars = 0
    duplicate_elements = 0
    for element in paragraphs:
        if element in unique_x:
            duplicate_chars += len(element)
            duplicate_elements +=1
        else:
            unique_x.add(element)
    return duplicate_elements, duplicate_chars

In [20]:
import re

def paragraph_repetition_filter(x):
    """
    Returns False iff a page has too many repetitions.
    """
    text = x['text']
    paragraphs = re.compile(r"\n{2,}").split(text.strip())                # Split by paragraphs (2 or more newlines)
    paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)  # Find number of duplicates in paragraphs
    if paragraphs_duplicates / len(paragraphs) > 0.3:
        return False
    if char_duplicates / len(text) > 0.2:
        return False
    return True

In [21]:
dataset = dataset.filter(
    paragraph_repetition_filter,
    load_from_cache_file=False
)

Filter:   0%|          | 0/52358 [00:00<?, ? examples/s]

In [22]:
dataset.num_rows

52328

### Deduplication

In this section, you'll remove duplicate examples from the entire dataset (in contrast to the previous step where you were just looking for repeated text in each example.)

In [23]:
def deduplication(ds):
    def dedup_func(x):
        """Use this function to remove duplicate entries"""
        if x['text'] in unique_text:
            return False
        else:
            unique_text.add(x['text'])
            return True

    unique_text = set()

    ds = ds.filter(dedup_func, load_from_cache_file=False, num_proc=1)
    return ds

dataset = deduplication(dataset)

Filter:   0%|          | 0/52328 [00:00<?, ? examples/s]

In [24]:
dataset.num_rows

43599

### Quality filter - Language

Here you'll remove any text examples that are in a language other than English. The code here uses a language detection model called fastText. You can read about fastText [here](https://fasttext.cc/).

In [25]:
!pip install fasttext






In [26]:
import urllib
from fasttext.FastText import _FastText

def english_language_filter(ds):
    # load language detection model
    model = _FastText('/kaggle/input/l2_language_model.bin/other/default/1/L2_language_model.bin')
    
    def is_english(x):
        # Predict language of the text and probability
        language, score = model.predict(x['text'].replace("\n", ""))

        language = language[0].split("__")[2]
        return score > 0.4 and language == "en" # change code here if building a model in another language

    ds = ds.filter(is_english, load_from_cache_file=False, num_proc=1)
    return ds

dataset = english_language_filter(dataset)



Filter:   0%|          | 0/43599 [00:00<?, ? examples/s]

In [27]:
dataset.num_rows

40474

## 3. Save the dataset to disk

Read more about the parquet data format [here](https://parquet.apache.org/).

In [28]:
file_path = "/kaggle/working/preprocessed_dataset.parquet"
dataset.to_parquet(file_path)

Creating parquet from Arrow format:   0%|          | 0/41 [00:00<?, ?ba/s]

197101804

# Lesson 3: Data Packaging
## 1. Tokenizing and creating input_ids

Start by loading the dataset from the previous lesson:

In [29]:
import datasets

dataset = datasets.load_dataset(
    "parquet", 
    data_files="/kaggle/working/preprocessed_dataset.parquet", 
    split="train"
)
print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({

    features: ['text'],

    num_rows: 40474

})


Use the `shard` method of the Hugging Face `Dataset` object to split the dataset into 10 smaller pieces, or *shards* (think shards of broken glass). You can read more about sharding at [this link](https://huggingface.co/docs/datasets/en/process#shard).

In [30]:
dataset = dataset.shard(num_shards=10, index=0)
print(dataset)

Dataset({

    features: ['text'],

    num_rows: 4048

})


Load the tokenizer and try it out:

In [31]:
from transformers import AutoTokenizer
model_path_or_name = "upstage/SOLAR-10.7B-v1.0"
tokenizer = AutoTokenizer.from_pretrained(
    model_path_or_name,
    use_fast=False
)

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

In [32]:
tokenizer.tokenize("I'm a short sentence")

['▁I', "'", 'm', '▁a', '▁short', '▁sentence']

In [33]:
def tokenization(example):
    #Tokenize
    tokens = tokenizer.tokenize(example["text"])
    
    # Convert tokens to ids
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # Add <bos>, <eos> tokens to the front and back of tokens_ids 
    # bos: begin of sequence, eos: end of sequence
    token_ids = [
        tokenizer.bos_token_id] \
        + token_ids \
        + [tokenizer.eos_token_id
    ]
    example["input_ids"] = token_ids
    
    # We will be using this column to count the total number of tokens 
    # in the final dataset
    example["num_tokens"] = len(token_ids)
    return example

Tokenize Etme:
İlk adımda, "Merhaba dünya" metni token'lere ayrılır. Bu işlem sonucu iki token elde ederiz: ["Merhaba", "dünya"].

Token'leri ID'lere Çevirme:
Elde edilen token'ler, modelin kelime dağarcığındaki karşılıklarıyla sayısal ID'lere dönüştürülür. Örneğin, "Merhaba" token'ı 1234 ve "dünya" token'ı 5678 ID'leriyle temsil edilebilir. Böylece token'ler şu hale gelir: [1234, 5678].

Özel Token'leri Ekleyin:
Başlangıç (<bos>) ve bitiş (<eos>) token'lerini eklemek için, ID listesinin başına ve sonuna sırasıyla 101 ve 102 ID'lerini ekleriz. Bu token'ler sırasıyla bos_token_id ve eos_token_id değerlerine karşılık gelir. Sonuç olarak, token ID'leri [101, 1234, 5678, 102] şeklinde olur.

Token Sayısını Hesaplayın:
Son olarak, toplam token sayısını hesaplarız. Token ID'leri listesi [101, 1234, 5678, 102] olduğundan, bu liste dört token içerir. Bu bilgi example sözlüğüne "num_tokens" anahtarı altında eklenir ve değeri 4 olur.
    
Sonuç olarak, example sözlüğü şu hale gelir:    
    {
    "text": "Merhaba dünya",
    "input_ids": [101, 1234, 5678, 102],
    "num_tokens": 4
}


Tokenize all the examples in the pretraining dataset:

Bu kod, Hugging Face datasets kütüphanesi kullanılarak bir veri kümesine tokenizasyon işlemi uygular ve bu işlem sonrası veri kümesini ekrana yazdırır. dataset.map metodu, her bir veri örneği üzerinde tokenization fonksiyonunu uygularken, load_from_cache_file=False parametresi, verilerin her seferinde taze olarak işlenmesini sağlar. print(dataset) ise veri kümesinin son durumunu görüntülemek için kullanılır.

Varsayalım ki dataset'in bazı verileri şu şekildedir:

[{'text': 'Merhaba dünya'}, {'text': 'Nasılsın?'}]

tokenization fonksiyonu uygulandıktan sonra, veri kümesi şu şekilde güncellenmiş olabilir:

[{'text': 'Merhaba dünya', 'input_ids': [101, 1234, 5678, 102], 'num_tokens': 4},
 {'text': 'Nasılsın?', 'input_ids': [101, 6789, 1020, 102], 'num_tokens': 3}]


In [34]:
dataset = dataset.map(tokenization, load_from_cache_file=False)
print(dataset)

Map:   0%|          | 0/4048 [00:00<?, ? examples/s]

Dataset({

    features: ['text', 'input_ids', 'num_tokens'],

    num_rows: 4048

})


In [35]:
sample = dataset[3]

print("text", sample["text"][:30]) # 
print("\ninput_ids", sample["input_ids"][:30])
print("\nnum_tokens", sample["num_tokens"])

text The Colorado Climate Center pr



input_ids [1, 415, 15837, 1366, 3314, 6064, 5312, 430, 19102, 304, 1178, 356, 281, 3928, 28725, 9735, 28713, 28725, 264, 1052, 14455, 4623, 28725, 9390, 1452, 274, 28725, 17268, 28713, 28725]



num_tokens 549


Öncelikle, elimizde şu şekilde bir veri kümesi olduğunu varsayalım:

Veri kümesi, her bir örnekte birkaç token ID'si içeren listelerden oluşuyor. Örneğin, input_ids şöyle görünebilir: [ [1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20] ].
İlk olarak, tüm token ID'lerini tek bir uzun liste halinde birleştiriyoruz. Bu, tüm örneklerdeki token ID'lerini ardışık olarak sıralamak anlamına gelir. Sonuçta elde edilen liste, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] şeklinde olur.

Sonra, bu birleşik listenin uzunluğunu, belirli bir maksimum uzunlukta parçalara bölmek için ayarlıyoruz. Örneğin, maksimum uzunluk olarak 8 belirleyelim. Listenin uzunluğu 20 olduğundan, bu uzunluğu en yakın şekilde 8'in katı olacak şekilde ayarlıyoruz. Bu, listeyi 16 elemana kadar kesmek anlamına gelir, çünkü 20'yi 8'in bir katı olan en yakın değere (16) indirgemek istiyoruz. Kalan son 4 token'ı (17, 18, 19, 20) atıyoruz. Bu işlemden sonra liste [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] olur.

Sonraki adımda, bu kesilmiş listeyi, her biri belirlenen maksimum uzunlukta olacak şekilde iki satıra bölüyoruz. Sonuçta, her biri 8 token içeren satırlar elde ediyoruz. Bu, iki satırdan oluşan bir yapı sağlar: [[1, 2, 3, 4, 5, 6, 7, 8], [9, 10, 11, 12, 13, 14, 15, 16]].

Son olarak, bu yeniden şekillendirilmiş listeyi Hugging Face datasets kütüphanesinin formatına dönüştürüyoruz. Her satır, bir örneği temsil eder ve veri kümesine uygun formatta yapılandırılmış olur. Bu dönüşüm işlemi tamamlandığında, Hugging Face veri kümesi biçiminde iki örnek içeren yeni bir veri kümesi elde ederiz. Bu veri kümesi, her örneğin token ID'lerini içerir ve modelleme süreçlerinde kullanılmak üzere hazır hale gelir.

Bu adımlar, token verilerini işlemeye yönelik genel bir akışı ve veri hazırlık aşamalarını içerir, böylece dil modelleme görevlerinde kullanılacak uygun formatta veri setleri oluşturulur.

Check the total number of tokens in the dataset:

In [36]:
import numpy as np
np.sum(dataset["num_tokens"])

5113663

Concatenate input_ids for all examples into a single list:

In [37]:
input_ids = np.concatenate(dataset["input_ids"])
print(len(input_ids))

5113663


In [38]:
max_seq_length=32

In [39]:
total_length = len(input_ids) - len(input_ids) % max_seq_length
print(total_length)

5113632


Discard extra tokens from end of the list so number of tokens is exactly divisible by `max_seq_length`:

In [40]:
input_ids = input_ids[:total_length]
print(input_ids.shape)

(5113632,)


In [41]:
input_ids_reshaped = input_ids.reshape(-1, max_seq_length).astype(np.int32)
input_ids_reshaped.shape

(159801, 32)

In [42]:
type(input_ids_reshaped)

numpy.ndarray

Convert to Hugging Face dataset:

In [43]:
input_ids_list = input_ids_reshaped.tolist()
packaged_pretrain_dataset = datasets.Dataset.from_dict(
    {"input_ids": input_ids_list}
)
print(packaged_pretrain_dataset)

Dataset({

    features: ['input_ids'],

    num_rows: 159801

})


## 3. Save the packed dataset to disk

In [44]:
packaged_pretrain_dataset.to_parquet("/kaggle/working/packaged_pretrain_dataset.parquet")

Creating parquet from Arrow format:   0%|          | 0/160 [00:00<?, ?ba/s]

21093732

# Lesson 4: Preparing your model for training

In [45]:
# Ignore insignificant warnings (ex: deprecation warnings)
import warnings
warnings.filterwarnings('ignore')

# Set a seed value for reproducibility
import torch

def fix_torch_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

fix_torch_seed()

## 1. Model configuration

You'll configure models based on Meta's Llama family of models. The transformers library has several tools for working with these models, which you can read about [here](https://huggingface.co/docs/transformers/main/en/model_doc/llama).

Start by creating a `LlamaConfig` object to configure the architecture of the model:

In [46]:
from transformers import LlamaConfig
config = LlamaConfig()
print(config)

LlamaConfig {

  "attention_bias": false,

  "attention_dropout": 0.0,

  "bos_token_id": 1,

  "eos_token_id": 2,

  "hidden_act": "silu",

  "hidden_size": 4096,

  "initializer_range": 0.02,

  "intermediate_size": 11008,

  "max_position_embeddings": 2048,

  "mlp_bias": false,

  "model_type": "llama",

  "num_attention_heads": 32,

  "num_hidden_layers": 32,

  "num_key_value_heads": 32,

  "pretraining_tp": 1,

  "rms_norm_eps": 1e-06,

  "rope_scaling": null,

  "rope_theta": 10000.0,

  "tie_word_embeddings": false,

  "transformers_version": "4.42.3",

  "use_cache": true,

  "vocab_size": 32000

}




Next, update parameters to change the model architecture:

In [47]:
config.num_hidden_layers = 12 # reduced from 32 to 12
config.hidden_size = 1024 # reduced 1/4 from 4096 to 1024
config.intermediate_size = 4096 # reduced 1/3 from 11008 to 4096
config.num_key_value_heads = 8 # reduced 1/4 from 32 to 8 (defaults to num_attention_heads=32)
config.torch_dtype = "bfloat16" # for half-precision training
config.use_cache = False # 'True' is incompatible w/ gradient checkpointing
print(config)

LlamaConfig {

  "attention_bias": false,

  "attention_dropout": 0.0,

  "bos_token_id": 1,

  "eos_token_id": 2,

  "hidden_act": "silu",

  "hidden_size": 1024,

  "initializer_range": 0.02,

  "intermediate_size": 4096,

  "max_position_embeddings": 2048,

  "mlp_bias": false,

  "model_type": "llama",

  "num_attention_heads": 32,

  "num_hidden_layers": 12,

  "num_key_value_heads": 8,

  "pretraining_tp": 1,

  "rms_norm_eps": 1e-06,

  "rope_scaling": null,

  "rope_theta": 10000.0,

  "tie_word_embeddings": false,

  "torch_dtype": "bfloat16",

  "transformers_version": "4.42.3",

  "use_cache": false,

  "vocab_size": 32000

}




## 2. Weight initialization

In the next sections, you'll explore four different ways to initialize the weights of a model for training:
1. Random weight initialization
2. Using an existing model for continued pre-training
3. Downscaling an existing model
4. Upscaling an existing model

### Random weight initialization

Randomly initializing model weights sets all weights to values from a truncated normal distribution with mean 0 and standard deviation of 0.02. Values beyond 2-sigma from the mean are set to 0.

In [48]:
from transformers import LlamaForCausalLM
model = LlamaForCausalLM(config)
print(model)

LlamaForCausalLM(

  (model): LlamaModel(

    (embed_tokens): Embedding(32000, 1024)

    (layers): ModuleList(

      (0-11): 12 x LlamaDecoderLayer(

        (self_attn): LlamaSdpaAttention(

          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)

          (k_proj): Linear(in_features=1024, out_features=256, bias=False)

          (v_proj): Linear(in_features=1024, out_features=256, bias=False)

          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)

          (rotary_emb): LlamaRotaryEmbedding()

        )

        (mlp): LlamaMLP(

          (gate_proj): Linear(in_features=1024, out_features=4096, bias=False)

          (up_proj): Linear(in_features=1024, out_features=4096, bias=False)

          (down_proj): Linear(in_features=4096, out_features=1024, bias=False)

          (act_fn): SiLU()

        )

        (input_layernorm): LlamaRMSNorm()

        (post_attention_layernorm): LlamaRMSNorm()

      )

    )

    (norm): LlamaRMSNorm()

In [49]:
def print_nparams(model):
    """Calculate the total number of model parameters"""
    nparams = sum(p.numel() for p in model.parameters())
    print(f"The total number of parameters is: {nparams}")

print_nparams(model)  # 248013824 => 248M

The total number of parameters is: 248013824


Take a look at a sample of the weights in a single layer:

In [50]:
layer_name = "model.layers.0.self_attn.q_proj.weight"

for name,param in model.named_parameters():
    if name == layer_name:
        print(f"First 30 weights of layer '{layer_name}':")
        print(param.data.view(-1)[:30])
        break

First 30 weights of layer 'model.layers.0.self_attn.q_proj.weight':

tensor([ 1.5794e-02, -2.2748e-02,  2.0156e-02, -2.6072e-02, -8.3267e-05,

         8.7432e-03, -9.0255e-04, -4.2442e-02,  1.5337e-02,  1.4482e-02,

         1.3526e-02,  1.9171e-03, -2.3141e-02, -4.2336e-03,  6.9818e-04,

         8.9955e-03, -2.0524e-02, -1.3378e-02,  2.3255e-02,  9.5167e-04,

         2.1053e-02,  1.2794e-02, -7.6783e-03, -3.7832e-03, -8.9180e-03,

         7.4018e-04, -2.5204e-02, -1.7069e-02,  1.3481e-03,  4.7622e-02])


Try using the model for inference:

In [51]:
# Load a tokenizer from Upstage Solar, 
# which is compatible with the Llama-2 tokenizer
from transformers import LlamaTokenizer
model_dir = "upstage/SOLAR-10.7B-v1.0"
tokenizer = LlamaTokenizer.from_pretrained(model_dir)

# Run simple inference with prompt
from transformers import TextStreamer

prompt = "I am an engineer. I love"

inputs = tokenizer(prompt,return_tensors="pt").to(model.device)

streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = model.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=128,
    do_sample=False
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

2024-07-26 15:27:57.950073: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered

2024-07-26 15:27:57.950254: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered

2024-07-26 15:27:58.151169: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


possessed possessed possessed possessed possessed possessedcontinuecontinuecontinuecontinuecontinueDownloadџcontinueDownloadcontinueDownloadcontinueertsxE Point remoterts remoterts remoterts갑continuecontinuecontinue wide wide atr wide atr wide wide wide wide wide wide wide wide wide wide wide wideursor otra FC otraopesopesopesopesopesopesopesopesopesopesopes wideopes wideopes wideopes wideopes wideopes wideopes wideopesimpse Library wideopesasterasterasterasterasterasterasterasterasterasterasterasterasterasterasterasterasterasterasteraster primarily primarily primarily primarily primarily primarily primarilyasterasterasterasterasterasterasterasterasterasterasteraster primarilyitä primarilyitä primarilyitä primarilyitä


Remove the model from memory to avoid crashing the kernel:

In [52]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
import gc
del model
del streamer
del outputs
gc.collect()

23

### Reuse general pretrained model weights

If you load an existing model, you can use it as is to continue pretraining on new data.

In [53]:
from transformers import AutoModelForCausalLM

model_name_or_path = "upstage/TinySolar-248m-4k"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    device_map="cpu",
    torch_dtype=torch.bfloat16,
)

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Remove the model from memory to avoid crashing the kernel:

In [54]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
del model
gc.collect()

0

### Downscaling from a general pretrained model

Here you'll downscale the tinySolar-248m-4k model from a 12 layer model to a 10 layer model.

In [55]:
from transformers import AutoTokenizer, AutoConfig

model_name_or_path = "upstage/TinySolar-248m-4k"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    device_map="cpu",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [56]:
print(model)

LlamaForCausalLM(

  (model): LlamaModel(

    (embed_tokens): Embedding(32000, 1024)

    (layers): ModuleList(

      (0-11): 12 x LlamaDecoderLayer(

        (self_attn): LlamaSdpaAttention(

          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)

          (k_proj): Linear(in_features=1024, out_features=256, bias=False)

          (v_proj): Linear(in_features=1024, out_features=256, bias=False)

          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)

          (rotary_emb): LlamaRotaryEmbedding()

        )

        (mlp): LlamaMLP(

          (gate_proj): Linear(in_features=1024, out_features=4096, bias=False)

          (up_proj): Linear(in_features=1024, out_features=4096, bias=False)

          (down_proj): Linear(in_features=4096, out_features=1024, bias=False)

          (act_fn): SiLU()

        )

        (input_layernorm): LlamaRMSNorm()

        (post_attention_layernorm): LlamaRMSNorm()

      )

    )

    (norm): LlamaRMSNorm()

In [57]:
print_nparams(model) # 248013824 => 248M

The total number of parameters is: 248013824


Remove the middle two layers (layers 5 and 6) and update the configuration:

In [58]:
layers = model.model.layers
model.model.layers = layers[:5] + layers[-5:]

config = AutoConfig.from_pretrained(
    model_name_or_path,
    num_hidden_layers=len(model.model.layers),
)
model.config=config

print_nparams(model)# 217601024 => 217M

The total number of parameters is: 217601024


Clear the memory to avoid crashing the kernel:

In [59]:
# NOTE: We're running large models in a limited environment. Run me if you encounter any memory issues.
import gc
del model
gc.collect()

99

### Depth Upscaling from a general pretrained model

Here you are going to upscale the tinySolar-248m-4k model from 12 layers to 16 layers. Here are the steps you'll take:
1. Configure a 16 layer model and initialize it with random weights
2. Load the 12 layer tinySolar-248m-4k model into memory
3. Copy the bottom 8 and top 8 layers from the 12 layer model and use them to overwrite the random weights of the 16 layer model
4. Copy over the embedding and classifying layers to replace the randomly initialized counterparts in the 16 layer model

In [60]:
config = LlamaConfig(
    num_hidden_layers=16,  # We want our model to have 16 final layers
    hidden_size=1024,
    intermediate_size=4096,
    num_attention_heads=32,
    num_key_value_heads=8,
    torch_dtype="bfloat16",
    use_cache=False 
)
print(config)

LlamaConfig {

  "attention_bias": false,

  "attention_dropout": 0.0,

  "bos_token_id": 1,

  "eos_token_id": 2,

  "hidden_act": "silu",

  "hidden_size": 1024,

  "initializer_range": 0.02,

  "intermediate_size": 4096,

  "max_position_embeddings": 2048,

  "mlp_bias": false,

  "model_type": "llama",

  "num_attention_heads": 32,

  "num_hidden_layers": 16,

  "num_key_value_heads": 8,

  "pretraining_tp": 1,

  "rms_norm_eps": 1e-06,

  "rope_scaling": null,

  "rope_theta": 10000.0,

  "tie_word_embeddings": false,

  "torch_dtype": "bfloat16",

  "transformers_version": "4.42.3",

  "use_cache": false,

  "vocab_size": 32000

}




In [61]:
model = LlamaForCausalLM(config)
model = model.to(dtype=torch.bfloat16)  # convert to bfloat16
print_nparams(model)  # 308839424 => 308M

The total number of parameters is: 308839424


In [62]:
model_name_or_path = "upstage/TinySolar-248m-4k"
pretrained_model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    device_map="cpu",
    torch_dtype=torch.bfloat16,    
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

print_nparams(pretrained_model) #  248013824 => 248M

The total number of parameters is: 248013824


In [63]:
from copy import deepcopy

model.model.layers = deepcopy(pretrained_model.model.layers[:-4]) \
    + deepcopy(pretrained_model.model.layers[4:])

model.model.embed_tokens = deepcopy(pretrained_model.model.embed_tokens)

model.lm_head = deepcopy(pretrained_model.lm_head)

print(model.config)

LlamaConfig {

  "attention_bias": false,

  "attention_dropout": 0.0,

  "bos_token_id": 1,

  "eos_token_id": 2,

  "hidden_act": "silu",

  "hidden_size": 1024,

  "initializer_range": 0.02,

  "intermediate_size": 4096,

  "max_position_embeddings": 2048,

  "mlp_bias": false,

  "model_type": "llama",

  "num_attention_heads": 32,

  "num_hidden_layers": 16,

  "num_key_value_heads": 8,

  "pretraining_tp": 1,

  "rms_norm_eps": 1e-06,

  "rope_scaling": null,

  "rope_theta": 10000.0,

  "tie_word_embeddings": false,

  "torch_dtype": "bfloat16",

  "transformers_version": "4.42.3",

  "use_cache": false,

  "vocab_size": 32000

}




Check the number of parameters is still 308 million:

In [64]:
print_nparams(model)  # 308839424 => 308M

The total number of parameters is: 308839424


Try using the model for inference:

In [65]:
# Run simple inference to show no trained model
prompt = "I am an engineer. I love"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

streamer = TextStreamer(
    tokenizer, 
    skip_prompt=True, 
    skip_special_tokens=True
)

outputs = model.generate(
    **inputs, 
    streamer=streamer, 
    use_cache=True, 
    max_new_tokens=128, 
    do_sample=False
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


to work with people who are not afraid to look at the world and are not afraid to look at the world with a little bit of a twist.

I am a very humble person and I am very fortunate to have a great team of people who work hard to make a difference.

I am very fortunate to have a great team of people who work hard to make a difference.

I am very fortunate to have a great team of people who work hard to make a difference.

I am very fortunate to have a great team of people who work hard to make a difference.

I am very fortunate to have a great team


### Save the model to disk

Note the new model name here which reflects the 308 million parameters of the new, upscaled model. 

In [66]:
model.save_pretrained('/kaggle/working/TinySolar-308m-4k-init')

# Lesson 5. Model training

Pretraining is very expensive! Please check costs carefully before starting a pretraining project.

You can get a rough estimate your training job cost using [this calculator](https://huggingface.co/training-cluster) from Hugging Face. For training on other infrastructure, e.g. AWS or Google Cloud, please consult those providers for up to date cost estimates.

In [None]:
import warnings
warnings.filterwarnings('ignore')

## 1. Load the model to be trained

Load the upscaled model from the previous lesson:

In [None]:
import torch
from transformers import AutoModelForCausalLM

pretrained_model = AutoModelForCausalLM.from_pretrained(
    "/content/TinySolar-308m-4k-init",
    device_map="cpu",
    torch_dtype=torch.bfloat16,
    use_cache=False,
)

In [None]:
pretrained_model

## 2. Load dataset

Here you'll update two methods on the `Dataset` object to allow it to interface with the trainer. These will be applied when you specify the dataset you created in Lesson 3 as the training data in the next section.

Note that the code has additional comment strings that don't appear in the video. These are to help you understand what each part of the code is doing.

In [None]:
import datasets
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, args, split="train"):
        """Initializes the custom dataset object."""
        self.args = args
        self.dataset = datasets.load_dataset(
            "parquet",
            data_files=args.dataset_name,
            split=split
        )

    def __len__(self):
        """Returns the number of samples in the dataset."""
        return len(self.dataset)

    def __getitem__(self, idx):
        """
        Retrieves a single data sample from the dataset
        at the specified index
        """
        # Convert the lists to a LongTensor for PyTorch
        input_ids = torch.LongTensor(self.dataset[idx]["input_ids"])
        labels = torch.LongTensor(self.dataset[idx]["input_ids"])

        # Return the sample as a dictionary
        return {"input_ids": input_ids, "labels": labels}

## 3. Configure Training Arguments

Here you set up the training run. The training dataset you created in Lesson 3 is specified in the Dataset configuration section.

Note: there are comment strings in the cell below that don't appear in the video. These have been included to help you understand what each parameter does.

In [None]:
from dataclasses import dataclass, field
import transformers

@dataclass
class CustomArguments(transformers.TrainingArguments):
    dataset_name: str = field(                           # Dataset configuration
        default="/content/packaged_pretrain_dataset.parquet")
    num_proc: int = field(default=1)                     # Number of subprocesses for data preprocessing
    max_seq_length: int = field(default=32)              # Maximum sequence length

    # Core training configurations
    seed: int = field(default=0)                         # Random seed for initialization, ensuring reproducibility
    optim: str = field(default="adamw_torch")            # Optimizer, here it's AdamW implemented in PyTorch
    max_steps: int = field(default=30)                   # Number of maximum training steps
    per_device_train_batch_size: int = field(default=2)  # Batch size per device during training

    # Other training configurations
    learning_rate: float = field(default=5e-5)           # Initial learning rate for the optimizer
    weight_decay: float = field(default=0)               # Weight decay
    warmup_steps: int = field(default=10)                # Number of steps for the learning rate warmup phase
    lr_scheduler_type: str = field(default="linear")     # Type of learning rate scheduler
    gradient_checkpointing: bool = field(default=True)   # Enable gradient checkpointing to save memory
    dataloader_num_workers: int = field(default=2)       # Number of subprocesses for data loading
    bf16: bool = field(default=True)                     # Use bfloat16 precision for training on supported hardware
    gradient_accumulation_steps: int = field(default=1)  # Number of steps to accumulate gradients before updating model weights

    # Logging configuration
    logging_steps: int = field(default=3)                # Frequency of logging training information
    report_to: str = field(default="none")               # Destination for logging (e.g., WandB, TensorBoard)

    # Saving configuration
    # save_strategy: str = field(default="steps")          # Can be replaced with "epoch"
    # save_steps: int = field(default=3)                   # Frequency of saving training checkpoint
    # save_total_limit: int = field(default=2)             # The total number of checkpoints to be saved

Parse the custom arguments and set the output directory where the model will be saved:

In [None]:
parser = transformers.HfArgumentParser(CustomArguments)
args, = parser.parse_args_into_dataclasses(
    args=["--output_dir", "output"]
)

Setup the training dataset:

In [None]:
train_dataset = CustomDataset(args=args)

Check the shape of the dataset:

In [None]:
print("Input shape: ", train_dataset[0]['input_ids'].shape)

## 4. Run the trainer and monitor the loss

First, set up a callback to log the loss values during training (note this cell is not shown in the video):

In [None]:
from transformers import Trainer, TrainingArguments, TrainerCallback

# Define a custom callback to log the loss values
class LossLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            self.logs.append(logs)

    def __init__(self):
        self.logs = []

# Initialize the callback
loss_logging_callback = LossLoggingCallback()

Then, create an instance of the Hugging Face `Trainer` object from the `transformers` library. Call the `train()` method of the trainder to initialize the training run:

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=pretrained_model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=None,
    callbacks=[loss_logging_callback]
)

trainer.train()

You can use the code below to save intermediate model checkpoints in your own training run:

In [None]:
# Saving configuration
    # save_strategy: str = field(default="steps")          # Can be replaced with "epoch"
    # save_steps: int = field(default=3)                   # Frequency of saving training checkpoint
    # save_total_limit: int = field(default=2)             # The total number of checkpoints to be saved

### Checking the performance of an intermediate checkpoint

Below, you can try generating text using an intermediate checkpoint of the model. This checkpoint was saved after 10,000 training steps. As you did in previous lessons, you'll use the Solar tokenizer and then set up a `TextStreater` object to display the text as it is generated:

In [None]:
from transformers import AutoTokenizer, TextStreamer
model_name_or_path = "./models/upstage/TinySolar-248m-4k"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

In [None]:
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
import torch

model_name_or_path = "./models/output/checkpoint-10000"
model2 = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)


In [None]:
prompt = "I am an engineer. I love"

inputs = tokenizer(prompt, return_tensors="pt").to(model2.device)

streamer = TextStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

outputs = model2.generate(
    **inputs,
    streamer=streamer,
    use_cache=True,
    max_new_tokens=64,
    do_sample=True,
    temperature=1.0,
)

# Lesson 6. Model evaluation

The model comparison tool that Sung described in the video can be found at this link: https://console.upstage.ai/ (note that you need to create a free account to try it out.)

A useful tool for evaluating LLMs is the **LM Evaluation Harness** built by EleutherAI. Information about the harness can be found at this [github repo](https://github.com/EleutherAI/lm-evaluation-harness):

You can run the commented code below to install the evaluation harness in your own environment:

In [None]:
#!pip install -U git+https://github.com/EleutherAI/lm-evaluation-harness

You will evaluate TinySolar-248m-4k on 5 questions from the **TruthfulQA MC2 task**. This is a multiple-choice question answering task that tests the model's ability to identify true statements. You can read more about the TruthfulQA benchmark in [this paper](https://arxiv.org/abs/2109.07958), and you can checkout the code for implementing the tasks at this [github repo](https://github.com/sylinrl/TruthfulQA).

The code below runs only the TruthfulQA MC2 task using the LM Evaluation Harness:

In [None]:
!lm_eval --model hf \
    --model_args pretrained=./models/upstage/TinySolar-248m-4k \
    --tasks truthfulqa_mc2 \
    --device cpu \
    --limit 5

### Evaluation for the Hugging Face Leaderboard
You can use the code below to test your own model against the evaluations required for the [Hugging Face leaderboard](https://huggingface.co/open-llm-leaderboard). 

If you decide to run this evaluation on your own model, don't change the few-shot numbers below - they are set by the rules of the leaderboard.

In [None]:
import os

def h6_open_llm_leaderboard(model_name):
  task_and_shot = [
      ('arc_challenge', 25),
      ('hellaswag', 10),
      ('mmlu', 5),
      ('truthfulqa_mc2', 0),
      ('winogrande', 5),
      ('gsm8k', 5)
  ]

  for task, fewshot in task_and_shot:
    eval_cmd = f"""
    lm_eval --model hf \
        --model_args pretrained={model_name} \
        --tasks {task} \
        --device cpu \
        --num_fewshot {fewshot}
    """
    os.system(eval_cmd)

h6_open_llm_leaderboard(model_name="YOUR_MODEL")