<a href="https://colab.research.google.com/github/kallepalomaki/South-Ostrobothnia-Deep-Seek/blob/main/deep_seek_southob_blog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This a google Colab notebook based on Data camp Deep Seek fine tuning tutorial https://www.datacamp.com/tutorial/fine-tuning-deepseek-r1-reasoning-model which I have adapted to for translating standard Finnish to South Osthrobothnian dialect. Fine tuning data is done in two passes firstly a bit over 300 standard Finnish and South Osthrobothnian dialect sentence pairs and second on bit over 2000 standard Finnish South Osthrobothnia word pairs. South Osthrobothnian sentences are taken from Pekka Pohjanpää's blog: http://pohopekka.blogspot.com/" I'm not sharing the sentence pairs but Standard Finnish that correspond the blog text. Blog text is indexed by sentence order so that standard Finnish sentences can be matched. I also share the vocabulary. Shared data is placeded at directory data.
Not that the blog is Copyrighted material and I haven't reached copyright holder. Therfore I provide only a download function. Use at your own responsibility.

In [1]:
%%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git


In [2]:
!pip install bs4



Directory settings. Adapt the as needed. Secrets are needed for huggingface and wandb tokens. Datadir is needed for storing General Finnish - South Ostrobotnian vocabulary data in file vocabulary_deep_seek_format.json and dialect sentence indices against standard Finnish sentences file in standard_finnish_sentences.json. I'm using google colab, and therefore directories point to google drive.

In [3]:
SECRETS_DIR = "/content/drive/My Drive/secrets"
DATA_DIR = "/content/drive/My Drive/data"

In [4]:
from google.colab import drive
drive.mount('/content/drive')

secret_file_path = SECRETS_DIR+'/huggingfacetoken.key'
with open(secret_file_path) as f:
  hf_token=f.read().strip()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
secret_file_path = SECRETS_DIR+'/wandb.key'

with open(secret_file_path) as f:
  wb_token=f.read().strip()

In [6]:
from huggingface_hub import login
login(hf_token)

Wandb is used here to monitor training. You'll need to create account and get token to use it.

In [7]:
import wandb
import os
os.environ["WANDB_SILENT"] = "true"

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on South Ostrobotnia dialect 2',
    job_type="training",
    anonymous="allow"
)

Deep Seek model is downloaded from Huggingface. You'll need Huggingface api key for this.

In [8]:
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.17: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Prompt for testing with standard Finnish how that translates to the dialect.

In [9]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
Translate standard Finnish sentence to South Ostrobothnian dialect.

### Question:
{}

### Response:
<think>{}"""

Run test on the original model before finetuning.

In [10]:
standard_finnish = "Hevoskauppias ajoi uudella hevosellaan meidän talon ohi. Katsoin häntä aitan ovelta."

FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(standard_finnish, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print("Dialect:")
print(response[0].split("### Response:")[1])

Dialect:

<think>
Okay, so I have this task where I need to translate a standard Finnish sentence into the South Ostrobothnian dialect. Let me break this down step by step. First, I need to understand the original sentence and then figure out how it would be expressed in the dialect.

The original sentence is: "Hevoskauppias ajoi uudella hevosellaan meidän talon ohi. Katsoin häntä aitan ovelta."

Alright, let's parse this sentence. The first part is "Hevoskauppias ajoi uudella hevosellaan meidän talon ohi." This seems to be about a horse dealer who is driving a new horse past their house. The second part is "Katsoin häntä aitan ovelta," which translates to "I watched him from the window."

Now, translating this into South Ostrobothnian dialect. I know that dialects often have unique words and structures compared to standard language. So, I need to identify the key components and see how they might be expressed differently.

Starting with "Hevoskauppias." In the dialect, I think "kauppi

Training prompt for standard Finnish and dialect sentence pairs.

In [11]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
Translate standard Finnish sentence to South Ostrobothnian dialect.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

Training prompt for standard Finnish to dialect vocabulary.

In [12]:
train_prompt_vocabulary_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
Translate the following word from standard Finnish to South Ostrobothnian dialect.

### Word:
{}

### Dialect Equivalent:
<think>
{}
</think>
{}"""

Formatting function for standard Finnish to dialect sentence pairs.

In [13]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

Formatting function for vocabulary prompts.

In [14]:
def formatting_prompts_vocab_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_vocabulary_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

Function to get Pekka Pohjanpää's blog. The blog is split into sentence level and arranged as list of dictionaries.

In [15]:
import random
import requests
from bs4 import BeautifulSoup
import json
def get_pohopekka():
    main_page_link="http://pohopekka.blogspot.com/"
    structured_data = []
    page=requests.get(main_page_link)
    soup = BeautifulSoup(page.content, 'html.parser')
    for element in soup.find_all(["h2", "div"]):
        if element.name=="h2":
            latest_h2 = element.text.strip()
        elif element.name=="div" and "blogPost" in element.get("class", []):
            if latest_h2:
                content=element.text.split("# posted")[0].strip().replace('\r',' ').replace('PEKAN SIVUPERSOONA','').\
                    replace('Pekan Sivupersoona','').replace("\n"," ")

                content=content.split("----",1)[0].strip()

                if len(content)>5:
                    structured_data.append({"title": latest_h2, "content": content})
                latest_h2 = None

    return structured_data

Matching Pekka Pohjanpää's story on dialect to standard Finnish based on precalculated index.

In [16]:
def match_story_sentences_to_standard(stories,standard_finnish_sentences):
    story_idx=0
    dialect=dict()
    for story_ in stories:
        story=story_["content"]
        story_splitted=story.split(". ")
        inside_story_idx=0
        for sentence in story_splitted:
            dialect[str(story_idx) + "-" + str(inside_story_idx)] = sentence
            inside_story_idx+=1
        story_idx+=1


    with open(standard_finnish_sentences) as f:
        standard=json.load(f)

    standard_dialect=[]

    for key in standard:
        standard_dialect.append({"standard":standard[key],"dialect":dialect[key]})
    return standard_dialect

Function to compose standard dialect prompts.

In [17]:
def write_translated(standard_dialect):
    data_arr=[]
    for data in standard_dialect:
        target_data=data["dialect"].strip()
        input_data=data["standard"].strip()
        system_dict=dict()
        input_dict=dict()
        target_dict=dict()
        system_dict["role"] = "system"
        system_dict["content"]="You translate standard Finnish sentences into the South Ostrobothnian dialect."
        input_dict["role"] = "user"
        input_dict["content"] = "Ilmaise seuraava Etelä-Pohjanmaan murteella: " + input_data

        target_dict["role"] = "assistant"
        target_dict["content"] = target_data
        data_arr.append({
            "Question": input_data,  # Standard Finnish
            "Complex_CoT": "",
            "Response": target_data,  # South Ostrobothnian dialect
            "text": input_data + " - " + target_data  # Same as Question and Response
        })


    return data_arr

In [18]:
stories=get_pohopekka()
standard_dialect=match_story_sentences_to_standard(stories=stories,standard_finnish_sentences=DATA_DIR+"/standard_finnish_sentences.json")
train_set=write_translated(standard_dialect=standard_dialect)
random.shuffle(train_set)

Run standard Finnish vs. dialect data generation.

In [19]:
from datasets import load_dataset, Dataset

dataset0 = Dataset.from_list(train_set)

dataset0 = dataset0.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/306 [00:00<?, ? examples/s]

Get the vocabulary data, which I'm sharing in the repo.

In [20]:
data_file_path = DATA_DIR+'/vocabulary_deep_seek_format.json'


dataset1 = load_dataset('json',data_files=data_file_path, split = "train[:-1]",trust_remote_code=True)

dataset1 = dataset1.map(formatting_prompts_vocab_func, batched = True,)

Model configurations.

In [21]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2025.3.17 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Trainer for standard Finnish - dialect sentence pairs.

In [22]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer_transl = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset0,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        #max_steps=60,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/306 [00:00<?, ? examples/s]

Trainer for vocabulary data. Less epochs and smaller learning rate as number of training samples is larger than number of samples on the sentence pair data.

In [23]:
#works when not ran
trainer_vocab = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset1,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        #max_steps=60,
        num_train_epochs=1,
        learning_rate=5e-5,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs/vocab",
    ),
)

Run trainers.

In [24]:
!nvidia-smi
trainer_stats = trainer_transl.train()

Wed Mar 19 20:06:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   71C    P0             32W /   70W |    6242MiB /  15360MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 306 | Num Epochs = 3 | Total steps = 114
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,3.2455
20,1.8585
30,1.5911
40,1.4417
50,1.1337
60,1.1204
70,1.0418
80,0.9724
90,0.7293
100,0.7805


In [25]:
!nvidia-smi
trainer_stats = trainer_vocab.train()

Wed Mar 19 20:16:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             33W /   70W |    7012MiB /  15360MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,688 | Num Epochs = 1 | Total steps = 211
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Step,Training Loss
10,1.236
20,0.3639
30,0.3213
40,0.3255
50,0.3229
60,0.3402
70,0.329
80,0.336
90,0.3218
100,0.297


Run test on signle sentence.

In [27]:
standard_finnish = "Hevoskauppias ajoi uudella hevosellaan meidän talon ohi. Katsoin häntä aitan ovelta."


FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(standard_finnish, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print("Dialect:")
print(response[0].split("### Response:")[1])


Dialect:

<think>

</think>
Hevooskauppias ajoo uureella hevoosella, meirän talo ohi, kattoon sitä aita ovelta<｜end▁of▁sentence｜>
