# Training Mistral 1.7
After some research, we have determined that Mistral 1.7 is an ideal model for training with question - answer pairs.
Unfortunately we were not able to get it to run on our devices, due to lack of RAM.

In [8]:
def setup():
    !pip install torch transformers datasets
    !pip install scipy==1.11.1         
    !pip install mistral_inference
    !pip install transformers
    !pip install accelerate -U
    !pip install transformers[torch]
    !pip install accelerate -U    
    !pip install markdown           
    !pip install nltk          
    !pip install more_itertools        
    !pip install matplotlib             
    #Depending on your system, you might need different version, see https://pytorch.org/get-started/locally/
    !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
    
#setup()

In [9]:
import torch
from huggingface_hub import HfApi, list_models
import requests
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, AutoTokenizer
from datasets import Dataset
import json

## Huggingface access
Mistral 7B is run via Huggingface. To set up the connection, the following steps are required:

 -Register at https://huggingface.co/
 -Request access for Mistral-7B-v0.1 at https://huggingface.co/mistralai/Mistral-7B-v0.1
 -Create an access token with "Write" authorization at https://huggingface.co/settings/tokens
 -Insert the token below

In [10]:
hf_api = HfApi(
    endpoint="https://huggingface.co", # Can be a Private Hub endpoint.
    token="your_token_here", # Token is not persisted on the machine.
)

## Preparing data
The prepared question - answer pairs need to be transformed into a dictionary.
Cleantech media and google patent data are kept separately to be later used as training and test data.

In [11]:
# Download the markdown file
url_media= 'https://raw.githubusercontent.com/pscllbssr/clt-cleantech-project/main/faq_media.md'
url_patent='https://raw.githubusercontent.com/pscllbssr/clt-cleantech-project/main/faq_patent.md'

def tokenize_qa(url,filename):
    response = requests.get(url)
    md_content = response.text
    
    # Parse the markdown file
    qa_pairs = []
    lines = md_content.split('\n')
    question, answer = None, None

    for line in lines:
        if line.startswith('# Q:'):
            if question and answer:
                qa_pairs.append({'question': question, 'answer': answer})
            question = line.replace('# Q:', '').strip()
            answer = None
        elif line.startswith('A:'):
            answer = line.replace('A:', '').strip()
    
    # Add the last QA pair
    if question and answer:
        qa_pairs.append({'question': question, 'answer': answer})
        print(question+";;;"+answer)
    
    # Save QA pairs to JSON file
    with open(filename+'.json', 'w') as f:
        json.dump(qa_pairs, f, indent=2)
    return Dataset.from_dict({'question': [qa['question'] for qa in qa_pairs],
                                 'answer': [qa['answer'] for qa in qa_pairs]})

dataset_media=tokenize_qa(url_media, 'media')
dataset_patent=tokenize_qa(url_patent,'patent')
print(dataset_patent["question"])
print(dataset_media['answer'])

What is the purpose of Array Technologies' agreement with POSCO?;;;Array Technologies has entered into a multi-year supply arrangement with steelmaker POSCO to diversify and strengthen its global supply chain, providing access to POSCO's proprietary PosMAC material – an alloy-coated corrosion-resistant steel.
What are the main components of a IoT device's controlling means?;;;The main components include the fixed rotation motor provided with in the bottom of the controlling means shell's inner wall, the fixed axis of rotation that passes through the top of the bearing and controlling means casing, and extends to the top of the controlling means casing.
['What does "Disclosed" refer to?', 'Is the energy recycling method suitable for all types of buildings?', 'What factors affect the growth of plants in different environments?', 'What are the key features of the utility model?', 'What components are typically found on the front and rear sides of a water tank?', 'What is the purpose of ar

# Encoding
The prepared question - answer pairs now need to be tokenized according to Mistral's needs.

In [12]:
## Set up tokenizer
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})  # Adding padding token

def preprocess_function(examples):
    inputs = [q for q in examples['question']]
    targets = [a for a in examples['answer']]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding='max_length')

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=512, truncation=True, padding='max_length')

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset_training = dataset_media.map(preprocess_function, batched=True)
tokenized_dataset_test = dataset_patent.map(preprocess_function, batched=True)

Map:   0%|          | 0/25 [00:00<?, ? examples/s]



Map:   0%|          | 0/43 [00:00<?, ? examples/s]

In [13]:
print(tokenized_dataset_training[0])

{'question': 'What process does the facility use to convert solar energy into hydrogen?', 'answer': 'The facility converts solar energy into green hydrogen through water electrolysis.', 'input_ids': [32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32

## Setup Model
Now the Model needs to be imported. Due to a problem with padding size, embedding size needed to be adjusted (by 1).

In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-v0.1').to(device)

#Had a problem with padding, this code solves it
print("Before resizing:")
print("Model embedding size:", model.get_input_embeddings().weight.size(0))
print("Tokenizer vocabulary size:", len(tokenizer))

print("After resizing:")
model.resize_token_embeddings(len(tokenizer))
print("Model embedding size:", model.get_input_embeddings().weight.size(0))
print("Tokenizer vocabulary size:", len(tokenizer))

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Before resizing:
Model embedding size: 32000
Tokenizer vocabulary size: 32001
After resizing:
Model embedding size: 32001
Tokenizer vocabulary size: 32001


## Setting up Training
Here the training parameters are defined. Currently all parameters are set to minimize RAM usage.

In [15]:
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    warmup_steps=500,
    weight_decay=0.01,
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=1,
    save_steps=5,
    evaluation_strategy="steps",
    eval_steps=5,
    load_best_model_at_end=True,
    gradient_accumulation_steps=8,
    fp16=True  # Enable mixed precision training
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_training,
    eval_dataset=tokenized_dataset_test
)



## Training
Now for the actual training. Unfortunately, this is as far as we got: Our jupiter Kernels crashed at every attempt of executing the following command and when executing it in Python proper, it crashed as well due to lack of RAM.

In [None]:
trainer.train()