# Training and Deploying Mistral 7B

This notebook was copied from AWS SageMaker to give insight on the workflow on model configuration and deployment

In [2]:
!pip install bitsandbytes


Collecting bitsandbytes
  Using cached bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl.metadata (2.2 kB)
Using cached bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.1


In [1]:
import sagemaker
import boto3
import io
from sagemaker.huggingface import get_huggingface_llm_image_uri
import json
from sagemaker.huggingface import HuggingFaceModel

ModuleNotFoundError: No module named 'sagemaker'

# Modeling

In [29]:
sess = sagemaker.Session()

#preparing the necessary AWS resources and permissions to ensure 
#SageMaker can access the data it needs and has the permissions to perform operations on behalf of the user.

# sagemaker_session_bucket -> used for uploading data, models and logs
# sagemaker_will_automatically create this bucket if it not exists

sagemaker_session_bucket=None

if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::566086704797:role/service-role/AmazonSageMaker-ExecutionRole-20240222T123141
sagemaker session region: us-west-2


## Hugging Face Deep Learning Container

In [3]:
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
    "huggingface",
    version="1.1.0"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

llm image uri: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04


## Configuration
Mistral 7B instruct model is 15GB. With an instance of ml.g5.xlarge, I'm able to load the model since it is comprised of 1 GPU that has 24GB of VRAM (GPU memory). 

### Increasing computation efficiency - Quantisation (8bit)
Reduces the precision of the model, and therefore reduces the memory footprint as well. There is a minimal loss in accuracy, but not enough to affect the model.
Memory and Computation Reduction: Quantization reduces the precision of the model's weights and activations, which decreases the memory and computational requirements of the model. increases computational efficiency. increases inference speed which will be more suitable for user interaction with this app.

In [21]:

# sagemaker config
instance_type = "ml.g5.xlarge"
n_gpu = 1
health_check_timeout = 800
model_checkpoint = 'mistralai/Mistral-7B-v0.1'

# Model configuration for text generation ingerence with huggingface
config = {
    'HF_MODEL_ID': f"{model_checkpoint}", # model_id for Mistral 7B. 
    'SM_NUM_GPUS': json.dumps(n_gpu),
    'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),  # Limits the number of tokens that can be processed in parallel during the generation
    'HUGGING_FACE_HUB_TOKEN': json.dumps("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"),
    'HF_MODEL_QUANTIZE': "bitsandbytes",  # Enable bitsandbytes quantization
    'QUANTIZATION_BITS': json.dumps(8)  # Set quantization to 8-bit
}

# HF Model Class
hf_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

## Deploy Mistral 7B

In [22]:
# deploy the HuggingFaceModel to Amazon SageMaker
# Creates an endpoint that will contain the model
llm = hf_model.deploy(
  initial_instance_count= 1,
  instance_type= instance_type,
  container_startup_health_check_timeout= health_check_timeout,
)

----------------!

## Structure and configure response
The Mistral-7B-Instruct model, as opposed to its base counterpart, the Mistral7B, has undergone a process known as instruction fine-tuning. This specialized training enhances the model's ability to understand and follow instructions more effectively, making it significantly more adept for tasks requiring precise command adherence. The instruction fine-tuning process involved training the model on datasets that include specific sentence identifiers (IDs), which play a crucial role in guiding the model's responses.

In [16]:
message = 'what are the ingredients of a meat pie'
prompt= f'''[INST] {message} [/INST]'''

### Hyper params

In [10]:
#config response
input_data = {
    'inputs': prompt,  # The prompt or input text that you want the model to generate a response for.
    'parameters': {
        'do_sample': True,  # Whether to use sampling; if False, the model will use greedy decoding.
        'top_p': 0.6,  # The cumulative probability threshold for nucleus sampling. Only the smallest set of tokens with cumulative probability >= top_p are considered.
        'temperature': 0.3,  # Controls the randomness of predictions by scaling the logits before applying softmax. Lower values make the model more deterministic.
        'top_k': 50,  # The number of highest probability vocabulary tokens to keep for top-k filtering. Only the top_k tokens are considered for sampling.
        'max_new_tokens': 512  # The maximum number of new tokens to generate in the response.
        # 'repetition_penalty': 1.03  # Penalty for repeated tokens. Values > 1.0 penalize repetition, encouraging the model to produce more diverse outputs.
    }
}

body_input_data_json = json.dumps(input_data)# needs to be in json format to be passed

### Request the response

In [11]:
#sagemaker runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')
endpoint = 'huggingface-pytorch-tgi-inference-2024-04-26-05-43-18-580'
content_type = 'application/json'


# Requests inference from AWS SageMaker endpoint
# Model deployed - Mistral 7B
response = sagemaker_runtime.invoke_endpoint(
    EndpointName = endpoint,
    ContentType = content_type,
    Body = body_input_data_json.encode('utf-8')
)

### Parse the response
The response is given in json format.

In [12]:
response_body = response['Body'].read().decode('utf-8')
response_json = json.loads(response_body)

In [13]:
response_json

[{'generated_text': '[INST] what are the ingredients of a meat pie [/INST] The ingredients for a meat pie can vary depending on the recipe, but some common ingredients include:\n\n* Flaky pie crust\n* Ground beef, pork, or a combination of the two\n* Onion\n* Garlic\n* Carrots\n* Peas\n* Thyme\n* Rosemary\n* Salt\n* Pepper\n* Egg wash (optional)\n\nSome recipes may also include other ingredients such as mushrooms, celery, or diced tomatoes.'}]

### Extract and Print Response
Lets see how the baseline Mistral7B model will perform without fine-tuning. This will be comparable to the fine-tuned model which has been trained on an SQL dataset.

In [14]:
generated_text = response_json[0]['generated_text']
print(generated_text[len(prompt):])

 The ingredients for a meat pie can vary depending on the recipe, but some common ingredients include:

* Flaky pie crust
* Ground beef, pork, or a combination of the two
* Onion
* Garlic
* Carrots
* Peas
* Thyme
* Rosemary
* Salt
* Pepper
* Egg wash (optional)

Some recipes may also include other ingredients such as mushrooms, celery, or diced tomatoes.


# Fine-Tuning

## Pre-processing the data

In [3]:
from transformers import AutoTokenizer
from datasets import load_dataset

In [4]:
dataset = load_dataset('b-mc2/sql-create-context')

Downloading readme: 100%|██████████| 4.43k/4.43k [00:00<00:00, 10.7MB/s]
Downloading data: 100%|██████████| 21.8M/21.8M [00:02<00:00, 7.53MB/s]
Downloading data files: 100%|██████████| 1/1 [00:02<00:00,  2.92s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 305.51it/s]
Generating train split: 78577 examples [00:00, 241250.60 examples/s]


In [2]:

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space = True)

tokenizer_config.json: 100%|██████████| 1.47k/1.47k [00:00<00:00, 1.75MB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 870kB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 1.92MB/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 208kB/s]


In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answer'],
        num_rows: 78577
    })
})

In [6]:
dataset['train'].features

{'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answer': Value(dtype='string', id=None)}

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples['context'],
        examples['question'],
        padding = 'max_length',
        truncation = 'only_seond' #only allows the 'question' to be truncated if necessary due to constraints with size of input
    )

In [14]:
directory = '../data'
file_name = 'data_sql.jasonl'

'../data/data_sql.jasonl'

In [None]:
def load_data_sql(data_dir: str = "data_sql"):
    from datasets import load_dataset

    dataset = load_dataset("b-mc2/sql-create-context")

    dataset_splits = {"train": dataset["train"]}
    out_path = f'{directory}/{file_name}'

    for key, ds in dataset_splits.items():
        with open(out_path, "w") as f:
            for item in ds:
                newitem = {
                    "input": item["question"],
                    "context": item["context"],
                    "output": item["answer"],
                }
                f.write(json.dumps(newitem) + "\n")