# Fine-tuning Mistral-7B for Instruction Generation

## Overview
This Jupyter notebook demonstrates the process of fine-tuning the Mistral-7B language model for instruction generation using Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA). The goal is to adapt the model to generate instructions based on given inputs and responses, essentially reversing the typical instruction-following behavior of large language models.

## Purpose
- Showcase the fine-tuning process for large language models
- Demonstrate the use of LoRA for efficient adaptation of pre-trained models
- Provide a practical example of preparing data, configuring models, and training for a specific NLP task
- Include a section on deploying the fine-tuned model using FastAPI and testing the model's performance using the checkpoints created during the fine-tuning process

## Key Components
1. Data preparation using the mosaicml/instruct-v3 dataset
2. Model loading and configuration with 4-bit quantization
3. LoRA setup for parameter-efficient fine-tuning
4. Training process using the SFTTrainer from the TRL library
5. Model deployment using FastAPI
6. Testing the fine-tuned model with sample prompts

## How to Use This Notebook
1. **Environment Setup**: Ensure you have a GPU-enabled environment with Python and Jupyter installed.
2. **Dependencies**: Run the first cell to install required libraries.
3. **Data Preparation**: Follow the cells that load and preprocess the dataset.
4. **Model Configuration**: Execute cells that load and configure the Mistral-7B model.
5. **Training**: Run the training cell to fine-tune the model.
6. **Evaluation**: Use the provided functions to test the model's performance after training.
7. **Deployment**: Explore the section that deploys the fine-tuned model as a FastAPI web service.
8. **Testing**: Test the deployed model with sample prompts and inspect the generated responses.

## Notes
- This notebook uses a subset of the full dataset for quicker experimentation. Adjust dataset size as needed.
- The training process is resource-intensive. Ensure you have adequate GPU memory available.
- Experiment with different LoRA configurations and training parameters to optimize results.
- The deployment and testing sections leverage the checkpoints created during the fine-tuning process.

By following this notebook, you'll gain hands-on experience in fine-tuning large language models for specific tasks using state-of-the-art techniques in natural language processing, as well as deploying the fine-tuned model as a web service.

## Installing Required Libraries

**What it's doing:**
Installing necessary Python libraries for the project.

**Why:**
These libraries are essential for working with transformers, fine-tuning models, handling datasets, and optimizing performance. Installing them ensures we have all the tools needed for our task.


In [1]:
! pip install transformers trl accelerate torch bitsandbytes peft datasets -qU

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.2.0+cu121 requires torch==2.2.0, but you have torch 2.3.1 which is incompatible.
torchvision 0.17.0+cu121 requires torch==2.2.0, but you have torch 2.3.1 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Loading the Dataset

**What it's doing:**
Loading the "mosaicml/instruct-v3" dataset.

**Why:**
This dataset contains instruction-response pairs, which are crucial for our task of fine-tuning a model to generate instructions. It provides the training data we need.


In [2]:
from datasets import load_dataset

instruct_tune_dataset = load_dataset("mosaicml/instruct-v3")

## Examining the Dataset

**What it's doing:**
Displaying the structure of the loaded dataset.

**Why:**
This helps us understand the composition of our dataset, including the number of examples and the available features. It's an important step for data exploration and verification.


In [3]:
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 56167
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 6807
    })
})

## Filtering the Dataset

**What it's doing:**
Filtering the dataset to only include examples from the "dolly_hhrlhf" source.

**Why:**
By focusing on a specific subset of the data, we can potentially improve the quality and consistency of our fine-tuning results. This step helps in data curation.


In [4]:
instruct_tune_dataset = instruct_tune_dataset.filter(lambda x: x["source"] == "dolly_hhrlhf")
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 34333
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 4771
    })
})

## Reducing Dataset Size

**What it's doing:**
Limiting the dataset to 5,000 training examples and 200 test examples.

**Why:**
This reduction in dataset size allows for faster experimentation and requires less computational resources. It's a common practice when initially developing and testing a model fine-tuning pipeline.


In [5]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(5_000))
instruct_tune_dataset["test"] = instruct_tune_dataset["test"].select(range(200))
instruct_tune_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 200
    })
})

## Defining the Prompt Template

**What it's doing:**
Creating a template for formatting our training data.

**Why:**
This template structures our input data consistently, telling the model how to interpret the input and what kind of output we expect. It's crucial for instruction-tuning tasks.


In [6]:
prompt_template = """<s>### Instruction:
Use the provided input to create an instruction that could have been used to generate the response with an LLM.

### Input:
{input}

### Response:
{response}</s>"""

## Creating the Prompt Function

**What it's doing:**
Defining a function to format each sample from our dataset according to the prompt template.

**Why:**
This function prepares our data for training, ensuring each example is formatted consistently and correctly for our specific task of instruction generation.


In [7]:
def create_prompt(sample):
    input_text = sample["response"]  # The 'response' from the dataset becomes the 'input' for our new task
    response_text = sample["prompt"].replace("Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction\n", "").strip()
    
    full_prompt = prompt_template.format(input=input_text, response=response_text)
    
    return full_prompt

## Testing the Prompt Function

**What it's doing:**
Applying the prompt function to a sample from the dataset.

**Why:**
This test ensures our prompt function is working correctly before we use it in training. It's a crucial verification step in our data preparation process.


In [8]:
create_prompt(instruct_tune_dataset["train"][0])

'<s>### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.\n\n### Input:\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.\n\n### Response:\nWhat are different types of grass?\n\n### Response</s>'

## Loading the Pre-trained Model and Tokenizer

**What it's doing:**
Loading the Mistral-7B model and its tokenizer, with 4-bit quantization.

**Why:**
This step prepares our base model for fine-tuning. The 4-bit quantization allows us to work with this large model on more modest hardware by reducing its memory footprint.

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    device_map='auto',
    quantization_config=nf4_config,
    use_cache=False
)

print(model)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

## Defining the Generation Function

**What it's doing:**
Creating a function to generate responses using our model.

**Why:**
This function allows us to test our model's outputs at various stages of fine-tuning, helping us assess its performance and progress.



In [10]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

## Testing the Generation Function

**What it's doing:**
Generating a response with our base model before fine-tuning.

**Why:**
This provides a baseline to compare against after fine-tuning, helping us understand how much the model's performance improves.

In [11]:
generate_response("### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.\n\n### Input:\nI think it depends a little on the individual, but there are a number of steps you’ll need to take.  First, you’ll need to get a college education.  This might include a four-year undergraduate degree and a four-year doctorate program.  You’ll also need to complete a residency program.  Once you have your education, you’ll need to be licensed.  And finally, you’ll need to establish a practice.\n\n### Response:", model)

'<s> \nTo become a healthcare professional, such as a medical doctor, you must first obtain a four-year undergraduate degree and a four-year doctorate degree. After completing your education, you must complete a residency program. Once you have successfully completed these steps, you will need to become licensed before establishing a practice.</s>'

## Configuring LoRA for Fine-tuning

**What it's doing:**
Setting up the Low-Rank Adaptation (LoRA) configuration for fine-tuning.

**Why:**
LoRA allows us to fine-tune the model efficiently by adding a small number of trainable parameters. This configuration defines how LoRA will be applied to our model.

In [12]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)



## Preparing the Model for LoRA Fine-tuning

**What it's doing:**
Applying the LoRA configuration to our model.

**Why:**
This step prepares our model for efficient fine-tuning, setting up the additional LoRA parameters while keeping most of the original model frozen.


In [13]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

print(peft_config)  # Print your LoRA configuration to confirm it's set up correctly

for name, module in model.named_modules():
    print(f"Module: {name}")

for name, module in model.named_modules():
    if any(lora_term in name.lower() for lora_term in ['lora', 'adapter', 'peft']):
        print(f"Potential LoRA adapter found in: {name}")

for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"Trainable parameter: {name}")

for name, module in model.named_modules():
    if 'lora' in name.lower():
        print(f"LoRA adapter found in: {name}")

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='mistralai/Mistral-7B-Instruct-v0.1', revision=None, task_type='CAUSAL_LM', inference_mode=False, r=64, target_modules={'v_proj', 'q_proj'}, lora_alpha=16, lora_dropout=0.1, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None)
Module: 
Module: base_model
Module: base_model.model
Module: base_model.model.model
Module: base_model.model.model.embed_tokens
Module: base_model.model.model.layers
Module: base_model.model.model.layers.0
Module: base_model.model.model.layers.0.self_attn
Module: base_model.model.model.layers.0.self_attn.q_proj
Module: base_model.model.model.layers.0.self_attn.q_proj.base_layer
Module: base_model.model.model.layers.0.self_attn.q_proj.lora_dro

## Setting Up Training Arguments

**What it's doing:**
Configuring the training process parameters.

**Why:**
These arguments define crucial aspects of our training process, such as learning rate, batch size, and evaluation frequency. They significantly impact the efficiency and effectiveness of fine-tuning.


In [14]:
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "mistral_end_to_end",
  #num_train_epochs=5,
  max_steps = 100, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 4,
  warmup_steps = 0,
  logging_steps=10,
  save_strategy="epoch",
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=20, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=2e-4,
  bf16=True,
  lr_scheduler_type='constant',
)



## Setting Up the Trainer

**What it's doing:**
Initializing the SFTTrainer with our model, datasets, and training configuration.

**Why:**
The trainer handles the fine-tuning process, managing the training loop, evaluation, and logging. This setup brings together all the components we've prepared for fine-tuning.

In [15]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt,
  args=args,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
    all_param += param.numel()
    if param.requires_grad:
        trainable_params += param.numel()
print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")




Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


trainable params: 27262976 || all params: 3779334144 || trainable%: 0.7213698223345028




## Training the Model

**What it's doing:**
Running the fine-tuning process and testing the result.

**Why:**
This is the main training step where our model learns from the prepared dataset. After training, we test it on a sample input to verify improvement and check resource usage to understand the computational cost of our fine-tuning process.

In [16]:
trainer.train()

sample_input = instruct_tune_dataset["train"][0]
formatted_input = create_prompt(sample_input)
print("Sample Input:")
print(formatted_input)
print("\nModel Output:")
print(generate_response(formatted_input, model))

import torch
print(f"GPU memory allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"GPU memory cached: {torch.cuda.memory_reserved()/1e9:.2f} GB")



Step,Training Loss,Validation Loss
20,1.5439,1.336502
40,1.4112,1.298247
60,1.4278,1.285116
80,1.4144,1.279086
100,1.3394,1.272258


Sample Input:
<s>### Instruction:
Use the provided input to create an instruction that could have been used to generate the response with an LLM.

### Input:
There are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.

### Response:
What are different types of grass?

### Response</s>

Model Output:
<s></s>
GPU memory allocated: 4.99 GB
GPU memory cached: 22.08 GB


In [17]:
results = trainer.evaluate()
print(results)

{'eval_loss': 1.2722580432891846, 'eval_runtime': 3.7054, 'eval_samples_per_second': 4.048, 'eval_steps_per_second': 0.54, 'epoch': 0.8}


### let's deploy the model

**What it's doing?**
This code installs the following Python packages:
- `fastapi`: A high-performance web framework for building APIs with Python
- `uvicorn`: An ASGI (Asynchronous Server Gateway Interface) web server for running FastAPI applications
- `transformers`: A library for state-of-the-art natural language processing (NLP) models
- `torch`: The PyTorch machine learning library
- `nest_asyncio`: A library that allows nested event loops in Python, which is needed for some asynchronous operations
- `requests`: A popular library for making HTTP requests in Python
- `torchvision`: A library that provides access to popular datasets, model architectures, and common image transformations for computer vision

**Why:**
These packages are needed for building a web application that uses machine learning models for natural language processing or computer vision tasks. FastAPI and Uvicorn are used to create the web server and API, while the other packages provide the necessary machine learning functionality and utilities.

In [18]:
pip install fastapi uvicorn transformers torch nest_asyncio requests torchvision

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting fastapi
  Using cached fastapi-0.111.0-py3-none-any.whl.metadata (25 kB)
Collecting uvicorn
  Using cached uvicorn-0.30.1-py3-none-any.whl.metadata (6.3 kB)
Collecting starlette<0.38.0,>=0.37.2 (from fastapi)
  Using cached starlette-0.37.2-py3-none-any.whl.metadata (5.9 kB)
Collecting fastapi-cli>=0.0.2 (from fastapi)
  Using cached fastapi_cli-0.0.4-py3-none-any.whl.metadata (7.0 kB)
Collecting python-multipart>=0.0.7 (from fastapi)
  Using cached python_multipart-0.0.9-py3-none-any.whl.metadata (2.5 kB)
Collecting ujson!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0,>=4.0.1 (from fastapi)
  Using cached ujson-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Collecting orjson>=3.2.1 (from fastapi)
  Using cached orjson-3.10.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
Collecting email_validator>=2.0.0 (from fastapi)
  Using cached email_validator-2.2.0-py3-none-any.whl.metadata (25 kB)
Collecting torch
  U

**What it's doing?**
This code snippet demonstrates how to load the "Instruct-V3" dataset using the `load_dataset` function from the `datasets` library. The dataset is loaded from the "mosaicml/instruct-v3" dataset identifier, and the first example from the "train" split is printed to the console.

**Why:**
The Instruct-V3 dataset is a popular natural language processing dataset that can be used for training and evaluating language models. By loading the dataset programmatically, you can easily access the data and use it in your own machine learning projects or experiments.

In [19]:
from datasets import load_dataset

dataset = load_dataset("mosaicml/instruct-v3")
print(dataset['train'][0])


{'prompt': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction\nQuestion: Nancy and Rose are making bracelets, and there are eight beads in each bracelet. Nancy has 40 metal beads and 20 more pearl beads. Rose has 20 crystal beads and twice as many stone beads as crystal beads. How many bracelets can Nancy and Rose make?\nAnswer: Nancy has 40 + 20 = 60 pearl beads. So, Nancy has a total of 40 + 60 = 100 beads. Rose has 2 x 20 = 40 stone beads. So, Rose has 20 + 40 = 60 beads. Thus, Nancy and Rose have 100 + 60 = 160 beads altogether. Therefore, they can make 160 / 8 = 20 bracelets. The answer is 20.\n[Question]Ms. Estrella is an entrepreneur with a startup company having 10 employees. The company makes a revenue of $400000 a month, paying 10% in taxes, 5% of the remaining amount on marketing and ads, 20% of the remaining amount on operational costs, and 15% of the remaining amount on employee wages. Assuming each

**What it's doing?**
This code snippet imports the `transformers` library, which is a popular library for working with state-of-the-art natural language processing (NLP) models. It then creates a `pipeline` object from the `transformers` library and prints the version number of the `transformers` library.

**Why:**
The `transformers` library provides a high-level interface for using pre-trained NLP models, such as BERT, GPT, and RoBERTa, for tasks like text classification, named entity recognition, and question answering. By importing and using this library, you can easily leverage these powerful models in your own machine learning projects.

Printing the version number can be useful for ensuring that you're using a compatible version of the library with your project or for troubleshooting issues that may be version-specific.

In [1]:
import transformers
from transformers import pipeline

print(transformers.__version__)

4.42.3


**What it's doing?**
This code sets up a FastAPI web application with a text generation endpoint. The application uses the Transformers library to load a fine-tuned language model and generate text based on a provided prompt.

The key steps are:

1. Import necessary libraries and modules, including `FastAPI`, `Transformers`, and `Torch`.
2. Define a `Prompt` model using `pydantic.BaseModel` to represent the input data for the text generation endpoint.
3. Implement the `generate_text` function that takes a `Prompt` object, loads the fine-tuned language model, and generates text based on the provided prompt.
4. Add a `shutdown` endpoint to the FastAPI app to allow gracefully shutting down the server.
5. Define a `run_server` function that starts the Uvicorn server in a separate thread.
6. Test the API by sending a POST request to the `/generate` endpoint and print the generated text.
7. Implement a `shutdown_server` function to shut down the server.

**Why:**
This code demonstrates how to build a web application using FastAPI that can generate text using a fine-tuned language model from the Transformers library. This can be useful for various applications, such as:

- Building a text generation API for creative writing, translation, or summarization tasks.
- Integrating a text generation model into a larger application or system.
- Experimenting with fine-tuning language models and deploying the resulting model in a production-ready environment.

The use of a separate thread to run the server allows the application to be responsive and handle multiple requests concurrently. The shutdown endpoint provides a way to gracefully terminate the server when needed.

This artifact can be used as a starting point for building more complex text generation applications or as a reference for integrating Transformers models into a web-based system.

In [2]:
import nest_asyncio
import uvicorn
from threading import Thread
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Apply the nest_asyncio patch
nest_asyncio.apply()

# Initialize FastAPI app
app = FastAPI()

# Define the Prompt model
class Prompt(BaseModel):
    instruction: str
    input_text: str

# Define the text generation endpoint
@app.post("/generate")
def generate_text(prompt: Prompt):
    model_path = "mistral_end_to_end/checkpoint-100"  # Path to your fine-tuned model checkpoint
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    device = 0 if torch.cuda.is_available() else -1
    nlp = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)
    
    # Combine instruction and input_text to form the full prompt
    full_prompt = f"Instruction: {prompt.instruction}\nInput: {prompt.input_text}\nOutput:"
    print(f"Full Prompt: {full_prompt}")  # Debugging print statement
    
    outputs = nlp(full_prompt, max_length=100, num_return_sequences=1, truncation=True)
    if outputs:
        generated_text = outputs[0].get("generated_text", "")
        print(f"Generated Text: {generated_text}")  # Debugging print statement
    else:
        generated_text = ""
        print("No output generated")  # Debugging print statement
    
    return {"generated_text": generated_text}

# Adding shutdown endpoint to FastAPI app
@app.post("/shutdown")
def shutdown():
    import os
    os._exit(0)
    return {"message": "Server is shutting down..."}

# Function to run the server
def run_server():
    import asyncio
    asyncio.set_event_loop_policy(asyncio.DefaultEventLoopPolicy())
    config = uvicorn.Config(app, host="0.0.0.0", port=8000, loop="asyncio")
    server = uvicorn.Server(config)
    server.run()

# Start the server in a new thread
server_thread = Thread(target=run_server)
server_thread.start()

# Test the API
import requests
import time

# Wait a few seconds for the server to start
time.sleep(5)

# Define the API endpoint
api_url = "http://localhost:8000/generate"

# Define the instruction and input text
prompt_data = {
    "instruction": "Translate the following English text to French.",
    "input_text": "Hello, how are you?"
}

# Send the POST request
response = requests.post(api_url, json=prompt_data)

# Print the response
if response.status_code == 200:
    generated_text = response.json().get("generated_text")
    print("Generated Text:", generated_text)
else:
    print(f"Error: {response.status_code}, {response.text}")

# Shutdown the server
def shutdown_server():
    response = requests.post("http://localhost:8000/shutdown")
    print(response.json())


INFO:     Started server process [342]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Full Prompt: Instruction: Translate the following English text to French.
Input: Hello, how are you?
Output:
Generated Text: Instruction: Translate the following English text to French.
Input: Hello, how are you?
Output: Bonjour, comment allez vous?

### Response:
I need to translate this text from English to French: Hello, how are you?

### Response
INFO:     127.0.0.1:34000 - "POST /generate HTTP/1.1" 200 OK
Generated Text: Instruction: Translate the following English text to French.
Input: Hello, how are you?
Output: Bonjour, comment allez vous?

### Response:
I need to translate this text from English to French: Hello, how are you?

### Response


In [None]:
# Call the shutdown function
shutdown_server()

# Wait a bit and then join the thread to ensure it has completed
server_thread.join()