# Model Tuning

Now that we have our data ready it's time to get a model.

One of the current best places to find trained models for use is [Hugging Face](https://huggingface.co/). Not only is Hugging Face a great repository of Gen AI models, they've also developed a number of python libraries for working with models that we'll use here.

We'll be attempting to do this all locally so we'll not need to create any accounts with Hugging Face.

In addition to libraries provided by Hugging Face we'll use [LangChain](https://www.langchain.com/) to streamline the workflow.

In [1]:
from langchain.llms import HuggingFacePipeline
#import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from trl import SFTTrainer
from pprint import pprint
import pandas as pd
from datasets import load_dataset, Dataset

### Where to start
---

As part of this process we want to see how a model can improve after being fine tuned. So instead of choosing a model that has been specifically designed for code assistance we'll take a very baseline model and see what improvements we can get.

For that we'll use a base gpt2 model with a smaller number of parameters.

Many Gen AI models in production have parameter values in the billions which would be very time intensive to try to fine tune without access to dedicated GPUs so we'll see what hurdles we face on a smaller model being trained directly on the CPU

The hugging face libraries allow us to download both the model and the associated tokenizer worked with it.

In order to give the model an input and receive and output we'll create a model and pipeline doing the following:
1. Use a model and tokenizer from Hugging Face
2. Put everything in a pipeline
3. Create a local model
4. Ask a question

In [2]:
# Use a Hugging Face model and tokenizer

model_link = "openai-community/gpt2"
model = AutoModelForCausalLM.from_pretrained(model_link)
tokenizer = AutoTokenizer.from_pretrained(model_link)

In [3]:
# Build the Pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100, # Adjust this for longer answers
    pad_token_id=50256 # Setting the token to this value allows for open ended generation
)

In [4]:
# Create a local model

local_llm = HuggingFacePipeline(pipeline=pipe)

##### Questions

We'll ask some basic questions related to MongoDB to get an idea on how well the model currently knows the subject matter

In [5]:
pprint(local_llm("What is mongodb?"))

  warn_deprecated(


('\n'
 '\n'
 'Mongodb is a simple Python script for displaying JSON formatted results for '
 'users of Google Analytics. It supports JSON formatting at compile, with a '
 'few settings that help determine just how much data is displayed and not '
 'what it is. It also sends "message" to the browser when a browser doesn\'t '
 'see any MONGODB.\n'
 '\n'
 "To start listening for mongodb, simply open your own web browser and you'll "
 'see this message')


In [6]:
pprint(local_llm("How do I match by date in mongodb?"))

(' If you have an option enter the date using the following format: date = '
 'monday, Monday, Tuesday, Wednesday night, Thursday, Friday Sunday, and '
 'Saturday Monday, Tuesday, Thursday, Friday, Saturday, Sunday, and Monday day '
 'or Sunday and Sunday.\n'
 '\n'
 "You can convert to MIME Type if you don't have a date converter, or convert "
 'it to MIME type without your date converter.\n'
 '\n'
 'Here is an example')


In [7]:
pprint(local_llm("How to use a custom function in a mongodb aggregation pipeline?"))

('\n'
 '\n'
 '1. Install a custom function with your current user base! When launching '
 'ng-app:\n'
 '\n'
 '1. Create a new mongo project.\n'
 '\n'
 '2. Place mongo project in your ng-app:\n'
 '\n'
 '2. Type this code to generate results in the following format;\n'
 '\n'
 '3. The output you see in the following output is your custom function. To '
 'view the list of your custom functions')


##### Answers

Well as we can see the model currently doesn't know much about Mongodb.

The next step is to try to fine tune the model on our local machine to see if we can get an improved output.

### Fine Tuning

We'll use Hugging Face's training and dataset libraries to bring in our data and tune the model

First we'll import our formatted CSV into a pandas dataframe then create a Hugging Face dataset type for use in our trainer

In [8]:
dataset_df = pd.read_csv("./data/csv_template_formatted.csv")
dataset_df

Unnamed: 0,text
0,### Question: Mongoose findById is not returni...
1,### Question: Why is my mongo collection being...
2,### Question: MongoDb score results based on s...
3,### Question: Laravel 5.7 mongodb atlas connec...
4,### Question: Remote Mongo DB connection throu...
...,...
87085,### Question: Validator error when POSTing. Cr...
87086,### Question: Does Meteor-JS support offline s...
87087,### Question: Update a given mongo field in un...
87088,### Question: MongoDB search - find newest wit...


In [9]:
dataset=Dataset.from_pandas(dataset_df)
dataset

Dataset({
    features: ['text'],
    num_rows: 87090
})

In [10]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    #tokenizer=tokenizer,
    max_seq_length= 512
    )

Map:   0%|          | 0/87090 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


### Training the model

The next step would be to run the following code:

```python
trainer.train()
```

I let the model train for about six days only was about 32% done.

![Training Time](./images/training_time.png)

So as we can see from the attached image that while the model is indeed training the time it would take is a bit longer then expected for easy model evaluation and experimentation

## What's Next?

This isn't to say that we still can't proceed in training the model this way but for some exploratory learnings the time commitment here may be a bit much. So how do we try to resolve this?

##### Parameter Tuning
---
The libraries provided by Hugging Face have a large number of parameters that can be adjusted. Specifically modifying the tokenization may help speed up the training process.

##### GPUs
---
We ignore the earlier statement of having a model train on any machine and specifically use one with a compatible GPU. This would drastically cut down on the training time

##### OS Specific Libraries
---
If you're using a Mac with an M(X) chip Apple has been developing libraries to allow machine learning to tap into the computer's GPU

[Tensorflow Metal](https://pypi.org/project/tensorflow-metal/)

[Hugging Face Apple GPU-Acceleration](https://huggingface.co/docs/accelerate/en/usage_guides/mps#how-it-works-out-of-the-box)

##### Different Tuning Methods
---
There are variety ways to fine tune a model. We tried training on all the available parameters here and was we saw it was very resource intensive. One method we could try to attempt next would be [Low-Rank Adaption of Large Language Models](https://huggingface.co/docs/diffusers/en/training/lora) as this trains a smaller number of weights.
