## Getting started

Download amazon reviews and place it into a pandas dataframe.

### Download amazon reviews

Get the reviews from https://www.kaggle.com/datasets/mexwell/amazon-reviews-multi?resource=download. Reviews were previously posted at https://huggingface.co/datasets/amazon_reviews_multi but they were taken down by request of the data owner. Extract the reviews and place them in a folder named "reviews", next to this file.

### Get reviews into a Pandas dataframe

See the code below:

In [43]:
import pandas as pd

PATH="./reviews/%s.csv"
FILES=[
  "train",
  "test",
  "validation"]
PREPARED_OUTPUT_PATH = "amazon-english-full-%s-sentiment.jsonl"

# load 
training_df = pd.read_csv('./reviews/train.csv')


In [42]:
def prepare_df_for_training(df:pd.DataFrame)->pd.DataFrame:
  '''
  returns records in english and without duplicates. Records \
  follow openAIs format for tuning models.
  '''
  df['prompt'] = df['review_title'] + '\n\n' + df['review_body']
  df['completion'] = df['stars'].astype(str)
  english_df = df[training_df['language'] == 'en']
  english_df.drop_duplicates(subset=['prompt'], inplace=True)
  return english_df[['prompt','completion']].sample(len(english_df))

def create_jsonl_files()->None:
  for i in range(len(FILES)):
    file_path = PATH % FILES[i]
    print('loading %s' % file_path)
    df = pd.read_csv(file_path)
    df = prepare_df_for_training(df)
    df.to_json(PREPARED_OUTPUT_PATH
               % FILES[i], orient='records', lines=True)
    print("saved %s" % file_path)

# prepare data frames for use with open AI
create_jsonl_files()

loading ./reviews/train.csv


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  english_df.drop_duplicates(subset=['prompt'], inplace=True)


saved ./reviews/train.csv
loading ./reviews/test.csv
saved ./reviews/test.csv
loading ./reviews/validation.csv
saved ./reviews/validation.csv


  english_df = df[training_df['language'] == 'en']
  english_df = df[training_df['language'] == 'en']


At this point, 3 files will be placed in your local directory. These files use a format that is compatible with OpenAIs fine tuning API.

If you are using OpenAI to fine tune a model, then use `conda env config vars set OPENAI_API_KEY=<your_api_key>`.

I am not going to use OpenAI and instead will rely on huggingface to fine tune mistral7b. Run the following commands:

1. `pip install -U autotrain-advanced`
1. `pip install datasets transformers`

More context and examples for how mistral was trained to follow instructions are in https://www.kdnuggets.com/how-to-finetune-mistral-ai-7b-llm-with-hugging-face-autotrain.



In [48]:
# this instruction will be used to predict the rating for every review
INSTRUCTION = "\
You are a customer reviewing a product that you purchased \
and your task is to rate how you feel about the product based \
on your review. The rating is a number in the range of 1 to 5 \
where 1 is extremely negative and 5 is extremely positive."

# this will be used to prompt mistral7b during training
TEXT ="\
<s>[INST] Below is an instruction that describes a task, \
paired with an input that provides context for the task. \
Write a response that appropriately completes the request. \
\n\n### Instruction:\n%s \
\n\n### Input:\n <product_review>%s</product_review> [/INST] \
%s</s>"

def format_text(data:pd.Series)->str:
  '''
  Takes an object indexed by the columns and produces a new value.
  data: a pandas series with indexes "prompt" and "completion"
  '''
  return TEXT % (INSTRUCTION,
                 data["prompt"],
                 data["completion"])

def create_csv_for_mistral(fileNames:list[str]=["train"])->None:
  '''
  persists a data frame to train mistral, given a jsonl file with
  prompt and completion columns used to train openAIs chaptGPT.
  fileNames: list of files to load
  '''
  for filename in fileNames:
    path = PREPARED_OUTPUT_PATH % filename
    df = pd.read_json(path,lines=True)
    df["text"] = df.apply(format_text, axis=1)
    df["instruction"] = pd.Series(
      [INSTRUCTION for i in range(len(df.index))])
    df["input"] = df["prompt"]
    df["output"] = df["completion"]
    df[["instruction","input","output","text"]]\
      .to_csv('data/train.csv', index=False)
    
create_csv_for_mistral()

We're now ready to fine-tune mistral7B:instruct! (actually https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

We'll use HuggingFace's AutoTrain to fine-tune mistral.

In [53]:
import os

!autotrain setup

# setup training parameters
project_name = 'my_autotrained_sentiment_predictor_llm'
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
push_to_hub = False
# if you cannot find your token, check
# https://huggingface.co/settings/tokens
# then set your token with:
# conda env config vars set HUGGINGFACE_TOKEN=token_goes_here
hf_token = os.environ.get("HUGGINGFACE_TOKEN") or "type_your_token"
repo_id = "schroedinger-s-cat/product_rating_predictor"
learning_rate = 2e-4
num_epochs = 4
batch_size = 4
block_size = 1024
trainer = "sft"
warmup_ratio = 0.1
weight_decay = 0.01
gradient_accumulation = 4
use_fp16 = False
use_peft = True
use_int4 = False
lora_r = 16
lora_alpha = 32
lora_dropout = 0.045

# propagate parameters to the local environment
os.environ["PROJECT_NAME"] = project_name
os.environ["MODEL_NAME"] = model_name
os.environ["PUSH_TO_HUB"] = str(push_to_hub)
os.environ["HF_TOKEN"] = hf_token
os.environ["REPO_ID"] = repo_id
os.environ["LEARNING_RATE"] = str(learning_rate)
os.environ["NUM_EPOCHS"] = str(num_epochs)
os.environ["BATCH_SIZE"] = str(batch_size)
os.environ["BLOCK_SIZE"] = str(block_size)
os.environ["WARMUP_RATIO"] = str(warmup_ratio)
os.environ["WEIGHT_DECAY"] = str(weight_decay)
os.environ["GRADIENT_ACCUMULATION"] = str(gradient_accumulation)
os.environ["USE_FP16"] = str(use_fp16)
os.environ["USE_PEFT"] = str(use_peft)
os.environ["USE_INT4"] = str(use_int4)
os.environ["LORA_R"] = str(lora_r)
os.environ["LORA_ALPHA"] = str(lora_alpha)
os.environ["LORA_DROPOUT"] = str(lora_dropout)

# if needed, check auto train cmd line params:
# https://github.com/huggingface/autotrain-advanced/blob/main/src/autotrain/cli/run_llm.py
# run AutoTrain
!autotrain llm \
--train \
--model ${MODEL_NAME} \
--project-name ${PROJECT_NAME} \
--data-path data/ \
--text-column text \
--lr ${LEARNING_RATE} \
--batch-size ${BATCH_SIZE} \
--epochs ${NUM_EPOCHS} \
--block-size ${BLOCK_SIZE} \
--warmup-ratio ${WARMUP_RATIO} \
--lora-r ${LORA_R} \
--lora-alpha ${LORA_ALPHA} \
--lora-dropout ${LORA_DROPOUT} \
--weight-decay ${WEIGHT_DECAY} \
--gradient-accumulation ${GRADIENT_ACCUMULATION} \
$( [[ "$USE_FP16" == "True" ]] && echo "--mixed-precision fp16" ) \
$( [[ "$USE_PEFT" == "True" ]] && echo "--use-peft" ) \
$( [[ "$USE_INT4" == "True" ]] && echo "--quantization int4" ) \
$( [[ "$PUSH_TO_HUB" == "True" ]] && echo "--push-to-hub --token ${HF_TOKEN} --repo-id ${REPO_ID}" )

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
> [1mINFO    Installing latest xformers[0m
> [1mINFO    Successfully installed latest xformers[0m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
> [1mINFO    Running LLM[0m
> [1mINFO    Params: Namespace(version=False, text_column='text', rejected_text_column='rejected', prompt_text_column='prompt', model_ref=None, warmup_ratio=0.1, optimizer='adamw_torch', scheduler='linear', weight_decay=0.01, max_grad_norm=1.0, add_eos_token=False, block_size=1024, peft=True, lora_r=16, lora_alpha=32, lora_dropout=0.045, logging_steps=-1, evaluation_strategy='epoch', save_total_limit=1, save_strategy='epoch', auto_find_batch_size=False, mixed_precision=None, quantization=None, model_max_length=1024, trainer='default', target_modules=None, merge_adapter=False, use_flash_attention_2=False, dpo_beta=0.1, apply_c

In [None]:
# use the model directly
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = project_name
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
review_text = "This book is great value.\n\
A must buy if you are into reading"
input_text = f"\
[INST] Below is an instruction that describes a task, \
paired with an input that provides context for the task. \
Write a response that appropriately completes the request. \
\n\n### Instruction:\n{INSTRUCTION} \
\n\n### Input:\n <product_review>{review_text}</product_review> [/INST]"
input = tokenizer.encode(input_text, return_tensors="pt")
output = model.generateCompletion(input, max_new_tokens=5)
result = tokenizer.decode(output[0], skip_special_tokens=True)
print(result)

In [None]:
# use the model via pipelines
from transformers import pipeline

pipe = pipeline(model=model, tokenizer=tokenizer)
result = pipe(input_text)
print(result)