# Goal 1

In this script, I will show the full end-to-end process including making tuning records from the raw dataset.

In many cases, the customer has only raw datasets instead of formalized training set. So, 

1. I will make formalized training dataset efficiently through LLM, 

2. And, Basically LLMs are based on statistics, so I will apply the following input/output data format design.

>  - Input Data Formatting : For example, all input dataset should include the following template - "As a xxx role, we need to answer the following question. Question : ${question}". 
>  - Input Data Duplicating : There is no statistics/evidence for how much impacted duplicated input data will be used. For example 
>
>>      Record 1 : { "input" : "As a banker role, which accounts of xxx bank would be the most beneficial for a foreigner with L1 Visa ?", "Answer : It's the xxx account type."  }
>>      Record 2 : { "input" : "As a banker role, which accounts of xxx bank would be the most beneficial for a foreigner with L1 Visa ?", "Answer : It's the xxx account type."  } -- Duplicated records.
  
3. But the duplicated records could make wrong sub-optimal parameters for the task, we will add some general corpos to enhance 'prefix' layer. 
  In this whitepaper (https://arxiv.org/pdf/2101.00190.pdf), the prefix PEFT uses thw following datasets to optimize the prefix parameters (E2E- 50K, WebNLG - 22K). 

4. I will use prompt template to increase correctness of PEFT model. 

> PROMPT : 'As a xxx role, we need to answer the following question. Question : ${question}'

# Goal 2

In this script, how to refine raw dataset such like PDF files. 

During processing PDF files, you can meet the following issues. 

1. Some exceptional phrases (footer/header/numbering) could be added into the context

2. Some phrases are splited by pages. 

3. Pages with two or three columns are often analyzed incorrectly.

4. ETC...

In this script, I will try to solve Issue #1, Issue #2 with LLMs.



## 1. Making Formalized Training Set with more Large LLM

LargeLLM - such like Bard, ChatGPT3.5/4 can make the input/output pairs with raw dataset such like the following. 
(In this example, I will use the Robinhood's account descriptions - https://robinhood.com/us/en/support/articles/robinhood-accounts/)

> Prompt : "As a robinhood counseller, make four questions and answers related with the ${context}"
>
>> Context : Margin accounts - Trade with unsettled funds from stock and option sales. You don’t have to wait for funds to settle (2 trading days for stocks, 1 for options).
>>
>> Access margin investing. If eligible, you can enable margin investing and get more buying power.
>>
>> Access Level 3 options trading. If eligible, you can trade multi-leg options, like spreads.
>>
>> Pattern day trading regulations. If you make 4 or more day trades within 5 trading days in a margin account, you may be flagged as a pattern day trader. Review Pattern day trading for more information.

The quality of these QnA pairs are good. For example. 

> Q1: What is a margin account?
> 
> A1: A margin account is a type of brokerage account that allows investors to trade with borrowed funds, also known as margin, based on the value of their existing securities.
> 
> 
> Q2: How does trading with unsettled funds work in a margin account?
> 
> A2: In a margin account, you can trade with unsettled funds from stock and option sales. This means you don't have to wait for the funds to settle before making new trades.
> 
> 
> Q3: How long does it typically take for funds to settle in a margin account?
> 
> A3: For stocks, it usually takes 2 trading days for funds to settle, while for options, it takes 1 trading day.
> 
> 
> Q4: What is margin investing, and how can I access it?
> 
> A4: Margin investing allows you to leverage your existing securities to increase your buying power. If you are eligible, you can enable margin investing through your brokerage account.
> 

So, we can convert these results into formalized training sets. 

## 2. Formalized Splitter

But What **you have to worry about** is 'context length'. 

In many moder LLMs has its own context length limitations. So we need to split the raw context into small context-awared phrases. This issue is mentioned in Goal# 2.

So I have to make appropriate splitter with LLM at first.

In [1]:
## There is no official Bard's API endpoint. So we will use hand-made wrapper class. 

#! pip3 install langchain typing_extensions==4.5.0
#! pip3 install pypdf

# Instead of using Bard, I replaced it with openAI for trial.

# !pip3 install pycryptodome

In [2]:
import os

MODEL_ID="text-bison"
PROJECT_NUMBER=os.getenv('PROJECT_NUMBER')
VERTEX_AI_LOCATION="us-central1"


In [3]:
from langchain.document_loaders import PyPDFLoader # for loading the pdf


--I choose a Non-English content for multi-lingual cases.--
--The below content(samsungpay appcard document) can be downloaded from Hyundai Card site. (https://www.hyundaicard.com/doc/samsungpay_appcard_existing.pdf)--

I choose a English content(https://advisor.morganstanley.com/chad.detienne/documents/field/c/ch/chad-h--detienne/Guide%20to%20Reading%20Your%20Morgan%20Stanley%20Statement.pdf) cause of no permission to PaLM2 non-English API access.

Before to proceed, you should download this file from the website. 

In [4]:
#! wget https://advisor.morganstanley.com/chad.detienne/documents/field/c/ch/chad-h--detienne/Guide%20to%20Reading%20Your%20Morgan%20Stanley%20Statement.pdf
#! mv Guide* ../resources

## If you can't download it, browse the site and download it mannualy and then copy it to resources directory.


In [5]:
pdf_path = "../resources/Guide to Reading Your Morgan Stanley Statement.pdf"
loader = PyPDFLoader(pdf_path)

documents = loader.load()

for document in documents:
    print (document)

page_content='Guide to Reading  \nYour Morgan\xa0Stanley Statement\nYour account statement is a valuable resource that provides the information you need as you \nwork with a member of your Morgan\xa0Stanley team towards realizing your financial objectives. \nBy\xa0carefully reading your statement, you can remain up-to-date on your account(s). The goal of \nthis guide is to provide suggestions on how to read and understand your statement.\nWe encourage you to review your statements online via Morgan\xa0Stanley Online: it’s secure and \nconvenient. By enrolling in eDelivery, you reduce the volume of paper you have to manage, \nwhile\xa0retaining online access to seven years of statements. As an added benefit, you need \nnot wait for the mail: when your statement is available to view online, you receive an email \nnotification. To enroll in eDelivery, go to www.morganstanley.com/edelivery.\nRegardless of the delivery method you choose, we are committed to providing statements  \nthat keep

In [6]:
COMMON_ROLE="investment bank customer service representative"
SPECIFIED_ROLE="Morgan Stanley investment bank customer service representative"

In [7]:
SPLITTER_PROMPT_TEMPLATE = """
"As a {common_role}, remove useless footer/header words and divide the paragraphs into sections with hierarchical section numbers which could be given in context in CSV format in English. 
Hierarchical Section numbering rule : 1-1
CSV format is <hierarchical section number> , <phrase>
Context : last section number : 1-3
 4. This card MUST be used by the card owner. A. Some cards can be used in your mobile platform
Result : 
"1-4", "4. This card MUST be used by the card owner."
"1-4-1", "A. Some cards can be used in your mobile platform"

Context : last section number : {last_section_number} \n {content}
"""

In [8]:
import openai
import os

openai.api_key = os.getenv("OPENAI_API_KEY")

def formalized_splitter_openai(document, last_section_number):
    response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-16k",
    messages = [{"role": "system", "content" : SPLITTER_PROMPT_TEMPLATE.format(common_role=COMMON_ROLE, last_section_number=last_section_number, content=document)}]
    )
    return response['choices'][0]['message']['content']

# print(len(documents[0].page_content))


In [9]:

import vertexai
from vertexai.preview.language_models import TextGenerationModel
from google.auth import default

# Without Scopes, you will see the error. 
credentials, _ = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])


vertexai.init(project=PROJECT_NUMBER, location=VERTEX_AI_LOCATION, credentials=credentials)
parameters = {
    "temperature" : 0.2,
    "max_output_tokens" : 1024, 
    "top_p": 0.8, 
    "top_k" : 10
}

model = TextGenerationModel.from_pretrained("text-bison@001")

In [10]:

def formalized_splitter_palm2(document, last_section_number):
  response=model.predict(prompt=SPLITTER_PROMPT_TEMPLATE.format(common_role=COMMON_ROLE, last_section_number=last_section_number, content=document),
  **parameters)
  return response.text

# Below line is for testing
print("##" + formalized_splitter_palm2(document=documents[0],last_section_number="1-14"))



##Result : 
"1-1", "Guide to Reading  \nYour Morgan\xa0Stanley Statement"
"1-2", "Your account statement is a valuable resource that provides the information you need as you \nwork with a member of your Morgan\xa0Stanley team towards realizing your financial objectives. \nBy\xa0carefully reading your statement, you can remain up-to-date on your account(s). The goal of \nthis guide is to provide suggestions on how to read and understand your statement."
"1-3", "We encourage you to review your statements online via Morgan\xa0Stanley Online: it’s secure and \nconvenient. By enrolling in eDelivery, you reduce the volume of paper you have to manage, \nwhile\xa0retaining online access to seven years of statements. As an added benefit, you need \nnot wait for the mail: when your statement is available to view online, you receive an email \nnotification. To enroll in eDelivery, go to www.morganstanley.com/edelivery."
"1-4", "Regardless of the delivery method you choose, we are committed to pro

In [11]:
# We will use parm2 API for splitter
def formalized_splitter(document, last_section_number):
  return formalized_splitter_palm2(document, last_section_number)


In [12]:
import csv
import traceback
import io

phrases = []
analyzed_phrases = ""
skipped_phrases = []
MAX_TRY_COUNT = 2
MIN_ANALYZED_CONTENT_LENGTH = 10
MIN_RESULT_CONTENT_LENGTH = 10

def full_page_splitter(documents):
  last_section_number = ""
  last_phrase_text = ""
  for document in documents:
    phrases_list = []
    analysis_target_text = last_phrase_text + '\n' + document.page_content
    if len(analysis_target_text) < MIN_ANALYZED_CONTENT_LENGTH :
      last_phrase_text = analysis_target_text
      continue
    try_count = 0
    while True:
      try:
        analyzed_phrases = formalized_splitter(document=analysis_target_text, last_section_number=last_section_number)
        #print('##' + analyzed_phrases)
        csv_file = io.StringIO(analyzed_phrases)
        reader = csv.reader(csv_file, delimiter=',')
        for row in reader:
          if len(row) >= 2:
            section_number = row[0]
            phrase = ' '.join(row[1:])
            phrases_list.append({ 'section' : section_number, 'phrase' : phrase})
        if len(phrases_list) > 1 :
          last_phrase = phrases_list.pop()
          last_section_number = last_phrase['section']
          last_phrase_text = last_phrase['phrase']
        phrases.extend(phrases_list)
        break
      except:
        print('Error Occured')
        traceback.print_exc()
        try_count = try_count + 1
        if try_count >= MAX_TRY_COUNT :
          skipped_phrases.append(analysis_target_text)
          break
  phrases.append(last_phrase)
  return phrases


In [13]:
import pandas as pd
import fastparquet as fp
import os

#parquet_file_path = '../resources/splitted.parquet'
parquet_file_path = '../resources/splitted_guide.parquet'

if not os.path.exists(parquet_file_path):
  # At first time, you have to  unremark the following three lines
  full_page_splitter(documents=documents)
  df_phrases = pd.DataFrame.from_records(phrases)
  df_phrases.to_parquet(parquet_file_path)
else:
  df_phrases = fp.ParquetFile(parquet_file_path).to_pandas()
  phrases = df_phrases.to_dict(orient='records')

In [14]:
phrases[0:10]

[{'section': '1',
  'phrase': ' "Guide to Reading Your Morgan\xa0Stanley Statement"'},
 {'section': '1-1',
  'phrase': ' "Your account statement is a valuable resource that provides the information you need as you work with a member of your Morgan\xa0Stanley team towards realizing your financial objectives."'},
 {'section': '1-2',
  'phrase': ' "By\xa0carefully reading your statement  you can remain up-to-date on your account(s). The goal of this guide is to provide suggestions on how to read and understand your statement."'},
 {'section': '1-3',
  'phrase': ' "We encourage you to review your statements online via Morgan\xa0Stanley Online: it’s secure and convenient."'},
 {'section': '1-4',
  'phrase': ' "By enrolling in eDelivery  you reduce the volume of paper you have to manage  while\xa0retaining online access to seven years of statements."'},
 {'section': '1-5',
  'phrase': ' "As an added benefit  you need not wait for the mail: when your statement is available to view online  you

Ok. You can see the splittd content with appropriate phrases (but wrong section numbering. but it doens't matter)

We will make the QnA pairs with these splitted context.

Let's do it.

In [15]:
MAKING_QNA_PAIR_PROMPT_TEMPLATE = """
As a {specific_role}, make {number_of_qna} QnA pairs based on the $Context with reasoning in CSV format in English.
Example Context : This card will only be used in Mobile App. 
Example Result : 
"context", "question", "reason", "answer", 
"This card will only be used in Mobile App.", "Which platform the user can use this card ?", "In the context, it said only mobile app. so mobile app is correct answer", "mobile app platform"

Context : 
{context}
"""

In [17]:
def formalized_qna_maker_api_openai(phrases, number_of_qna):
  response = openai.ChatCompletion.create(
              model="gpt-3.5-turbo-16k",
              messages = [{"role": "system", "content" : MAKING_QNA_PAIR_PROMPT_TEMPLATE.format(specific_role=SPECIFIED_ROLE, context=phrases, number_of_qna=number_of_qna)}]
          )
  return response['choices'][0]['message']['content']

def formalized_qna_maker_api_palm(phrases, number_of_qna):
  response=model.predict(prompt=MAKING_QNA_PAIR_PROMPT_TEMPLATE.format(specific_role=SPECIFIED_ROLE, context=phrases, number_of_qna=number_of_qna),
  **parameters)
  return response.text

def formalized_qna_maker_api(phrases, number_of_qna):
  return formalized_qna_maker_api_openai(phrases, number_of_qna)

In [18]:
import csv
import io

def formalized_qna_maker(phrases, number_of_qna):
    try_count = 0
    skipped_phrases = ""
    while True:
        qna_pairs = []
        try_count = try_count + 1
        try:
            analyzed_phrases = formalized_qna_maker_api(phrases=phrases, number_of_qna=number_of_qna)
            csv_file = io.StringIO(analyzed_phrases)
            reader = csv.reader(csv_file, delimiter=',')
            for row in reader:
                if len(row) >= 4:
                    context = row[0]
                    question = row[1]
                    reason = row[2]
                    answer = ' '.join(row[3:])
                    qna_pairs.append({'context' : context, 'question': question, 'reason': reason, 'answer': answer})
            break
        except:
            print("Error Occured : {phrase}".format(phrases))
            if try_count >= MAX_TRY_COUNT:
                skipped_phrases = phrases
                break
    return qna_pairs, skipped_phrases

To make context-awared QnAs, we will use hierarchical section number.

All phrases would be reaggregate with sections. 

I will just use two-depth section number.

In [19]:
last_parent_phrase = ""
last_parent_section = ""
last_phrase = ""
last_section = ""
last_aggregate_phrase = ""
aggregate_phrase = ""

MIN_PHRASE_SIZE = 100
MAX_PHRASE_SIZE = 3000

NUM_OF_QUESTION_ESTIMATION_SIZE = 30

def has_same_depth(last_section, section):
  last_sections = last_section.split('-')
  sections = section.split('-')
  # zero depth ?
  if len(last_sections) == 0 or len(sections) == 0 :
    return False
  # one dpeth ?
  if len(last_sections) == 1 or len(sections) == 1 :
    return last_sections[0] == sections[0]
  #print(len(last_sections), len(sections), len(last_sections) == 1 | len(sections) == 1)
  return last_sections[1] == sections[1]

def full_qna_maker(phrases):
  total_qna_pairs = []
  skipped_phrases_list = []
  last_aggregate_phrase = ""
  for phrase in phrases:
    section = phrase['section']
    aggregate_phrase = last_aggregate_phrase + '\n' + phrase['phrase']
    if len(aggregate_phrase) > MIN_PHRASE_SIZE:
      if len(aggregate_phrase) > MAX_PHRASE_SIZE:
        num_of_questions = len(last_aggregate_phrase) / NUM_OF_QUESTION_ESTIMATION_SIZE
        qna_pairs, skipped_phrases = formalized_qna_maker(last_aggregate_phrase, num_of_questions)
        total_qna_pairs.extend(qna_pairs)
        skipped_phrases_list.append(skipped_phrases)
        last_aggregate_phrase = phrase['phrase']
        last_section = section
        continue
      elif not has_same_depth(last_section=last_section, section=section):
        num_of_questions = len(last_aggregate_phrase) / NUM_OF_QUESTION_ESTIMATION_SIZE
        qna_pairs, skipped_phrases = formalized_qna_maker(last_aggregate_phrase, num_of_questions)
        total_qna_pairs.extend(qna_pairs)
        skipped_phrases_list.append(skipped_phrases)
        last_aggregate_phrase = phrase['phrase']
        last_section = section
      else:
        last_section = section
        last_aggregate_phrase = aggregate_phrase
    else:
      last_section = section
      last_aggregate_phrase = aggregate_phrase
  num_of_questions = len(last_aggregate_phrase) / NUM_OF_QUESTION_ESTIMATION_SIZE
  qna_pairs, skipped_phrases = formalized_qna_maker(last_aggregate_phrase, num_of_questions)
  total_qna_pairs.extend(qna_pairs)
  skipped_phrases_list.append(skipped_phrases)
  return total_qna_pairs, skipped_phrases_list


In [53]:
# The below line is test code.
total_qna_pairs, skipped_phrases = full_qna_maker(phrases=phrases)

#total_qna_pairs, skipped_phrases = full_qna_maker(phrases=phrases)

If you can see the context of Context / Question / Reasoning / Answer, these contents all have missing information. 

It's the product name and role of this answer. 

So I will add this information into input / output training dataset.

I don't have multi ligual model permission not yet. So I will translate all qna pair into English. 

In [57]:
MISSING_INFORMATION_PADDING_TEMPLATE = 'As a {specified_role}, you are answering to the customer for the {product_name}. Context: {context}. Question : {question}'
#OUTPUT_FORMAT = 'reason : {reason} answer: {answer}'
OUTPUT_FORMAT = 'Answer: {answer}'

refined_qna_pairs = []

for qna_pair in total_qna_pairs:
    if len(qna_pair['question']) > 12 and not qna_pair['context'] == 'context':
        refined_qna_pairs.append({ 
            "input_text" : MISSING_INFORMATION_PADDING_TEMPLATE.format(specified_role=SPECIFIED_ROLE,product_name='Guide to reading your morgan stanley statement',context=qna_pair['context'],question=qna_pair['question']), 
            #"output_text" : OUTPUT_FORMAT.format(reason=qna_pair['reason'], answer=qna_pair['answer']) 
            "output_text" : OUTPUT_FORMAT.format(answer=qna_pair['answer']) 
            })


In [23]:
#! pip3 install jsonlines

In [58]:
import jsonlines

multiplied_qna_pairs = []
for item in refined_qna_pairs:
   multiplied_qna_pairs.append(item)
   multiplied_qna_pairs.append(item)
   multiplied_qna_pairs.append(item)
   multiplied_qna_pairs.append(item)
   multiplied_qna_pairs.append(item)

    


Above codes mutiplied original dataset 5 times. 

I will add more examples into the dataset mentioned before.

There are many Q & A Datasets in Hugging Face.

For example. 

- SQuAD (Stanford Question Answering Dataset): SQuAD is one of the most widely used Q&A datasets. It contains questions and corresponding answers for passages or documents. You can access the SQuAD dataset through the Hugging Face datasets library.

- NewsQA: NewsQA is a dataset that includes questions and answers extracted from news articles. It provides various types of questions and corresponding answers found in the news articles.

- TriviaQA: TriviaQA is a dataset that includes knowledge-based questions and answers extracted from documents covering various topics. It encompasses a range of question types and difficulty levels.

- Natural Questions: The Natural Questions dataset contains questions asked by real users in search engines, along with corresponding answers. The questions are written in natural language, and the answers are found in a set of documents.

- DuoRC: DuoRC is a dataset where two colleagues provide different answers to the same question from the same document. This dataset is useful for improving models' ability to compare and evaluate different answers.

SQuAD would be used.

In [None]:
#! pip3 install wget
#! pip3 install datasets


In [59]:
from datasets import load_dataset

dataset = load_dataset("squad")

train_data = dataset['train']
validation_data = dataset['validation']

print(train_data[0])
print(validation_data[0])

Found cached dataset squad (/home/postgres/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)
100%|██████████| 2/2 [00:00<00:00, 627.04it/s]

{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}
{'id': '56be4db0acb8001400a502ec', 'title': 'Super_Bow




As you can see, 

the format of SQuAD is

    {
      id: text
      title: text
      context: text
      question: text
      answers: list
    }

So, we make input jsonl using context + question -> input_text, answers[0] -> output_text

In [60]:
for one_record in train_data.to_list():
  input_text = "context: {context} question: {question}".format(context=one_record['context'], question=one_record['question'])
  output_text = 'answer: {answer}'.format(answer=one_record['answers']['text'])
  multiplied_qna_pairs.append({ 
            "input_text" : input_text, 
            "output_text" : output_text })

import random

random.shuffle(multiplied_qna_pairs)

In [61]:
multiplied_qna_pairs[0:10]

[{'input_text': 'context: The core technology used in a videoconferencing system is digital compression of audio and video streams in real time. The hardware or software that performs compression is called a codec (coder/decoder). Compression rates of up to 1:500 can be achieved. The resulting digital stream of 1s and 0s is subdivided into labeled packets, which are then transmitted through a digital network of some kind (usually ISDN or IP). The use of audio modems in the transmission line allow for the use of POTS, or the Plain Old Telephone System, in some low-speed applications, such as videotelephony, because they convert the digital pulses to/from analog waves in the audio spectrum range. question: What is the software that performs audio and/or video compression?',
  'output_text': "answer: ['a codec (coder/decoder)']"},
 {'input_text': 'context: Nuestra Señora del Sagrado Corazón ("Our Lady of the Sacred Heart"), also known as Iglesia Punta Carretas ("Punta Carretas Church"), w

In [62]:
import json
import jsonlines

original_training_set_file_path = "../resources/original_set.jsonl"

with jsonlines.open(original_training_set_file_path, mode='a') as writer:
  for item in refined_qna_pairs:
      writer.write(item)

trainingset_file_path = "../resources/training_set.jsonl"

with jsonlines.open(trainingset_file_path, mode='a') as writer:
  for item in multiplied_qna_pairs:
      writer.write(item)


OK. It's time to make customize model through PEFT feature of Vertex AI.

In this time, there are only two regions in which PEFT feature is enabled - us (multi-region), eu (multi-region).

I will use eu region. 

**Fine Tune steps are the followings**

--1. we should make a bucket where the jsonl file will be stored.--

--2. we should make a bucket where the customized model would be store after training. (You can use it as same bucket made in the above line)--

1. Make a dataframe from the training set.

3. Create pipeine and Wait. 

In [63]:
import jsonlines
import pandas as pd


total_training_set = []

# Open the JSONL file
with jsonlines.open(trainingset_file_path) as reader:
    for json_obj in reader:
        total_training_set.append(json_obj)
        
df_trainingset = pd.DataFrame.from_records(total_training_set)

In [64]:
df_trainingset

Unnamed: 0,input_text,output_text
0,"context: The phonograph disc record was the primary medium used for music reproduction until late in the 20th century, replacing the phonograph cylinder record–with which it had co-existed from the late 1880s through to the 1920s–by the late 1920s. Records retained the largest market share even when new formats such as compact cassette were mass-marketed. By the late 1980s, digital media, in the form of the compact disc, had gained a larger market share, and the vinyl record left the mainstream in 1991. From the 1990s to the 2010s, records continued to be manufactured and sold on a much smaller scale, and were especially used by disc jockeys (DJ)s, released by artists in some genres, and listened to by a niche market of audiophiles. The phonograph record has made a niche resurgence in the early 21st century – 9.2 million records were sold in the U.S. in 2014, a 260% increase since 2009. Likewise, in the UK sales have increased five-fold from 2009 to 2014. question: Approximately how many phonograph records were sold in 2014?",answer: ['9.2 million']
1,"context: It was believed that immortality could be achieved if one reached the lands of the Queen Mother of the West or Mount Penglai. Han-era Daoists assembled into small groups of hermits who attempted to achieve immortality through breathing exercises, sexual techniques and use of medical elixirs. By the 2nd century AD, Daoists formed large hierarchical religious societies such as the Way of the Five Pecks of Rice. Its followers believed that the sage-philosopher Laozi (fl. 6th century BC) was a holy prophet who would offer salvation and good health if his devout followers would confess their sins, ban the worship of unclean gods who accepted meat sacrifices and chant sections of the Daodejing. question: What could be earned if an individual had reached the lands of the Queen Mother of the West?",answer: ['immortality']
2,"context: Mohinga is the traditional breakfast dish and is Myanmar's national dish. Seafood is a common ingredient in coastal cities such as Sittwe, Kyaukpyu, Mawlamyaing (formerly Moulmein), Mergui (Myeik) and Dawei, while meat and poultry are more commonly used in landlocked cities like Mandalay. Freshwater fish and shrimp have been incorporated into inland cooking as a primary source of protein and are used in a variety of ways, fresh, salted whole or filleted, salted and dried, made into a salty paste, or fermented sour and pressed. question: What is considered as an alternative to tofu for the valuable ingredient it holds for those not living near water in BUrma?",answer: ['Freshwater fish and shrimp have been incorporated into inland cooking as a primary source of protein']
3,"As a Morgan Stanley investment bank customer service representative, you are answering to the customer for the Guide to reading your morgan stanley statement. Context: How are unrealized gains or losses calculated? Question : ""In the context","reason : it says that unrealized gains or losses are calculated using the average cost tax lot method."" answer: ""Average cost tax lot method"""
4,"context: Six games were initially available in Japan, while eagerly anticipated titles such as Dead or Alive 4 and Enchanted Arms were released in the weeks following the console's launch. Games targeted specifically for the region, such as Chromehounds, Ninety-Nine Nights, and Phantasy Star Universe, were also released in the console's first year. Microsoft also had the support of Japanese developer Mistwalker, founded by Final Fantasy creator Hironobu Sakaguchi. Mistwalker's first game, Blue Dragon, was released in 2006 and had a limited-edition bundle which sold out quickly with over 10,000 pre-orders. Blue Dragon is one of three Xbox 360 games to surpass 200,000 units in Japan, along with Tales of Vesperia and Star Ocean: The Last Hope. Mistwalker's second game, Lost Odyssey also sold over 100,000 copies. question: Blue Dragon surpassed this sales figure in Japan?","answer: ['200,000 units']"
...,...,...
179418,"context: Though sexual attraction, behavior, and identity are all components of sexual orientation, if a person defined by one of these dimensions were congruent with those defined by another dimension it would not matter which was used in assessing orientation, but this is not the case. There is ""little coherent relationship between the amount and mix of homosexual and heterosexual behavior in a person's biography and that person's choice to label himself or herself as bisexual, homosexual, or heterosexual"". Individuals typically experience diverse attractions and behaviors that may reflect curiosity, experimentation, social pressure and is not necessarily indicative of an underlying sexual orientation. For example, a woman may have fantasies or thoughts about sex with other women but never act on these thoughts and only have sex with opposite gender partners. If sexual orientation was being assessed based on one's sexual attraction then this individual would be considered homosexual, but her behavior indicates heterosexuality. question: What do individuals typically experience?","answer: ['diverse attractions and behaviors that may reflect curiosity, experimentation, social pressure']"
179419,"context: Eleven days after Orsini's assassination attempt in France, Victoria's eldest daughter married Prince Frederick William of Prussia in London. They had been betrothed since September 1855, when Princess Victoria was 14 years old; the marriage was delayed by the Queen and Prince Albert until the bride was 17. The Queen and Albert hoped that their daughter and son-in-law would be a liberalising influence in the enlarging Prussian state. Victoria felt ""sick at heart"" to see her daughter leave England for Germany; ""It really makes me shudder"", she wrote to Princess Victoria in one of her frequent letters, ""when I look round to all your sweet, happy, unconscious sisters, and think I must give them up too – one by one."" Almost exactly a year later, Princess Victoria gave birth to the Queen's first grandchild, Wilhelm, who would become the last German Kaiser. question: How old was Victoria's oldest daughter when she was amrried?",answer: ['17']
179420,"context: With the record company a global operation in 1965, the Columbia Broadcasting System upper management started pondering changing the name of their record company subsidiary from Columbia Records to CBS Records. question: CBS began thinking of a name change to their record label in what year?",answer: ['1965']
179421,"context: With such a high percentage of children working, the rising of illiteracy, and the lack of a formal education became a widespread issue for many children who worked to provide for their families. Due to this problematic trend, many parents developed a change of opinion when deciding whether or not to send their children to work. Other factors that lead to the decline of child labour included financial changes in the economy, changes in the development of technology, raised wages, and continuous regulations on factory legislation. question: What are the reasons that lead to a decline of child labour?","answer: ['financial changes in the economy, changes in the development of technology, raised wages, and continuous regulations on factory legislation']"


In [66]:
from __future__ import annotations
import pandas as pd

def tuning(
    training_data: pd.DataFrame | str,
    train_steps: int = 10,
    learning_rate: float = 0.001
) -> None:

    model.tune_model(
        training_data=training_data,
        train_steps=train_steps,
        tuning_job_location="europe-west4",  # Only supported in europe-west4 for Public Preview
        tuned_model_location=VERTEX_AI_LOCATION,
        learning_rate=learning_rate
    )

    print(model._job.status)


tuning(training_data=df_trainingset,train_steps=100,learning_rate=0.00005)


Creating PipelineJob
PipelineJob created. Resource name: projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716005951
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716005951')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/europe-west4/pipelines/runs/tune-large-model-20230716005951?project=547505032058
PipelineJob projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716005951 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716005951 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716005951 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/547505032058/locations/europe-west4/pi

In [67]:
print(model._job._model.__dict__)

{'_model_id': 'publishers/google/models/text-bison@001', '_endpoint_name': 'projects/547505032058/locations/us-central1/endpoints/56783178504863744', '_endpoint': <google.cloud.aiplatform.models.Endpoint object at 0x7f56ffff23a0> 
resource name: projects/547505032058/locations/us-central1/endpoints/56783178504863744}


In [68]:
## The list shows up the latest model at the top.

latest_tuned_model = model.list_tuned_model_names()
tuned_model = model.get_tuned_model(latest_tuned_model)




projects/547505032058/locations/us-central1/models/4013333989614944256


OK. model creation is completed.

We will test it with two different models - Customized model, Frozon Model (text-bison-001)


In [70]:
QUESTION_TEMPLATE = "As a {specified_role}, you are answering to the customer, Question : {question}"



In [73]:
pd.set_option('display.max_colwidth', None)


                                                                                                                                                                                                                                                                                                                                                            input_text  \
4074             As a Morgan Stanley investment bank customer service representative, you are answering to the customer for the Guide to reading your morgan stanley statement. Context: Individual and Advisory Education Accounts receive all Basic Securities Advisory Accounts. Question :  "What type of accounts do Basic Securities Advisory Accounts receive?"   
16853   As a Morgan Stanley investment bank customer service representative, you are answering to the customer for the Guide to reading your morgan stanley statement. Context: Individual and Advisory Education Accounts receive all Basic Securities Advisory Accounts. Question 

In [78]:
df_trainingset[10:20]

Unnamed: 0,input_text,output_text
10,"context: The decorations were seldom displayed, however. After the Tito–Stalin split of 1948 and his inauguration as president in 1953, Tito rarely wore his uniform except when present in a military function, and then (with rare exception) only wore his Yugoslav ribbons for obvious practical reasons. The awards were displayed in full number only at his funeral in 1980. Tito's reputation as one of the Allied leaders of World War II, along with his diplomatic position as the founder of the Non-Aligned Movement, was primarily the cause of the favorable international recognition. question: Who was inauguarated as President of Yugoslavia in 1953?",answer: ['Tito']
11,"context: Bermuda was colonised by the English as an extension of Virginia and has long had close ties with the US Atlantic Seaboard and Canadian Maritimes as well as the UK. It had a history of African slavery, although Britain abolished it decades before the US. Since the 20th century, there has been considerable immigration to Bermuda from the West Indies, as well as continued immigration from Portuguese Atlantic islands. Unlike immigrants from British colonies in the West Indies, the latter immigrants have had greater difficulty in becoming permanent residents as they lacked British citizenship, mostly spoke no English, and required renewal of work permits to remain beyond an initial period. From the 1950s onwards, Bermuda relaxed its immigration laws, allowing increased immigration from Britain and Canada. Some Black politicians accused the government of using this device to counter the West Indian immigration of previous decades. question: Why did the English originally colonize Bermuda?",answer: ['an extension of Virginia']
12,"context: E 122nd Street runs four blocks (2,250 feet (690 m)) west from the intersection of Second Avenue and terminates at the intersection of Madison Avenue at Marcus Garvey Memorial Park. This segment runs in East Harlem and crosses portions of Third Avenue, Lexington, and Park (Fourth Avenue). question: A segment of what road crosses portions of Third Avenue, Lexington, and Park and runs in East Harlem?",answer: ['E 122nd Street']
13,"context: Egypt has a wide range of beaches situated on the Mediterranean and the Red Sea that extend to over 3,000 km. The Red Sea has serene waters, coloured coral reefs, rare fish and beautiful mountains. The Akba Gulf beaches also provide facilities for practising sea sports. Safaga tops the Red Sea zone with its beautiful location on the Suez Gulf. Last but not least, Sharm el-Sheikh (or City of Peace), Hurghada, Luxor (known as world's greatest open-air museum/ or City of the ⅓ of world monuments), Dahab, Ras Sidr, Marsa Alam, Safaga and the northern coast of the Mediterranean are major tourist's destinations of the recreational tourism. question: What locations on Egypt's northern coast are major tourist destinations for recreational tourism?","answer: ['Dahab, Ras Sidr, Marsa Alam, Safaga']"
14,"context: Since 1947, Canadian military units have participated in more than 200 operations worldwide, and completed 72 international operations. Canadian soldiers, sailors, and aviators came to be considered world-class professionals through conspicuous service during these conflicts and the country's integral participation in NATO during the Cold War, First Gulf War, Kosovo War, and in United Nations Peacekeeping operations, such as the Suez Crisis, Golan Heights, Cyprus, Croatia, Bosnia, Afghanistan, and Libya. Canada maintained an aircraft carrier from 1957 to 1970 during the Cold War, which never saw combat but participated in patrols during the Cuban Missile Crisis. question: How many operations have been completed by the Canadian Military Internationally?",answer: ['72']
15,"context: One of the major developments in the military sphere during the Late Middle Ages was the increased use of infantry and light cavalry. The English also employed longbowmen, but other countries were unable to create similar forces with the same success. Armour continued to advance, spurred by the increasing power of crossbows, and plate armour was developed to protect soldiers from crossbows as well as the hand-held guns that were developed. Pole arms reached new prominence with the development of the Flemish and Swiss infantry armed with pikes and other long spears. question: Along with crossbows, what was plate armor designed to defend against?",answer: ['hand-held guns']
16,"context: On 25 and 26 April, May and Taylor appeared on the eleventh series of American Idol at the Nokia Theatre, Los Angeles, performing a Queen medley with the six finalists on the first show, and the following day performed ""Somebody to Love"" with the 'Queen Extravaganza' band. Queen were scheduled to headline Sonisphere at Knebworth on 7 July 2012 with Adam Lambert before the festival was cancelled. Queen's final concert with Freddie Mercury was in Knebworth in 1986. Brian May commented, ""It's a worthy challenge for us, and I'm sure Adam would meet with Freddie's approval."" Queen expressed disappointment at the cancellation and released a statement to the effect that they were looking to find another venue. It was later announced that Queen + Adam Lambert would play two shows at the Hammersmith Apollo, London on 11 and 12 July 2012. Both shows sold out within 24 hours of tickets going on open sale. A third London date was scheduled for 14 July. On 30 June, Queen + Lambert performed in Kiev, Ukraine at a joint concert with Elton John for the Elena Pinchuk ANTIAIDS Foundation. Queen also performed with Lambert on 3 July 2012 at Moscow's Olympic Stadium, and on 7 July 2012 at the Municipal Stadium in Wroclaw, Poland. question: Where did Queen perform with Adam Lambert on 3 July 2012?","answer: [""Moscow's Olympic Stadium""]"
17,"As a Morgan Stanley investment bank customer service representative, you are answering to the customer for the Guide to reading your morgan stanley statement. Context: As an added benefit you need not wait for the mail: when your statement is available to view online you receive an email notification. Question : ""How do I know when my statement is available to view online?""","reason : ""In the context answer: it stated that you receive an email notification when your statement is available to view online."" ""You will receive an email notification"""
18,"context: The campaigns of French Emperor and General Napoleon Bonaparte characterized the Napoleonic Era. Born on Corsica as the French invaded, and dying suspiciously on the tiny British Island of St. Helena, this brilliant commander, controlled a French Empire that, at its height, ruled a large portion of Europe directly from Paris, while many of his friends and family ruled countries such as Spain, Poland, several parts of Italy and many other Kingdoms Republics and dependencies. The Napoleonic Era changed the face of Europe forever, and old Empires and Kingdoms fell apart as a result of the mighty and ""Glorious"" surge of Republicanism. question: From where did the French empire rule a large portion of Europe?",answer: ['Paris']
19,"context: In 1981, the new US President Ronald Reagan pursued a hard line approach to Libya, erroneously considering it a puppet regime of the Soviet Union. In turn, Gaddafi played up his commercial relationship with the Soviets, visiting Moscow again in April 1981 and 1985, and threatening to join the Warsaw Pact. The Soviets were nevertheless cautious of Gaddafi, seeing him as an unpredictable extremist. Beginning military exercises in the Gulf of Sirte – an area of sea that Libya claimed as a part of its territorial waters – in August 1981 the U.S. shot down two Libyan Su-22 planes monitoring them. Closing down Libya's embassy in Washington, D.C., Reagan advised U.S. companies operating in the country to reduce the number of American personnel stationed there. In March 1982, the U.S. implemented an embargo of Libyan oil, and in January 1986 ordered all U.S. companies to cease operating in the country, although several hundred workers remained. Diplomatic relations also broke down with the U.K., after Libyan diplomats were accused in the shooting death of Yvonne Fletcher, a British policewoman stationed outside their London embassy, in April 1984. In Spring 1986, the U.S. Navy again began performing exercises in the Gulf of Sirte; the Libyan military retaliated, but failed as the U.S. sank several Libyan ships. question: What did Reagan wrongly believe Libya to be?",answer: ['a puppet regime of the Soviet Union']


In [79]:
QUESTIONS = [
    #"How can I review my statements?",                                              # Morgan Stanley Online
    #"What is the benefit of enrolling in eDelivery?",       # You reduce the volume of paper you have to manage and retain online access to seven years of statements
    #"How do I know when my statement is available to view online?", # You will receive an email notification
    #"Which section of Morgan Stanley statement can show the change in value ?"                          # CHANGE IN VALUE OF YOUR ACCOUNT
    #"What information is typically included in an Account Summary?", # recent transactions  and any important notifications or alerts.\"  \"An Account Summary typically includes a summary of the account balance  recent transactions  and any important notifications or alerts.
    # "What purpose does the statement serve?", #suggesting that the answer is to provide an overview of the account and its growth.\"  \"To provide an overview of the account and its growth
    #"What type of accounts receive all Basic Securities Advisory Accounts?", # Individual and Advisory Education Accounts
    # "What can cause the manual summing to be inaccurate?",
    # "Approximately how many phonograph records were sold in 2014?",
    # "What was the average unemployment rate in the U.S. in 2014?", # X, X
    # "Why did the English originally colonize Bermuda?", 
]

def qna_predict(question):
    print("Question \n{question}".format(specified_role=SPECIFIED_ROLE, question=question))
    print("\n")
    print("Answer \n[Customized Model]\n{answer}".format(answer=tuned_model.predict(QUESTION_TEMPLATE.format(specified_role=SPECIFIED_ROLE, question=question),
        **parameters
    ).text))
    print("\n")
    print("Answer \n[Frozen Model]\n{answer}".format(answer=model.predict(QUESTION_TEMPLATE.format(specified_role=SPECIFIED_ROLE, question=question),
        **parameters
    ).text))

for question in QUESTIONS:
    qna_predict(question)
    print('\n')
    


Question 
Why did the English originally colonize Bermuda?


Answer 
[Customized Model]
The English originally colonized Bermuda in 1609 as a way to establish a permanent presence in the New World. The island was first discovered by the Spanish in 1503, but it was not until the English arrived that it was settled. The English were looking for a place to establish a colony that would be a base for trade and exploration in the Americas. Bermuda was seen as an ideal location because it was located in a strategic position in the Atlantic Ocean and it had a good climate. The English colonists quickly established a thriving community on the island, and it became an important part of the British Empire.


Answer 
[Frozen Model]
The English originally colonized Bermuda in 1609 as a way to establish a permanent presence in the New World. The island was first discovered by the Spanish in 1503, but it was not until the English arrived that it was settled. The English were looking for a place to e

The result doesn't seem good.

I need to find why this PEFT couldn't show the accuracy improvement. 

So I will make a simple loop to test to find which learning rate / steps - hyperparameters - is best fit for the small knowlege base adoption. 

In [84]:
context1 = 'In the novel - ''mistaken'', there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one.'
question1 = 'Who are the best hero in novel ''mistaken'' ?'
answer1 = 'Abraxas'
question2 = 'Heroes name in novel ''mistaken'' ?'
answer2 = 'Rox, Soroth, Abraxas'

small_knowledge_base = [{
    'input_text': '# Context : {context} Question: {question}'.format(context=context1, question=question1),
    'output_text': 'Answer: {answer}'.format(answer=answer1)
},{
    'input_text': '# Context : {context} Question: {question}'.format(context=context1, question=question2),
    'output_text': 'Answer: {answer}'.format(answer=answer2)
}]

knowledge_base = []

for x in range(1,100):
    knowledge_base.extend(small_knowledge_base)


df_small_knowledge_base = pd.DataFrame.from_records(knowledge_base)

df_small_knowledge_base

Unnamed: 0,input_text,output_text
0,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Who are the best hero in novel mistaken ?",Answer: Abraxas
1,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Heroes name in novel mistaken ?","Answer: Rox, Soroth, Abraxas"
2,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Who are the best hero in novel mistaken ?",Answer: Abraxas
3,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Heroes name in novel mistaken ?","Answer: Rox, Soroth, Abraxas"
4,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Who are the best hero in novel mistaken ?",Answer: Abraxas
...,...,...
193,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Heroes name in novel mistaken ?","Answer: Rox, Soroth, Abraxas"
194,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Who are the best hero in novel mistaken ?",Answer: Abraxas
195,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Heroes name in novel mistaken ?","Answer: Rox, Soroth, Abraxas"
196,"# Context : In the novel - mistaken, there are some heros - Rox, Soroth, Abraxas. Abraxas is the best one. Question: Who are the best hero in novel mistaken ?",Answer: Abraxas


In [85]:

steps = [10,50,100,500]
learning_rates = [0.015, 0.005, 0.001, 0.0005, 0.0001]

def find_step_and_rate():
    is_sucessful = False
    for step in steps:
        if is_sucessful:
            break
        for learning_rate in learning_rates:
            print('step : {step}, learning_rate : {learning_rate}'.format(step=step, learning_rate=learning_rate))
            tuning(training_data=df_small_knowledge_base,train_steps=step,learning_rate=learning_rate)
            tuned_model = model.get_tuned_model(model.list_tuned_model_names()[0])
            check_response = tuned_model.predict('# Question : {question}'.format(question=question1), **parameters).text
            if answer1 in check_response:
                print("Succeed !")
                is_sucessful = True
                break;

find_step_and_rate()
        


step : 10, learning_rate : 0.015
Creating PipelineJob
PipelineJob created. Resource name: projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716101851
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716101851')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/europe-west4/pipelines/runs/tune-large-model-20230716101851?project=547505032058
PipelineJob projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716101851 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716101851 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/547505032058/locations/europe-west4/pipelineJobs/tune-large-model-20230716101851 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/54750

| Dataset Size	| Training Time (Sec)	| Steps	| Learning Rate	| Remark |
|---------------|---------------------|-------|---------------|--------|
|200	|1634	|10	|0.015	| |
|200	|1288	|10	|0.005	| |
|200	|1103	|10	|0.001	| |
|200	|1072	|10	|0.0005	| |
|200	|1134	|10	|0.0001	| |
|200	|3368	|50	|0.015	| |
|200	|4920	|50	|0.005	| |
|200	|5100	|50	|0.001	| |
|200	|2559	|50	|0.0005	| |
|200	|2324	|50	|0.0001	| |
|200	|3660	|100	|0.015	| |
|200	|3660	|100	|0.005	| |
|200	|3660	|100	|0.001	| |
|200	|3660	|100	|0.0005	| |
|200	|3720	|100	|0.0001	| |
|200	|14940	|500	|0.015	| Sometimes, It shows the trained knowledge in response |
|200	|15240	|500	|0.005	| |
|200	|15240	|500	|0.001	| Frequently, It show the traind knoweledge in response |
|200	|	|500	|0.0005	| |
|200	|	|500	|0.0001	| |

