<a href="https://colab.research.google.com/github/phuonghoathu/nothing1988nevergive/blob/main/L7_Synthetic_Data_FAQ_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install --quiet -U langchain chromadb langchain-openai pypdf gradio datasets

In [None]:
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
loader = PyPDFLoader("/content/VinaLLaMA_final.pdf")
splits = loader.load_and_split(text_splitter)

In [None]:
synthetic_prompt = """
Bạn sẽ được cung cấp với một đoạn văn bản. Nhiệm vụ của bạn là hãy dựa trên nội dung này và tạo ra 20 cặp câu hỏi theo dạng FAQ mà một người bình thường sẽ hỏi:
---CONTEXT---
{context}
{context2}
---END---
Bạn hãy tạo ra 20 cặp câu hỏi theo dạng FAQ mà một người bình thường sẽ hỏi với format như sau
---FORMAT INSTRUCTION---
Question:
<question1>
Answer:
<answer1>

Question:
<question2>
Answer:
<answer2>
... do this 10 times
---END---
Now, let's start
---START---
"""

In [None]:
splits[0].page_content

'VinaLLaMA: LLaMA-based Vietnamese Foundation Model\nQuan Nguyen∗, Huy Pham and†Dung Dao‡\nDecember 15, 2023\nAbstract\nIn this technical report, we present VinaLLaMA, an open-source, state-of-the-art (SOTA)\nLarge Language Model for the Vietnamese language, built upon LLaMA-2 with an additional\n800 billion trained tokens. VinaLLaMA not only demonstrates fluency in Vietnamese but also\nexhibits a profound understanding of Vietnamese culture, making it a truly indigenous model.'

In [None]:
gen_prompt = synthetic_prompt.format(context=splits[0].page_content, context2=splits[1].page_content)

In [None]:
print(gen_prompt)


Bạn sẽ được cung cấp với một đoạn văn bản. Nhiệm vụ của bạn là hãy dựa trên nội dung này và tạo ra 20 cặp câu hỏi theo dạng FAQ mà một người bình thường sẽ hỏi:
---CONTEXT---
VinaLLaMA: LLaMA-based Vietnamese Foundation Model
Quan Nguyen∗, Huy Pham and†Dung Dao‡
December 15, 2023
Abstract
In this technical report, we present VinaLLaMA, an open-source, state-of-the-art (SOTA)
Large Language Model for the Vietnamese language, built upon LLaMA-2 with an additional
800 billion trained tokens. VinaLLaMA not only demonstrates fluency in Vietnamese but also
exhibits a profound understanding of Vietnamese culture, making it a truly indigenous model.
VinaLLaMA-7B-chat, trained on 1-million high quality synthetic samples, achieves SOTA results
on key benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese, marking a
significant advancement in the Vietnamese AI landscape and offering a versatile resource for various
applications.
1 Introduction
The surge in Large Language Models (LLMs)

In [None]:
from openai import OpenAI
from google.colab import userdata
userdata.get('OPENAI_API_KEY')
llm = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

In [None]:
response = llm.chat.completions.create(model='gpt-3.5-turbo',
            messages=[{"role": "user", "content": gen_prompt}],
            temperature=0.1,)

In [None]:
print(response.choices[0].message.content)

Question:
What is VinaLLaMA?
Answer:
VinaLLaMA is an open-source, state-of-the-art Large Language Model for the Vietnamese language.

Question:
How many trained tokens does VinaLLaMA have?
Answer:
VinaLLaMA is built upon LLaMA-2 with an additional 800 billion trained tokens.

Question:
What sets VinaLLaMA apart from other language models?
Answer:
VinaLLaMA not only demonstrates fluency in Vietnamese but also exhibits a profound understanding of Vietnamese culture.

Question:
What is VinaLLaMA-7B-chat trained on?
Answer:
VinaLLaMA-7B-chat is trained on 1-million high quality synthetic samples.

Question:
What benchmarks has VinaLLaMA-7B-chat achieved SOTA results on?
Answer:
VinaLLaMA-7B-chat has achieved SOTA results on key benchmarks including VLSP, VMLU, and Vicuna Benchmark Vietnamese.

Question:
How does VinaLLaMA contribute to the Vietnamese AI landscape?
Answer:
VinaLLaMA marks a significant advancement in the Vietnamese AI landscape and offers a versatile resource for various ap

In [None]:
def parse_qa_pairs(text):
    """
    Parses a given text containing questions and answers into a list of dictionaries.

    Parameters:
    - text (str): A string containing questions and answers in a structured format.

    Returns:
    - List[Dict[str, str]]: A list of dictionaries, each representing a question-answer pair.
    """
    qa_pairs_simplified = []

    # Split the text based on "Question:" as a delimiter and ignore the first split which is empty
    sections = text.split("Question:\n")[1:]

    for section in sections:
        # Each section contains one question and one answer split by "Answer:"
        question_part, answer_part = section.split("\nAnswer:\n")
        question = question_part.strip()
        answer = answer_part.strip()
        qa_pairs_simplified.append({"question": question, "answer": answer})

    return qa_pairs_simplified

In [None]:
qa = parse_qa_pairs(response.choices[0].message.content)

In [None]:
qa

[{'question': 'What is VinaLLaMA?',
  'answer': 'VinaLLaMA is an open-source, state-of-the-art Large Language Model for the Vietnamese language.'},
 {'question': 'How many trained tokens does VinaLLaMA have?',
  'answer': 'VinaLLaMA is built upon LLaMA-2 with an additional 800 billion trained tokens.'},
 {'question': 'What sets VinaLLaMA apart from other language models?',
  'answer': 'VinaLLaMA not only demonstrates fluency in Vietnamese but also exhibits a profound understanding of Vietnamese culture.'},
 {'question': 'What is VinaLLaMA-7B-chat trained on?',
  'answer': 'VinaLLaMA-7B-chat is trained on 1-million high quality synthetic samples.'},
 {'question': 'What benchmarks has VinaLLaMA-7B-chat achieved SOTA results on?',
  'answer': 'VinaLLaMA-7B-chat has achieved SOTA results on key benchmarks including VLSP, VMLU, and Vicuna Benchmark Vietnamese.'},
 {'question': 'How does VinaLLaMA contribute to the Vietnamese AI landscape?',
  'answer': 'VinaLLaMA marks a significant advance

##Convert to ShareGPT

In [None]:
sharegpt_data = []
for pair in qa:
  convo = {}
  convo['conversations'] = [{"from": "human", "value": pair['question']}, {"from": "gpt", "value": pair['answer']}]
  sharegpt_data.append(convo)

In [None]:
sharegpt_data

[{'conversations': [{'from': 'human', 'value': 'What is VinaLLaMA?'},
   {'from': 'gpt',
    'value': 'VinaLLaMA is an open-source, state-of-the-art Large Language Model for the Vietnamese language.'}]},
 {'conversations': [{'from': 'human',
    'value': 'How many trained tokens does VinaLLaMA have?'},
   {'from': 'gpt',
    'value': 'VinaLLaMA is built upon LLaMA-2 with an additional 800 billion trained tokens.'}]},
 {'conversations': [{'from': 'human',
    'value': 'What sets VinaLLaMA apart from other language models?'},
   {'from': 'gpt',
    'value': 'VinaLLaMA not only demonstrates fluency in Vietnamese but also exhibits a profound understanding of Vietnamese culture.'}]},
 {'conversations': [{'from': 'human',
    'value': 'What is VinaLLaMA-7B-chat trained on?'},
   {'from': 'gpt',
    'value': 'VinaLLaMA-7B-chat is trained on 1-million high quality synthetic samples.'}]},
 {'conversations': [{'from': 'human',
    'value': 'What benchmarks has VinaLLaMA-7B-chat achieved SOTA res

In [None]:
# prompt: save sharegpt_data to a json file with enforce anscii = false

import json

with open('sharegpt_data.json', 'w', encoding='utf-8') as f:
  json.dump(sharegpt_data, f, indent=4, ensure_ascii=False)

In [None]:
import datasets

dataset = datasets.load_dataset('json', data_files='sharegpt_data.json')

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
dataset.push_to_hub('qnguyen3/demo_faq')

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/339 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/qnguyen3/demo_faq/commit/e4ebda7555b8e3f5f1232dd2d05eea52f92cf431', commit_message='Upload dataset', commit_description='', oid='e4ebda7555b8e3f5f1232dd2d05eea52f92cf431', pr_url=None, pr_revision=None, pr_num=None)