In [None]:
!pip install -q -U google-generativeai

### Setup your API key

Before you can use the Gemini API, you must first obtain an API key. If you don't already have one, create a key with one click in Google AI Studio.

<a class="button button-primary" href="https://makersuite.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>

In Colab, add the key to the secrets manager under the "üîë" in the left panel. Give it the name `GOOGLE_API_KEY`.

Once you have the API key, pass it to the SDK. You can do this in two ways:

* Put the key in the `GOOGLE_API_KEY` environment variable (the SDK will automatically pick it up from there).
* Pass the key to `genai.configure(api_key=...)`

In [None]:
# Used to securely store your API key
from google.colab import userdata

api_key = userdata.get('GOOGLE_API_KEY')

In [None]:
import google.generativeai as genai
genai.configure(api_key=api_key)
model = genai.GenerativeModel('gemini-pro')

In [None]:
def split_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()

    # Remove leading/trailing whitespace and split the content into sentences
    sentences = content.strip().split('.')

    # Remove empty sentences
    sentences = [sentence.strip() for sentence in sentences if sentence.strip()]

    # Calculate the number of sentences per chunk
    total_sentences = len(sentences)
    chunk_size = total_sentences // 10

    chunks = []
    start = 0
    for i in range(9):
        end = start + chunk_size
        chunk = sentences[start:end]
        chunks.append('. '.join(chunk) + '.')
        start = end

    # Add the remaining sentences to the last chunk
    last_chunk = sentences[start:]
    chunks.append('. '.join(last_chunk) + '.')

    return chunks

# Example usage
file_path = '/content/vinallama.txt'
chunks = split_file(file_path)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:")
    print(chunk)
    print()

Chunk 1:
In this technical report, we present VinaLLaMA, an open-source, state-of-the-art (SOTA) Large Language Model for the Vietnamese language, built upon LLaMA-2 with an additional 800 billion trained tokens. VinaLLaMA not only demonstrates fluency in Vietnamese but also exhibits a profound understanding of Vietnamese culture, making it a truly indigenous model. VinaLLaMA-7B-chat, trained on 1-million high quality synthetic samples, achieves SOTA results on key benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese, marking a significant advancement in the Vietnamese AI landscape and offering a versatile resource for various applications. 1 Introduction 
The surge in Large Language Models (LLMs) such as ChatGPT and GPT-4 has significantly advanced the field of artificial intelligence (AI), particularly in language processing. In 2023, Vietnam‚Äôs AI sector witnessed a notable development with the introduction of several Vietnamese-centric LLMs, including BLOOMZ‚Äôs Vietc

Summarization (T√≥m T·∫Øt) l√† m·ªôt trong s·ªë nh·ªØng task kh√≥ nh·∫•t cho c√°c LLM. N√≥ ƒë·ªèi h·ªèi LLM c√≥ kh·∫£ nƒÉng x·ª≠ l√≠ long context nh∆∞ng v·∫´n c√≥ kh·∫£ nƒÉng l√†m retrieval tr√™n text t·ªët. V√¨ v·∫≠y, ƒë·ªëi v∆°i ƒë·∫°i ƒëa s·ªë c√°c LLM, ch√∫ng ta s·∫Ω l√†m prompt chaining ƒë·ªÉ tri·ªÉn khai b√†i to√°n n√†y

In [None]:
question = """
H√£y t√≥m t·∫Øt vƒÉn b·∫£n: {context}.
T√¥i ch·ªâ c·∫ßn n·ªôi dung t√≥m t·∫Øt, kh√¥ng c·∫ßn th√™m b·∫•t k√¨ th·ª© g√¨ kh√°c.
"""

In [None]:
def build_final_context(chunks):
    context = ""
    for index, chunk in enumerate(chunks):
        context +=  f"Context {index + 1}: " + chunk + "\n"
    return context

In [None]:
context = build_final_context(chunks)
context

In [None]:
%%time
response = model.generate_content(question.format(context=context))

CPU times: user 137 ms, sys: 15.6 ms, total: 153 ms
Wall time: 7.85 s


In [None]:
response.text

'VinaLLaMA l√† m·ªôt M√¥ h√¨nh Ng√¥n ng·ªØ L·ªõn (LLM) ti√™n ti·∫øn, ngu·ªìn m·ªü, d√†nh ri√™ng cho ti·∫øng Vi·ªát, ƒë∆∞·ª£c x√¢y d·ª±ng d·ª±a tr√™n LLaMA-2 v·ªõi 800 t·ª∑ token ƒë∆∞·ª£c ƒë√†o t·∫°o th√™m. VinaLLaMA kh√¥ng ch·ªâ th·ªÉ hi·ªán s·ª± l∆∞u lo√°t ti·∫øng Vi·ªát m√† c√≤n hi·ªÉu s√¢u s·∫Øc v·ªÅ vƒÉn h√≥a Vi·ªát Nam, bi·∫øn n√≥ tr·ªü th√†nh m·ªôt m√¥ h√¨nh th·ª±c s·ª± b·∫£n ƒë·ªãa. VinaLLaMA-7B-chat, ƒë∆∞·ª£c ƒë√†o t·∫°o tr√™n 1 tri·ªáu m·∫´u t·ªïng h·ª£p ch·∫•t l∆∞·ª£ng cao, ƒë·∫°t k·∫øt qu·∫£ SOTA tr√™n c√°c ƒëi·ªÉm chu·∫©n ch√≠nh, bao g·ªìm VLSP, VMLU v√† Vicuna Benchmark Vietnamese, ƒë√°nh d·∫•u m·ªôt b∆∞·ªõc ti·∫øn ƒë√°ng k·ªÉ trong b·ªëi c·∫£nh AI c·ªßa ti·∫øng Vi·ªát v√† cung c·∫•p m·ªôt ngu·ªìn ƒëa nƒÉng cho nhi·ªÅu ·ª©ng d·ª•ng kh√°c nhau.'

In [None]:
summarize_prompt = """
Sau ƒë√¢y l√† m·ªôt n·ªôi dung ƒë∆∞·ª£c tr√≠ch xu·∫•t c·ªßa m·ªôt vƒÉn b·∫£n, nhi·ªám v·ª• c·ªßa b·∫°n l√† h√£y t√≥m t·∫Øt n√≥
---CONTEXT---
{context}
---END CONTEXT---
H√£y ƒë∆∞a ra t√≥m t·∫Øt c·ªßa vƒÉn b·∫£n tr√™n. T√¥i ch·ªâ c·∫ßn ph·∫ßn t√≥m t·∫Øt, v√† kh√¥ng c·∫ßn th√™m b·∫•t k√¨ th·ª© g√¨ kh√°c.
T√≥m T·∫Øt:
"""

final_ans_prompt = """
D∆∞·ªõi ƒë√¢y l√† c√°c ƒëo·∫°n t√≥m t·∫Øt c·ªßa t·ª´ng ƒëo·∫°n nh·ªè c·ªßa m·ªôt vƒÉn b·∫£n l·ªõn.
---CONTEXT---
{context}
---END CONTEXT---
D·ª±a tr√™n c√°c ƒëo·∫°n tr√™n, h√£y ƒë∆∞a ra t√≥m t·∫Øt t·ªïng c·ªßa t·∫•t c·∫£ c√°c ƒëo·∫°n tr√™n. T√¥i ch·ªâ c·∫ßn t√≥m t·∫Øt t·ªïng, kh√¥ng c·∫ßn th√™m b·∫•t k√¨ th·ª© g√¨ kh√°c.
T√≥m t·∫Øt t·ªïng:
"""

In [None]:
summarize_chunks = []
for chunk in chunks:
  message = summarize_prompt.format(context=chunk)
  response = model.generate_content(message)
  summarize_chunks.append(response.text)
summarize_chunks

In [None]:
context_new = build_final_context(summarize_chunks)
message = final_ans_prompt.format(context=context_new)
response = model.generate_content(message)
response.text


'VinaLLaMA-7B l√† m·ªôt m√¥ h√¨nh ng√¥n ng·ªØ l·ªõn tinh ch·ªânh h∆∞·ªõng d·∫´n ti√™n ti·∫øn cho ti·∫øng Vi·ªát, ƒë·∫°t hi·ªáu su·∫•t v∆∞·ª£t tr·ªôi trong c√°c ƒëi·ªÉm chu·∫©n v·ªÅ kh·∫£ nƒÉng hi·ªÉu v√† t·∫°o vƒÉn b·∫£n, nh·ªù d·ªØ li·ªáu t·ªïng h·ª£p ƒë∆∞·ª£c thi·∫øt k·∫ø c·∫©n th·∫≠n. C√°c chi·∫øn l∆∞·ª£c tinh ch·ªânh ƒë∆∞·ª£c tri·ªÉn khai t·ªët v·ªõi d·ªØ li·ªáu t·ªïng h·ª£p c√≥ ti·ªÅm nƒÉng n√¢ng cao kh·∫£ nƒÉng c·ªßa c√°c m√¥ h√¨nh ng√¥n ng·ªØ l·ªõn, nh∆∞ ƒë∆∞·ª£c minh h·ªça b·ªüi s·ª± c√¢n b·∫±ng t·ªëi ∆∞u gi·ªØa k√≠ch th∆∞·ªõc, t·ªëc ƒë·ªô v√† hi·ªáu su·∫•t c·ªßa VinaLLaMA-7B-chat. C√°c b·ªô c√¥ng c·ª• ƒë√°nh gi√° nh∆∞ VMLU v√† Vicuna Benchmark Vietnamese cung c·∫•p c√°c c√°ch ƒë√°nh gi√° s√°ng t·∫°o v√† to√†n di·ªán v·ªÅ hi·ªáu su·∫•t c·ªßa c√°c m√¥ h√¨nh n√†y tr√™n c√°c nhi·ªám v·ª• ƒëa d·∫°ng.'

In [None]:
samples_retrieval = ['H√¥m nay, gi√° Bitcoin ƒë√£ ƒë·∫°t ƒë·∫øn m·ªôt ƒë·ªânh m·ªõi.',
           'Ph·ªü Vi·ªát Nam r·∫•t ngon, H√† N·ªôi n·ªïi ti·∫øng v·ªõi m√≥n Ph·ªü B√≤.',
           'Ph·ªü l√† m√≥n ƒÉn Vi·ªát Nam xu·∫•t hi·ªán ·ªü r·∫•t nhi·ªÅu n∆°i tr√™n th·∫ø gi·ªõi.',
           'M·ªôt th·ªã tr∆∞·ªùng kh√≥ nh∆∞ Nh·∫≠t c≈©ng r·∫•t ·∫•n t∆∞·ª£ng v·ªõi m√≥n Ph·ªü B√≤ Vi·ªát Nam.']

context = build_final_context(samples_retrieval)
print(context)

Context 1: H√¥m nay, gi√° Bitcoin ƒë√£ ƒë·∫°t ƒë·∫øn m·ªôt ƒë·ªânh m·ªõi.
Context 2: Ph·ªü Vi·ªát Nam r·∫•t ngon, H√† N·ªôi n·ªïi ti·∫øng v·ªõi m√≥n Ph·ªü B√≤.
Context 3: Ph·ªü l√† m√≥n ƒÉn Vi·ªát Nam xu·∫•t hi·ªán ·ªü r·∫•t nhi·ªÅu n∆°i tr√™n th·∫ø gi·ªõi.
Context 4: M·ªôt th·ªã tr∆∞·ªùng kh√≥ nh∆∞ Nh·∫≠t c≈©ng r·∫•t ·∫•n t∆∞·ª£ng v·ªõi m√≥n Ph·ªü B√≤ Vi·ªát Nam.



In [None]:
sample_question = "H√£y gi·ªõi thi·ªáu v·ªÅ m√≥n Ph·ªü B√≤ Vi·ªát Nam"
refinement_prompt = """
D∆∞·ªõi ƒë√¢y l√† c√°c n·ªôi dung ƒë∆∞·ª£c tr√≠ch xu·∫•t t·ª´ Database:
---CONTEXT---
{context}
---END CONTEXT---
Nhi·ªám v·ª• c·ªßa b·∫°n l√† h√£y vi·∫øt l·∫°i nh·ªØng n·ªôi dung tr√™n d·ª±a tr√™n c√¢u h·ªèi d∆∞·ªõi ƒë√¢y, h√£y lo·∫°i b·ªè c√°c n·ªôi dung kh√¥ng li√™n quan ƒë·ªÉ tr√°nh g√¢y nhi·ªÖu lo·∫°n. T√¥i ch·ªâ c·∫ßn ph·∫ßn vi·∫øt l·∫°i c·ªßa b·∫°n, vui l√≤ng kh√¥ng th√™m b·∫•t k√¨ th√¥ng tin n√†o kh√°c.
C√¢u h·ªèi: {question}
"""

In [None]:
sample_message = refinement_prompt.format(context=context, question=sample_question)
sample_response = model.generate_content(sample_message)
sample_response.text

'* Ph·ªü l√† m√≥n ƒÉn Vi·ªát Nam xu·∫•t hi·ªán ·ªü r·∫•t nhi·ªÅu n∆°i tr√™n th·∫ø gi·ªõi.\n* Th·∫≠m ch√≠ th·ªã tr∆∞·ªùng kh√≥ t√≠nh nh∆∞ Nh·∫≠t B·∫£n c≈©ng r·∫•t th√≠ch m√≥n Ph·ªü B√≤ Vi·ªát Nam.'