<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/cookbooks/upstage/Solar-Full-Stack LLM-101/05_3_OracleDB.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Retrieval Augmented Generation (RAG) Baseline
## Overview  
In this time, we will check the baseline code.
The goal of this project is to provide students with hands-on experience in handling and enhancing Large Language Models (LLMs) provided by [**Upstage**](https://www.upstage.ai) (Solar).

You can use any engineering method for improving benchmark performance excluding direct training (Fine-tuning).

*Collecting data directly related to the test set is considered cheating e.g., using MMLU-pro dataset or EWHA.pdf for KB*

In [None]:
!pip install openai



In [None]:
# @title set API key
# First, enroll your API key as the colab key.
from pprint import pprint
import os

import warnings

warnings.filterwarnings("ignore")

from IPython import get_ipython

upstage_api_key_env_name = "upstage_api_key"


def load_env():
    if "google.colab" in str(get_ipython()):
        # Running in Google Colab
        from google.colab import userdata

        upstage_api_key = userdata.get(upstage_api_key_env_name)
        # print(upstage_api_key)
        return os.environ.setdefault(upstage_api_key_env_name, upstage_api_key)
    else:
        # Running in local Jupyter Notebook
        from dotenv import load_dotenv

        load_dotenv()
        return os.environ.get(upstage_api_key_env_name)


UPSTAGE_API_KEY = load_env() # Setting API Key

In [None]:
# pip install openai

from openai import OpenAI # openai==1.52.2

client = OpenAI(
    api_key=UPSTAGE_API_KEY,
    base_url="https://api.upstage.ai/v1"
)

stream = client.chat.completions.create(
    model="solar-pro2",
    messages=[
        {
            "role": "user",
            "content": "Hi, how are you?"
        }
    ],
    stream=True, # show the multiple responses gradually as chatgpt does
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

# Use with stream=False
# print(stream.choices[0].message.content)

Hello! I'm just a large language model, so I don't have feelings, but I'm here and ready to help you. How can I assist you today? üòä  

Is there something specific you'd like to talk about or ask? I'm here to provide information, answer questions, or even have a conversation!  

(And if you're just saying hi, that's totally fine too! üåü)

In [None]:
response = client.chat.completions.create(
    model="solar-pro2",
    messages=[
        {
            "role": "user",
            "content": "Hi, how are you?"
        }
    ],
    stream=False, # show the multiple responses at once
)
print(response.choices[0].message.content)

I'm just a large language model, so I don't have feelings, but thanks for asking! I'm here and ready to help you. How can I assist you today? üòä  

Feel free to ask any questions or let me know if you need information, creative ideas, or help with something specific!


# Baseline

In [None]:
!pip3 install -qU python-dotenv PyPDF2 langchain langchain-community langchain-core langchain-text-splitters langchain_upstage oracledb python-dotenv

In [None]:
# Additional Contents : https://wikidocs.net/253106

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# set parameters

api_key = UPSTAGE_API_KEY
data_path = "./drive/MyDrive/Upstage_Project" # folder path containing ewah.pdf and samples.csv

In [None]:
from langchain_upstage import UpstageDocumentParseLoader
import os

UPSTAGE_API_KEY = api_key

layzer = UpstageDocumentParseLoader(api_key=UPSTAGE_API_KEY,file_path=os.path.join(data_path, 'ewha.pdf'), output_format="text")

docs = layzer.load()  # or layzer.lazy_load()

In [None]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))


Splits: 54


In [None]:
print(splits[0])

page_content='Ïù¥ÌôîÏó¨ÏûêÎåÄÌïôÍµê ÌïôÏπô1946. 8. 15. Ï†úÏ†ï
2017. 8. 16. Í∞úÏ†ïÏ†ú1Ïû• Ï¥ùÏπôÏ†ú1Ï°∞(Î™©Ï†Å) Î≥∏ÍµêÎäî ÎåÄÌïúÎØºÍµ≠Ïùò ÍµêÏú°Ïù¥ÎÖêÍ≥º Í∏∞ÎèÖÍµêÏ†ïÏã†ÏùÑ Î∞îÌÉïÏúºÎ°ú ÌïòÏó¨ ÌïôÏà†Ïùò ÍπäÏùÄ Ïù¥Î°†Í≥º
Í∑∏ Í¥ëÎ≤îÌïòÍ≥† Ï†ïÎ∞ÄÌïú ÏùëÏö©Î∞©Î≤ïÏùÑ ÍµêÏàò‚Ä§Ïó∞Íµ¨ÌïòÎ©∞, Ïù∏Í≤©ÏùÑ ÎèÑÏïºÌïòÏó¨ Íµ≠Í∞ÄÏôÄ Ïù∏Î•òÏÇ¨ÌöåÏùò Î∞úÏ†ÑÏóê
Í≥µÌóåÌï† Ïàò ÏûàÎäî ÏßÄÎèÑÏó¨ÏÑ±ÏùÑ ÏñëÏÑ±Ìï®ÏùÑ Î™©Ï†ÅÏúºÎ°ú ÌïúÎã§.Ï†ú2Ï°∞(Î™ÖÏπ≠) Î≥∏ÍµêÎäî Ïù¥ÌôîÏó¨ÏûêÎåÄÌïôÍµêÎùº Î∂ÄÎ•∏Îã§.
Ï†ú3Ï°∞(ÏúÑÏπò) Î≥∏ÍµêÎäî ÏÑúÏö∏ÌäπÎ≥ÑÏãú ÏÑúÎåÄÎ¨∏Íµ¨ Ïù¥ÌôîÏó¨ÎåÄÍ∏∏ 52Ïóê ÎëîÎã§. (Í∞úÏ†ï 2013.2.25.)Ï†ú2Ïû• Ìé∏Ï†úÏ†ú4Ï°∞(ÎåÄÌïô Î∞è ÎåÄÌïôÏõê) ‚ë† Î≥∏ÍµêÏóêÎäî Îã§Ïùå Í∞Å Ìò∏Ïùò ÎåÄÌïôÏùÑ ÎëîÎã§.1. Ïù∏Î¨∏Í≥ºÌïôÎåÄÌïô, ÏÇ¨ÌöåÍ≥ºÌïôÎåÄÌïô, ÏûêÏó∞Í≥ºÌïôÎåÄÌïô, ÏóòÌÖçÍ≥µÍ≥ºÎåÄÌïô, ÏùåÏïÖÎåÄÌïô, Ï°∞ÌòïÏòàÏà†ÎåÄÌïô, ÏÇ¨Î≤î
ÎåÄÌïô, Í≤ΩÏòÅÎåÄÌïô, Ïã†ÏÇ∞ÏóÖÏúµÌï©ÎåÄÌïô, ÏùòÍ≥ºÎåÄÌïô, Í∞ÑÌò∏ÎåÄÌïô, ÏïΩÌïôÎåÄÌïô, Ïä§ÌÅ¨ÎûúÌäºÎåÄÌïô(Ïù¥Ìïò ‚ÄúÍ∞Å
ÎåÄÌïô‚ÄùÏù¥Îùº ÌïúÎã§) (Í∞úÏ†ï 2016.6.16.)
2. Ìò∏ÌÅ¨Îßà(HOKMA)ÍµêÏñëÎåÄÌïô
‚ë° 

In [None]:
from langchain_upstage import UpstageEmbeddings

upstage_embeddings = UpstageEmbeddings(api_key=UPSTAGE_API_KEY, model="solar-embedding-1-large")

embedded_query = upstage_embeddings.embed_query(splits[0].page_content)
print(embedded_query)

[0.0035944078117609024, -0.014428798109292984, -0.014659044332802296, 0.014492754824459553, 0.0025023347698152065, -0.017063844949007034, -0.009216266684234142, -0.0012735524214804173, -0.0011456375941634178, 0.00723358616232872, -0.014211342670023441, 0.004720058757811785, 0.006271026562899351, -0.005465162917971611, 0.00855750497430563, -0.003635980188846588, -0.023625876754522324, 0.00567302480340004, -0.004665695130825043, 0.002415992086753249, 0.012043185532093048, -0.004975888412445784, -0.003664761083200574, -0.014505546540021896, -0.00757256057113409, 0.009619198739528656, -0.012976964004337788, -0.010674496181309223, 0.0017156582325696945, -0.0014158576959744096, -0.0024975379928946495, -0.024470115080475807, -0.024367783218622208, -0.018726738169789314, -0.016091691330075264, -0.013379896059632301, 0.017038261517882347, 0.03985827416181564, 0.033002037554979324, -0.0091011431068182, -0.01711500994861126, 0.0027181911282241344, -0.006165497004985809, 0.009855841286480427, -0.0

In [None]:
# read samples.csv file

import pandas as pd

def read_data(data_path):
    data = pd.read_csv(data_path)
    prompts = data['prompts']
    answers = data['answers']
    # returns two lists: prompts and answers
    return prompts, answers

In [None]:
prompts, answers = read_data(os.path.join(data_path, 'testset.csv'))

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_upstage import ChatUpstage


llm = ChatUpstage(api_key = UPSTAGE_API_KEY,
                  model="solar-pro2",
                  )

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context.
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {context}
    """
)
chain = prompt_template | llm

responses = []

for prompt in prompts[:10]:
    response = chain.invoke({"question": prompt, "context": splits[0].page_content})
    responses.append(response.content)

In [None]:
for i in responses[:10]:
    print(i)
    print('-'*10)

The information is not present in the context.

The provided context from "Ïù¥ÌôîÏó¨ÏûêÎåÄÌïôÍµê ÌïôÏπô" (Ewha Womans University Regulations) does not contain any specific details regarding the deadline for submitting a leave of absence request (Ìú¥Ìïô Ïã†Ï≤≠) after the semester begins. The text only includes general provisions about the university's purpose, structure, and academic organizations, but no procedural rules for leaves of absence. Therefore, the correct answer cannot be determined from the given material.
----------
The information is not present in the context. 

The provided context from "Ïù¥ÌôîÏó¨ÏûêÎåÄÌïôÍµê ÌïôÏπô" does not contain the specific article (Ï†ú28Ï°∞Ï†ú4Ìò∏) or any clause mentioning the values of "a" and "b" related to Ïû¨ÏûÖÌïô (readmission). Therefore, the answer cannot be determined from the given text. 

Correct choice: **(D) A,B,C Ï§ë Îãµ ÏóÜÏùå**.
----------
The information is not present in the context.  

(Explanation: The provided context includes

In [None]:
# funcion to extract an answer from response

import re

def extract_answer(response):
    """
    extracts the answer from the response using a regular expression.
    expected format: "[ANSWER]: (A) convolutional networks"

    if there are any answers formatted like the format, it returns None.
    """
    pattern = r"\[ANSWER\]:\s*\((A|B|C|D|E)\)"
    match = re.search(pattern, response)

    if match:
        return match.group(1) # Extract the letter inside parentheses (e.g., A)
    else:
        return extract_again(response)

def extract_again(response):
    pattern = r"\b[A-J]\b(?!.*\b[A-J]\b)"
    match = re.search(pattern, response)
    if match:
        return match.group(0)
    else:
        return None

In [None]:
# print accuracy

cnt = 0

for answer, response in zip(answers[:10], responses[:10]):
    print("-"*10)
    generated_answer = extract_answer(response)
    # print(response)
    # check
    if generated_answer:
        print(f"generated answer: {generated_answer}, answer: {answer}")
    else:
        print("extraction fail")


    if generated_answer == None:
        continue
    if generated_answer in answer:
        cnt += 1

print()
print(f"acc: {(cnt/10)*100}%")

----------
extraction fail
----------
extraction fail
----------
extraction fail
----------
extraction fail
----------
extraction fail
----------
extraction fail
----------
extraction fail
----------
extraction fail
----------
generated answer: C, answer: (C)
----------
generated answer: B, answer: (B)

acc: 20.0%
