# LiOn
Lion 은 다양한 데이터와 연결되어 자연어 처리 분야에서의 전문성을 확장할 수 있는 모델입니다.

## LiOnConnect
Lion 모델의 강력한 자연어 처리 능력과 함께, 다양한 DB와의 연결을 통해 자연어 처리 분야에서 높은 성능과 효율성을 제공합니다.


### How
- [alpaca-lora.ipynb](https://colab.research.google.com/drive/1eWAmesrW99p7e1nah5bipn0zikMb8XYC#scrollTo=upOB2AQJSW9-): This notebook contains minimal code for **running [Alpaca-LoRA](https://github.com/tloen/alpaca-lora/)** for demonstration purposes.
- [Make ChatGPT-replica](https://colab.research.google.com/drive/1UcLLV4mLtn8vxGk5U3TxiLNbVBealy16?usp=sharing):  ChatGPT를 만든 원리인 **GPT fine-tuning, 강화학습(PPO), RLHF, ChatGPT 데이터셋 구축**에 대해 다루고 코드 실습.
- [LangChain](https://langchain.readthedocs.io/en/latest/index.html): 이 라이브러리로 **자연어로 데이터베이스 질의**.
- [r-build/sf-restaurants-sql](https://github.com/r-build/sf-restaurants-sql): 샌프란시스코 식품 건강 조사 **데이터**.
- [OpenAI GPT-3 and LangChain](https://blog.devgenius.io/query-database-using-natural-language-openai-gpt-3-d2403636527a): OpenAI GPT-3 및 LangChain을 사용하여 **자연어로 데이터베이스 질의**.

### Related issues
1. [비상업적 연구 목적으로만 본 소프트웨어의 파생 저작물을 제작할 수 있습니다.](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)

# Talk to Alpaca-LoRA

This notebook contains minimal code for running [Alpaca-LoRA](https://github.com/tloen/alpaca-lora/) for demonstration purposes. Please check the repo for more details.

## Use Model

In [1]:
!pip install bitsandbytes
!pip install -q datasets loralib sentencepiece
!pip install -q git+https://github.com/zphang/transformers@c3dc391
!pip install -q git+https://github.com/huggingface/peft.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
from peft import PeftModel
from transformers import LLaMATokenizer, LLaMAForCausalLM, GenerationConfig

tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = LLaMAForCausalLM.from_pretrained(
    "decapoda-research/llama-7b-hf",
    load_in_8bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [3]:
def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

In [4]:
generation_config = GenerationConfig(
    temperature=0.1,
    top_p=0.75,
    num_beams=4,
)

def evaluate(instruction, input=None):
    prompt = generate_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256
    )
    for s in generation_output.sequences:
        output = tokenizer.decode(s)
        print("Response:", output.split("### Response:")[1].strip())

### Run

In [5]:
# while 1:
#     evaluate(input("Instruction: "))

### Download

In [6]:
# from google.colab import drive
# drive.mount('/content/drive')
# model.save_pretrained('drive/MyDrive/LiOn/alpaca-lora-7b.pt')

# MVP
기업 데이터베이스(DB) 접속. **내부 정보를 실시간 접근**하여, 필요작업 수행.
- 예: 기업 전화상담실
- 내부 회계사와 변호사
- 개인 비서와 의사
- 글로벌 심리 상담사와 교수

## Default Preferences

In [7]:
# %cd drive/MyDrive/LiOn/
# !ls

### Installation of required libraries

In [8]:
!pip install openai
!pip install langchain 
!pip install pymssql 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Setting things need

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
!ls /content/drive/MyDrive/LiOn/archive

olist_customers_dataset.csv	  olist_products_dataset.csv
olist_geolocation_dataset.csv	  olist_sellers_dataset.csv
olist_order_items_dataset.csv	  olist_sellers_dataset.csv.db
olist_order_payments_dataset.csv  product_category_name_translation.csv
olist_order_reviews_dataset.csv   Untitled.ipynb
olist_orders_dataset.csv


In [11]:
!apt-get install sqlite3

Reading package lists... Done
Building dependency tree       
Reading state information... Done
sqlite3 is already the newest version (3.31.1-4ubuntu0.5).
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.


In [26]:
import os
import openai
from langchain import OpenAI, SQLDatabase, SQLDatabaseChain, PromptTemplate

# https://drive.google.com/drive/folders/1YL3AD4oXy3mGTCB6RiXNCac8O7AYiPec?usp=sharing
db = SQLDatabase.from_uri(f"sqlite:////content/drive/MyDrive/LiOn/db/exported_data.db")


# # 데이터베이스 연결 정보
# connection_string = "postgres://username:password@hostname:port/database"

# # SQLDatabase 객체 생성
# db = SQLDatabase.from_uri(connection_string)

In [27]:
generation_config = GenerationConfig(
    temperature=0.1,
    top_p=0.75,
    num_beams=4,
)

def chat(instruction, input=None):
    prompt = generate_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256
    )
    for s in generation_output.sequences:
        output = tokenizer.decode(s)
        print("Response:", output.split("### Response:")[1].strip())

def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""


_DEFAULT_TEMPLATE = """Given an input question, first create a syntactically correct MSSQL query to run, then look at the results of the query and return the answer.
Descriptions of columns are shown below.
--
Name: 상품명
ModelName: 모델명
MakerName: 제조사명
Price2: 공급가격
--

Use the following format:

Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"

Only use the following tables:

"Items"

Question: {input}"""

PROMPT = PromptTemplate(input_variables=["input"], template=_DEFAULT_TEMPLATE)

## Running LangChain
CustomSQLDatabaseChain, CustomOpenAI 를 만들어서 기존에 동작하지 않던 Model 사용 가능.

In [28]:
from langchain.chains import SQLDatabaseChain

class CustomSQLDatabaseChain(SQLDatabaseChain):
    def __init__(self, *args, **kwargs):
        llm = kwargs.pop("llm", None)
        super().__init__(llm=llm, *args, **kwargs)

    def get_response(self, text):
        prompt = self.prompt.format(input=text)
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids = inputs["input_ids"].cuda()
        generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256
        )
        for s in generation_output.sequences:
            output = tokenizer.decode(s)
            response = output.split("### Response:")[1].strip()
            return {"text": response}

In [29]:
from langchain import OpenAI
from typing import Any
from pydantic import Field

class CustomOpenAI(OpenAI):
    model: Any = Field(default=None, description="The model to use for generating responses")
    tokenizer: Any = Field(default=None, description="The tokenizer to use with the model")

    def __call__(self, text, *args, **kwargs):
        inputs = self.tokenizer(text, return_tensors="pt")
        input_ids = inputs["input_ids"].cuda()
        generation_output = self.model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256
        )
        for s in generation_output.sequences:
            output = self.tokenizer.decode(s)
            response = output.split("### Response:")[1].strip()
            return {"text": response}

In [30]:
### Prompts
llm = CustomOpenAI(model=model, tokenizer=tokenizer)

db_chain = CustomSQLDatabaseChain(
    llm=llm,
    database=db, prompt=PROMPT, verbose=True, return_intermediate_steps=True
)

ValidationError: ignored

In [None]:
### Prompts
llm = CustomOpenAI(model=model, tokenizer=tokenizer)

db_chain = CustomSQLDatabaseChain(
    llm=llm,
    database=db, prompt=PROMPT, verbose=True, return_intermediate_steps=True
)

In [31]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LLaMAForCausalLM(
      (model): LLaMAModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=31999)
        (layers): ModuleList(
          (0-31): 32 x LLaMADecoderLayer(
            (self_attn): LLaMAAttention(
              (q_proj): Linear8bitLt(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): Dropout(p=0.05, inplace=False)
                (lora_A): Linear(in_features=4096, out_features=8, bias=False)
                (lora_B): Linear(in_features=8, out_features=4096, bias=False)
              )
              (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear8bitLt(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): Dropout(p=0.05, inplace=False)
                (lora_A): Linear(in_features=4096, out_features=8, bias=False)
                (lora_B): Linear(in_features=

In [None]:
# ### Prompts
# llm = CustomOpenAI(model=model, tokenizer=tokenizer)

# db_chain = CustomSQLDatabaseChain(
#     llm=llm,
#     database=db, prompt=PROMPT, verbose=True, return_intermediate_steps=True
# )

### Prompts

In [None]:
text1 = "What are the top 3 most expensive products"
text2 = "Can you give me a list of the 3 priciest items we have for sale?"
text3 = "Which 3 products are the most expensive ones we offer?"

result = db_chain(text1)
result = db_chain(text2)
result = db_chain(text3)

# for _ in range(15):
#     text = input()
#     result = db_chain(text)

### ChatGPT API

In [None]:
# llm = CustomOpenAI(temperature=0)
# db_chain = CustomSQLDatabaseChain(
#     llm=llm,
#     database=db, prompt=PROMPT, verbose=True, return_intermediate_steps=True
# )
# result = db_chain("가장 비싼 제품 3개를 알려줘")