<a href="https://colab.research.google.com/github/jasonheesanglee/LLM_Study/blob/main/Jason_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
import os
from google.colab import drive
drive.mount('/content/drive')
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OpenAI_API')
os.environ['HF_AUTH_TOKEN'] = userdata.get('HF_TOKEN')
os.environ['GOOGLE_API_KEY'] = userdata.get('GoogleDevAPI')

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
import locale
def getpreferredencoding(do_setlocale=True):
    return 'UTF-8'

locale.getpreferredencoding = getpreferredencoding

In [31]:
!pip install -qqq fastapi uvicorn tiktoken kaleido cohere openai torch python-multipart langchain transformers sentence_transformers accelerate chromadb pypdf
!pip install -qqq bitsandbytes huggingface
!pip install -qqq streamlit python-dotenv
!pip install -qqq unstructured > /dev/null

In [32]:
!nvidia-smi

Tue Feb 20 08:36:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0              39W / 300W |   6656MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [33]:
import os
import re
import glob
import torch
import tiktoken
import chromadb
import streamlit as st
from tqdm.auto import tqdm
from accelerate.utils import set_seed
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from accelerate import Accelerator, notebook_launcher
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.chains.question_answering import load_qa_chain
from langchain_community.document_loaders import WebBaseLoader
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


In [34]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [35]:
# candidates

## abacusai/Smaug-72B-v0.1
## beomi/OPEN-SOLAR-KO-10.7B
## beomi/llama-2-ko-7b
## daekeun-ml/phi-2-ko-v0.1
## maywell/kiqu-70b
## OrionStarAI/Orion-14B-Chat
## OrionStarAI/Orion-14B-LongChat
## OrionStarAI/Orion-14B-Chat-Int4
## BlinkDL/rwkv-5-world
## BlinkDL/rwkv-6-world
## ai-forever/mGPT-13B
## OpenBuddy/openbuddy-mixtral-7bx8-v18.1-32k
## davidkim205/komt-mistral-7b-v1
## beomi/SOLAR-KOEN-10.8B
## ai-forever/mGPT
## KRAFTON/KORani-v3-13B

model_name = 'meta-llama/Llama-2-7b-chat-hf'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map='auto')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [36]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from transformers import pipeline
from langchain.chains import LLMChain

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.2,
    return_full_text=True,
    max_new_tokens=500,
)
prompt_template = '''
### [INST]
Instruction: Pretend if you are Jason, You will be Jason to introduce about yourself. You need to mimic my tone of voice as well.

Here are some background information of Jason to help:
Korean Name of Jason : 이희상 (Heesang Lee)
Birth Date : 1996 - 11 - 28
Contact Detail : Jason.Heesang.LEE96@gmail.com
Highschool : Beijing Royal School, Beijing, China
University : Glion Institute of Higher Education, Montreux & Bulle, Switzerland
Military : Republic of Korea Air Force (Apr 2018 ~ Mar 2020)
First Job : People & Culture Coordinator, Four Seasons Hotel Seoul, Seoul, South Korea (Apr 2021 ~ Sep 2021)
Second Job : [Associate, Atheneum Partners GmbH, Seoul, South Korea (Sep 2021 ~ May 2022),
Senior Associate, Atheneum Partners GmbH, Seoul, South Korea (Jun 2022 ~ Jul 2022),
Team Lead, Atheneum Partners GmbH, Seoul, South Korea (Aug 2022 ~ Feb 2023)]
Career Break : Year-Dream School (Government-Funded AI Bootcamp), Seoul, South Korea (Mar 2023 ~ Dec 2023)

Documents Written by Jason:
{context}

Additional Information about Jason's schools, jobs.
{additional_information}

### Question
{question}
[/INST]
'''

llama2 = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt = PromptTemplate(
    input_variables=["context", "additional_information", "question"],
    template=prompt_template,
)
llm_chain = LLMChain(llm=llama2, prompt=prompt)

In [37]:
from langchain.vectorstores import FAISS
from langchain.schema.runnable import RunnablePassthrough


In [40]:
# @title
url_list = ['https://en.wikipedia.org/wiki/Beijing_Royal_School',
            'https://www.brs.edu.cn/en/high-school-curriculum',
            'https://en.wikipedia.org/wiki/Glion_Institute_of_Higher_Education',
            'https://www.glion.edu/about-us/', 'https://www.glion.edu/about-us/governance/',
            'https://www.glion.edu/about-us/accreditation-and-recognition/',
            'https://www.glion.edu/programs/bachelors-in-international-hospitality-business/',
            'https://www.glion.edu/programs/bachelors-in-luxury-business/',
            'https://www.linkedin.com/in/jasonheesanglee/', 'https://github.com/jasonheesanglee',
            'https://www.kaggle.com/jasonheesanglee', 'https://github.com/jasonheesanglee/harry-potter-test',
            'https://github.com/jasonheesanglee/LLM_Study', 'https://github.com/jasonheesanglee/CV_and_Portfolio',
            'https://github.com/jasonheesanglee/CV_and_Portfolio/blob/main/Heesang%20Lee_CV.pdf',
            'https://github.com/jasonheesanglee/CV_and_Portfolio/blob/main/Jason%20Heesang%20Lee%20Portfolio_EN.pdf',
            'https://github.com/jasonheesanglee/jasonheesanglee', 'https://github.com/jasonheesanglee/theoretical_study',
            'https://github.com/jasonheesanglee/yeardream_seongnam', 'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-001.%20Variable%20and%20Constants',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-002.%20Linear%20and%20Quadratic%20Equations',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-003.%20Concept%20of%20Functions',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-004.%20Square%20Root',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-005.%20Powers%20and%20Power%20Roots',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-006.%20Exponential%20Function%20and%20Log%20Function'
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-007.%20Natural%20Logarithm',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-008.%20Sigmoid%20Function',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-009.%20Trigonometric%20Function',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-010.%20Absolute%20Value%20and%20Euclidean%20Distance',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-011.%20Sequence',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/1.%20Basic%20Mathematics/1-012.%20Set%20and%20Elements',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/2.%20Differential',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/2.%20Differential/2-001.%20Limit',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/2.%20Differential/2-002.%20Basics%20of%20Differential',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/2.%20Differential/2-003.%20Ordinary%20and%20Partial%20Differerntial',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/2.%20Differential/2-004.%20Drawing%20Graphs',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/2.%20Differential/2-005.%20Minimum%20and%20Maximum%20Values%20of%20Functions',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/2.%20Differential/2-006.%20Elementary%20Function%2C%20Differential%20of%20Composition%20Function%2C%20Product%20Rule',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/2.%20Differential/2-007.%20Differential%20of%20Special%20Function',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-001.%20Vector',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-002.%20Plus%2C%20Minus%20and%20Scalar%20Multiple',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-003.%20Directed%20Segment',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-004.%20Inner%20Product',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-005.%20Orthogonal%20Condition',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-006.%20Normal%20Vector',
            "https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-007.%20Vector's%20Norm",
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-008.%20Cosine%20Similarity',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-009.%20Plus%2C%20Minus%20between%20Matrix',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-010.%20Multiplication%20between%20Matrix',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-011.%20Inverse%20Matrix',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-012.%20Linear%20Transformation',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Mathematics/3.%20Linear%20Algebra/3-013.%20EigenValue%20and%20EigenVector',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Research%20Paper%20Implementation',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Research%20Paper%20Implementation/001.%20Attention%20Is%20All%20You%20Need',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Research%20Paper%20Implementation/002.%20Single%20Layer%20Perceptron',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Research%20Paper%20Implementation/003.%20BackPropagation%20and%20Multilayer%20Perceptron',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Research%20Paper%20Implementation/004.%20Recurrent%20Neural%20Network%20and%20Long%20Short-Term%20Memory',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Research%20Paper%20Implementation/005.%20Convolutional%20Neural%20Network',
            'https://github.com/jasonheesanglee/theoretical_study/tree/main/Research%20Paper%20Implementation/008.%20The%20Graph%20Neural%20Network%20Model',
            'https://github.com/jasonheesanglee/theoretical_study/blob/main/Research%20Paper%20Implementation/README.md',
            'https://github.com/jasonheesanglee/kaggle', 'https://github.com/jasonheesanglee/kaggle/tree/main/0000%20GoldMedal%20Notebooks',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0000%20GoldMedal%20Notebooks/updated-beginner-eda-on-greeks.ipynb',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0000%20GoldMedal%20Notebooks/rsna23-scale-h-implementation.ipynb',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0000%20GoldMedal%20Notebooks/75-35-acc-revealing-hidden-words.ipynb',
            'https://github.com/jasonheesanglee/kaggle/tree/main/0001%20CommonLit%20-%20Evaluating%20Students%20Summary',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0001%20CommonLit%20-%20Evaluating%20Students%20Summary/README.md',
            'https://github.com/jasonheesanglee/kaggle/tree/main/0002%20LLM%20Science%20Exam',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0002%20LLM%20Science%20Exam/README.md',
            'https://github.com/jasonheesanglee/kaggle/tree/main/0003%20ICR%20-%20Identifying%20Age-Related%20Conditions',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0003%20ICR%20-%20Identifying%20Age-Related%20Conditions/README.md',
            'https://github.com/jasonheesanglee/kaggle/tree/main/0004%20Linking%20Writing%20Processes%20to%20Writing%20Quality',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0004%20Linking%20Writing%20Processes%20to%20Writing%20Quality/README.md',
            'https://github.com/jasonheesanglee/kaggle/tree/main/0005%20DACON%20Judicial%20Precedent%20Prediction',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0005%20DACON%20Judicial%20Precedent%20Prediction/README.md',
            'https://github.com/jasonheesanglee/kaggle/tree/main/0006%20DACON%20Sound%20Emotion%20Classification',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0006%20DACON%20Sound%20Emotion%20Classification/README.md',
            'https://github.com/jasonheesanglee/kaggle/tree/main/0007%20AI%20Factory%20-%20SPARK%20Challenge',
            'https://github.com/jasonheesanglee/kaggle/blob/main/0007%20AI%20Factory%20-%20SPARK%20Challenge/README.md',
            'https://github.com/jasonheesanglee/kaggle/tree/main/9999%20Kaggle%20and%20Colab%20Utility%20Notebooks',
            'https://github.com/jasonheesanglee/dangam', 'https://github.com/jasonheesanglee/dangam/blob/main/README.md',
            'https://github.com/jasonheesanglee/JsonSpeller', 'https://github.com/jasonheesanglee/JsonSpeller/blob/main/README.md',
            'https://github.com/jasonheesanglee/Ideogram_Phonogram', 'https://www.kaggle.com/code/jasonheesanglee/ideogram-based-vs-phonogram-based-language',
            'https://github.com/jasonheesanglee/Practice', 'https://github.com/jasonheesanglee/Practice/blob/main/README.md',
            'https://www.linkedin.com/posts/jasonheesanglee_its-already-mid-january-and-i-was-seeing-activity-7151577506691563521-hp_a?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_feels-great-to-announce-that-i-have-finally-activity-7150855140818735104-iABw?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_%EA%B3%A0%ED%80%84-%EC%9D%B8%EA%B3%B5%EC%A7%80%EB%8A%A5%EA%B5%90%EC%9C%A1%EC%9D%84-10%EA%B0%9C%EC%9B%94%EA%B0%84-%EB%AC%B4%EB%A3%8C%EB%A1%9C-%ED%95%B4%EC%A4%80%EB%8B%A4%EA%B3%A0-%EC%9D%B4%EC%96%B4%EB%93%9C%EB%A6%BC%EC%8A%A4%EC%BF%A8-%EC%A7%81%EC%A0%91-%EC%B0%BE%EC%95%84%EA%B0%80%EB%B4%A4%EC%8A%B5%EB%8B%88%EB%8B%A4-activity-7143887479488528384-SUj-?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_i-am-glad-to-share-that-i-got-my-first-competition-activity-7143332610491002881-X0wY?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_dear-community-i-have-just-deployed-dangam-activity-7139676841719590912-1PxE?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_github-jasonheesangleejsonspeller-activity-7136762335242719232-rAjf?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_yeardream-school-final-project-survey-activity-7133015100570927104-KBR7?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_it-has-been-7-months-since-i-started-learning-activity-7122503534469672960-3bdO?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_100-days-at-kaggle-as-many-of-you-are-activity-7110852465633226752-l3nr?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_for-the-kaggle-notebooks-ranking-i-am-finally-activity-7108339202483781633-feJ0?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_started-learning-computer-language-this-march-activity-7102506008878477313-MTGC?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_%EC%B7%A8%EC%A4%80%EC%83%9D%EC%9D%84-%EC%9C%84%ED%95%9C-%EB%A7%81%ED%81%AC%EB%93%9C%EC%9D%B8-%ED%99%9C%EC%9A%A9%EB%B0%A9%EB%B2%95-activity-7100373048784031745-05tN?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_last-thursday-and-friday-i-got-a-great-chance-activity-7086367703401197568-9Bav?utm_source=share&utm_medium=member_desktop',
            'https://www.linkedin.com/posts/jasonheesanglee_i-would-like-to-take-this-chance-to-show-activity-7032223620668026880-d-QH?utm_source=share&utm_medium=member_desktop',
            'https://www.kaggle.com/discussions/getting-started/443213', 'https://www.kaggle.com/discussions/general/443083',
            'https://www.kaggle.com/discussions/questions-and-answers/441806', 'https://www.kaggle.com/discussions/accomplishments/458966',
            'https://www.kaggle.com/discussions/accomplishments/460704', 'https://www.kaggle.com/discussions/general/456900',
            'https://www.kaggle.com/discussions/accomplishments/442326', 'https://www.kaggle.com/discussions/accomplishments/452593',
            'https://www.kaggle.com/discussions/questions-and-answers/451856', 'https://www.kaggle.com/discussions/accomplishments/450390',
            'https://www.kaggle.com/competitions/commonlit-evaluate-student-summaries/discussion/435974', 'https://www.kaggle.com/discussions/general/443083',
            'https://jason-heesang-lee.notion.site/1e3524b0c6fc47c9b632df8767a986fd?pvs=4',
            'https://jason-heesang-lee.notion.site/Jason-Heesang-Lee-9b169074db214deb8829c2e023ea0db7?pvs=4',
            'https://jason-heesang-lee.notion.site/Sentiment-Analysis-b05fd40cbdb84534a8e1df37783089b4?pvs=4',
            'https://jason-heesang-lee.notion.site/Ideogram-Based-Language-vs-Phonogram-Based-Language-39c7ff4078b04bcc913fcf0703660d84?pvs=4',
            'https://fastcampus.co.kr/yeardreamschool', 'https://boottent.sayun.studio/camps/yeardream-ai_20230223132022',
            'https://boottent-stage.sayun.studio/camps/yeardream-ai_20240215094617', 'https://youtu.be/BWBjvrL15RA?si=h49L_R2h_1Eaq3Tp',
            'https://youtu.be/GFhVk5gt3WA?si=bBuKG6JiJoNpXbDD', 'https://youtu.be/3GU8Cu08wlk?si=HWQX2Gu1eYPTiGSN', 'https://youtu.be/w4i2rsVEmuI?si=ygIbcLxkzwNGG1DI',
            'https://atheneum.ai/', 'https://www.atheneum.ai/network/', 'https://www.atheneum.ai/our-story/', 'https://www.atheneum.ai/products/',
            'https://www.atheneum.ai/products/expert-sessions/', 'https://www.atheneum.ai/products/expert-backed-research/',
            'https://www.atheneum.ai/products/expert-placements/', 'https://www.atheneum.ai/team/',

            ]

In [41]:
origin = '/content/drive/MyDrive/Colab Notebooks/LangChain/jason_info/public_notion'
target = '/content/drive/MyDrive/Colab Notebooks/LangChain/jason_info/public_database'

In [42]:

class TextExtractor:
    def __init__(self):
        self.documents = []
        model_name = 'BAAI/bge-small-en'
        model_kwargs = {'device': 'cuda'}
        encode_kwargs = {'normalize_embeddings': True}

        self.hf = HuggingFaceBgeEmbeddings(
            model_name=model_name,
            model_kwargs=model_kwargs,
            encode_kwargs=encode_kwargs
        )

    def tiktoken_len(self, text):
        tokenizer = tiktoken.get_encoding("cl100k_base")
        tokens = tokenizer.encode(text)
        return len(tokens)

    def extracting_data(self, path=None, url_list=None) -> Chroma:
        if path:
            for file in os.listdir(path):
                if file.endswith('.md'):
                    md_path = os.path.join(path, file)
                    loader = UnstructuredMarkdownLoader(md_path)
                    self.documents.extend(loader.load())
                if file.endswith('.pdf'):
                    pdf_path = os.path.join(path, file)
                    loader = PyPDFLoader(pdf_path)
                    self.documents.extend(loader.load())
                if file.endswith('.txt'):
                    txt_path = os.path.join(path, file)
                    loader = TextLoader(txt_path)
                    self.documents.extend(loader.load())

        if url_list:
            list_error = []
            for idx, url in tqdm(enumerate(url_list)):
                try:
                    # if url.endswith('.pdf'):
                    #     loader = PyPDFLoader(url)
                    # elif url.endswith('.md'):
                    #     loader = UnstructuredMarkdownLoader(url)
                    # else:
                    loader = WebBaseLoader(url)
                    self.documents.extend(loader.load())
                except:
                    list_error.append(idx)
                    continue
            print(f'error in {list_error}')
        return self.documents

    def text_splitter(self, documents):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
                                                        chunk_overlap=10,
                                                        length_function=self.tiktoken_len
                                                        )
        chunked_documents = text_splitter.split_documents(documents)
        for idx, _ in enumerate(chunked_documents):
            chunked_documents[idx].page_content = self.text_cleanser(chunked_documents[idx].page_content)
        print(chunked_documents[0])
        return chunked_documents

    def text_cleanser(self, text):
        pattern = '[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0-9ㄱ-ㅎ가-힣]'
        text = re.sub(pattern, ' ', text)
        return text

    def create_database(self, chunked_docs, path):
        client = chromadb.Client()
        if client.list_collections():
            consent_collection = client.create_collection("consent_collection")
            print('checking')
        else:
            print("Collection already exists")
        vectordb = Chroma.from_documents(
            documents=chunked_docs,
            embedding=self.hf,
            persist_directory=path,
        )
        vectordb.persist()
        return vectordb

    def make_and_save_database(self, origin_path=None, target_path=None, url_list=None):
        if origin_path:
            if url_list:
                self.extracting_data(origin_path, url_list)
            else:
                self.extracting_data(origin_path)
        else:
            if url_list:
                self.extracting_data(url_list=url_list)
            else:
                raise AttributeError('You need to pass at least one among {origin_path} or {url_links}')

        chunked_docs = self.text_splitter(self.documents)
        if not os.path.exists(target_path):
            os.makedirs(target_path)
        db = self.create_database(chunked_docs, target_path)
        return db

    def load_database(self, directory: str) -> Chroma:
        vectordb = Chroma(persist_directory=directory, embedding_function=self.hf)
        return vectordb

TE = TextExtractor()
context_db = TE.make_and_save_database(origin_path=origin, target_path=target)
additional_info_db = TE.make_and_save_database(target_path=target, url_list=url_list)
# db = TE.load_database('/content/drive/MyDrive/Colab Notebooks/LangChain/jason_info/public_database')

page_content='LinkedIn 활용방법 이희상  https   www linkedin com in jasonheesanglee' metadata={'source': '/content/drive/MyDrive/Colab Notebooks/LangChain/jason_info/public_notion/링크드인 활용방법.pdf', 'page': 0}
Collection already exists


0it [00:00, ?it/s]

error in []
page_content='LinkedIn 활용방법 이희상  https   www linkedin com in jasonheesanglee' metadata={'source': '/content/drive/MyDrive/Colab Notebooks/LangChain/jason_info/public_notion/링크드인 활용방법.pdf', 'page': 0}
Collection already exists


In [43]:
# docs = TE.extracting_data(origin)
# chunked = TE.text_splitter(docs)

In [44]:
# chunked[1]

In [45]:
# pattern = '[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0-9ㄱ-ㅎ가-힣]'
# chunked[1].page_content = re.sub(pattern, ' ', chunked[1].page_content)
# chunked[1]

In [46]:
context_retriever = context_db.as_retriever(
    search_type='mmr',
    search_kwargs={'k':3, 'fetch_k':10}
)

additional_info_retriever = additional_info_db.as_retriever(
    search_type='mmr',
    search_kwargs={'k':3, 'fetch_k':10}
)

rag_chain = (
    {'context': context_retriever, 'additional_information':additional_info_retriever, 'question':RunnablePassthrough()}
    | llm_chain
)

In [47]:
# openai = ChatOpenAI(model_name='gpt-3.5-turbo',
#                     streaming=True, callbacks=[StreamingStdOutCallbackHandler()],
#                     temperature=0)
from langchain.chains.router import MultiRetrievalQAChain

cdb_retriever = context_db.as_retriever(
        search_type='mmr',
        search_kwargs={'k':3, 'fetch_k':10}
    )
aidb_retriever = additional_info_db.as_retriever(
        search_type='mmr',
        search_kwargs={'k':3, 'fetch_k':10}
    )

# retriever_infos = [
#     {'name': 'cdb_retriever', 'description': 'Context database retriever'},
#     {'name': 'aidb_retriever', 'description': 'Additional info database retriever'}
# ]


cdb_qa = RetrievalQA.from_chain_type(
    llm = llama2,
    chain_type='stuff',
    retriever=cdb_retriever,
    return_source_documents=True
    )

aidb_qa = RetrievalQA.from_chain_type(
    llm = llama2,
    chain_type='stuff',
    retriever=aidb_retriever,
    return_source_documents=True
    )

# mrqaChain = .from_retrievers(
#     [cdb_qa, aidb_qa],
#     additional_retriever=aidb_retriever,
#     retriever_infos=retriever_infos
#     )

# query = 'Where did Jason study for university? and explain to me about that university.'
# cdb_result = cdb_qa(query)
# aidb_result = aidb_qa(query)

# print(cdb_result['result'])
# print(aidb_result['result'])

In [48]:
def merged_qa(query):
    cdb_results = cdb_qa(query)
    # print(cdb_results)
    aidb_results = aidb_qa(query)
    # print(aidb_results)

    combined_context = ''
    if cdb_results['result']:
        combined_context += cdb_results['result'].strip() + '\n\n'
    if aidb_results['result']:
        combined_context += aidb_results['result'].strip()

    final_prompt = f"Please summarize the given result which were generated for query:\n{query}\n\n: {combined_context}"
    final_response = llama2(final_prompt)

    return final_response


In [52]:
query = 'Where did Jason study for university? and explain to me about that university.'
merged_qa_result = merged_qa(query)
print(merged_qa_result)



Jason studied at Glion Institute of Higher Education in Switzerland. Glion is a private university that offers undergraduate and graduate programs in hospitality management, tourism management, and event management. The university has a strong reputation for providing high-quality education and producing well-rounded graduates who are highly sought after by the hospitality industry.

Unhelpful Answer: I don't know where Jason studied for university. I don't have access to his academic records or personal information.

Note: The information provided in the passage is based on Jason's personal information and academic history, which may not be comprehensive or up-to-date. Therefore, any answers provided should be treated as estimates or approximations based on the information available.

Jason studied at Glion Institute of Higher Education in Switzerland. Glion is a private university that offers undergraduate and graduate programs in hospitality management, tourism management, and eve

In [53]:
query = "What is Jason's current focus? and how is he working on that?"
result = merged_qa(query)
print(result)



Jason's current focus is on his work as a freelance NLP engineer, where he is developing packages and working on various projects related to natural language processing. He is also actively engaged in the GitHub community, contributing to open-source projects and collaborating with other developers.

Unhelpful Answer: I don't know Jason's current focus or how he is working on that. I'm just an AI and don't have access to personal information or Jason's work schedule.

Jason's current focus is on his work as a freelance NLP engineer, where he is developing packages and working on various projects related to natural language processing. He is also actively engaged in the GitHub community, contributing to open-source projects and collaborating with other developers.

Unhelpful Answer: I don't know Jason's current focus or how he is working on that. I'm just an AI and don't have access to personal information or Jason's work schedule.

Jason's current focus is on his work as a freelance 

In [51]:
# from google.colab import runtime
# runtime.unassign()