### Fetching API Key

In [2]:
from decouple import config
openai_api_key=config("OPENAI_API_KEY")

In [3]:
# print(openai_api_key)

### Pipeline for converting raw unstructed data into a QA chain
#### Indexing

1. Loading: Initially, the data needs to be loaded. Unstructured data can be sourced from various platforms. Utilize the LangChain Integration Hub to explore the complete range of loaders. Each loader outputs the data as a LangChain Document.

2. Splitting: Text splitters segment the Documents into specified sizes.

3. Storage: A storage solution, often a vector store, is used to house and sometimes embed the splits.

<img src="https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png" width="1000" height="450">

#### Retrieval and Generation

4. Retrieval: The application fetches the splits from the storage, usually based on embeddings similar to the input question.

5. Generation: A Language Model (LLM) generates an answer using a prompt that incorporates both the question and the retrieved data.

<img src="https://python.langchain.com/assets/images/rag_retrieval_generation-1046a4668d6bb08786ef73c56d4f228a.png" width="1000" height="450">

#### Bonous
6. Conversation (Extension): To facilitate multi-turn conversations, Memory can be added to the QA chain.

### Q & A pipeline RAG

Referrence: [Langchain](https://python.langchain.com/docs/use_cases/question_answering/)

* RAG is a technique for augmenting LLM knowledge with additional data

### Step 1: Loading the Document

In [6]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://aws.amazon.com/vi/what-is/langchain/")

corpus = loader.load()

In [None]:
# print(corpus)

[Document(page_content='\n\n\n\n\n\n\n\n\n\n\n\nLangChain là gì? – Giải thích về LangChain – AWS\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n Chuyển đến nội dung chính\n\n\n\n\n\nNhấp vào đây để quay lại trang chủ Amazon Web Services\n\n\n\nLiên hệ với chúng tôi\n Hỗ trợ\xa0 \nTiếng Việt\xa0\nTài khoản của tôi\xa0\n\n\n\n\n Đăng nhập\n\n\n  Tạo tài khoản AWS \n\n\n\n\n\n\n\n\n\nre:Invent\nSản phẩm\nGiải pháp\nĐịnh giá\nTài liệu\nTìm hiểu\nMạng lưới đối tác\nAWS Marketplace\nTiếp cận khách hàng\nSự kiện\nKhám phá thêm \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n Đóng \n\n\n\nعربي\nBahasa Indonesia\nDeutsch\nEnglish\nEspañol\nFrançais\nItaliano\nPortuguês\n\n\n\n\nTiếng Việt\nTürkçe\nΡусский\nไทย\n日本語\n한국어\n中文 (简体)\n中文 (繁體)\n\n\n\n\n\n Đóng \n\nHồ sơ của tôi\nĐăng xuất khỏi AWS Builder ID\nBảng điều khiển quản lý AWS\nThiết lập tài khoản\nQuản lý hóa đơn và chi phí\nThông tin xác thực bảo mật\nAWS Personal Health Dashb

### Step 2: Splitting the Document into Chunks

In [7]:
import tiktoken

# Set up token encoding for the GPT-3.5 Turbo model
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [8]:
tokenizer = tiktoken.get_encoding('cl100k_base')

# Define a function to calculate the token length of a given text
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

# tiktoken_len("Dentin decay: Dentin is the layer just beneath your tooth enamel.")
tiktoken_len("LangChain là một khung mã nguồn mở để xây dựng các ứng dụng dựa trên các mô hình ngôn ngữ lớn (LLM).")

49

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with specified parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 20,
    length_function = tiktoken_len
)
"""
Params:
    1. chunk_size: size of each chunk of text.
    2. chunk_overlap: each chunk will overlap with the previous chunk.
    3. length_function: the length of the text
"""

'\nParams:\n    1. chunk_size: size of each chunk of text.\n    2. chunk_overlap: each chunk will overlap with the previous chunk.\n    3. length_function: the length of the text\n'

In [10]:
# Split the loaded document into smaller chunks
chunks = text_splitter.split_documents(corpus)

In [11]:
len(chunks)

95