# Langchain RAG Tutorial
---
<br>

## Install dependencies

1. Do the following before installing the dependencies found in `requirements.txt` file because of current challenges installing `onnxruntime` through `pip install onnxruntime`. 

    - For MacOS users, a workaround is to first install `onnxruntime` dependency for `chromadb` using:

    ```python
     conda install onnxruntime -c conda-forge
    ```
    See this [thread](https://github.com/microsoft/onnxruntime/issues/11037) for additonal help if needed. 

     - For Windows users, follow the guide [here](https://github.com/bycloudai/InstallVSBuildToolsWindows?tab=readme-ov-file) to install the Microsoft C++ Build Tools. Be sure to follow through to the last step to set the enviroment variable path.


2. Now run this command to install dependenies in the `requirements.txt` file. 

```python
pip install -r requirements.txt
```

3. Install markdown depenendies with: 

```python
pip install "unstructured[md]"
```

You'll also need to set up an OpenAI account (and set the OpenAI key in your environment variable) for this to work.
<br>

### References

Instructions and code snippets from [Langchain RAG Tutorial](https://github.com/pixegami/langchain-rag-tutorial)

In [1]:
pip install -r requirements.txt

Collecting python-dotenv==1.0.1 (from -r requirements.txt (line 1))
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting langchain==0.2.2 (from -r requirements.txt (line 2))
  Downloading langchain-0.2.2-py3-none-any.whl.metadata (13 kB)
Collecting langchain-community==0.2.3 (from -r requirements.txt (line 3))
  Downloading langchain_community-0.2.3-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-openai==0.1.8 (from -r requirements.txt (line 4))
  Downloading langchain_openai-0.1.8-py3-none-any.whl.metadata (2.5 kB)
Collecting unstructured==0.14.4 (from -r requirements.txt (line 5))
  Downloading unstructured-0.14.4-py3-none-any.whl.metadata (28 kB)
Collecting chromadb==0.5.0 (from -r requirements.txt (line 9))
  Downloading chromadb-0.5.0-py3-none-any.whl.metadata (7.3 kB)
Collecting openai==1.31.1 (from -r requirements.txt (line 10))
  Downloading openai-1.31.1-py3-none-any.whl.metadata (21 kB)
Collecting tiktoken==0.7.0 (from -r requirements.txt (li

In [2]:
pip install "unstructured[md]"

Collecting markdown (from unstructured[md])
  Downloading Markdown-3.6-py3-none-any.whl.metadata (7.0 kB)
Collecting certifi>=2017.4.17 (from requests->unstructured[md])
  Using cached certifi-2024.6.2-py3-none-any.whl.metadata (2.2 kB)
Collecting urllib3<3,>=1.21.1 (from requests->unstructured[md])
  Using cached urllib3-2.2.2-py3-none-any.whl.metadata (6.4 kB)
Downloading Markdown-3.6-py3-none-any.whl (105 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.4/105.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hUsing cached certifi-2024.6.2-py3-none-any.whl (164 kB)
Using cached urllib3-2.2.2-py3-none-any.whl (121 kB)
Installing collected packages: urllib3, markdown, certifi
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.7
    Uninstalling urllib3-1.26.7:
      Successfully uninstalled urllib3-1.26.7
  Attempting uninstall: certifi
    Found existing installation: certifi 2021.10.8
    Uninstalling certifi-2021.10

## Create database

Create the Chroma DB.

In [5]:
# from langchain.document_loaders import DirectoryLoader
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
# from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import openai 
from dotenv import load_dotenv
import os
import shutil

# Load environment variables. Assumes that project contains .env file with API keys
load_dotenv()
#---- Set OpenAI API key 
# Change environment variable name from "OPENAI_API_KEY" to the name given in 
# your .env file.
openai.api_key = os.environ['OPENAI_API_KEY']

CHROMA_PATH = "chroma"
DATA_PATH = "data"


def main():
    generate_data_store()


def generate_data_store():
    documents = load_documents()
    chunks = split_text(documents)
    save_to_chroma(chunks)


def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents


def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=100,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks


def save_to_chroma(chunks: list[Document]):
    # Clear out the database first.
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    # Create a new DB from the documents.
    db = Chroma.from_documents(
        chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")


if __name__ == "__main__":
    main()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/thomaskojoaddaquay/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/thomaskojoaddaquay/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Split 1 documents into 10 chunks.


IndexError: list index out of range

## Query the database

Query the Chroma DB.
