<a href="https://colab.research.google.com/github/nnennandukwe/GenAI-Dev-Onboarding-Starter-Kit/blob/main/GenAI_Dev_Onboarding_Starter_Kit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gen AI Dev Onboarding Starter Kit #1: Intro to RAG! 🚀

Welcome! This notebook guides you through a Retrieval Augmented Generation (RAG) pipeline demonstration, built for developers (or enthusiasts) looking to get started in hands-on Generative AI!

### Tools You'll Use

- LangChain
- OpenAI
- ChromaDB
- Ragas


### Key steps covered:
1.  Setting up the environment and installing dependencies (including Langchain).
2.  Processing local documents (`doc1.txt` - Security Guidelines, `doc2.txt` - Company FAQs) using Langchain document loaders and text splitters.
3.  Generating embeddings using Langchain wrappers for OpenAI (`text-embedding-3-small`).
4.  Storing and indexing embeddings in ChromaDB using Langchain integration.
5.  Performing Question Answering using a Langchain `RetrievalQA` chain.
6.  Evaluating the RAG pipeline using Ragas (metrics: context precision, context recall, faithfulness), leveraging Langchain components.

### Before you begin

*   You will need an OpenAI API key.
*   The project files (including `embedding_processor_langchain.py`, `evaluation_langchain.py`, `doc1.txt`, `doc2.txt`, and `pyproject.toml`) should be accessible in your Colab environment (e.g., by cloning a Git repository or uploading them) once you've run Step 1 in order to complete the guide!

## 2. Setup Environment

In [None]:
# Clone the project repository if not already done
!git clone https://github.com/nnennandukwe/GenAI-Dev-Onboarding-Starter-Kit.git

# Change directory to the project folder
%cd GenAI-Dev-Onboarding-Starter-Kit

print("Environment setup started...")

In [None]:
# Confirm you're in the correct directory before we begin installation
import os
project_root = os.getcwd()
print(project_root) # output below should be /content/GenAI-Dev-Onboarding-Starter-Kit/

### Install Poetry for managing Python dependencies

In [None]:
# Install Poetry
!curl -sSL https://install.python-poetry.org | python3 -

# Add Poetry to PATH for the current Colab session
import os
os.environ["PATH"] += ":" + os.path.expanduser("~/.local/bin")

!poetry --version
print("Poetry installed.")

### Install project dependencies with Poetry

In [None]:
# This command reads pyproject.toml and installs all dependencies including Langchain.
!poetry install --no-root
print("Project dependencies (including Langchain) installed.")

### Set up OpenAI API Key

In [None]:
import os
from google.colab import userdata
from getpass import getpass

api_key = os.environ.get("OPENAI_API_KEY")
colab_api_key = userdata.get('OPENAI_API_KEY')

if not api_key:

  if colab_api_key:
    api_key = colab_api_key
    os.environ["OPENAI_API_KEY"] = api_key
    print("Using Google Colab API key.")
  # Check for Google Colab key
  else:
    # Request for new OpenAI API key if none available
    api_key = getpass("Please enter your OpenAI API key: ")

    os.environ["OPENAI_API_KEY"] = api_key
    colab_api_key = api_key

if os.environ.get("OPENAI_API_KEY") or colab_api_key:
    print("OpenAI API key set successfully!")
else:
    print("Failed to set OpenAI API key.")

## 3. Data Processing and Embedding (Langchain)

Now, we will use the **Langchain-based script** (`embedding_processor_langchain.py`) to:
1. Load documents (`doc1.txt`, `doc2.txt`) using `TextLoader`.
2. Split documents into chunks using `RecursiveCharacterTextSplitter`.
3. Generate embeddings with `OpenAIEmbeddings` (`text-embedding-3-small`).
4. Store chunks and embeddings in ChromaDB via its Langchain integration.

> Make sure `doc1.txt`, `doc2.txt`, and `embedding_processor_langchain.py` (inside the `my_rag_project` subdirectory) are present.

### Double-check current directory and available script files

In [None]:
project_root = os.getcwd() # Should be /content/GenAI-Dev-Onboarding-Starter-Kit
script_dir = os.path.join(project_root, "my_rag_project") # where scripts go

!ls -l {project_root}
!ls -l {script_dir}

### Run the Langchain embedding processor script

In [None]:
# This script will use Langchain for loading, chunking, embedding, and storing in ChromaDB.
# Ensure your OPENAI_API_KEY is set.

!poetry run python my_rag_project/embedding_processor_langchain.py

print("Langchain embedding process execution attempted.")
# Check for the new chroma_db_langchain directory
!ls -l
!ls -l chroma_db_langchain # This should exist if the script ran successfully

## 4. Ragas Evaluation (Langchain Pipeline)

Next, we'll use the **Langchain-based Ragas evaluation script** (`evaluation_langchain.py`). This script will:
1.  Connect to the ChromaDB populated by the Langchain embedding script.
2.  Set up a Langchain `RetrievalQA` chain using an OpenAI LLM (`gpt-3.5-turbo` or similar) and the ChromaDB retriever.
3.  Generate answers for predefined questions using the QA chain.
4.  Retrieve contexts (source documents) used by the QA chain.
5.  Calculate Ragas metrics: `context_precision`, `context_recall`, and `faithfulness`.

> Make sure `evaluation_langchain.py` (inside `my_rag_project` subdirectory) is present.

### Run the Langchain Ragas evaluation script

In [None]:
# This script requires the ChromaDB (langchain version) to be populated.
# It also requires the OPENAI_API_KEY.
!poetry run python my_rag_project/evaluation_langchain.py

print("Langchain Ragas evaluation process execution attempted.")

## 5. Conclusion: You Did it! 🎊 ✅

This notebook demonstrated the core steps of setting up a RAG pipeline **using Langchain**, from document processing and embedding to QA and evaluation with Ragas.

**Further exploration:**
*   Explore the `embedding_processor_langchain.py` and `evaluation_langchain.py` files to get a deeper look into the embedding and evaluation code!
*   Try different LLMs available through Langchain. (all you have to do is edit the name of the model `LLM_MODEL` value in `evaluation_langchain.py` file!)
*   Explore more advanced Langchain chains and agents for RAG.
*   Expand the evaluation dataset and [other Ragas metrics](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/) in conjunction with Langchain.