<a href="https://colab.research.google.com/github/nnennandukwe/GenAI-Dev-Onboarding-Starter-Kit/blob/main/GenAI_Dev_Onboarding_Starter_Kit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gen AI Dev Onboarding Starter Kit #1: Intro to RAG! 🚀

Welcome! This notebook guides you through a Retrieval Augmented Generation (RAG) pipeline demonstration, built for developers (or enthusiasts) looking to get started in hands-on Generative AI!

### Tools You'll Use

- LangChain
- OpenAI
- ChromaDB
- Ragas


### Key steps covered:
1.  Setting up the environment and installing dependencies (including Langchain).
2.  Processing local documents (`doc1.txt` - Security Guidelines, `doc2.txt` - Company FAQs) using Langchain document loaders and text splitters.
3.  Generating embeddings using Langchain wrappers for OpenAI (`text-embedding-3-small`).
4.  Storing and indexing embeddings in ChromaDB using Langchain integration.
5.  Performing Question Answering using a Langchain `RetrievalQA` chain.
6.  Evaluating the RAG pipeline using Ragas (metrics: context precision, context recall, faithfulness), leveraging Langchain components.

### Before you begin

*   You will need an OpenAI API key.
*   The project files (including `embedding_processor_langchain.py`, `evaluation_langchain.py`, `doc1.txt`, `doc2.txt`, and `pyproject.toml`) should be accessible in your Colab environment (e.g., by cloning a Git repository or uploading them) once you've run Step 1 in order to complete the guide!

## 2. Setup Environment

In [1]:
# Clone the project repository if not already done
!git clone https://github.com/nnennandukwe/GenAI-Dev-Onboarding-Starter-Kit.git

# Change directory to the project folder
%cd GenAI-Dev-Onboarding-Starter-Kit

print("Environment setup started...")

Cloning into 'GenAI-Dev-Onboarding-Starter-Kit'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 39 (delta 15), reused 9 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (39/39), 28.53 KiB | 2.59 MiB/s, done.
Resolving deltas: 100% (15/15), done.
/content/GenAI-Dev-Onboarding-Starter-Kit
Environment setup started...


In [11]:
# Confirm you're in the correct directory before we begin installation
import os
project_root = os.getcwd()
print(project_root) # output below should be /content/GenAI-Dev-Onboarding-Starter-Kit/

/content/GenAI-Dev-Onboarding-Starter-Kit


### Install Poetry for managing Python dependencies

In [3]:
# Install Poetry
!curl -sSL https://install.python-poetry.org | python3 -

# Add Poetry to PATH for the current Colab session
import os
os.environ["PATH"] += ":" + os.path.expanduser("~/.local/bin")

!poetry --version
print("Poetry installed.")

[36mRetrieving Poetry metadata[0m

The latest version ([1m2.1.3[0m) is already installed.
[39;1mPoetry[39;22m (version [36m2.1.3[39m)
Poetry installed.


### Install project dependencies with Poetry

In [14]:
# This command reads pyproject.toml and installs all dependencies including Langchain.
!poetry install --no-root
print("Project dependencies (including Langchain) installed.")

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(15.0s)[39;22m

[39;1mPackage operations[39;22m: [34m128[39m installs, [34m0[39m updates, [34m0[39m removals

  [34;1m-[39;22m [39mInstalling [39m[36mcertifi[39m[39m ([39m[39;1m2025.4.26[39;22m[39m)[39m: [34mPending...[39m
  [34;1m-[39;22m [39mInstalling [39m[36mcharset-normalizer[39m[39m ([39m[39;1m3.4.2[39;22m[39m)[39m: [34mPending...[39m
  [34;1m-[39;22m [39mInstalling [39m[36mh11[39m[39m ([39m[39;1m0.16.0[39;22m[39m)[39m: [34mPending...[39m
  [34;1m-[39;22m [39mInstalling [39m[36midna[39m[39m ([39m[39;1m3.10[39;22m[39m)[39m: [34mPending...[39m
  [34;1m-[39;22m [39mInstalling [39m[36msniffio[39m[39m ([39m[39;1m1.3.1[39;22m[39m)[39m: [34mPending...[39m
  [34;1m-[39;22m [39mInstalling [39m[36mtyping-extensions[39m[39m ([39m[39;1m4.13.2[39;22m[39m)[39m: [34mPending...[39m
[3A[0J  [34;1m-[39;22m [39mInstalling [

### Set up OpenAI API Key

In [19]:
import os
from google.colab import userdata
from getpass import getpass

api_key = os.environ.get("OPENAI_API_KEY")
colab_api_key = userdata.get('OPENAI_API_KEY')

if not api_key:

  if colab_api_key:
    api_key = colab_api_key
    os.environ["OPENAI_API_KEY"] = api_key
    print("Using Google Colab API key.")
  # Check for Google Colab key
  else:
    # Request for new OpenAI API key if none available
    api_key = getpass("Please enter your OpenAI API key: ")

    os.environ["OPENAI_API_KEY"] = api_key
    colab_api_key = api_key

if os.environ.get("OPENAI_API_KEY") or colab_api_key:
    print("OpenAI API key set successfully!")
else:
    print("Failed to set OpenAI API key.")

OpenAI API key set successfully! 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


## 3. Data Processing and Embedding (Langchain)

Now, we will use the **Langchain-based script** (`embedding_processor_langchain.py`) to:
1. Load documents (`doc1.txt`, `doc2.txt`) using `TextLoader`.
2. Split documents into chunks using `RecursiveCharacterTextSplitter`.
3. Generate embeddings with `OpenAIEmbeddings` (`text-embedding-3-small`).
4. Store chunks and embeddings in ChromaDB via its Langchain integration.

> Make sure `doc1.txt`, `doc2.txt`, and `embedding_processor_langchain.py` (inside the `my_rag_project` subdirectory) are present.

### Double-check current directory and available script files

In [20]:
project_root = os.getcwd() # Should be /content/GenAI-Dev-Onboarding-Starter-Kit
script_dir = os.path.join(project_root, "my_rag_project") # where scripts go

!ls -l {project_root}
!ls -l {script_dir}

total 444
-rw-r--r-- 1 root root   9184 May 14 15:28 colab_notebook_guide.md
-rw-r--r-- 1 root root  10219 May 14 15:28 doc1.txt
-rw-r--r-- 1 root root   3634 May 14 15:28 doc2.txt
drwxr-xr-x 2 root root   4096 May 14 16:59 my_rag_project
-rw-r--r-- 1 root root 399471 May 14 16:52 poetry.lock
-rw-r--r-- 1 root root    518 May 14 16:55 pyproject.toml
-rw-r--r-- 1 root root   5123 May 14 15:28 README.md
drwxr-xr-x 3 root root   4096 May 14 15:28 src
drwxr-xr-x 2 root root   4096 May 14 15:28 tests
total 16
-rw-r--r-- 1 root root 4473 May 14 16:59 embedding_processor_langchain.py
-rw-r--r-- 1 root root 6404 May 14 15:28 evaluation_langchain.py


### Run the Langchain embedding processor script

In [22]:
# This script will use Langchain for loading, chunking, embedding, and storing in ChromaDB.
# Ensure your OPENAI_API_KEY is set.

!poetry run python my_rag_project/embedding_processor_langchain.py

print("Langchain embedding process execution attempted.")
# Check for the new chroma_db_langchain directory
!ls -l
!ls -l chroma_db_langchain # This should exist if the script ran successfully

Starting Langchain-based embedding process...
Looking for documents in: /content/GenAI-Dev-Onboarding-Starter-Kit
Document paths: ['./doc1.txt', './doc2.txt']
Loaded document: doc1.txt
Loaded document: doc2.txt
Total documents loaded via Langchain: 2
Split documents into 21 chunks.
Initialized OpenAIEmbeddings with model: text-embedding-3-small
Initializing Chroma vector store at: /content/GenAI-Dev-Onboarding-Starter-Kit/chroma_db_langchain
  warn_deprecated(
Successfully created/updated Chroma vector store
Total items in collection: 21
Langchain-based embedding process completed.
Langchain embedding process execution attempted.
total 448
drwxr-xr-x 3 root root   4096 May 14 17:15 chroma_db_langchain
-rw-r--r-- 1 root root   9184 May 14 15:28 colab_notebook_guide.md
-rw-r--r-- 1 root root  10219 May 14 15:28 doc1.txt
-rw-r--r-- 1 root root   3634 May 14 15:28 doc2.txt
drwxr-xr-x 2 root root   4096 May 14 17:14 my_rag_project
-rw-r--r-- 1 root root 399471 May 14 16:52 poetry.lock
-rw-r

## 4. Ragas Evaluation (Langchain Pipeline)

Next, we'll use the **Langchain-based Ragas evaluation script** (`evaluation_langchain.py`). This script will:
1.  Connect to the ChromaDB populated by the Langchain embedding script.
2.  Set up a Langchain `RetrievalQA` chain using an OpenAI LLM (`gpt-3.5-turbo` or similar) and the ChromaDB retriever.
3.  Generate answers for predefined questions using the QA chain.
4.  Retrieve contexts (source documents) used by the QA chain.
5.  Calculate Ragas metrics: `context_precision`, `context_recall`, and `faithfulness`.

> Make sure `evaluation_langchain.py` (inside `my_rag_project` subdirectory) is present.

### Run the Langchain Ragas evaluation script

In [24]:
# This script requires the ChromaDB (langchain version) to be populated.
# It also requires the OPENAI_API_KEY.
!poetry run python my_rag_project/evaluation_langchain.py

print("Langchain Ragas evaluation process execution attempted.")

from langchain_community.output_parsers.rail_parser import GuardrailsOutputParser
Starting Langchain-based RAGAS evaluation process...
Initialized OpenAIEmbeddings with model: text-embedding-3-small
Initialized ChatOpenAI with model: gpt-3.5-turbo
Connecting to Chroma vector store at: /content/GenAI-Dev-Onboarding-Starter-Kit/chroma_db_langchain
Successfully connected to Chroma vector store
Langchain RetrievalQA chain created.
Generating answers and retrieving contexts using Langchain QA chain...
Q: What is the company's policy on password complexity?
A: The company's policy on password complexity requires passwords to be at least 12 characters long and include a mix of uppercase letters, lowercase letters, numbers, and special symbols. It also advises against using easily guessable information like names, birthdays, or common words.
Retrieved contexts: 3
Q: How should employees report a security incident?
A: Employees should report any suspected or confirmed security incidents, such a

## 5. Conclusion: You Did it! 🎊 ✅

This notebook demonstrated the core steps of setting up a RAG pipeline **using Langchain**, from document processing and embedding to QA and evaluation with Ragas.

**Further exploration:**
*   Explore the `embedding_processor_langchain.py` and `evaluation_langchain.py` files to get a deeper look into the embedding and evaluation code!
*   Try different LLMs available through Langchain. (all you have to do is edit the name of the model `LLM_MODEL` value in `evaluation_langchain.py` file!)
*   Explore more advanced Langchain chains and agents for RAG.
*   Expand the evaluation dataset and [other Ragas metrics](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/) in conjunction with Langchain.