# RAG Project
**Description of the project:** Retrieval-Augmented Generation project focused on question answering for a AWS' documentation.

**Description of the data:** Amazon EC2 related documentation in .md format.

**Document's Structure:**

1. Preprocessing Markdown Files
        3. Break the content into chunks (paragraphs).

    Tools:
        Python-Markdown library for parsing Markdown.
        Use regular expressions to clean or process specific sections of the files (e.g., remove links or preserve headers).

2. Chunking and Embedding the Data
        1. Split the text into semantic chunks (sections, paragraphs, or even specific questions/answers).
        2. Generate embeddings for each chunk so it can be efficiently searched.

    Tools:
        LangChain: Use it for document loading, chunking, and vectorization. LangChain provides document loaders that support Markdown format.
        OpenAI embeddings or Hugging Face transformers to generate embeddings for each chunk.
        FAISS or Pinecone for storing and searching through the embeddings.

3. Storing the Chunks and Metadata

After chunking and embedding the documentation, you need to store the chunks in a searchable database with relevant metadata (e.g., file source, section title, etc.).
        1. Store the chunked data in a vector store for efficient retrieval.
        2. Store additional metadata for filtering or tracking the origin of the content.
    Tools:
        FAISS (Facebook AI Similarity Search) or Pinecone: These are great tools for storing and querying vectorized documents.
        Weaviate or Qdrant are alternatives for vector databases.
        AWS S3 or DynamoDB for storing the original .md files or metadata.

4. Building the Query System

With the data processed and stored, you can build the query interface where developers can ask questions.
    Tasks:
        Integrate a natural language interface (e.g., via a chatbot or web interface).
        Retrieve relevant chunks using vector search.
        Generate answers based on the retrieved chunks and offer further reading or relevant documentation links.
    Tools:
        Streamlit, Gradio, or FastAPI for building the web interface.
        LangChain to connect the query system with the vector store and language model.
        OpenAI GPT for generating responses based on retrieved chunks.

5. Incorporating Internal Documentation (Sensitive Data)
           1. Implement access control to ensure only authorized users can access internal data.
        2. Ensure internal documentation is stored and processed in a way that complies with data security regulations (e.g., geographical restrictions).

    Tools:
        AWS IAM (Identity and Access Management) for access control.
        AWS KMS (Key Management Service) for encrypting sensitive data.
        AWS Lambda or Fargate for secure serverless deployments.


## Data Ingestion

In [41]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_core.documents import Document

markdown_path = "../../../amazon-sagemaker-toolkits.md"
loader = UnstructuredMarkdownLoader(markdown_path)

ImportError: unstructured package not found, please install it with `pip install unstructured`

In [40]:
len(data)

7530

In [27]:
import os
import re

# Function to chunk text into sections based on Markdown headers
def chunk_by_sections(data):
    # Split by headers like #, ##, ###, etc. (Markdown section headers)
    sections = re.split(r'(#+\s[^\n]+)', data)  # Split on header lines

    # Clean the chunks by stripping whitespace and removing empty sections
    sections = [section.strip() for section in sections if section.strip()]
    
    return sections

# Function to chunk paragraphs
def chunk_by_paragraphs(data):
    # Split the text into paragraphs by double newlines
    paragraphs = [p.strip() for p in data.split('\n\n') if p.strip()]
    
    return paragraphs

# Example: Loop through each .md file in a folder and process them
folder_path = "sagemaker_documentation"

for filename in os.listdir(folder_path):
    if filename.endswith(".md"):
        file_path = os.path.join(folder_path, filename)

        # Read the .md file content
        with open(file_path, "r") as file:
            data = file.read()

        # Chunk the text semantically
        sections = chunk_by_sections(data)
        paragraphs = chunk_by_paragraphs(data)

In [34]:
import openai

# Retrieve the OpenAI API key from the environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

# Check if the API key was successfully retrieved
if openai.api_key is None:
    raise ValueError("OpenAI API key not found. Please set the environment variable OPENAI_API_KEY.")

In [None]:
# Function to generate embeddings for a chunk of text
def generate_embeddings(chunk):
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=chunk
    )
    return response['data'][0]['embedding']

# Example usage
text = "How to use SageMaker?"
embedding = generate_embeddings(text)
print(f"Embedding for the text: {embedding[:5]}")  # Preview the first 5 values

## Data Preprocessing

1) Text Segmentation: Splitting the text into manageable chunks.

2) Indexing: Creating an index for the text chunks, using a search engine or vector store (like Elasticsearch or FAISS) to enable efficient retrieval.

In [5]:
# Concatenate all elements of the list 'text' into a single string

joint_text = "n\"".join(text)
print(type(joint_text))
#print(joint_text[:500])

# Split the single string into a list of chunks, using LangChain (Recursively split by character)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,  # Set your desired chunk size
    chunk_overlap=200  # Overlap between chunks
)

chunks = text_splitter.split_text(joint_text)
print(type(chunks))

for i, chunk in enumerate(chunks[:100]):
    print(f"Chunk {i + 1}:\n{chunk}\n")

<class 'str'>
<class 'list'>
Chunk 1:
Instance Types
Amazon EC2
Copyright © 2024 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon EC2 Instance Types
Amazon EC2: Instance Types
Copyright © 2024 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service
that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any
manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are
the property of their respective owners, who may or may not be affiliated with, connected to, or
sponsored by Amazon.

Chunk 2:
Amazon EC2 Instance Types
Table of Contents
Instance types.................................................................................................................................. 1
Current generation instances ...............................................................................

In [6]:
import os
import pdfplumber
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter


In [7]:
# Creating an index for the text chunks, using a search engine or vector store (like Elasticsearch or FAISS) to enable efficient retrieval.

In [13]:
# Approximate nearest neighbours approach for indexing

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.docstore.document import Document

In [9]:
model_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

  from tqdm.autonotebook import tqdm, trange


In [15]:
documents = []
for chunk in chunks:
    documents.append(
        Document(page_content=chunk))

In [16]:
vectorstore = FAISS.from_documents(documents, embeddings_model)

In [19]:
# embedding search

user_query = "What are the instance types available in AWS EC2?"
k = 2

retrieved_docs = vectorstore.similarity_search_with_relevance_scores(query=user_query, k=k)

In [20]:
retrieved_docs

[(Document(metadata={}, page_content='Amazon EC2 Instance Types\nAmazon EC2 instance type specifications\nAmazon EC2 provides a wide selection of instance types optimized to fit different use cases.\nInstance types comprise varying combinations of CPU, memory, storage, and networking capacity\nand give you the flexibility to choose the appropriate mix of resources for your applications. Each\ninstance type includes one or more instance sizes, allowing you to scale your resources to the\nrequirements of your target workload.\nWe group EC2 instance into the following categories:\n• General purpose – Provide a balance of compute, memory, and networking resources. These\ninstances are ideal for applications that use these resources in equal proportions, such as web\nservers and code repositories.\nBurstable performance – The T instance family is also referred to as burstable performance\ninstances. These instances provide a baseline CPU performance with the ability to burst above\nthe base

In [23]:
vectorstore.save_local('vector/')

In [24]:
vectorstore= FAISS.load_local('vector/', embeddings_model, allow_dangerous_deserialization=True)

In [25]:
vectorstore

<langchain_community.vectorstores.faiss.FAISS at 0x24d01c8c650>