# Simple RAG (Retrieval-Augmented Generation) System for CSV Files

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

## Key Components

1. Loading and spliting csv files.
2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and all-MiniLM-L6-v2 embeddings
3. Retriever setup for querying the processed documents
4. Creating a question and answer over the csv data.

## Method Details

### Document Preprocessing

1. The csv is loaded using langchain Csvloader
2. The data is split into chunks.


### Vector Store Creation

1. all-MiniLM-L6-v2 embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the most relevant chunks for a given query.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file.

import libries

In [1]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from pathlib import Path
from sentence_transformers import SentenceTransformer
import os
import json
import pandas as pd
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
api_key = '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'
os.environ["GROQ_API_KEY"] = api_key
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0.0,
    max_retries=2,
)

  from .autonotebook import tqdm as notebook_tqdm


# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

In [33]:
file_path = ('hotel_data_grouped.csv') # insert the path of the csv file
data = pd.read_csv(file_path)

#preview the csv file
data.drop(columns='Unnamed: 0', axis=1, inplace=True)
data.head()

Unnamed: 0,depature_date,hotel_name,reg,types,room-type,Sales Person,sales
0,2024-01-04,"Le Méridien, Sabah",East,Resort Hotel,Deluxe Room,Derek,697.5
1,2024-01-04,"Le Méridien, Sabah",East,Resort Hotel,Single Room,Leo,542.5
2,2024-01-04,"Lexis Suites, Penang",North,City Hotel,Deluxe Room,Amy,657.5
3,2024-01-04,"The Hilton, Kuala Lumpur",Central,City Hotel,Deluxe Room,Amy,1987.5
4,2024-01-04,"The Hilton, Kuala Lumpur",Central,City Hotel,Single Room,Derek,542.5


load and process csv data

In [34]:
loader = CSVLoader(file_path=file_path)
docs = loader.load_and_split()

Initiate faiss vector store and openai embedding

In [36]:
embeddings = SentenceTransformer('all-MiniLM-L6-v2')
index = faiss.IndexFlatL2(embeddings.get_sentence_embedding_dimension())
vector_store = FAISS(
    embedding_function=embeddings.encode,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


Add the splitted csv data to the vector store

In [37]:
vector_store.add_documents(documents=docs)

['919af8b0-1fbf-49a9-9bb9-0e986acd9d93',
 'e6e91a17-b9ae-4664-ad43-ebcb9935d828',
 'fe082c76-ee9f-4b86-bbb6-c30be8d9c070',
 '8ca7dea3-f4c7-4a9c-8a0f-2ab69403226e',
 '3f041710-6f35-49c4-b16e-4d2db59dc8eb',
 'cbbae92e-ebda-4f61-a27a-195dd8e0f7c8',
 'bae140e6-d8c0-44e9-8210-616e99922351',
 '755ab6d6-e43f-4c31-bf1d-61e214099cba',
 '0af4bd32-c75f-4b6f-9b55-2ab40e1b4607',
 '5215d782-17ab-47b3-96e6-61d9d0982613',
 '0fb0283a-3460-43a1-93f4-b91c3f155bb2',
 'da37e36b-683a-4b19-ae70-ebcc9c25806e',
 '9ffab591-e2d0-46d5-8e4c-43cf7e922c1c',
 '52fef52d-48f3-430b-9408-aa48c370d173',
 'b68a011d-0457-4527-8d92-686e5facfddf',
 '6e10ca25-098e-4d81-8604-4bbb548ab6f1',
 '91ea0028-e77c-4a2f-a88e-b671ad01ec02',
 '11559029-682e-4755-8231-4193624e4cd5',
 '2e78a8ab-0a44-4721-89fd-c91308c2e02c',
 '43134042-d7fd-49b0-a724-4e5d0e006d3e',
 'f2cacc09-7347-4840-9712-39ef9a183e44',
 '4c350dbd-f238-4334-942f-b43ca73da532',
 'a8490e34-d89b-435e-a9c8-3fdd18d65a8e',
 '7dd31b87-343f-4a68-8f90-9e0391463606',
 'bcb1be75-d34c-

In [None]:
vector_store.save_local("Analysis_FAISS")



`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.
