[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/mtptisid/RAG-hands-on/edit/master/load_data.ipynb)]

In [3]:
!pip install sentence_transformers faiss-cpu -q

In [1]:
import glob
file_paths = glob.glob("./data/*")
type(file_paths)

list

In [2]:
file_paths

['./data/The Picture of Dorian Gray',
 './data/Dracula',
 './data/Pride and Prejudice',
 './data/Alice’s Adventures in Wonderland',
 './data/The Adventures of Sherlock Holmes']

In [4]:
documents = []
for path in file_paths:
    with open(path, "r") as file:
        text = file.read()
        chunks = text.split("\n\n") 
        documents.extend(chunks)

In [5]:
len(documents)

10083

In [6]:
documents[0]

'\ufeffThe Project Gutenberg eBook of The Picture of Dorian Gray\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.'

In [7]:
cleaned_chunks = [chunk.strip() for chunk in documents if chunk.strip()]

In [8]:
len(cleaned_chunks)

9856

In [10]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(cleaned_chunks)

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
import faiss
import numpy as np

dimension = embeddings.shape[1]  # Size of each embedding vector
index = faiss.IndexFlatL2(dimension)  # L2 distance index
index.add(np.array(embeddings))

In [12]:
import pickle
faiss.write_index(index, "faiss_index.bin")
with open("chunks.pkl", "wb") as f:
    pickle.dump(cleaned_chunks, f)

In [13]:
# Load the index and chunks
index = faiss.read_index("faiss_index.bin")
with open("chunks.pkl", "rb") as f:
    cleaned_chunks = pickle.load(f)

In [15]:

# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')
query = "can it have remained so long undiscovered, when there is a sure index to it if men will but take the trouble to look?"
query_embedding = model.encode([query])

In [16]:
k = 5  # Number of chunks to retrieve
distances, indices = index.search(np.array(query_embedding), k)

In [None]:
# Retrieve the corresponding text chunks
retrieved_chunks = [cleaned_chunks[i] for i in indices[0]]

# Print or use the retrieved chunks (e.g., pass to a language model)
for chunk in retrieved_chunks:
    print(chunk)

“But how,” said I, “can it have remained so long undiscovered, when
there is a sure index to it if men will but take the trouble to look?”
The Count smiled, and as his lips ran back over his gums, the long,
sharp, canine teeth showed out strangely; he answered:--
“But if a woman is partial to a man, and does not endeavor to conceal
it, he must find it out.”
Later in the day I got together the whole crew, and told them, as they
evidently thought there was some one in the ship, we would search from
stem to stern. First mate angry; said it was folly, and to yield to such
foolish ideas would demoralise the men; said he would engage to keep
them out of trouble with a handspike. I let him take the helm, while the
rest began thorough search, all keeping abreast, with lanterns: we left
no corner unsearched. As there were only the big wooden boxes, there
were no odd corners where a man could hide. Men much relieved when
search over, and went back to work cheerfully. First mate scowled, but
said