# Search Directory Tutorial

This notebook demonstrates how to use the `SearchDirectory` class to create embeddings and query a set of documents using a FAISS index.

In [18]:
import os
import shutil
from file_processing import SearchDirectory

1. Get text information from documents in a directory

In [11]:
# specify a path to save the chunking, embedding, and faiss index to
search_dir_path = "docs/sample_search_docs"

# create a SearchDirectory object
search = SearchDirectory(search_dir_path)

# specify the path with the files to extract text from
resource_path = "tests/resources/similarity_test_files"

# generate a CSV report that contains text information
search.report_from_directory(resource_path)

Processing files: 20 files completed [00:00, 73.99 files completed/s]
Processing batches: 1 batches completed [00:00,  3.70 batches completed/s]


2. Chunk the text data from the report

In [12]:
search.chunk_text()

Total rows (excluding header): 20


Processing rows: 100%|██████████| 20/20 [00:00<00:00, 19972.88it/s]

Chunking complete and saved to 'data_chunked.csv'.





3. Specify the embedding model to use

In [13]:
search.load_embedding_model("paraphrase-MiniLM-L3-v2")

4. Perform embeddings on the chunked text data

In [14]:
search.embed_text(row_start=0, row_end=-1, batch_size=20)

100%|██████████| 20/20 [00:00<00:00, 48.06it/s]


Embedding batch complete and saved to embeddings (0-20).npy').


100%|██████████| 20/20 [00:00<00:00, 44.12it/s]


Embedding batch complete and saved to embeddings (20-40).npy').


100%|██████████| 20/20 [00:00<00:00, 50.26it/s]


Embedding batch complete and saved to embeddings (40-60).npy').


100%|██████████| 16/16 [00:00<00:00, 49.74it/s]

Embedding batch complete and saved to embeddings (60-76).npy').
Embeddings combined and saved to embeddings.npy





5. Create the FAISS index

In [15]:
search.create_flat_index()

6. Query the FAISS index and find the most similar documents

In [16]:
query = "What is the meaning of life, the universe, and everything?"
search.search(query, k=3)

Unnamed: 0,file_path,content
10,tests\resources\similarity_test_files\climate_...,The earth's climate is naturally variable on a...
43,tests\resources\similarity_test_files\history_...,"The Huron-Wendat of the Great Lakes Region, li..."
56,tests\resources\similarity_test_files\origin_o...,"Origin of the name ""Canada""\nToday, it seems i..."


Clean up created files

In [19]:
os.remove("docs/sample_search_docs/report.csv")
os.remove("docs/sample_search_docs/data_chunked.csv")
os.remove("docs/sample_search_docs/setup_data.json")
shutil.rmtree("docs/sample_search_docs/embedding_batches")
os.remove("docs/sample_search_docs/embeddings.npy")
os.remove("docs/sample_search_docs/index.faiss")