# Search Directory Tutorial

This notebook demonstrates how to use the `SearchDirectory` class to create embeddings and query a set of documents using a FAISS index.

The process can be broken down into the following steps:
1. Get text information from documents in a directory
2. Chunk the text data
3. Load an embedding model
4. Embed the chunked text
5. Create a FAISS index
6. Use a query to search the FAISS index

In [1]:
import os
import shutil
from file_processing import SearchDirectory

### 1. Get text information from documents in a directory

This step can be skipped if you already have a CSV containing the document names/file paths and the text data.

In [2]:
# specify a path to save the chunking, embedding, and faiss index to
search_dir_path = "docs/sample_search_docs"

# create a SearchDirectory object
search = SearchDirectory(search_dir_path)

In [3]:
# specify the path with the files to extract text from
resource_path = "tests/resources/similarity_test_files"

# generate a CSV report that contains text information
search.report_from_directory(resource_path)

Processing files: 20 files completed [00:00, 70.91 files completed/s]
Processing batches: 1 batches completed [00:00,  3.54 batches completed/s]


### 2. Chunk the text data from the report

You can either pass no arguments and it will use the `report.csv` generated in the previous step or you can specify the file path of another CSV file containing text data along with the column names of the file path and text content.

In [4]:
# generate chunks from the report
search.chunk_text()

Total rows (excluding header): 20


Processing rows: 100%|██████████| 20/20 [00:00<00:00, 6652.35it/s]

Chunking complete and saved to 'data_chunked.csv'.





In [None]:
# generate chunks from a CSV file
search.chunk_text("tests/resources/document_search_test_files/report_modified.csv",
                  "path",
                  "content")

### 3. Specify the embedding model to use

In [5]:
search.load_embedding_model("paraphrase-MiniLM-L3-v2")

### 4. Perform embeddings on the chunked text data

By default, this will split the task into batches that are saved to store progress during long computations. This is demonstrated by specifying the `batch_size` to be 20 chunks. The embeddings can also be broken down further by specifying the start and end chunks. The function is also designed to not recompute any chunks that have already been saved.

Once all the chunks are computed and saved then they are combined and saved to `embeddings.npy`

In [6]:
search.embed_text(row_start=0, row_end=-1, batch_size=20)

Embeddings combined and saved to embeddings.npy


### 5. Create the FAISS index

Multiple different types of FAISS indexes can be created with different hyperparameters. The functionality of using and creating FAISS indexes is demonstrated in more depth in `faiss_demo.ipynb`. This class uses the same methods as that demo but will always save the FAISS index after creating them.

In [7]:
search.create_flat_index()

### 6. Query the FAISS index and find the most similar documents

Specify a query and the number of similar chunks to return (as well as any hyperparameters depending on the FAISS index used) and this will return a data frame with the most similar chunks (accoring to the embedding and FAISS models used).

In [8]:
query = "What is the meaning of life, the universe, and everything?"
search.search(query, k=3)

Unnamed: 0,file_path,content
10,tests\resources\similarity_test_files\climate_...,The earth's climate is naturally variable on a...
43,tests\resources\similarity_test_files\history_...,"The Huron-Wendat of the Great Lakes Region, li..."
56,tests\resources\similarity_test_files\origin_o...,"Origin of the name ""Canada""\nToday, it seems i..."


Clean up created files

In [9]:
os.remove("docs/sample_search_docs/report.csv")
os.remove("docs/sample_search_docs/data_chunked.csv")
os.remove("docs/sample_search_docs/setup_data.json")
shutil.rmtree("docs/sample_search_docs/embedding_batches")
os.remove("docs/sample_search_docs/embeddings.npy")
os.remove("docs/sample_search_docs/index.faiss")