# Create Embeddings Vectorstore

This is a Collab script creates a [Chroma](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html)  vector store from the specified **zip** file, which includes `JSON` files with splits of unstructured documents.

Use the following [script](https://github.com/gosha70/document-assistant/blob/main/embeddings/document_loader.py) to scan and proocess local documents. After the cloning the [document-assistant](https://github.com/gosha70/document-assistant), navigate into the cloned repo:

```
> python3 -m embeddings.document_loader --dir_path DOCUMENT_FOLDER --file_type FILE_TYPE --file_type FILE_NAME_PATTERN --persist_directory OUTPUT_FOLDER
```

Where:


* `--dir_path`- (optional) the root directory where to look for documents. The default is `"."`
* `--file_type`- the list of file extensions:
 * `md`
 * `java`
 * `xml`
 * `html`
 * `pdf`

* `--file_patterns` - (optional) file name patterns for each file type; for example: `--file_patterns "java:**/*Function.java" "java:**/*Api*"`


* `--persist_directory` - (optional) the path to the directory where unstructured document splits are saved.
<br/><br/>
<br/><br/>
>
> &copy; **EGOGE** - All Rights Reserved.
>
> _This software may be used and distributed according to the terms of the CC-BY-SA-4.0 license._
<br/><br/>

### Installation
---
Install required python libraries:


1.   [LangChain](https://python.langchain.com/docs/modules/chains/foundational/llm_chain)

2. [ChromaDB](https://docs.trychroma.com/)

3. [Huggingface](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceInstructEmbeddings.html)




In [None]:
# Install lingchain and embedding libraries
!pip install langchain transformers tqdm sentence_transformers InstructorEmbedding huggingface huggingface_hub chromadb

### Map Google Drive
---
Map to Google Drive where a zip file with Documents splits is located

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

## Unzip Documents

Set the following variables:


*   `zip_path`: the full path to the zip file in the mounted Google Drive
*   `extract_dir`: the full path to the directory where extracted files are saved



In [3]:
import zipfile
import os
# Specify a full path to the zip file:
zip_path = '/content/drive/MyDrive/DOCS_ZIP.zip'

# Specify a directory to extract the zip file to:
extract_dir = '/content/drive/MyDrive/UNZIP_FOLDER'

# Specify a directory to cache the embedding data.
# Comment out this lien. if you want to disable the Transformer Cache.
os.environ['TRANSFORMERS_CACHE'] = '/content/drive/MyDrive/TRANSFORMER_CACHE'

def unzip_documents(zip_file, unzip_folder):
  # Create the directory if it doesn't exist
  os.makedirs(extract_dir, exist_ok=True)

  # Open the zip file
  with zipfile.ZipFile(zip_path, 'r') as zip_ref:
      # Extract all the contents into the directory
      zip_ref.extractall(extract_dir)

# Unzip documents.
# If you run the script a few times, - comment out this line after the first run.
unzip_documents(zip_file=zip_path, unzip_folder=extract_dir)

## JSON to Document

The conversion utility to convert the textual presentation of `Document` stored in a loaded **JSON** into in the in-memory `Document`.

Here is the example of *jsonified* `Document`:
```
{
    "lc": 1,
    "type": "constructor",
    "id": [
        "langchain_core",
        "documents",
        "base",
        "Document"
    ],
    "kwargs": {
        "page_content": "public class Main {}\n\n}",
        "metadata": {
            "source": "Main.java"
        }
    }
}
```

In [4]:
import json
from langchain_core.documents import Document

def load_document(file) -> Document:
    """
    Create a Document from the specified JSON file.

    Parameters:
    - file (File): the JSON file

    Returns (Document)
    """
    try:
        # Read and process the content as JSON
        json_content = file.read()

        # Parse the JSON content
        data = json.loads(json_content)

        # Access the "page_content" field
        page_content = data['kwargs']['page_content']

        # Access the "metadata" field
        metadata = data['kwargs']['metadata']

        # Transform the data into a langchain_core.documents.Document
        # Assuming the JSON structure fits the Document's requirements
        return Document(page_content=page_content, metadata=metadata)
    except Exception as error:
        print(f"File {file} is not a valid JSON: {str(error)}")

    return None

## Create Embedding LLM

1. Update `EMBEDDING_KWARGS` the device type based on the selected runtime environment: `cuda` or `cpu`

2. Update `DEFAULT_MODEL_NAME` with the model name of embedding LLM. The choosen model must correspond to a model will be use later in a runtime Application. Both models must have the same dimensionality: the number of dimensions in which the vector representation of a word is defined; the total number of features that are encoded in the vector representation.


In [None]:
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from tqdm.auto import tqdm

# You have to choose GPU in order to use CUDA
EMBEDDING_KWARGS = {'device': 'cpu'} # 'cuda' or 'cpu'
ENCODE_KWARG = {'normalize_embeddings': True}

def create_embedding(model_name):
    return HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs=EMBEDDING_KWARGS,
        encode_kwargs=ENCODE_KWARG
    )

DEFAULT_MODEL_NAME = "hkunlp/instructor-large"

# Create Embedding LLM
embedding = create_embedding(model_name=DEFAULT_MODEL_NAME)

## Create/Update Chroma Vectorstore


Customize the following variables:
*   `persist_directory`: the folder where the **Chroma** vectorstore is located
*   `collection_name`: the name of the **Chroma** vectorstore



In [6]:
import chromadb
from datetime import datetime
from chromadb.config import Settings
from langchain_community.vectorstores import Chroma

def create_vectorestore(embedding, documents) -> Chroma:
  # Specify the location in Google Drive where the new Chroma vectorstore is saved.
  persist_directory = '/content/drive/MyDrive/CHROMA_DB/'
  # Specify the name of the new Chroma vectorstore
  collection_name="EGOGE_DOCUMENTS_DB"
  print(f"Creating the embedding vectorstore with {embedding}\n with {len(documents)} document splits ...")
  docs_db = Chroma.from_documents(
      documents=documents,
      collection_name=collection_name
      embedding=embedding,
      persist_directory=persist_directory
  )

  # Save the Chroma database after processing each chunk
  print(f"Saving the Chroma vectorstore with {len(documents)} documents  ...")
  docs_db.persist()
  
  meta_inf_path = os.path.join(persist_directory, 'META-INF')
  manifest_file_path = os.path.join(meta_inf_path, 'MANIFEST.MF')
  
  # Get current UTC time and format it
  current_utc_datetime = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3] + 'Z' 

  # Manifest content
  manifest_content = f"""
  Manifest-Version: 1.0
  Created-On: {current_utc_datetime}
  Created-By: EGOGE (https://github.com/gosha70/document-assistant)
  Collection Name: {collection_name}
  Embedding Class: {"langchain_community.embeddings.HuggingFaceInstructEmbeddings"}
  Embedding Model Name: {DEFAULT_MODEL_NAME}
  """

  if not os.path.exists(meta_inf_path):
      os.makedirs(meta_inf_path)
      print(f"Created directory: {meta_inf_path}")

  with open(manifest_file_path, 'w') as file:
      file.write(manifest_content)
      print(f"Created MANIFEST.MF at: {manifest_file_path}")

  return docs_db

def process_documents_in_chunks(embedding, file_paths, chunk_size) -> Chroma:
  docs_db = None
  total_count = len(file_paths)
  # Process in chunks
  for i in range(0, len(file_paths), chunk_size):
      chunk = file_paths[i:i + chunk_size]
      documents = []

      # Process each file in the chunk
      for file_path in chunk:
          with open(file_path, 'r') as file:
              doc = load_document(file)
              if doc is not None:
                  documents.append(doc)

      batch_id = f"{i // chunk_size + 1}/{total_count // chunk_size + 1}"
      print(f"Processing the batch {batch_id}: {len(documents)} documents")
      if documents:
          if docs_db is None:
            docs_db = create_vectorestore(embedding=embedding, documents=documents)
          else:
            print(f"Updating the embedding vectorstore with {len(documents)} document splits ...")
            docs_db.add_documents(documents=documents)
            # Save the Chroma database after processing each chunk
            print(f"Finished the batch {batch_id}.")

  docs_db.persist()
  return docs_db

## Process All Documents

In [None]:
CHUNK_SIZE = 1000  # Number of files to process at a time

# Collect all file paths
file_paths = []
for dirpath, dirnames, filenames in os.walk(extract_dir):
    for file_name in filenames:
        full_path = os.path.join(dirpath, file_name)
        file_paths.append(full_path)

filesCount = len(file_paths)
print(f"Found {filesCount} files will be processed in {CHUNK_SIZE} batches ...")

In [None]:
if filesCount > 0:
  chroma_db = process_documents_in_chunks(embedding, file_paths, CHUNK_SIZE)
  dic = chroma_db.get()["ids"]
  print(f"Finished the creation of the Chroma vectorstore with {len(dic)} documents:\n {chroma_db}")
else:
  print("ERROR: Cannot create the Chroma vectorstore without documents.")