# Tutorial 4: Large similarity matrices

In this tutorial you are going to learn how to:

<div class="alert alert-block alert-warning">
    
**[Set up a batched similarity matrix](#Setup)**

**[Run the calculation](#Running-the-calculation)**

**[Write the file of most similar material](#Most-similar-materials-search)**
    
</div>

Let's get started!

## Setup

When a large number of materials should be compared to find the most similar materials, it can happen that the computers memory can not hold the full matrix, or, the calculation of similarity can be slow, requiring a large degree of parallelization to obtain results in a reasonable time. 

For this pupose, a `BatchedSimilarityMatrix` can be used to distribute the calculation of the matrix over different processes. This is achieved by dividing the matrix into different sub-matrices, where each sub-matrix contains the similarity between two _batches_ of fingerprints. This allows to scale the calculation up to an arbitrary number of independent tasks.

First, we load the database and extract the fingerprints. For demonstration purposes we use the PTE fingerprint here, it can be done with any type of `Fingerprint`.

In [1]:
from madas import MaterialsDatabase

db = MaterialsDatabase(filename="CuPdAu-energies.db")
assert len(db) == 196, "Please run tutorial 1 first!"

In [2]:
fp_list = db.get_fingerprints("PTE")

  0%|          | 0/196 [00:00<?, ?it/s]

In [3]:
print(f"Got {len(fp_list)} fingerprints!")

Got 196 fingerprints!


Next, we need to serialize the fingerprints and write them to an external file.

In [4]:
import json

In [5]:
with open("data/CuPdAu_PTE_fingerprints.json", "w") as f:
    json.dump([fp.serialize() for fp in fp_list], f)

The next steps require us to import the `BatchedSimilarityMatrix`.

In [6]:
from madas.similarity import BatchedSimilarityMatrix

In the next step, we are going to generate the file structure that holds the large similarity matrix and split the large fingerprints file into smaller _batches_.

In [7]:
bsm = BatchedSimilarityMatrix(root_path="data", # relative path where the matrix is stored
                              matrix_folder_name="CuPdAu_bsm", # name of the folder that holds sub-matrices
                              # name of the initial fingerprint file
                              fingerprint_files_name="CuPdAu_PTE_fingerprints.json",  
                              batch_size=10, # size of a batch, i.e., the size of individual sub-matrices
                              n_tasks=2, # Number of individual tasks for calculating the matrix
                              load_from_file=True) # read matrix metadata from file

In [8]:
bsm.fingerprint_file_batches(bsm.fingerprint_files_name, fingerprint_file_path="data")

You can see how many fingerprint files have been created by:

In [9]:
!ls data/CuPdAu_bsm | grep json

CuPdAu_PTE_fingerprints.json_0_10.json
CuPdAu_PTE_fingerprints.json_100_110.json
CuPdAu_PTE_fingerprints.json_10_20.json
CuPdAu_PTE_fingerprints.json_110_120.json
CuPdAu_PTE_fingerprints.json_120_130.json
CuPdAu_PTE_fingerprints.json_130_140.json
CuPdAu_PTE_fingerprints.json_140_150.json
CuPdAu_PTE_fingerprints.json_150_160.json
CuPdAu_PTE_fingerprints.json_160_170.json
CuPdAu_PTE_fingerprints.json_170_180.json
CuPdAu_PTE_fingerprints.json_180_190.json
CuPdAu_PTE_fingerprints.json_190_196.json
CuPdAu_PTE_fingerprints.json_20_30.json
CuPdAu_PTE_fingerprints.json_30_40.json
CuPdAu_PTE_fingerprints.json_40_50.json
CuPdAu_PTE_fingerprints.json_50_60.json
CuPdAu_PTE_fingerprints.json_60_70.json
CuPdAu_PTE_fingerprints.json_70_80.json
CuPdAu_PTE_fingerprints.json_80_90.json
CuPdAu_PTE_fingerprints.json_90_100.json
CuPdAu_bsm_metadata.json


Having this prepared, we can continue by:

## Running the calculation

For running the calculation, we can decide how many individual tasks can be used. This depends on the availability of, e.g., a compute cluster. In this example, we are going to assume you have a SLURM workload manager installed.

First, we need to write a python script that is executed on each computed node:

```python
from madas.similarity import BatchedSimilarityMatrix
from madas.fingerprints.PTE_fingerprint import PTE_similarity

import os

MATRIX_ROOT_PATH = "<absolute path to the matrix folder>"

def main():
    n_tasks = int(os.getenv('SLURM_NTASKS')) # SLURM sets the number of tasks as environmental variables 
    task_id = int(os.getenv('SLURM_PROCID')) # This is the id of the task that runs this script
    
    matrix = BatchedSimilarityMatrix(root_path=MATRIX_ROOT_PATH, matrix_folder_name="CuPdAu_bsm")
    
    # To make sure that the correct number of tasks and task id are used, we set them explicitly.
    # If the number of tasks is not set correctly, it can happen that some parts of the matrix
    # are not calculated.
    matrix.set_n_tasks(n_tasks)
    matrix.set_task_id(task_id)
    
    # Printing the number of matrices for each task helps to estimate the runtime
    print(f"Task {task_id} calculating {matrix.matrices_for_this_task} matrices.")

    # Here we start calculating all sub-matrices for this task
    matrix.calculate(PTE_similarity)

if __name__ == "__main__":
    main()
```

You can write this code to a file.

Next, we need a submission script, that tells SLURM what to do:

```bash
#!/bin/bash
#SBATCH -p <name-of-queue>

# The name of the job:
#SBATCH --job-name="batched_similarity_matrix_calculation_tutorial"

#SBATCH -o simat_test.out
#SBATCH -e simat_test.err
# Here we use two different tasks
#SBATCH --ntasks=2 
#SBATCH --cpus-per-task=2
#SBATCH --tasks-per-node=1

#SBATCH --time=01:00:00

source <path-to-venv-where-madas-is-installed>

time srun python -u <name-of-python-runscript>
```

To simulate the behaviour of the code, and since the matrix is small in this case, we can run each task individually:

In [10]:
from madas.fingerprints.PTE_fingerprint import PTE_similarity

for task_id in range(bsm.n_tasks):
    bsm.set_task_id(task_id)
    print(f"Task {task_id} calculating {bsm.matrices_for_this_task} matrices.")
    bsm.calculate(PTE_similarity)

Task 0 calculating 105 matrices.
Task 1 calculating 105 matrices.


In order to be able to restart calculations, the `BatchedSimilarityMatrix` will skip exisiting sub-matrices. To overwrite them, use the keyword argument `BatchedSimilarityMatrix().calculate(<similarity_function>, overwrite=True)`.

We can inspect what was created by this process by looking into the matrix folder:

In [11]:
!ls data/CuPdAu_bsm | grep npy

CuPdAu_bsm_0_10__0_10.npy
CuPdAu_bsm_0_10__100_110.npy
CuPdAu_bsm_0_10__10_20.npy
CuPdAu_bsm_0_10__110_120.npy
CuPdAu_bsm_0_10__120_130.npy
CuPdAu_bsm_0_10__130_140.npy
CuPdAu_bsm_0_10__140_150.npy
CuPdAu_bsm_0_10__150_160.npy
CuPdAu_bsm_0_10__160_170.npy
CuPdAu_bsm_0_10__170_180.npy
CuPdAu_bsm_0_10__180_190.npy
CuPdAu_bsm_0_10__190_196.npy
CuPdAu_bsm_0_10__20_30.npy
CuPdAu_bsm_0_10__30_40.npy
CuPdAu_bsm_0_10__40_50.npy
CuPdAu_bsm_0_10__50_60.npy
CuPdAu_bsm_0_10__60_70.npy
CuPdAu_bsm_0_10__70_80.npy
CuPdAu_bsm_0_10__80_90.npy
CuPdAu_bsm_0_10__90_100.npy
CuPdAu_bsm_100_110__100_110.npy
CuPdAu_bsm_100_110__110_120.npy
CuPdAu_bsm_100_110__120_130.npy
CuPdAu_bsm_100_110__130_140.npy
CuPdAu_bsm_100_110__140_150.npy
CuPdAu_bsm_100_110__150_160.npy
CuPdAu_bsm_100_110__160_170.npy
CuPdAu_bsm_100_110__170_180.npy
CuPdAu_bsm_100_110__180_190.npy
CuPdAu_bsm_100_110__190_196.npy
CuPdAu_bsm_10_20__100_110.npy
CuPdAu_bsm_10_20__10_20.npy
CuPdAu_bsm_10_20__110_120.npy

## Most similar materials search
As the matrix is calculated, we can search for the most similar materials. This is also optimized to be done with multiple independent tasks. Therefore, we compose a new python script.

```python
from madas.similarity import BatchedSimilarityMatrix

import os

MATRIX_ROOT_PATH = "<absolute path to the matrix folder>"

def main():
    n_tasks = int(os.getenv('SLURM_NTASKS')) # SLURM sets the number of tasks as environmental variables 
    task_id = int(os.getenv('SLURM_PROCID')) # This is the id of the task that runs this script
    
    matrix = BatchedSimilarityMatrix(root_path=MATRIX_ROOT_PATH, matrix_folder_name="CuPdAu_bsm")
    
    # To make sure that the correct number of tasks and task id are used, we set them explicitly.
    # If the number of tasks is not set correctly, it can happen that materials are missing.
    matrix.set_n_tasks(n_tasks)
    matrix.set_task_id(task_id)
    
    # Here we start start searching for the most similar materials
    matrix.write_most_similar_materials_file(k=10)

if __name__ == "__main__":
    main()
```

For this example, we can use the same trick as above:

In [12]:
for task_id in range(bsm.n_tasks):
    bsm.set_task_id(task_id)
    bsm.write_most_similar_materials_file(k=10)

In [13]:
!ls data/CuPdAu_bsm | grep most_similar_materials

CuPdAu_bsm_most_similar_materials_0.txt
CuPdAu_bsm_most_similar_materials_1.txt


Looking at these files we can see that each line in the file contains the `mid` of the refrence material, as well as a dictionary of the most similar materials, where the keys are the `mid`s and the values are the similarity. In our example, the majority of matterials have $S = 1$, since they have very similar atomic compositions.

In [14]:
!tail -n 1 data/CuPdAu_bsm/CuPdAu_bsm_most_similar_materials_1.txt

{"OgU4E2hIjUyHVUnzkJPV-xIhDUT6": {"yYnCOkHlYfoBzVT6mvuZzHupD69w": "1.0", "y5trjCmfE9JyRF4m-fedgsqOUcnd": "1.0", "w7y48xkqP8ofPKwkCWJVWh3y6b4t": "1.0", "uJPjF9ywEVcyHQdl-05FnqLQSptF": "1.0", "t6R-4fS8leaO0RCMMj95xVT8J2nH": "1.0", "t5rZMgux78pnbeXxg0dK2tR6km9q": "1.0", "suIwtjsjtqfkNDdVAVh3pAGzUKDz": "1.0", "siPgEpTrlVw4AkLH8Nbuxy4NhYhi": "1.0", "sT0bXwaPdrAteiTUDAIAXEjGlmRS": "1.0", "s9a3rbvlgP5S7h9tfLVG1AJ0ZzOu": "1.0"}}
