<h1 align="center">ArcticDB Diagnostics Notebook</h1>



<img src="https://raw.githubusercontent.com/man-group/ArcticDB/master/static/ArcticDBCropped.png" alt="ArcticDB Logo" width="400">



Introducing our cutting-edge **ArcticDB** diagnostics notebook. This powerful tool offers a seamless and efficient way to analyze and understand your ArcticDB data, by providing in-depth insights and performance metrics.

With our diagnostics notebook, users can:
- Easily assess different options for each ArcticDB library
- Gain valuable insights into the structure and performance of their databases
- Receive personalized explanations for each option
- Make informed decisions about their database operations
- Enhance overall productivity and effectiveness

Whether you're troubleshooting an issue, optimizing your database, or just exploring your data, our diagnostics notebook for ArcticDB is an indispensable tool that simplifies complex database analysis. 

### Necessary packages installation

Run the following cell to install the necessary packages to run the diagnostics notebook

In [None]:
!pip install networkx
!pip install matplotlib
!pip install seaborn

<h1 align="center">Library level Diagnosis</h1>

### Setting Up the Library Object for Use Throughout This Notebook. 

From this point onward, we will be interacting with a specific library object. This library will be used for all the operations showcased in the rest of this Jupyter notebook.

- Please make sure to assign the library object to the "lib" variable.
- Example: lib = your_code_to_retrieve_library()

In [None]:
lib = None

<h2 align="center">Library Options Diagnosis</h2>


Gain comprehensive insights into your ArcticDB library options, and understand how they interact with other elements of your database. This knowledge can help you optimize your database performance and functionality, ensuring that your ArcticDB database is always running at its best.

#### Benefits:
- Understand all the options you've specifically set up for your library
- Gain insights into how they can affect your overall user experience with the database
- Optimize your database performance and functionality

#### Please, run the following cell

In [None]:
from arcticdb.diagnostics_engine import display_library_options
display_library_options(lib)

<h2 align="center">Symbol list consistency diagnosiss</h2>


- This section is designed to address a specific challenge when working with your database library's symbol list. When the symbol_list option is enabled, the library employs a caching technique to store the list of symbols. This greatly enhances performance when using lib.list_symbols, as the function can quickly retrieve the symbol list from the cache instead of querying the database.

- However, due to certain edge cases, this caching mechanism can occasionally lead to inconsistencies between the cached list of symbols and the actual state of the database. These inconsistencies can cause problems when relying on the symbol list for database operations.

- This section verifies the consistency of the cached symbol list by comparing it with the result of a direct database query, thereby highlighting any discrepancies.

#### Please, run the following cell

In [None]:
from arcticdb.diagnostics_engine import symbol_list_consistency
%matplotlib inline
symbol_list_consistency(lib)

<h1 align="center">Symbol level Diagnosis</h1>

### Setting Up the Symbol Name and Version for Use Throughout This Notebook. 

From this point onward, we will be interacting with a specific symbol and version from your previously specified library. This symbol and version will be used for all the operations showcased in the rest of this Jupyter notebook.

Specify a s symbol and an as_of:
- Please make sure to assign the symbol name to the "symbol" variable.
- Please make sure to assign the as_of version to the "as_of" variable. Specify it only in case you want to evaluate a specific version. Leave it as None instead if you want to evaluate the last version.
- Example: 
    - symbol = "my_symbol"
    - as_of = 3

In [2]:
symbol = None
as_of = None

<h2 align="center">Index Key structure</h2>

In this section we provide the data segment information for a specific version. A detailed breakdown of the data segments will be shown in a tabular format. Each row corresponds to a segment and contains the following information:

- ***Version_id***: The identifier of the version to which the data segment belongs (i.e. was written by the first time). Each version may have multiple data segments.
- ***Start_index***: The index at which the segment begins.
- ***End_index***: The index at which the segment ends.
- ***Creation_ts***: The timestamp indicating when the segment was created.
- ***Start_row, end_row***: These represent the range of rows included in the segment.
- ***Start_col, end_col***: These denote the range of columns included in the segment.

#### Please, run the following cell

In [None]:
from arcticdb.diagnostics_engine import display_index_key_structure 
%matplotlib inline
display_index_key_structure(lib, symbol, as_of)

<h2 align="center">Fragmentation diagnosis</h2>

- Fragmentation refers to poor space optimization when storing data. In our context, fragmentation implies that the storage space allocated to data segments within a version is not being efficiently utilized.

- When the total number of data points in individual data segments represents a small fraction of the maximum number of data points that can be accommodated per segment, it indicates fragmentation.

- From a performance perspective, fragmentation implies inefficient use of storage space. Not only does it lead to storing less data in more space, but it can also degrade the performance of data retrieval operations. 

- The extent of fragmentation for each data segment can be quantified as the ratio of the actual number of data points to the maximum possible number of data points per segment, expressed as a percentage. The closer this percentage is to 100%, the more efficiently the segment space is being utilized. Conversely, a lower percentage indicates higher fragmentation.

#### Next, run the following cell to diagnose how fragmented is your data

In [None]:
from arcticdb.diagnostics_engine import fragmentation_diagnosis 
%matplotlib inline
fragmentation_diagnosis(lib, symbol, as_of)

<h2 align="center">Sortedness diagnosis</h2>


- The ArcticDB Sortedness Verification layer is designed to ensure the integrity and optimal performance of your data operations in an efficient manner.
- In data science and analytics, the order of data can be crucial for various operations such as time series analysis. Having data pre-sorted can significantly speed up these operations. 
- ArcticDB determines the order of your data by checking if your dataframes are sorted in an ascending or descending manner, or if they are unsorted.
    - During the initial write, the sorting status is stored with the dataframe. If neither of the flags (DataFrame.index.is_monotonic_increasing and DataFrame.index.is_monotonic_decreasing) are set to True, the dataframe's ordering state is marked as UNSORTED.
    - In situations where the dataframe is older (written when sorting status wasn't tracked) or the data isn't a pandas dataframe (and thus doesn't have the aforementioned flags), the ordering state is marked as UNKNOWN.
    - Further appends or updates to the dataframe follow a state transition path that depends on its current sorting status.

#### Next, run the following cell to diagnose the degree of sorting in your data

In [None]:
from arcticdb.diagnostics_engine import sortedness_diagnosis 
%matplotlib inline
sortedness_diagnosis(lib, symbol, as_of)

<h2 align="center">Version list</h2>

This section provides a comprehensive overview of your version list, focusing on the process of retrieving a specific version or the latest one. It illustrates the structure of the version list and offers useful information about the steps involved in locating a particular version.


#### Four scenarios are considered in the analysis:
1. ***Version Present***: If the version is included in the version list, the number of steps required to access it from the initial version key are provided. Each step represents a disk I/O operation, which can be costly for versions located further away.
2. ***Version Deleted***: In the scenario where a specific version is absent from the version list, the number of steps is calculated under two possible circumstances:
    - Either we stop traversing the list when we encounter an explicit deletion node (tombstone) for that version, indicating that the targeted version has been deleted, or
    - we stop when when the end of the chain is reached. 
    - In either case, the tool determines the number of snapshots for that library, which need to be traversed to verify if the deleted version exists. 
3. ***Version Compacted***: If a version is missing from the version list but one of the nodes contains its index key, it implies that the version has been compacted. The tool calculates the number of steps in the version chain needed to reach the node that contains the compacted versions.
4. ***Version in Snapshots Only***: If a version is not present or is marked as deleted in the version list but is found in any of the snapshots, the process involves traversing all the snapshots until the version is found. This operation can be costly as it requires examining the total number of snapshots in the library. In addition, the process includes navigating first through the version list, with each step involving corresponding I/O operations.

#### Next, run the following cell



#### Next, run the following cell

In [None]:
from arcticdb.diagnostics_engine import version_list_diagnosis 
%matplotlib inline
version_list_diagnosis(lib, symbol, as_of)

# <h2 align="center">Pickling</h2>


ArcticDB has some limitations on data storage that you should be aware of. When data cannot be normalized, it will be pickled.
- This section allows you to determine ahead of time whether the data you plan to store will need to be pickled for storage in the database.
- Data pickling is a process that converts Python objects into a byte stream, allowing complex data structures to be saved and reloaded for later use. However, pickling is not always the most efficient method for storing data, particularly when it comes to database operations.
- The types of data that ArcticDB supports without requiring pickling are:
    - Pandas DataFrames
    - NumPy arrays
    - Integers (including timestamps, though timezone information in timestamps is removed)
    - Floats
    - Bools
    - Strings (if written as part of a DataFrame/NumPy array)




#### Enter the code to retrieve the data you want to verify pickling
- Please make sure to assign the data object to the "data" variable.
- Example: data = your_code_to_retrieve_frame()

In [2]:
data = None

 #### Next, run the following cell to verify if your data can be pickled

In [None]:
from arcticdb.diagnostics_engine import pickling_diagnosis 
pickling_diagnosis(lib, data)