

<img src="https://raw.githubusercontent.com/mahvin92/DNAcrypt-AI/main/assets/DNAcrypt%20logo/DNAcrypt-logo-3.png" height="200" align="center" style="height:340px">


# DNAcrypt-AI
**Genome-based cryptography using artificial intelligence.**


DNAcrypt-AI is an artificial intelligence system that uses the human genome as a high-entropy foundation for generating, encrypting, and decrypting passwords and cryptographic keys. It generates and encrypts a password or key by extending a kmer-based strategy to encode alphanumeric symbols and randomly building a genome vocabulary by sampling DNA coordinates from multiple human genome assemblies. It decrypts the information by reconstructing the corresponding DNA sequences using [FAS2rDNA](https://fas2rdna.chordexbio.com/). These sequences form a complex corpus used by [Covary](https://covary.chordexbio.com/) to extract deep sequence patterns through machine learning. The resulting vector embeddings are mapped back into the unique alphanumeric and symbolic encodings from the kmer dictionary; these, in turn, represent the type and sequence of characters encrypted in the genome vocabulary, forming the original password or key.


Version: v.0.1 (Beta, no longer supported)


Learn more by visiting our [website](https://dnacryptai.chordexbio.com) or [GitHub repo](https://github.com/mahvin92/DNAcrypt-AI)


This Colab version of DNAcrypt-AI was built through the [assemBold program by ChordexBio](https://assembold.chordexbio.com).

---
Versions:

[v.1.1](https://colab.research.google.com/drive/1GZfnlJke6Lql41d215hSnfn4-17KB6cQ) - improved version (data builder issue fixed)

[v.1.0](https://colab.research.google.com/github/mahvin92/DNAcrypt-AI/blob/main/ipynb/DNAcrypt-AI_v.1.0.ipynb) - foundational release (not supported anymore)

[v.0.1](https://colab.research.google.com/github/mahvin92/DNAcrypt-AI/blob/main/ipynb/DNAcrypt-AI_beta(v.0.1).ipynb) - beta version (not supported anymore)

In [None]:
# @title ## User configuration

import json
import os
from google.colab import files

INPUT_DIR = "/content/DNAcrypt" # please do not change
OUTPUT_DIR = "/content/DNAcrypt/outputs" # please do not change
project_name = "DNAcrypt project" # @param {"type":"string","placeholder":"e.g., My experiment"}
char_count = 12 # @param {"type":"number","placeholder":"12"}
# @markdown ####Use cases:
Password = True # @param {type:"boolean"}
Encryption = False # @param {type:"boolean"}
Decryption = False # @param {type:"boolean"}
input_type = 1 if Encryption else 0
# @markdown ####Kmer dictionary:
Default = True # @param {type:"boolean"}
Custom = False # @param {type:"boolean"}
kmer_type = 1 if Custom else 0

# @markdown ##### For decryption, upload your files below üëá after a runtime is initiated.

if Decryption:
  input_type = 2
else:
  if Password == Encryption:
    if Password:
        print("‚ùå Error: Both 'Password' and 'Encryption' are selected. Please pick only one.")
    else:
        print("‚ùå Error: Neither 'Password' nor 'Encryption' is selected. Please pick one.")
    raise ValueError("Invalid use case selection")

  # Validity checks:
  if not (6 <= char_count <= 90):
    print(f"‚ùå Error: char_count ({char_count}) must be between 6 and 90.")
    # Stop execution in Colab cell
    raise ValueError("Invalid char_count")

  if Default == Custom:
    if Default:
        print("‚ùå Error: Both 'Default' and 'Custom' are selected. Please pick only one.")
    else:
        print("‚ùå Error: Neither 'Default' nor 'Custom' is selected. Please pick one.")
    raise ValueError("Invalid Kmer dictionary selection")

os.makedirs(INPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)
%cd {OUTPUT_DIR}

input_length = char_count

data = {
    "project_name": project_name,
    "input_length": input_length,
    "input_type": input_type,
    "kmer_type": kmer_type
}

file_name = "config.json"

with open(file_name, 'w') as json_file:
    json.dump(data, json_file, indent=4)



#Decryption
if Decryption:
  !rm -rf /content/DNAcrypt/outputs/kmer_dict.json
  !rm -rf /content/DNAcrypt/userdata_input.txt
  !rm -rf /content/DNAcrypt/source_dict.json
  %cd /content/DNAcrypt

  print("Proceed with decrypting...")

  def is_valid_metadata(obj):
    if not isinstance(obj, dict):
        return False

    required_top = {"data", "userinput", "usecase"}
    if not required_top.issubset(obj):
        return False

    # data[]
    if not isinstance(obj["data"], list) or not obj["data"]:
        return False
    d = obj["data"][0]
    if not {"id", "sequence", "datetime"}.issubset(d):
        return False

    # userinput[]
    if not isinstance(obj["userinput"], list) or not obj["userinput"]:
        return False
    u = obj["userinput"][0]
    if not {"sample_id", "seq_id", "seq_loc", "description"}.issubset(u):
        return False

    # usecase[]
    if not isinstance(obj["usecase"], list) or not obj["usecase"]:
        return False
    if "type" not in obj["usecase"][0]:
        return False
    if obj["usecase"][0]["type"] not in (0, 1):
        return False

    return True

  def is_valid_kmer_dict(obj):
    if not isinstance(obj, dict):
        return False

    vocab = obj.get("kmer_vocab")
    if not isinstance(vocab, dict):
        return False

    for k, v in vocab.items():
        if isinstance(k, str) and isinstance(v, list):
            if all(isinstance(x, str) and len(x) == 5 for x in v):
                return True

    return False


  # Upload files here
  uploaded = files.upload()

  metadata_file = None
  kmer_dict_file = None

  for fname, raw in uploaded.items():
    if not fname.lower().endswith(".json"):
        print(f"‚ö†Ô∏è Skipped {fname} not a valid JSON file")
        continue

    try:
        obj = json.loads(raw.decode("utf-8"))

        if is_valid_metadata(obj):
            target_name = "DNAcrypt_metadata.json"

            if fname != target_name:
                with open(target_name, "w") as f:
                    json.dump(obj, f, indent=2)
                metadata_file = target_name
            else:
                metadata_file = fname

            # Reconstitue DNAcrypt-required files
            metadata_path = "/content/DNAcrypt/DNAcrypt_metadata.json"
            source_json_out = "/content/DNAcrypt/source_dict.json"
            userdata_txt_out = "/content/DNAcrypt/userdata_input.txt"

            with open(metadata_path, "r") as f:
                metadata = json.load(f)

            source_dict = {
                "data": metadata.get("data", [])
            }

            with open(source_json_out, "w") as f:
                json.dump(source_dict, f, indent=2)

            userinput = metadata.get("userinput", [])

            if userinput:
                headers = list(userinput[0].keys())

                with open(userdata_txt_out, "w") as f:
                    f.write("\t".join(headers) + "\n")
                    for row in userinput:
                        f.write("\t".join(row.get(h, "") for h in headers) + "\n")


        if is_valid_kmer_dict(obj):
            target_name = "kmer_dict.json"

            if fname != target_name:
                with open(target_name, "w") as f:
                    json.dump(obj, f, indent=2)
                kmer_dict_file = target_name

            else:
                kmer_dict_file = fname

            !mv /content/DNAcrypt/kmer_dict.json /content/DNAcrypt/outputs/kmer_dict.json

        else:

          # Determine correct static kmer dictionary to use
          metadata_path = "/content/DNAcrypt/DNAcrypt_metadata.json"

          if not os.path.exists(metadata_path):
              raise FileNotFoundError("‚ùå DNAcrypt_metadata.json not found")

          with open(metadata_path, "r") as f:
              metadata = json.load(f)

          usecase = metadata.get("usecase", [])
          if not usecase or "type" not in usecase[0]:
              raise ValueError("‚ùå You have a corrupted data")

          usecase_type = usecase[0]["type"] # use to rebuild dictionary after json folder is fetched

    except Exception as e:
        print(f"‚ùå Failed to read {fname}: {e}")


print(f"Proceeding to install DNAcrypt-AI. Please wait...")


In [None]:
# @title ## Install DNAcrypt-AI
%%capture

%cd /content/DNAcrypt/

install_dnacrypt = "https://github.com/mahvin92/DNAcrypt-AI.git"
!git init
!git remote add origin {install_dnacrypt}
!git config core.sparseCheckout true
!echo "json/*" >> .git/info/sparse-checkout
!echo "source/*" >> .git/info/sparse-checkout
!git pull origin main
!mv source/* . 2>/dev/null
!rm -rf source .git
!cp /content/DNAcrypt/json/assembly_dict.json /content/DNAcrypt/outputs

# Get dependencies
!pip install /content/DNAcrypt/install/codeenigma_runtime-1.2.0-py3-none-any.whl
!pip install pandas pyfaidx tqdm
!apt-get update -qq
!apt-get install -y samtools
import sys, builtins
import os
import sys
import re
import math
from itertools import product
!pip install umap-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
from tensorflow.keras import layers, models
from scipy.cluster.hierarchy import linkage, dendrogram
import seaborn as sns
from scipy.spatial.distance import pdist, squareform
import zipfile
from datetime import datetime
import sys, builtins

In [None]:
# @title ##Workflow detection

import json
from pathlib import Path
import os
from google.colab import files

config_path = Path('/content/DNAcrypt/outputs/config.json')

# Encryption
if not Decryption:
  sys.path.append("/content/DNAcrypt")
  builtins.exit = sys.exit
  import dict_router

else:
  if usecase_type == 0:
    !cp /content/DNAcrypt/json/kmer_dict_key.json /content/DNAcrypt/outputs/kmer_dict.json

  elif usecase_type == 1:
    !cp /content/DNAcrypt/json/kmer_dict_enc.json /content/DNAcrypt/outputs/kmer_dict.json

  else:
    raise ValueError("‚ùå Invalid encryption data")


In [None]:
# @title ## Genome encryption

if not Decryption:
  import json
  import sys
  from pathlib import Path

  sys.path.append("/content/DNAcrypt")
  builtins.exit = sys.exit
  import num_crypt
  import genome_keyring

In [None]:
# @title ## Sequence reconstruction (FAS2rDNA)

INPUT_DIR = "/content/fas2rdna" # please do not change
OUTPUT_DIR = "/content/fas2rdna/outputs" # please do not change
os.makedirs(INPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)
!cp /content/DNAcrypt/userdata_input.txt /content/fas2rdna
%cd {INPUT_DIR}

# Initiate FAS2rDNA
sys.path.append("/content/DNAcrypt")
builtins.exit = sys.exit
import fas2rdna

# Append anchor
import append_anchor

In [None]:
# @title ## Machine learning (Covary)
%cd /content/

# Clone Covary-encoder
COVARY_DIR = "/content/Covary-encoder" # please do not change
os.makedirs(COVARY_DIR, exist_ok=True)

%cd Covary-encoder
!cp /content/fas2rdna/outputs/userdata_input.fasta /content/Covary-encoder/
os.rename("/content/Covary-encoder/userdata_input.fasta", "/content/Covary-encoder/input_seq.fasta")

# Pre-processing for Covary inference
sys.path.append("/content/DNAcrypt")
builtins.exit = sys.exit
import clean_seq

# Genetic encoding with Covary_encoder
sys.path.append("/content/DNAcrypt")
builtins.exit = sys.exit
import covary_encoder

# Machine learning with Covary-ML
print(f"Deconvoluting your data now on Covary. Please wait...")
sys.path.append("/content/DNAcrypt")
builtins.exit = sys.exit
import covary_ml

In [None]:
# @title ## Data builder

import os
import shutil
import json
from google.colab import files

# Prompt decrypted key
sys.path.append("/content/DNAcrypt")
builtins.exit = sys.exit
import dnacrypt_builder

if Decryption:
  print("Decryption finished ...")
else:
  sys.path.append("/content/DNAcrypt")
  builtins.exit = sys.exit
  import dnacrypt_compiler

# Instructions

## Quick notice:
DNAcrypt-AI generates a password (alphanumeric + symbol) or cryptographic key (alphanumeric) randomly, with a user-defined length. The password or key is then encrypted using the coordinates of randomly sampled length-variable DNA sequences found in the human genome (referenced to assemblies hg19 and hg38). DNAcrypt-AI decrypts the data using a high-throughput DNA sequence reconstitution pipeline ([FAS2rDNA](https://github.com/mahvin92/FAS2rDNA)) and sequence-informed machine learning model ([Covary](https://github.com/mahvin92/Covary)). It does not support custom password generation, however, it supports high entropy random sequence assignment.

DNAcryt-AI supports alphanumeric encodings ```a -> z, A -> Z, 0 -> 9``` (for both Password and Encryption) and symbolic encodings such as ```!@#$%^&*()-_+=‚Äù``` (for Password only).

## Supported human genome assemblies:
The hg19 and hg38 genome assemblies are used as reference to encrypt and decrypt your password/key. Therefore, DNAcrypt-AI uses a biological and native genome sequences to store and read your secret information. Although current works is already being undertaken to expand this support to multi-species genome assemblies to increase the genome vocabularies of DNAcrypt-AI. Reach out to us if you have any recommendations or suggestions to improve this project.

## Generating and encrypting a password/key:
DNAcrypt-AI is very intuitive to use. To generate and encrypt a password or a key:
1. Create a user configuration:
- char_count: string length
- Use cases: *Password* = alphanumeric + symbol; *Encryption* = alphanumeric only
2. Run DNAcrypt-AI: ```Runtime -> Run all```
3. Store your encrypted data (DNAcrypt_metadata.json and/or kmer_dict.json if you used to custom kmer dictionary)
- These are automatically downloaded. However, if your browser is preventing the download, you can get them directly from ```/content/``` and/or ```/content/DNAcrypt/outputs/``` through the File browser.

## Decrypting a password/key:
Decrypting information is very easy. Follow the steps below:
1. Modify the user configuration
- Use cases: select *Decryption*
2. Run DNAcrypt-AI: ```Runtime -> Run all```
3. Upload your encrypted data
- Note that in addition to the original 'DNAcrypt_metadata.json', you are required to upload the 'kmer_dict.json' if you encrypted your data using a customized kmer dictionary.
4. Wait for the key to be decrypted
- This may take some time, roughly about 15 minutes or less, depending on the genome vocabularies you have.

## Encrypting and decrypting using a custom kmer dictionary:
a. Encryption
1. In addition to char_count and a use case, create a user configuration by selecting *Custom* under Kmer dictionary.
2. Run DNAcrypt-AI as usal
3. Store both the *DNAcrypt_metadata.json* and *kmer_dict.json* files for decryption later.

b. Decryption
1. Select *Decryption* as use case in the user configuration
2. Run DNAcrypt-AI as usal
3. Upload the *DNAcrypt_metadata.json* and *kmer_dict.json* files

## Encryption data:
Your encrpytion data MUST NOT be modified. Any loss, modification, or unnecessary addition will affect the stored information. Tampering with the encrypted data (both *DNAcrypt_metadata.json* and *kmer_dict.json*) will affect the recovery of your password/key. However, you can rename the files as you wish.

The encrypted data is light and does not take too much of your storage (maximum of 200kb). You can store your encryption as follows:
1. Printed copy - private and secure but requires re-encoding to a digital copy later before using
2. Digital copy - ready to use but also allows others to decrypt once they get a copy of your file/s

## Training DNAcrypt-AI:
DNAcrypt-AI (this notebook version) runs entirely in your browser; the developer does not collect nor store your data in the cloud or local server. Furthermore, DNAcrypt-AI does not index your encrypted data, nor does it utilize a pre-packed knowledge base or vector store of genomic vocabularies (such as encoded human genome sequences or k-mer dictionaries used in alphanumeric/symbol encoding). By design through [Covary](https://github.com/mahvin92/Covary), users train and run DNAcrypt-AI locally, ensuring their data remains exclusively theirs. Please note that machine learning inference is computationally intensive, so expect a processing time of approximately 15 minutes to successfully decrypt your password or key.

## Troubleshooting:
1. Decrypted password/key does not match the expected alphanumeric-symbol sequence
- Re-run the decryption or restart your session ```Runtime -> Restart session and run all```. If this does not fix the issue, check if you're using a GPU or CPU for training. It has been reported that decryption varies between GPU- and CPU-based inferences. Try changing you runtime from GPU to CPU and *vice versa*.
2. DNAcrypt-AI session crashed due to RAM exhaustion
- Machine learning is a RAM-hungry process, ensure that you are connected to a GPU support (e.g., T4)
3. DNAcrypt-AI is taking so long in reconstucting my genome corpus
- The reconstruction of DNA sequences is loaded to FAS2rDNA, which requires to download either or both the hg19 and hg38 human genome assemblies. These assemblies are quite massive (>3 GB each) in size. Please wait for a moment for these assemblies to be fetched and indexed
4. DNAcrypt-AI is not downloading the required file (e.g., *DNAcrypt_metadata.json*)
- If DNAcrypt-AI executed without an error and yet did not produce a download, it means that your browser is blocking the process. You can fetch your encrypted files from ```/content``` or ```/content/DNAcrypt/outputs/``` directory through the File browser.
5. I obtained a low entropy sequence
- DNAcrypt-AI estimates the entropy of the generated password (the random assignment and sequence of alphanumeric-symbol charcaters), not necessarily the strength of the encryption. If you are not satisfied with the entropy estimate, you can always run DNAcrypt-AI again until you arrive at a desirable entropy. If you encounter issues while running DNAcrypt-AI multiple times, restarting your session can fix most issues ```Runtime -> Restart session and run all```

## Bug:

If you encounter any bugs, please open a ticket [here](https://github.com/mahvin92/DNAcrypt-AI/issues/new) or request a support [here](https://dnacryptai.chordexbio.com).


## Funding (unofficial):

[ChordexBio](https://chordexbio.com)


## License:
DNAcrypt-AI uses FAS2rDNA and Covary, with permission from ChordexBio. Your usage of DNAcrypt-AI and its dependencies may be limited, [read the full license here.](https://github.com/mahvin92/DNAcrypt-AI)


## Footnote:
DNAcrypt-AI is powered by [ChordexBio](https://chordexbio.com), [FAS2rDNA](https://github.com/mahvin92/FAS2rDNA), [Covary](https://github.com/mahvin92/Covary), [assemBold program](https://assembold.chordexbio.com) and [CodeEnigma](https://github.com/KrishnanSG/codeenigma), made with Python, and tested using Google Colab ‚ù§Ô∏è