# **Hi there!**

This is a Jupyter notebook to search on the AlphaMissense ([AlphaMissense 2023 publication](https://www.science.org/doi/10.1126/science.adg7492)) database using your Uniprot ID.

AlphaMissense is a deep learning model that builds on the protein structure prediction tool AlphaFold2. The model is trained on population frequency data and uses sequence and predicted structural context, all of which contribute to its performance. The authors evaluated the model against related methods using clinical databases not included in the training and demonstrated agreement with multiplexed assays of variant effect. Predictions for all single–amino acid substitutions in the human proteome are provided as a community resource.

---

Here, we implemented an easy way to search and find a prediction for a specific variant or residue of your selected protein.

---
**Bugs**
- If you encounter any bugs, please report the issue to https://github.com/pablo-arantes/AlphaMissense_db_search/issues

**Acknowledgments**
- We would like to thank the [AlphaMissense](https://github.com/google-deepmind/alphamissense) team to pre-computed predictions for all possible human amino acid substitutions and missense variants.

In [2]:
#Install Dependecies
#If you have done this step before, you don't need to run this cell again.
!pip install gdown

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Collecting filelock (from gdown)
  Downloading filelock-3.15.4-py3-none-any.whl.metadata (2.9 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Downloading filelock-3.15.4-py3-none-any.whl (16 kB)
Installing collected packages: filelock, gdown
Successfully installed filelock-3.15.4 gdown-5.2.0


In [3]:
#Download the Database
#Run the cell to download the database file. If you have done this step before, you can skip this step.
import gdown
url = 'https://drive.google.com/u/0/uc?id=1FrDf9qirnQ_hd23xv3X9yrPbVaOJ-Q70'
output = 'protein_data.db'
gdown.download(url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/u/0/uc?id=1FrDf9qirnQ_hd23xv3X9yrPbVaOJ-Q70
From (redirected): https://drive.google.com/uc?id=1FrDf9qirnQ_hd23xv3X9yrPbVaOJ-Q70&confirm=t&uuid=45073462-c92f-41a5-8647-5db6451cc249
To: /Users/pabloarantes/Downloads/protein_data.db
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.0G/12.0G [05:53<00:00, 33.9MB/s]


'protein_data.db'

In [4]:
#Query the Data based on specific variant

#Below, you should provide all inputs, i.e., **uniprot_id:** P09874 **protein_variant:** F74V.
workDir = "./"
database_name = "protein_data.db"
uniprot_id = 'P09874' #YOUR UNIPROT_ID
protein_variant = "F74V" #YOUR PROTEIN VARIANT i.e F74V, S75C
output_file_name = str(uniprot_id) + "_all_results.dat"

import sqlite3
import gzip
import csv
import os

def query_and_print_specific_variant(db_name, uniprot_id, protein_variant, output_file):
    conn = sqlite3.connect(db_name)
    cur = conn.cursor()

    # Execute SQL query for all data related to uniprot_id
    cur.execute("SELECT * FROM data WHERE uniprot_id = ?", (uniprot_id,))
    rows = cur.fetchall()

    # Open the output file and write the header and all results
    with open(output_file, 'w') as file:
        header = "uniprot_id  protein_variant  am_pathogenicity  am_class\n"

        # Write the header to file
        file.write(header)

        # Print the header to console
        print(header.strip())

        # Write all rows and print specific variant rows
        for row in rows:
            file.write('\t'.join(map(str, row)) + '\n')
            if row[1] == protein_variant:  # Check if the row is the specific variant
                print('\t'.join(map(str, row)))

    conn.close()

# Usage Example
db_name = os.path.join(workDir, database_name)
uniprot_id_to_search = uniprot_id  # Replace with the ID you want to search
protein_variant_to_search = protein_variant
output_file = os.path.join(workDir, output_file_name)    # Name of the output file

query_and_print_specific_variant(db_name, uniprot_id_to_search, protein_variant_to_search, output_file)

uniprot_id  protein_variant  am_pathogenicity  am_class
P09874	F74V	0.5155	ambiguous


In [5]:
#Query the Data based on residue number

#Below, you should provide all inputs, i.e., **uniprot_id:** P09874 **residue_number:** 74.

workDir = "./"
database_name = "protein_data.db"
uniprot_id = 'P09874' #YOUR UNIPROT_ID
residue_number = "74" #YOUR RESIDUE NUMBER
output_file_name = str(uniprot_id) + "_residue_" + str(residue_number) +".dat"


import sqlite3
import gzip
import csv
import os

def save_and_print_variants_by_residue(db_name, uniprot_id, residue_number, output_file):
    conn = sqlite3.connect(db_name)
    cur = conn.cursor()

    # Execute SQL query for all data related to uniprot_id
    cur.execute("SELECT * FROM data WHERE uniprot_id = ?", (uniprot_id,))
    rows = cur.fetchall()

    # Open the output file and write the header
    with open(output_file, 'w') as file:
        header = "uniprot_id\tprotein_variant\tam_pathogenicity\tam_class\n"
        file.write(header)

        # Print the header to console
        print(header.strip())

        # Write rows matching the specified residue number to the file and print them
        for row in rows:
            # Extract the residue number from the protein variant
            if len(row[1]) > 1 and row[1][1:-1].isdigit():
                residue_num = int(row[1][1:-1])
                if residue_num == residue_number:
                    line = '\t'.join(map(str, row))
                    file.write(line + '\n')
                    print(line)

    conn.close()

# Usage Example
db_name = os.path.join(workDir, database_name)
uniprot_id_to_search = uniprot_id  # Replace with the ID you want to search
residue_number_to_search = int(residue_number)
output_file = os.path.join(workDir, output_file_name)    # Name of the output file

save_and_print_variants_by_residue(db_name, uniprot_id_to_search, residue_number_to_search, output_file)

uniprot_id	protein_variant	am_pathogenicity	am_class
P09874	F74A	0.9557	pathogenic
P09874	F74C	0.7501	pathogenic
P09874	F74D	0.9955	pathogenic
P09874	F74E	0.9956	pathogenic
P09874	F74G	0.9829	pathogenic
P09874	F74H	0.9503	pathogenic
P09874	F74I	0.4493	ambiguous
P09874	F74K	0.9953	pathogenic
P09874	F74L	0.9283	pathogenic
P09874	F74M	0.8263	pathogenic
P09874	F74N	0.9772	pathogenic
P09874	F74P	0.998	pathogenic
P09874	F74Q	0.9897	pathogenic
P09874	F74R	0.9871	pathogenic
P09874	F74S	0.9419	pathogenic
P09874	F74T	0.9636	pathogenic
P09874	F74V	0.5155	ambiguous
P09874	F74W	0.6688	pathogenic
P09874	F74Y	0.2123	benign
