```
# Copyright 2023 Google Inc.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

This colab supports UniProt 2023_01, where Google predicted protein names for 10s of millions of proteins previously named "Uncharacterized protein".

You can run this file to check whether any prediction (especially for previously "Uncharacterized" proteins) produced by Google's systems is supported by other sources.


**Paste in the UniProt accession of the protein below!**

# Imports and dependencies

In [1]:
!apt-get install pigz

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  pigz
0 upgraded, 1 newly installed, 0 to remove and 22 not upgraded.
Need to get 57.4 kB of archives.
After this operation, 259 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 pigz amd64 2.4-1 [57.4 kB]
Fetched 57.4 kB in 0s (167 kB/s)
Selecting previously unselected package pigz.
(Reading database ... 128275 files and directories currently installed.)
Preparing to unpack .../archives/pigz_2.4-1_amd64.deb ...
Unpacking pigz (2.4-1) ...
Setting up pigz (2.4-1) ...
Processing triggers for man-db (2.9.1-1) ...


In [2]:
!pip install binary_file_search

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting binary_file_search
  Downloading binary_file_search-0.7.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: binary_file_search
  Building wheel for binary_file_search (setup.py) ... [?25l[?25hdone
  Created wheel for binary_file_search: filename=binary_file_search-0.7-py3-none-any.whl size=3968 sha256=08db9047bad570a67c78614c5cfd8b1417b8ffc8e7bf4d24222a6608f59ee3bf
  Stored in directory: /root/.cache/pip/wheels/4e/ad/a2/3a9a72f26e1b3dc30147de9f09a853f4e2c73c2683a71bba2d
Successfully built binary_file_search
Installing collected packages: binary_file_search
Successfully installed binary_file_search-0.7


In [3]:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
import IPython.display
def print_markdown(string):
    IPython.display.display(IPython.display.Markdown(string))

# Download evidence file and unzip
(takes a few minutes)

In [4]:
!wget -c -O sorted_evidencer.tsv.gz https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2023_01_evidencer_sorted.tsv.gz

--2023-03-13 18:15:10--  https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2023_01_evidencer_sorted.tsv.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.31.128, 172.253.62.128, 142.251.163.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.31.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 317478150 (303M) [application/octet-stream]
Saving to: ‘sorted_evidencer.tsv.gz’


2023-03-13 18:15:12 (166 MB/s) - ‘sorted_evidencer.tsv.gz’ saved [317478150/317478150]



In [5]:
!unpigz -fk sorted_evidencer.tsv.gz

# Search for your evidence

We search for evidence for a prediction via two alignment methods: sequence alignment via [**phmmer**](http://hmmer.org/) and structure alignment via [**tmalign**](https://zhanggroup.org/TM-score/).

In particular, for a given prediction, we define as "evidence" a protein in UniProt that has a protein name matching our prediction and which has a similar amino acid sequence, as given by a phmmer score [above 25](https://hmmer-web-docs.readthedocs.io/en/latest/searches.html#significance-bit-scores) or a tmalign score [above 0.5](https://zhanggroup.org/TM-score/) (with a confident AlphaFold structure). We ignore proteins named by ProtNLM when searching for evidence in UniProt.

When found, we provide one example evidence for each alignment method.

In [18]:
accession = "A0A3M3H8U9" #@param {type:"string"}
with BinaryFileSearch('sorted_evidencer.tsv', sep="\t", string_mode=True) as bfs:
  try:
    lines = bfs.search(accession)
  except KeyError:
    raise ValueError('Sorry, this protein\'s accession wasn\'t found in our database. Maybe check your spelling, or maybe this prediction wasn\'t provided by Google?')

if len(lines) > 1:
  for l in lines:
    print(l)
  raise ValueError('There was some sort of error - we found multiple predictions for this protein!', lines)


if len(lines[0]) == 2:
  accession, prediction = tuple(lines[0])
  to_print = (f'The prediction **{prediction}** for **{accession}**: \n'
              f"* **no support** found with these automated methods. Perhaps the protein is very new or the prediction is wrong?")
else:
  accession, prediction, _, alignment_support, structure_support = tuple(lines[0])
  to_print_prefix = f'The prediction **{prediction}** for **{accession}**: \n'
  to_print = to_print_prefix

  if alignment_support:
    to_add = f"* has a strong phmmer alignment to **{alignment_support}** (bit score > 25).\n"
    to_print += to_add
  if structure_support:
    to_add = f"* has a structural alignment to the high-confidence AlphaFold structure for **{alignment_support}** (tmalign score > .5).\n"
    to_print += to_add

print_markdown(to_print)

The prediction **ATPase** for **A0A3M3H8U9**: 
* has a strong phmmer alignment to **A0A345UY82** (bit score > 25).
* has a structural alignment to the high-confidence AlphaFold structure for **A0A345UY82** (tmalign score > .5).


# License of downloaded data

The evidencer file is licensed CC-BY 4.0 and is built based on UniProt.

"UniProt: the universal protein knowledgebase in 2021." Nucleic acids research 49, no. D1 (2021): D480-D489.