```
# Copyright 2023 Google Inc.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

This colab supports UniProt 2023_01, where Google predicted protein names for 10s of millions of proteins previously named "Uncharacterized protein".

You can run this file to check whether any prediction (especially for previously "Uncharacterized" proteins) produced by Google's systems is supported by other sources.


**Paste in the UniProt accession of the protein below!**

# Imports and dependencies

In [48]:
!apt-get install pigz

Reading package lists... Done
Building dependency tree       
Reading state information... Done
pigz is already the newest version (2.4-1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.


In [49]:
!pip install binary_file_search

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [50]:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
import IPython.display
def print_markdown(string):
    IPython.display.display(IPython.display.Markdown(string))

# Download evidence file and unzip
(takes a few minutes)

In [51]:
!wget -c -O sorted_evidencer.csv.gz https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2023_01_evidencer_sorted.csv.gz

--2023-02-27 22:53:30--  https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2023_01_evidencer_sorted.csv.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 172.253.117.128, 142.250.107.128, 173.194.202.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.253.117.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 317356364 (303M) [application/octet-stream]
Saving to: ‘sorted_evidencer.csv.gz’


2023-02-27 22:53:36 (58.8 MB/s) - ‘sorted_evidencer.csv.gz’ saved [317356364/317356364]



In [52]:
!unpigz -fk sorted_evidencer.csv.gz

# Search for your evidence

In [58]:
accession = "A0A3M3H8U9" #@param {type:"string"}
with BinaryFileSearch('sorted_evidencer.csv', sep=",", string_mode=True) as bfs:
  try:
    lines = bfs.search(accession)
  except KeyError:
    raise ValueError('Sorry, this protein\'s accession wasn\'t found in our database. Maybe check your spelling, or maybe this prediction wasn\'t provided by Google?')

if len(lines) > 1:
  for l in lines:
    print(l)
  raise ValueError('There was some sort of error - we found multiple predictions for this protein!', lines)

accession, prediction, support_in_uniprot, alignment_support, structure_support = tuple(lines[0])

to_print_prefix = f'The prediction **{prediction}** for **{accession}**: \n'
to_print = to_print_prefix
if support_in_uniprot:
  to_print += f"* appears as a substring of the **UniProt page** for **{accession}**.\n"
if alignment_support:
  to_add = f"* has a strong phmmer alignment to **{alignment_support}** (bit score > 25).\n"
  to_print += to_add
if structure_support:
  to_add = f"* has a structural alignment to the high-confidence AlphaFold structure for **{alignment_support}** (tmalign score > .5).\n"
  to_print += to_add

if to_print == to_print_prefix:
  to_print += f"* **no support** found with these automated methods. Perhaps the protein is very new or the prediction is wrong?"

print_markdown(to_print)

The prediction **ATPase** for **A0A3M3H8U9**: 
* appears as a substring of the **UniProt page** for **A0A3M3H8U9**.
* has a strong phmmer alignment to **A0A345UY82** (bit score > 25).
* has a structural alignment to the high-confidence AlphaFold structure for **A0A345UY82** (tmalign score > .5).


# License of downloaded data

The evidencer file is licensed CC-BY 4.0 and is built based on UniProt.

"UniProt: the universal protein knowledgebase in 2021." Nucleic acids research 49, no. D1 (2021): D480-D489.