```
# Copyright 2022 Google Inc.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

This colab supports UniProt 2022_04, where Google predicted protein names for 88% of all Uncharacterized proteins (over 1 in 5 proteins in the database).

You can run this file to check whether any prediction produced by Google's systems is supported by other sources. Note that some of the predicted names have since been updated in UniProt 2022_05.

**Paste in the UniProt accession of the protein below!**

# Imports and dependencies

In [1]:
!apt-get install pigz

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  pigz
0 upgraded, 1 newly installed, 0 to remove and 12 not upgraded.
Need to get 57.4 kB of archives.
After this operation, 259 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 pigz amd64 2.4-1 [57.4 kB]
Fetched 57.4 kB in 1s (99.5 kB/s)
Selecting previously unselected package pigz.
(Reading database ... 123934 files and directories currently installed.)
Preparing to unpack .../archives/pigz_2.4-1_amd64.deb ...
Unpacking pigz (2.4-1) ...
Setting up pigz (2.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


In [2]:
!pip install binary_file_search

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting binary_file_search
  Downloading binary_file_search-0.7.tar.gz (4.0 kB)
Building wheels for collected packages: binary-file-search
  Building wheel for binary-file-search (setup.py) ... [?25l[?25hdone
  Created wheel for binary-file-search: filename=binary_file_search-0.7-py3-none-any.whl size=3970 sha256=ad37736153f21eae44011a843bbd7814e1c66c7a8fb442460426e61c6f9694cb
  Stored in directory: /root/.cache/pip/wheels/7c/a9/91/fed3d9ba96b88a121ddb4dc5198f12e14119c4e59afe26fd8f
Successfully built binary-file-search
Installing collected packages: binary-file-search
Successfully installed binary-file-search-0.7


In [3]:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
import IPython.display
def print_markdown(string):
    IPython.display.display(IPython.display.Markdown(string))

# Download evidence file and unzip
(takes a few minutes)

In [4]:
!wget -c -O sorted_evidencer.csv.gz https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/evidencer_sorted.csv.gz

--2022-10-11 18:00:36--  https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/evidencer_sorted.csv.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128, 172.253.117.128, 142.250.99.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2358963308 (2.2G) [application/octet-stream]
Saving to: ‘sorted_evidencer.csv.gz’


2022-10-11 18:01:08 (72.2 MB/s) - ‘sorted_evidencer.csv.gz’ saved [2358963308/2358963308]



In [5]:
!unpigz -fk sorted_evidencer.csv.gz

# Search for your evidence

In [6]:
accession = "A0A009" #@param {type:"string"}
with BinaryFileSearch('sorted_evidencer.csv', sep=",", string_mode=True) as bfs:
    lines = bfs.search(accession)

if len(lines) == 0:
  raise ValueError('Sorry, this protein\'s accession wasn\'t found in our database. Maybe check your spelling, or maybe it\'s a very new protein?')
if len(lines) > 1:
  for l in lines:
    print(l)
  raise ValueError('There was some sort of error - we found multiple predictions for this protein!', lines)

accession, prediction, support_in_uniprot, support_in_uniref, support_in_uniref_example = tuple(lines[0])

if support_in_uniprot and support_in_uniprot != "NULL":
  print_markdown(f"The prediction **{prediction}** for **{accession}** appears as a substring of the **UniProt page** for **{accession}**.\n")
  print_markdown(f"More info at https://www.uniprot.org/uniprotkb/{accession}/entry")
elif support_in_uniref and support_in_uniref != "NULL":
  print_markdown(f"The prediction **{prediction}** for **{accession}** appears as a **substring of UniRef50 cluster** member {support_in_uniref_example}.\n")
  print_markdown(f"More info on on https://www.uniprot.org/uniprotkb/{support_in_uniref_example}/entry")
else:
  print_markdown(f"The prediction **{prediction}** for **{accession}** has **no support** found with these automated methods. Perhaps the protein is very new?")

The prediction **Carbamoyltransferase** for **A0A009** appears as a **substring of UniRef50 cluster** member D6A7G3.


More info on on https://www.uniprot.org/uniprotkb/D6A7G3/entry

# License of downloaded data

The evidencer file is licensed CC-BY 4.0 and is built based on UniProt.

"UniProt: the universal protein knowledgebase in 2021." Nucleic acids research 49, no. D1 (2021): D480-D489.