```
# Copyright 2022 Google Inc.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

This colab supports the UniProt launch 2022_04, where Google predicted
protein names for 88% of all Uncharacterized proteins (over 1 in 5 proteins in UniProt).

This colab allows you to run a model that's very similar to the one used in the UniProt release. To run this colab:

1. Select an appropriate runtime. We recommend choosing a runtime with at least 20G of RAM, e.g. the publicly-available "TPU v2" runtime.
2. **Put in the amino acid sequence below**, and press "Runtime > Run all" in the _File_ menu above to **get name predictions for your protein**!

This colab takes a few minutes to run initially, and then you get protein sequence predictions in a few seconds!

# Import code

In [None]:
#@markdown Please execute this cell by pressing the _Play_ button
#@markdown on the left to import the dependencies. It can take a few minutes.
!python3 -m pip install -q -U tensorflow==2.8.2
!python3 -m pip install -q -U tensorflow-text==2.8.2
import tensorflow as tf
import tensorflow_text
import numpy as np
import re

import IPython.display
from absl import logging

tf.compat.v1.enable_eager_execution()

logging.set_verbosity(logging.ERROR)  # Turn down tensorflow warnings

def print_markdown(string):
  IPython.display.display(IPython.display.Markdown(string))

# 2. Load the model

In [3]:
#@markdown Please execute this cell by pressing the _Play_ button.

def query(seq):
  return f"[protein_name_in_english] <extra_id_0> [sequence] {seq}"

EC_NUMBER_REGEX = r'(\d+).([\d\-n]+).([\d\-n]+).([\d\-n]+)'

def run_inference(seq):
  labeling = infer(tf.constant([query(seq)]))
  names = labeling['output_0'][0].numpy().tolist()
  scores = labeling['output_1'][0].numpy().tolist()
  beam_size = len(names)
  names = [names[beam_size-1-i].decode().replace('<extra_id_0> ', '') for i in range(beam_size)]
  for i, name in enumerate(names):
    if re.match(EC_NUMBER_REGEX, name):
      names[i] = 'EC:' + name
  scores = [np.exp(scores[beam_size-1-i]) for i in range(beam_size)]
  return names, scores

In [4]:
#@markdown Please execute this cell by pressing the _Play_ button
#@markdown on the left to load the model. It can take a few minutes.

! mkdir -p protnlm

! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/saved_model.pb -P protnlm -q --no-check-certificate
! mkdir -p protnlm/variables
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/variables/variables.index -P protnlm/variables/ -q --no-check-certificate
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/variables/variables.data-00000-of-00001 -P protnlm/variables/ -q --no-check-certificate

imported = tf.saved_model.load(export_dir="protnlm")
infer = imported.signatures["serving_default"]

In [5]:
#@title 3. Put your prediction here (hemoglobin is pre-loaded)

#@markdown Press the _Play_ button to get a prediction.
#@markdown The first time can take a few minutes.
#@markdown
#@markdown Subsequent predictions take a few seconds.
sequence = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP AVHASLDKFLASVSTVLTSKYR" #@param {type:"string"}
sequence = sequence.replace(' ', '')

names, scores = run_inference(sequence)

for name, score, i in zip(names, scores, range(len(names))):
  print_markdown(f"Prediction number {i+1}: **{name}** with a score of **{score:.03f}**")

Prediction number 1: **Hemoglobin subunit alpha** with a score of **0.364**

Prediction number 2: **Alpha-globin** with a score of **0.160**

Prediction number 3: **Hemoglobin subunit alpha-like** with a score of **0.069**

Prediction number 4: **Hemoglobin alpha chain** with a score of **0.068**

Prediction number 5: **Hemoglobin alpha subunit 2** with a score of **0.035**

Prediction number 6: **Hemoglobin subunit alpha-1** with a score of **0.033**

Prediction number 7: **GLOBIN domain-containing protein** with a score of **0.027**

Prediction number 8: **Hemoglobin subunit alpha-2** with a score of **0.025**

Prediction number 9: **Hemoglobin alpha-1 chain** with a score of **0.018**

Prediction number 10: **Hemoglobin alpha-2 chain** with a score of **0.012**

In [6]:
#@title 3. A second example prediction (bifunctional TrpCF protein P22098)

#@markdown Press the _Play_ button to get a prediction.
sequence = "MKMTDFNTQQANNLSEHVSKKEAEMAEVLAKIVRDKYQWVAERKASQHLSTFQSDLLPSD RSFYDALSGDKTVFITECKKASPSKGLIRNDFDLDYIASVYNNYADAISVLTDEKYFQGS FDFLPQVRRQVKQPVLCKDFMVDTYQVYLARHYGADAVLLMLSVLNDEEYKALEEAAHSL NMGILTEVSNEEELHRAVQLGARVIGINNRNLRDLTTDLNRTKALAPTIRKLAPNATVIS ESGIYTHQQVRDLAEYADGFLIGSSLMAEDNLELAVRKVTLGENKVCGLTHPDDAAKAYQ AGAVFGGLIFVEKSKRAVDFESARLTMSGAPLNYVGVFQNHDVDYVASIVTSLGLKAVQL HGLEDQEYVNQLKTELPVGVEIWKAYGVADTKPSLLADNIDRHLLDAQVGTQTGGTGHVF DWSLIGDPSQIMLAGGLSPENAQQAAKLGCLGLDLNSGVESAPGKKDSQKLQAAFHAIRN Y" #@param {type:"string"}
sequence = sequence.replace(' ', '')

names, scores = run_inference(sequence)

for name, score, i in zip(names, scores, range(len(names))):
  print_markdown(f"Prediction number {i+1}: **{name}** with a score of **{score:.03f}**")

Prediction number 1: **Multifunctional fusion protein** with a score of **0.182**

Prediction number 2: **Indole-3-glycerol phosphate synthase** with a score of **0.168**

Prediction number 3: **IGPS** with a score of **0.162**

Prediction number 4: **EC:4.1.1.48** with a score of **0.160**

Prediction number 5: **N-(5'-phosphoribosyl)anthranilate isomerase** with a score of **0.109**

Prediction number 6: **PRAI** with a score of **0.105**

Prediction number 7: **EC:5.3.1.24** with a score of **0.100**

Prediction number 8: **Bifunctional indole-3-glycerol-phosphate synthase TrpC/phosphoribosylanthranilate isomerase TrpF** with a score of **0.002**

Prediction number 9: **Tryptophan biosynthesis protein TrpCF** with a score of **0.001**

Prediction number 10: **Anthranilate synthase component 1** with a score of **0.001**