<a href="https://colab.research.google.com/github/mithril-security/blindai/blob/jupyter-0.5.0-redux/examples/wav2vec2/BlindAI-Wav2vec2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About this example

This example shows how you can run a Wav2Vec2 model to perform Speech-To-Text with confidentiality guarantees. 

By using BlindAI, people can send data for the AI to analyze their conversations without having to fear privacy leaks.

Wav2Vec2 is a state-of-the art Transformers model for speech. You can learn more about it on [FAIR blog's post](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/).

# Installing dependencies

Install the dependencies this example needs.

In [1]:
!pip install -q transformers[onnx] torch

[K     |████████████████████████████████| 4.7 MB 5.1 MB/s 
[K     |████████████████████████████████| 120 kB 67.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 45.8 MB/s 
[K     |████████████████████████████████| 212 kB 44.0 MB/s 
[K     |████████████████████████████████| 83 kB 870 kB/s 
[K     |████████████████████████████████| 442 kB 9.1 MB/s 
[K     |████████████████████████████████| 4.9 MB 34.8 MB/s 
[K     |████████████████████████████████| 46 kB 1.6 MB/s 
[K     |████████████████████████████████| 99 kB 3.7 MB/s 
[K     |████████████████████████████████| 13.1 MB 20.0 MB/s 
[K     |████████████████████████████████| 55 kB 2.4 MB/s 
[K     |████████████████████████████████| 86 kB 3.2 MB/s 
[?25h  Building wheel for py-cpuinfo (setup.py) ... [?25l[?25hdone


We will need `librosa` to load the "hello world" audio file. You might need to downgrade `numpy` to 1.21 to make it work. The following commands should do the trick to install `librosa`:

In [2]:
!pip install -q --upgrade numpy==1.21
!pip install -q librosa

[K     |████████████████████████████████| 15.7 MB 333 kB/s 
[?25h

In addition, you might need to install `ffmpeg` to have a backend to process the wav file.

In [3]:
!sudo apt-get install -y ffmpeg

Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.


Install the latest version of BlindAI.

In [4]:
!pip install blindai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting blindai
  Downloading blindai-0.5.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 4.8 MB/s 
[?25hCollecting cryptography
  Downloading cryptography-38.0.1-cp36-abi3-manylinux_2_24_x86_64.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 48.6 MB/s 
Collecting bitstring
  Downloading bitstring-3.1.9-py3-none-any.whl (38 kB)
Collecting grpcio-tools>=1.4
  Downloading grpcio_tools-1.48.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[K     |████████████████████████████████| 2.4 MB 41.0 MB/s 
Collecting grpcio>=1.4
  Downloading grpcio-1.48.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[K     |████████████████████████████████| 4.6 MB 42.9 MB/s 
Installing collected packages: grpcio, grpcio-tools, cryptography, bitstring, blindai
  Attempting uninstall: g

# Preparing the model

Here we will use a large Wav2Vec2 model. First step is to get the model and tokenizers.

In [5]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Downloading preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/360M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We can download an hello world audio file to be used as example. Let's download it.

In [6]:
!wget https://github.com/mithril-security/blindai/raw/master/examples/wav2vec2/hello_world.wav

--2022-09-11 19:30:22--  https://github.com/mithril-security/blindai/raw/master/examples/wav2vec2/hello_world.wav
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mithril-security/blindai/master/examples/wav2vec2/hello_world.wav [following]
--2022-09-11 19:30:22--  https://raw.githubusercontent.com/mithril-security/blindai/master/examples/wav2vec2/hello_world.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6083 (5.9K) [audio/wav]
Saving to: ‘hello_world.wav’


2022-09-11 19:30:22 (42.9 MB/s) - ‘hello_world.wav’ saved [6083/6083]



We can hear it here:

In [7]:
import IPython.display as ipd
ipd.Audio('hello_world.wav')

We can then see the Wav2vec2 model in action on the hello world file.

In [8]:
import librosa

audio, rate = librosa.load("hello_world.wav", sr = 16000)

# Tokenize sampled audio to input into model
input_values = processor(audio, sampling_rate=rate, return_tensors="pt", padding="longest").input_values

# Retrieve logits
logits = model(input_values).logits

# Take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)



['HELLO WORLD']


In order to facilitate the deployment, we will add the post processing directly to the full model. This way the client will not have to do the post processing.

In [9]:
import torch.nn as nn

# Let's embed the post-processing phase with argmax inside our model
class ArgmaxLayer(nn.Module):
    def __init__(self):
        super(ArgmaxLayer, self).__init__()

    def forward(self, outputs):
        return torch.argmax(outputs.logits, dim = -1)

In [10]:
final_layer = ArgmaxLayer()

# Finally we concatenate everything
full_model = nn.Sequential(model, final_layer)

We can check the results are the same.

In [11]:
predicted_ids = full_model(input_values)
transcription = processor.batch_decode(predicted_ids)
transcription

['HELLO WORLD']

Now we can export the model in ONNX format, so that we can feed later the ONNX to our BlindAI server.

In [12]:
torch.onnx.export(
    full_model,
    input_values,
    'wav2vec2_hello_world.onnx',
    export_params=True,
     opset_version = 11)

  _verify_batch_size([input.size(0) * input.size(1) // num_groups, num_groups] + list(input.size()[2:]))
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):


# Deployment on BlindAI

Now we can upload the model to BlindAI Cloud. To upload of the model, make sure you have an API key.

You can get one on the [Mithril Cloud](https://cloud.mithrilsecurity.io/).

You might get an error if the name you want to use is already taken, as models are uniquely identified by their `model_id`. We will implement namespace soon to avoid that. Meanwhile, you will have to choose a unique ID. We provide an example below to upload your model with a unique name:

In [14]:
import blindai
import uuid

api_key = "YOUR_API_KEY" # Enter your API key here
model_id = "wav2vec2-" + str(uuid.uuid4())

# Upload the ONNX file to the remote enclave
with blindai.connect(api_key=api_key) as client:
    response = client.upload_model("wav2vec2_hello_world.onnx", model_id=model_id)



# Sending data for confidential prediction

Now it's time to check it's working live!

We will just prepare some input for the model inside the secure enclave of BlindAI to process it.

First we prepare our input data, the hello world audio file.

In [15]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

audio, rate = librosa.load("hello_world.wav", sr = 16000)

# Tokenize sampled audio to input into model
input_values = processor(audio, sampling_rate=rate, return_tensors="pt", padding="longest").input_values



Now we can send the audio data to be processed confidentially!

In [16]:
with blindai.connect() as client:
  response = client.predict(model_id, input_values)

We can reconstruct the output now:

In [17]:
# Decode the output
processor.batch_decode(response.output[0].as_torch().unsqueeze(0))

['HELLO WORLD']

Et voila! We have been able to apply a start of the art model of speech recognition, without ever having to show the data in clear to the people operating the service!

If you have liked this example, do not hesitate to drop a star on our [GitHub](https://github.com/mithril-security/blindai) and chat with us on our [Discord](https://discord.gg/TxEHagpWd4)!