# About this example

This example shows how you can run a Wav2Vec2 model to perform Speech-To-Text with confidentiality guarantees. 

By using BlindAI, people can send data for the AI to analyze their conversations without having to fear privacy leaks.

Wav2Vec2 is a state-of-the art Transformers model for speech. You can learn more about it on [FAIR blog's post](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/).

# Installing dependencies

Install the dependencies this example needs.

In [1]:
!pip install -q transformers[onnx] torch



We will need `librosa` to load the "hello world" audio file. You might need to downgrade `numpy` to 1.21 to make it work. The following commands should do the trick to install `librosa`:

In [2]:
!pip install -q --upgrade numpy==1.21
!pip install -q librosa



In addition, you might need to install `ffmpeg` to have a backend to process the wav file.

In [3]:
!sudo apt-get install -y ffmpeg



Install the latest version of BlindAI.

In [4]:
!pip install blindai



# Preparing the model

Here we will use a large Wav2Vec2 model. First step is to get the model and tokenizers.

In [5]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We can download an hello world audio file to be used as example. Let's download it.

In [6]:
!wget https://github.com/mithril-security/blindai/raw/master/examples/wav2vec2/hello_world.wav



We can hear it here:

In [7]:
import IPython.display as ipd
ipd.Audio('hello_world.wav')

We can then see the Wav2vec2 model in action on the hello world file.

In [8]:
import librosa

audio, rate = librosa.load("hello_world.wav", sr = 16000)

# Tokenize sampled audio to input into model
input_values = processor(audio, sampling_rate=rate, return_tensors="pt", padding="longest").input_values

# Retrieve logits
logits = model(input_values).logits

# Take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

['HELLO WORLD']


In order to facilitate the deployment, we will add the post processing directly to the full model. This way the client will not have to do the post processing.

In [9]:
import torch.nn as nn

# Let's embed the post-processing phase with argmax inside our model
class ArgmaxLayer(nn.Module):
    def __init__(self):
        super(ArgmaxLayer, self).__init__()

    def forward(self, outputs):
        return torch.argmax(outputs.logits, dim = -1)

In [10]:
final_layer = ArgmaxLayer()

# Finally we concatenate everything
full_model = nn.Sequential(model, final_layer)

We can check the results are the same.

In [11]:
predicted_ids = full_model(input_values)
transcription = processor.batch_decode(predicted_ids)
transcription

['HELLO WORLD']

Now we can export the model in ONNX format, so that we can feed later the ONNX to our BlindAI server.

In [12]:
torch.onnx.export(
    full_model,
    input_values,
    'wav2vec2_hello_world.onnx',
    export_params=True,
     opset_version = 11)

# Deployment on BlindAI

Please make sure the **server is running**. To launch the server, refer to the [Launching the server](https://docs.mithrilsecurity.io/getting-started/quick-start/run-the-blindai-server) documentation page. 

If you have followed the steps and have the Docker image ready, this mean you simply have to run `docker run -it -p 50051:50051 -p 50052:50052 mithrilsecuritysas/blindai-server-sim:latest`

So the first thing we need to do is to connect securely to the BlindAI server instance. Here we will use simulation mode for ease of use. This means that we do not leverage the hardware security propertiers of secure enclaves, but we do not need to run the Docker image with a specific hardware.

If you wish to run this example in hardware mode, you need to prepare the `host_server.pem` and `policy.toml` files. Learn more on the [Deploy on Hardware](https://docs.mithrilsecurity.io/getting-started/deploy-on-hardware) documentation page. 

In [13]:
from blindai.client import BlindAiClient, ModelDatumType

# Launch client
client = BlindAiClient()

# Simulation mode
client.connect_server(addr="localhost", simulation=True)

# Hardware mode
# client.connect_server(addr="localhost", policy="./policy.toml", certificate="./host_server.pem")



Then, upload the model inside the BlindAI server. This simply means uploading the ONNX file created before.

When uploading the model, we have to precise the shape of the input and the data type. 

In this case, because we use a Wav2vec2 model, we will need to send floats for the audio file. By default BlindAI outputs floats, but in this case we need tokens so we have to precise that we expect integers as outputs.

In [14]:
client.upload_model(model="./wav2vec2_hello_world.onnx", shape=input_values.shape, 
                    dtype=ModelDatumType.F32, dtype_out=ModelDatumType.I64)



<blindai.client.UploadModelResponse at 0x7fc89aee8e20>

# Sending data for confidential prediction

Now it's time to check it's working live!

We will just prepare some input for the model inside the secure enclave of BlindAI to process it.

First we prepare our input data, the hello world audio file.

In [15]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

audio, rate = librosa.load("hello_world.wav", sr = 16000)

# Tokenize sampled audio to input into model
input_values = processor(audio, sampling_rate=rate, return_tensors="pt", padding="longest").input_values

Now we can send the audio data to be processed confidentially!

In [16]:
response = client.run_model(input_values.flatten().tolist())

We can reconstruct the output now:

In [17]:
# Decode the output
print(processor.batch_decode(torch.tensor(response.output).unsqueeze(0)))

['HELLO WORLD']


Et voila! We have been able to apply a start of the art model of speech recognition, without ever having to show the data in clear to the people operating the service!

If you have liked this example, do not hesitate to drop a star on our [GitHub](https://github.com/mithril-security/blindai) and chat with us on our [Discord](https://discord.gg/TxEHagpWd4)!