# About this example

This example shows how you can run a Wav2Vec2 model to perform Speech-To-Text with confidentiality guarantees. 

By using BlindAI, people can send data for the AI to analyze their conversations without having to fear privacy leaks.

Wav2Vec2 is a state-of-the art Transformers model for speech. You can learn more about it on [FAIR blog's post](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/).

# Installing dependencies

Install the dependencies this example needs.

In [None]:
!pip install -q transformers[onnx] torch

We will need `librosa` to load the "hello world" audio file. You might need to downgrade `numpy` to 1.21 to make it work. The following commands should do the trick to install `librosa`:

In [None]:
!pip install -q --upgrade numpy==1.21
!pip install -q librosa

In addition, you might need to install `ffmpeg` to have a backend to process the wav file.

In [None]:
!sudo apt-get install -y ffmpeg

Install the latest version of BlindAI.

In [None]:
!pip install blindai

# Preparing the model

Here we will use a large Wav2Vec2 model. First step is to get the model and tokenizers.

In [None]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

We can download an hello world audio file to be used as example. Let's download it.

In [None]:
!wget https://github.com/mithril-security/blindai/raw/master/examples/wav2vec2/hello_world.wav

We can hear it here:

In [None]:
import IPython.display as ipd
ipd.Audio('hello_world.wav')

We can then see the Wav2vec2 model in action on the hello world file.

In [None]:
import librosa

audio, rate = librosa.load("hello_world.wav", sr = 16000)

# Tokenize sampled audio to input into model
input_values = processor(audio, sampling_rate=rate, return_tensors="pt", padding="longest").input_values

# Retrieve logits
logits = model(input_values).logits

# Take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

In order to facilitate the deployment, we will add the post processing directly to the full model. This way the client will not have to do the post processing.

In [None]:
import torch.nn as nn

# Let's embed the post-processing phase with argmax inside our model
class ArgmaxLayer(nn.Module):
    def __init__(self):
        super(ArgmaxLayer, self).__init__()

    def forward(self, outputs):
        return torch.argmax(outputs.logits, dim = -1)

In [None]:
final_layer = ArgmaxLayer()

# Finally we concatenate everything
full_model = nn.Sequential(model, final_layer)

We can check the results are the same.

In [None]:
predicted_ids = full_model(input_values)
transcription = processor.batch_decode(predicted_ids)
transcription

Now we can export the model in ONNX format, so that we can feed later the ONNX to our BlindAI server.

In [None]:
torch.onnx.export(
    full_model,
    input_values,
    'wav2vec2_hello_world.onnx',
    export_params=True,
     opset_version = 11)

# Deployment on BlindAI

In [None]:
import blindai

# Launch client
with blindai.connect() as client:
    response = client.upload_model(
        model="./wav2vec2_hello_world.onnx"
    )
    model_id = response.model_id

This securely uploads the model to the Mithril Cloud.
If you wish to run this example on premise, you should read the [Deploy on Hardware](https://blindai.mithrilsecurity.io/en/latest/getting-started/deploy-on-hardware/) documentation page.

# Sending data for confidential prediction

Now it's time to check it's working live!

We will just prepare some input for the model inside the secure enclave of BlindAI to process it.

First we prepare our input data, the hello world audio file.

In [None]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

audio, rate = librosa.load("hello_world.wav", sr = 16000)

# Tokenize sampled audio to input into model
input_values = processor(audio, sampling_rate=rate, return_tensors="pt", padding="longest").input_values

Now we can send the audio data to be processed confidentially!

In [None]:
with blindai.connect() as client:
  response = client.predict(model_id, input_values)

We can reconstruct the output now:

In [None]:
# Decode the output
processor.batch_decode(response.output[0].as_torch().unsqueeze(0))

Et voila! We have been able to apply a start of the art model of speech recognition, without ever having to show the data in clear to the people operating the service!

If you have liked this example, do not hesitate to drop a star on our [GitHub](https://github.com/mithril-security/blindai) and chat with us on our [Discord](https://discord.gg/TxEHagpWd4)!