# Gai/Gen: Speech-to-Text (STT)

## 1. Note

The following examples has been tested on the following environment:

-   NVidia GeForce RTX 2060 6GB
-   Windows 11 + WSL2
-   Ubuntu 22.04
-   Python 3.10
-   CUDA Toolkit 11.8
-   openai 1.6.1
-   transformers 4.36.2
-   accelerate 0.25.0


## 2. Create Virtual Environment and Install Dependencies

We will create a seperate virtual environment for this to avoid conflicting dependencies that each underlying model requires.

```sh
sudo apt update -y && sudo apt install ffmpeg git git-lfs -y
conda create -n STT python=3.10.10 -y
conda activate STT
pip install -e ".[STT]"
```

## 3. Install Model

In [None]:
%%bash
huggingface-cli download openai/whisper-large-v3 \
        --local-dir ~/gai/models/whisper-large-v3 \
        --local-dir-use-symlinks False

## 4. Examples

In [2]:
## 6.12 OpenAI Whisper Speech-to-Text

print("GENERATING:")
from gai.gen import Gaigen
from pathlib import Path
gen = Gaigen.GetInstance().load('openai-whisper')

# Method 1: Using Path
response = gen.create(
  file=Path("../tests/gen/stt/tell-me-a-one-paragraph-story.wav")
)
print(response)

# Method 2: Using File
file = open("../tests/gen/stt/tell-me-a-one-paragraph-story.wav", "rb")
response = gen.create(
  file=file
)
print(response)

# Method 3: Using Bytes (Not-In-Spec)
file = open("../tests/gen/stt/tell-me-a-one-paragraph-story.wav", "rb")
data = file.read()
response = gen.create(
  file=data
)
print(response)


GENERATING:
Transcription(text='Tell me a one-paragraph story.')
Transcription(text='Tell me a one-paragraph story.')
Transcription(text='Tell me a one-paragraph story.')


In [3]:
## 6.13 Open-Sourced Whisper Speech-to-Text

print("GENERATING:")
from gai.gen import Gaigen
from pathlib import Path
gen = Gaigen.GetInstance().load('whisper-transformers')

# Method 1: Using Path
response = gen.create(
  file=Path("../tests/gen/stt/today-is-a-wonderful-day.wav")
)
print(response)

# Method 2: Using File
file = open("../tests/gen/stt/today-is-a-wonderful-day.wav", "rb")
response = gen.create(
  file=file
)
print(response)

# Method 3: Using Bytes (Not-In-Spec)
file = open("../tests/gen/stt/today-is-a-wonderful-day.wav", "rb")
data = file.read()
response = gen.create(
  file=data
)
print(response)

GENERATING:
Transcription(text='Today is a wonderful day to build something people love.')
Transcription(text='Today is a wonderful day to build something people love.')
Transcription(text='Today is a wonderful day to build something people love.')


## Running as a Service

#### Step 1: Start Docker container

```bash
docker run -d \
    --name gai-stt \
    -p 12031:12031 \
    --gpus all \
    -v ~/gai/models:/app/models \
    kakkoii1337/gai-stt:latest
```

#### Step 2: Wait for model to load

```bash
docker logs gai-stt
```

When the loading is completed, the logs should show this:

```bash
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12031 (Press CTRL+C to quit)
```

#### Step 3: Test


In [2]:
%%bash
curl -X 'POST' \
'http://localhost:12031/gen/v1/audio/transcriptions' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -s \
    -F 'file=@../tests/gen/stt/today-is-a-wonderful-day.wav' \
    -F 'model=whisper-transformers'

{"text":" Today is a wonderful day to build something people love.","chunks":[{"timestamp":[0.0,3.14],"text":" Today is a wonderful day to build something people love."}]}