# Transcribe tutorial

In [1]:
import tempfile
import mlrun

## Importing the transcribe function from hub

To import the function directly from hub, use:
```python 
transcribe_fn = mlrun.import_function("hub://transcribe")
```

In [2]:
artifact_path = tempfile.mkdtemp()

In [3]:
transcribe_fn = mlrun.import_function("function.yaml")

## Running transcribe without speaker diarization

In [4]:
transcribe_run = transcribe_fn.run(
    handler="transcribe",
    params={
        "model_name": "tiny",
        "input_path": "./data",
        "decoding_options": {"fp16": False},
        "output_directory": "./output",
    },
    returns=[
        "transcriptions: path",
        "transcriptions_df: dataset",
        {"key": "transcriptions_errors", "artifact_type": "file", "file_format": "yaml"},
    ],
    local=True,
    artifact_path=artifact_path,
)

> 2023-08-31 18:43:39,249 [info] Storing function: {'name': 'transcribe-transcribe', 'uid': '7aaac428055b4dab8d7fb49a204cd2c4', 'db': None}


[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m


> 2023-08-31 18:43:45,146 [info] Loading whisper model: 'tiny'
> 2023-08-31 18:43:46,032 [info] Model loaded.


Transcribing:   0%|          | 0/6 [00:00<?, ?file/s]

> 2023-08-31 18:44:18,997 [info] Done:
       audio_file transcription_file language      length  rate_of_speech
0  real_state.mp3     real_state.txt       en  147.604626        2.906413
1   speech_01.mp3      speech_01.txt       en    2.028299        3.451167
2   speech_02.mp3      speech_02.txt       en   17.513243        3.026281


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...204cd2c4,0,Aug 31 18:43:39,completed,transcribe-transcribe,v3io_user=pengwkind=owner=pengwhost=jupyter-pengw-5f99fb678d-zkdv7,,model_name=tinyinput_path=./datadecoding_options={'fp16': False}output_directory=./output,,transcriptionstranscriptions_dftranscriptions_errors





> 2023-08-31 18:44:19,380 [info] Run execution finished: {'status': 'completed', 'name': 'transcribe-transcribe'}


In [5]:
transcribe_run.outputs

{'transcriptions': 'store://artifacts/default/transcribe-transcribe_transcriptions:7aaac428055b4dab8d7fb49a204cd2c4',
 'transcriptions_df': 'store://artifacts/default/transcribe-transcribe_transcriptions_df:7aaac428055b4dab8d7fb49a204cd2c4',
 'transcriptions_errors': 'store://artifacts/default/transcribe-transcribe_transcriptions_errors:7aaac428055b4dab8d7fb49a204cd2c4'}

**Notice**: If connected to mlrun server, you can simply use:

```python
df = transcribe_run.artifact("transcriptions_df")
```

In [6]:
artifact_path += f"/{transcribe_run.metadata.name}/{transcribe_run.metadata.iteration}/"

In [7]:
df = mlrun.get_dataitem(artifact_path + "transcriptions_df.parquet").as_df()

In [8]:
df.head()

Unnamed: 0,audio_file,transcription_file,language,length,rate_of_speech
0,real_state.mp3,real_state.txt,en,147.604626,2.906413
1,speech_01.mp3,speech_01.txt,en,2.028299,3.451167
2,speech_02.mp3,speech_02.txt,en,17.513243,3.026281


In [9]:
!cat ./output/real_state.txt

 Hello, we're Luis speaking. Hi, Luis. This is Sophia. Good day. I'm finding from the real estate. Lee generation.com.au. We just want to see. Hello. Our custom lead generation packages are now live. We're getting real estate agent Lee task with speak. I know you're busy. We're going to see the LSE's, what if your team is already set up with the stakeholder movement in coming leads from property appraisals, property management and general anchorage that's fine. We'll leave you to it, but we've been working with real estate agencies for over ten years and there's always a common theme. Where is your next lead going to come from? So my call today is just a book a time when one of our real estate lead Specialist work based in Botzinyan Malburn, they'll quickly introduce a package to you. Just run through a few of the finer details, L.A.L.E. It will only take 10 to 20 minutes to give you the background of the package. So let your ask any questions you may have. Are you free next week? Wedn

## Running transcribe with speaker diarization

In [19]:
artifact_path = "./"
proj = mlrun.get_or_create_project("transcribe", "./")
fn = mlrun.code_to_function(
    project="transcribe",
    name="transcribe",
    filename="transcribe.py",
    kind="job",
    image="mlrun/mlrun",
    handler="transcribe",
    description="This function is used to transcribe audio file",
)
transcribe_run = fn.run(
    handler="transcribe",
    params={
        "model_name": "tiny",
        "input_path": "./data/real_state.mp3",
        "decoding_options": {"fp16": False},
        "output_directory": "./output_1",
        "url_path": "./input/dataset.parquet",
    },
    returns=[
        "transcriptions: path",
        "transcriptions_df: dataset",
        {"key": "transcriptions_errors", "artifact_type": "file", "file_format": "yaml"},
    ],
    local=True,
    artifact_path=artifact_path,
)

> 2023-08-31 19:12:10,078 [info] Project loaded successfully: {'project_name': 'transcribe'}
> 2023-08-31 19:12:10,220 [info] Storing function: {'name': 'transcribe-transcribe', 'uid': '232532c2a22f4a5f998b5b30adf4c3aa', 'db': None}
> 2023-08-31 19:12:10,357 [info] Loading whisper model: 'tiny'
> 2023-08-31 19:12:10,953 [info] Model loaded.


Transcribing:   0%|          | 0/1 [00:00<?, ?file/s]

> 2023-08-31 19:12:33,886 [info] Done:
   audio_file transcription_file language      length  rate_of_speech
15          .     real_state.txt       en  146.912625        3.253635
An exception was raised during the packing of '{'key': 'transcriptions_errors', 'file_format': 'yaml'}': 'yaml'


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
transcribe,...adf4c3aa,0,Aug 31 19:12:10,completed,transcribe-transcribe,v3io_user=pengwkind=owner=pengwhost=jupyter-pengw-5f99fb678d-zkdv7,,model_name=tinyinput_path=./data/real_state.mp3decoding_options={'fp16': False}output_directory=./output_1url_path=./input/dataset.parquet,,real_statetranscriptionstranscriptions_df





> 2023-08-31 19:12:34,185 [info] Run execution finished: {'status': 'completed', 'name': 'transcribe-transcribe'}


In [22]:
artifact_path += f"/{transcribe_run.metadata.name}/{transcribe_run.metadata.iteration}/"

In [24]:
df = mlrun.get_dataitem(artifact_path + "real_state.csv").as_df()

In [25]:
df

Unnamed: 0,speaker,start_time,end_time,transcription
0,speaker_0,8940.0,10410.0,"Hello, hello, we speaking."
1,speaker_0,10860.0,19485.0,"Hi, all of you. This is Sophia. Good day. I'm..."
2,speaker_0,19900.0,63210.0,Our custom lead generation packages are now l...
3,speaker_1,63500.0,67130.0,I'm really sorry. I'll be running around at o...
4,speaker_0,67660.0,73285.0,What about Friday next week or we can do Apri...
5,speaker_1,73285.0,75535.0,Then actually understand what is.
6,speaker_0,75535.0,75785.0,
7,speaker_1,76035.0,76285.0,
8,speaker_0,76285.0,81290.0,Custom lead generation package. This is your ...
9,speaker_0,81580.0,103850.0,"So, this are just custom lead generations to ..."


In [26]:
!cat output_1/real_state.txt

speaker_0  8940.0  10410.0: 
  Hello, hello, we speaking.
speaker_0  10860.0  19485.0: 
  Hi, all of you. This is Sophia. Good day. I'm finding from the real estate. Lee generation.com.au. We just want to be with you.
speaker_0  19900.0  63210.0: 
  Our custom lead generation packages are now live. We're getting real estate agent Lee task we speak. I know you're busy. I'll leave. What if you open your team is already set up with a stable stream of income and lead property management and general and curious that's fine. We'll leave it to it. But we've been working with real estate agencies for over 10 years. And there's always a common theme. Where is your next lead going to come from? So my call today is just a book a time. We had one of our real estate lead specialists were based in both Sydney and Melbourne. They'll quickly introduce a package to you. Just run through a few of the finer details. It will only take 10 to 20 minutes to give you the background of the package. The letter 