## the `transcribe` module

This document reviews the `transcribe` module - which takes as input an audio or video file and returns a transcription of spoken words made in the input.  Transcription data is returned as a json.

This document includes an overview of custom pipeline setup, current model set, parameters, and `.process` usage for this module.

To follow along with this demonstration be sure to initialize your krixik session with your api key and url as shown below. 

We illustrate loading these required secrets in via [python-dotenv](https://pypi.org/project/python-dotenv/), storing those secrets in a `.env` file.  This is always good practice for storing / loading secrets (e.g., doing so will reduce the chance you inadvertantly push secrets to a repo).

In [1]:
import sys 
sys.path.append('../../')
from docs.utilities.reset import reset_pipeline

In [2]:
# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


This small function prints dictionaries very nicely in notebooks / markdown.

In [3]:
# print dictionaries / json nicely in notebooks / markdown
import json
def json_print(data):
    print(json.dumps(data, indent=2))

A table of contents for the remainder of this document is shown below.


- [pipeline setup](#pipeline-setup)
- [using the `tiny` model](#using-the-default-model)
- [using the `large-v3` model](#using-the-large-v3-model)

## Pipeline setup

Below we setup a simple one module pipeline using the `transcribe` module. 

In [4]:
# import custom module creation tools
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# instantiate module
module_1 = Module(module_type="transcribe")

# create custom pipeline object
custom = CreatePipeline(name='transcribe-pipeline-1', 
                        module_chain=[module_1])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

The `transcribe` module comes with the current range of [whisper](https://openai.com/research/whisper) transcription models.  These range from tiny to large, and offer a trade-off of transcription accuracy versus computational cost, with smaller models being less accurate but cheaper to run.  

- [`whisper-tiny`](https://huggingface.co/openai/whisper-tiny): the smallest model, cheapest to run, but least accurate (default)
- [`whisper-base`](https://huggingface.co/openai/whisper-base): about 2x model parameters compared to tiny - small, cheap to turn, reasonably accurate 
- [`whisper-small`](https://huggingface.co/openai/whisper-small): about 3x model parameters compared to base - cheap to run, accurate 
- [`whisper-medium`](https://huggingface.co/openai/whisper-medium): about 3x model parameters compared to small - more costly to run, accurate
- [`whisper-large-v3`](https://huggingface.co/openai/whisper-large-v3): about 2x model parameters compared to medium - most costly to run, most accurate

These available modeling options and parameters are stored in our custom pipeline's configuration (described further in LINK HERE).  We can examine this configuration as shown below.

In [5]:
# nicely print the configuration of uor custom pipeline
json_print(custom.config)

{
  "pipeline": {
    "name": "transcribe-pipeline-1",
    "modules": [
      {
        "name": "transcribe",
        "models": [
          {
            "name": "whisper-tiny"
          },
          {
            "name": "whisper-base"
          },
          {
            "name": "whisper-small"
          },
          {
            "name": "whisper-medium"
          },
          {
            "name": "whisper-large-v3"
          }
        ],
        "defaults": {
          "model": "whisper-tiny"
        },
        "input": {
          "type": "audio",
          "permitted_extensions": [
            ".mp3",
            ".mp4"
          ]
        },
        "output": {
          "type": "json"
        }
      }
    ]
  }
}


Here we can see the models and their associated parameters available for use.

In [6]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

## using the `tiny` model

We first define a path to a local input file.

In [7]:
# define path to an input file
test_file = "../input_data/Interesting Facts About Colombia.mp4"

Lets take a quick look at this file before processing.

In [8]:
# examine contents of input file
from IPython.display import Video
Video(test_file)

Now let's process it using our `tiny` model.  Because `tiny` is the default model we need not input the optional `modules` argument into `.process`.

In [9]:
# define path to an input file
test_file = "../input_data/Interesting Facts About Colombia.mp4"

# process for search
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory=".", # save output in current directory
                                  expire_time=60*5,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,    # wait for process to complete before regaining ide
                                  verbose=False)            # set verbosity to False

The output of this process is printed below.  Because the output of this particular module-model pair is json, the process output is provided in this object as well.  The output file itself has been returned to the address noted in the `process_output_files` key.  The `file_id` of the processed input is used as a filename prefix for the output file.

In [10]:
# nicely print the output of this process
json_print(process_output)

{
  "status_code": 200,
  "pipeline": "transcribe-pipeline-1",
  "request_id": "960340c1-6e4e-46c2-9622-2c29c51274dd",
  "file_id": "28d4eeed-3dde-4dce-83fd-83123463baad",
  "message": "SUCCESS - output fetched for file_id 28d4eeed-3dde-4dce-83fd-83123463baad.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "transcript": " That's the episode looking at the great country of Columbia. We looked at some really basic facts. It's name, a bit of its history, the type of people that live there, land size, and all that jazz. But in this video, we're going to go into a little bit more of a detailed look. Yo, what is going on guys? Welcome back to F2D facts. The channel where I look at people cultures and places. My name is Dave Wouple, and today we are going to be looking more at Columbia and our Columbia part two video. Which just reminds me guys, this is part of our Columbia playlist. So put it down in the description box below, and I'll talk abo

We load in the text file output from `process_output_files` below. 

In [11]:
# load in process output from file
with open(process_output['process_output_files'][0], "r") as file:
   json_print(json.load(file))

[
  {
    "transcript": " That's the episode looking at the great country of Columbia. We looked at some really basic facts. It's name, a bit of its history, the type of people that live there, land size, and all that jazz. But in this video, we're going to go into a little bit more of a detailed look. Yo, what is going on guys? Welcome back to F2D facts. The channel where I look at people cultures and places. My name is Dave Wouple, and today we are going to be looking more at Columbia and our Columbia part two video. Which just reminds me guys, this is part of our Columbia playlist. So put it down in the description box below, and I'll talk about that more in the video. But if you're new here, join me every single Monday to learn about new countries from around the world. You can do that by hitting that subscribe and that belt notification button. But let's get started. Learn about Columbia. So we all know Columbia is famous for its coffee, right? Yes, right. I know. You guys are sit

### using the `large-v3` model

To use a non-default model like `large-v3` we enter it explicitly as a `modules` selection when invoking `.process`.

We use it below to process the same input file shown above.

In [12]:
# define path to an input file
test_file = "../input_data/Interesting Facts About Colombia.mp4"

# process for search
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory=".", # save output in current directory
                                  expire_time=60*5,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,    # wait for process to complete before regaining ide
                                  verbose=False,            # set verbosity to False
                                  modules={"transcribe":{"model":"whisper-large-v3"}})

The output of this process is printed below.  Because the output of this particular module-model pair is json, the process output is provided in this object as well.  The output file itself has been returned to the address noted in the `process_output_files` key.

In [14]:
# nicely print the output of this process
json_print(process_output)

{
  "status_code": 200,
  "pipeline": "transcribe-pipeline-1",
  "request_id": "66e2ae1c-8a0c-4830-8dc5-0f66e91a17bf",
  "file_id": "b1f517a9-4c48-4e6a-b797-4aa1cd185fee",
  "message": "SUCCESS - output fetched for file_id b1f517a9-4c48-4e6a-b797-4aa1cd185fee.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "transcript": " Episode looking at the great country of Colombia We looked at some really just basic facts its name a bit of its history the type of people that live there Landsize and all that jazz, but in this video, we're gonna go into a little bit more of a detailed look Yo, what is going on guys? Welcome back to have to D facts a channel where I look at people cultures and places My name is Dave Walpole and today We are gonna be looking more at Colombia in our Columbia part 2 video, which just reminds me guys This is part of our Columbia playlist I'll put it down in the description box below and I'll talk about that more at the end