## Single-Module Pipeline: `transcribe`

This document is a walkthrough of how to assemble and use a single-module pipeline that only includes a [`transcribe`](../../modules/ai_model_modules/transcribe_module.md) module. It is divided into the following sections:

- [Pipeline Setup](#pipeline-setup)
- [Required Input Format](#required-input-format)
- [Using the Default Model](#using-the-default-model)
- [Using a Non-Default Model](#using-a-non-default-model)

In [1]:
# import utilities
import sys 
import json
import importlib
sys.path.append('../../../')
reset = importlib.import_module("utilities.reset")
reset_pipeline = reset.reset_pipeline

# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


### Pipeline Setup

Let's first instantiate a single-module [`transcribe`](../../modules/ai_model_modules/transcribe_module.md)  pipeline.

We use the [`.create_pipeline`](../../system/pipeline_creation/create_pipeline.md) method for this, passing only the [`transcribe`](../../modules/ai_model_modules/transcribe_module.md)  module name into `module_chain`.

In [2]:
# create a pipeline with a single transcribe module
pipeline = krixik.create_pipeline(name="single_transcribe_1",
                                  module_chain=["transcribe"])

### Required Input Format

The [`transcribe`](../../modules/ai_model_modules/transcribe_module.md)  module accepts audio inputs. Acceptable file formats are only MP3 for the time being.

Let's take a quick look at a valid input file, and then process it.

In [3]:
# examine contents of input file
import IPython
IPython.display.Audio("../../../data/input/Interesting Facts About Colombia.mp3")

### Using the Default Model

Let's process our test input file using the [`transcribe`](../../modules/ai_model_modules/transcribe_module.md)  module's [default model](../../modules/ai_model_modules/transcribe_module.md#available-models-in-the-transcribe-module) : [`whisper-tiny`](https://huggingface.co/openai/whisper-tiny).

Given that this is the default model, we need not specify model selection through the optional [`modules`](../../system/parameters_processing_files_through_pipelines/process_method.md#selecting-models-via-the-modules-argument) argument in the [`.process`](../../system/parameters_processing_files_through_pipelines/process_method.md) method.

In [4]:
# process the file with the default model
process_output = pipeline.process(local_file_path="../../../data/input/Interesting Facts About Colombia.mp3", # the initial local filepath where the input file is stored
                                  local_save_directory="../../../data/output", # the local directory that the output file will be saved to
                                  expire_time=60 * 30, # process data will be deleted from the Krixik system in 30 minutes
                                  wait_for_process=True, # wait for process to complete before returning IDE control to user
                                  verbose=False) # do not display process update printouts upon running code

The output of this process is printed below. To learn more about each component of the output, review documentation for the [`.process`](../../system/parameters_processing_files_through_pipelines/process_method.md) method.

Because the output of this particular module-model pair is a JSON file, the process output is provided in this object as well (this is only the case for JSON outputs).  Moreover, the output file itself has been saved to the location noted in the `process_output_files` key.  The `file_id` of the processed input is used as a filename prefix for the output file.

In [5]:
# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "single_transcribe_1",
  "request_id": "7112b2bf-4f00-4a05-8640-fd64523fe53c",
  "file_id": "031af2e4-23ee-4f66-969e-6a02c91c10cd",
  "message": "SUCCESS - output fetched for file_id 031af2e4-23ee-4f66-969e-6a02c91c10cd.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "transcript": " This episode, looking at the great country of Columbia, we looked at some really basic facts. It's name, a bit of its history, the type of people that live there, land size, and all that jazz. But in this video, we're going to go into a little bit more of a detailed look. Yo, what is going on guys? Welcome back to F2D facts. The channel where I look at people cultures and places, my name is Dave Wouple, and today we are going to be looking more at Columbia in our Columbia Part 2 video. Which just reminds me guys, this is part of our Columbia playlist. So put it down in the description box below and I'll talk about that mor

To confirm that everything went as it should have, let's load in the text file output from `process_output_files`:

In [6]:
# load in process output from file
with open(process_output["process_output_files"][0]) as f:
    print(json.dumps(json.load(f), indent=2))

[
  {
    "transcript": " This episode, looking at the great country of Columbia, we looked at some really basic facts. It's name, a bit of its history, the type of people that live there, land size, and all that jazz. But in this video, we're going to go into a little bit more of a detailed look. Yo, what is going on guys? Welcome back to F2D facts. The channel where I look at people cultures and places, my name is Dave Wouple, and today we are going to be looking more at Columbia in our Columbia Part 2 video. Which just reminds me guys, this is part of our Columbia playlist. So put it down in the description box below and I'll talk about that more at the end of the video. But if you're new here, join me every single Monday to learn about new countries from around the world. You can do that by hitting that subscribe and that belt notification button. But let's get started. So we all know, Columbia is famous for its coffee, right? Yes, right. I know. You guys are sitting there going, f

As anticipated, the returned JSON file has not only the snippets of transcribed text, but along with each includes timestamps and a "confidence" value for the accuracy of each transcription.

### Using a Non-Default Model

To use a [non-default model](../../modules/ai_model_modules/transcribe_module.md#available-models-in-the-transcribe-module) like [`whisper-large-v3`](https://huggingface.co/openai/whisper-large-v3), we must enter it explicitly through the [`modules`](../../system/parameters_processing_files_through_pipelines/process_method.md#selecting-models-via-the-modules-argument) argument when invoking the [`.process`](../../system/parameters_processing_files_through_pipelines/process_method.md) method.

We do so below to process the same input file shown above.

In [7]:
# process the file with a non-default model
process_output = pipeline.process(local_file_path="../../../data/input/Interesting Facts About Colombia.mp3", # all parameters save 'modules' as above
                                  local_save_directory="../../../data/output",
                                  expire_time=60 * 30,
                                  wait_for_process=True,
                                  verbose=False,
                                  modules={"transcribe": {"model": "whisper-large-v3"}}) # specify a non-default model for this process as well as its parameters

We once again print out and review the output as we did above.

In [8]:
# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "single_transcribe_1",
  "request_id": "f833fd1e-a23e-43c7-8a99-36ab714c419d",
  "file_id": "f2225a00-7174-4298-a4bc-541e1b360b1b",
  "message": "SUCCESS - output fetched for file_id f2225a00-7174-4298-a4bc-541e1b360b1b.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "transcript": " Episode looking at the great country of Colombia We looked at some really just basic facts its name a bit of its history the type of people that live there Landsize and all that jazz, but in this video, we're gonna go into a little bit more of a detailed look Yo, what is going on guys? Welcome back to have to D facts a channel where I look at people cultures and places My name is Dave Walpole and today We are gonna be looking more at Colombia in our Columbia part 2 video, which just reminds me guys This is part of our Columbia playlist I'll put it down in the description box below and I'll talk about that more at the end o

In [9]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)