# semantically searchable multi-lingual transcription pipeline

This document details a modular pipeline that takes in an audio/video file, transcribes it, translates the transcription into a desired language, and makes the result semantically searchable.

Such a pipeline could be used to make podcast conversations searchable in any language, and likewise notes from an audio/video meeting.


Description: transcribe any audio/video and make it searchable in any language.

Business use cases: 

    - make podcasts searchable in any language
    - transcripts and automated summaries from meeting notes


Pipelines described:

- transcribe

- transcribe --> json-to-txt --> parser --> text_embedder --> vector-search

- transcribe --> translate --> json-to-txt --> parser --> text_embedder --> vector-search

- transcribe --> summarize

- transcribe --> sentiment


# Code walkthrough

### boilerplate

In [4]:
import sys 
sys.path.append('..')
from dotenv import load_dotenv
import os
load_dotenv()

TEST_DUMMY_API_KEY = os.getenv('TEST_DUMMY_API_KEY_DEV')
TEST_DUMMY_API_URL = os.getenv('TEST_DUMMY_API_URL_DEV')

from krixik import krixik
krixik.init(api_key = TEST_DUMMY_API_KEY, 
            api_url = TEST_DUMMY_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))
    
# define directory for input files 
input_directory = 'input_files/'

# define directory for output files
output_directory = 'output_files'

# define directory for pipeline_configs
pipeline_configs_directory = 'pipeline_configs/'

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### transcribe

In [5]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(name="transcribe")

# create custom pipeline object
custom = CreatePipeline(name='transcribe-pipeline-1', 
                        module_chain=[module_1])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [8]:
pipeline.pipeline

'transcribe-pipeline-1'

In [4]:
test_file = "latinx_pride_short.mp4"
output = pipeline.process(local_file_path = input_directory + test_file,
                          expire_time=60*3)

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted data/latinx_pride_short.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmprcecrni3/krixik_converted_version_latinx_pride_short.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_cxtedywpgc.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Tue Apr  9 14:35:31 2024 UTC
INFO: transcribe-pipeline-1 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This file's process_id is: d6497aa8-c0f5-91dc-578e-b3808243ce71
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - transcribe processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output d

Once complete i can view its output.

In [6]:
json_print(output)

{
  "status_code": 200,
  "request_id": "ed4fbf33-05b3-4a5e-8b91-88e27e1868fe",
  "file_id": "6854e4ec-5146-4648-b7ff-d4f6d61d9c68",
  "message": "SUCCESS - output fetched for file_id 6854e4ec-5146-4648-b7ff-d4f6d61d9c68.",
  "process_output": [
    {
      "transcript": " Every time I use the term Latinx there is a palpable rage that fills my comment sections that I just don't get it You can still use the word Latino. I still self identify as Latino all the time",
      "timestamped_transcript": [
        {
          "id": 0,
          "start": 0.34,
          "end": 6.36,
          "text": " Every time I use the term Latinx there is a palpable rage that fills my comment sections that I just don't get it",
          "no_speech_prob": 0.026368845254182816,
          "confidence": 0.813,
          "words": [
            {
              "text": "Every",
              "start": 0.34,
              "end": 0.58,
              "confidence": 0.858
            },
            {
              "te

Even constructed pipelines are flexible.  

Our current pipeline consists of one module - transcription - that has several model options.  

We can view these below.

In [None]:
custom.config

### english searchable-transcripts

Lets make our transcriptions searchable.

We can simply add on modules embedding and search modules after transcribe, creating a new pipeline.

This will make our transcripts searchable in english.

In [2]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(name="transcribe")
module_2 = Module(name="json-to-txt")
module_3 = Module(name="parser")
module_4 = Module(name="text-embedder")
module_5 = Module(name="vector-search")

# create custom pipeline object
custom = CreatePipeline(name='transcribe-pipeline-2', 
                        module_chain=[module_1, module_2, module_3, module_4, module_5])

# save your config for later use (that way you don't need to re-build in python)
custom.save(config_path=pipeline_configs_directory + 'transcribe-pipeline-2.yml')

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [3]:
test_file = "latinx_pride_short.mp4"
output = pipeline.process(local_file_path = input_directory + test_file,
                             expire_time=60*3)

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted data/latinx_pride_short.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmp8h0swv85/krixik_converted_version_latinx_pride_short.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'text-embedder': {'model': 'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': True}}, 'vector-search': {'model': 'faiss', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_vboarqcogg.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Tue Apr  9 15:15:41 2024 UTC
INFO: transcribe-pipeline-2 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This file's process_id is: 3dd7b005-5740-0783-45a2-cd681c37ea77
INFO: File process and p

Now out transcript is (vector) searchable.

Lets try it out.

In [6]:
output = pipeline.vector_search(query="they matter and you dont", 
                                   symbolic_directory_paths=['/*'])

json_print(output)

{
  "status_code": 200,
  "request_id": "60fabbda-5561-4941-9e32-17ff50645d51",
  "message": "Successfully queried 1 user file.",
  "items": [
    {
      "file_id": "4748b502-7a4a-4649-9906-30ce1cd845ea",
      "file_metadata": {
        "file_name": "krixik_generated_vboarqcogg.mp3",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 1,
        "created_at": "2024-04-09 22:12:42",
        "last_updated": "2024-04-09 22:12:42"
      },
      "search_results": [
        {
          "snippet": " Every time I use the term Latinx there is a palpable rage that fills my comment sections that I just don't get it You can still use the word Latino. I still self identify as Latino all the time",
          "line_numbers": [
            1
          ],
          "distance": 0.482
        }
      ]
    }
  ]
}


### (your-language-goes-here)-searchable-transcripts

Perhaps our audience would like to view and search transcripts in another language - like spanish.

We can easily adjust our pipeline to allow for this - by adding in a `translate` module.

In [2]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(name="transcribe")
module_2 = Module(name="translate")
module_3 = Module(name="json-to-txt")
module_4 = Module(name="parser")
module_5 = Module(name="text-embedder")
module_6 = Module(name="vector-search")

# create custom pipeline object
custom = CreatePipeline(name='transcribe-pipeline-3', 
                        module_chain=[module_1, module_2, module_3, module_4, module_5, module_6])

# save your config for later use (that way you don't need to re-build in python)
custom.save(config_path=pipeline_configs_directory + 'transcribe-pipeline-3.yml')

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

This single pipeline has quite a bit of flexibility baked into it.

Most modules have several choices for models / parameters.

For example, we can choose from various sized transcriber models, and translation models.

In [3]:
custom.config

{'pipeline': {'name': 'transcribe-pipeline-3',
  'modules': [{'name': 'transcribe',
    'models': [{'name': 'whisper-tiny'},
     {'name': 'whisper-base'},
     {'name': 'whisper-small'},
     {'name': 'whisper-medium'},
     {'name': 'whisper-large-v3'}],
    'defaults': {'model': 'whisper-tiny'},
    'input': {'type': 'audio', 'permitted_extensions': ['.mp3', '.mp4']},
    'output': {'type': 'json'}},
   {'name': 'translate',
    'models': [{'name': 'opus-mt-de-en'},
     {'name': 'opus-mt-en-es'},
     {'name': 'opus-mt-es-en'},
     {'name': 'opus-mt-en-fr'},
     {'name': 'opus-mt-fr-en'},
     {'name': 'opus-mt-it-en'},
     {'name': 'opus-mt-zh-en'}],
    'defaults': {'model': 'opus-mt-en-es'},
    'input': {'type': 'json', 'permitted_extensions': ['.json']},
    'output': {'type': 'json'}},
   {'name': 'json-to-txt',
    'models': [{'name': 'base'}],
    'defaults': {'model': 'base'},
    'input': {'type': 'json', 'permitted_extensions': ['.json']},
    'output': {'type': 'text

In [3]:
test_file = "latinx_pride_short.mp4"
output = pipeline.process(local_file_path = input_directory + test_file,
                          expire_time=60*3,
                          modules={"translate": {"model": "opus-mt-en-es"}})

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted input_files/latinx_pride_short.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpw1iduw64/krixik_converted_version_latinx_pride_short.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'translate': {'model': 'opus-mt-en-es', 'params': {}}, 'json-to-txt': {'model': 'base', 'params': {}}, 'parser': {'model': 'fixed', 'params': {'chunk_size': 10, 'overlap_size': 2}}, 'text-embedder': {'model': 'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': True}}, 'vector-search': {'model': 'faiss', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_gltnlkxpmh.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Wed Apr 24 12:47:57 2024 UTC
INFO: transcribe-pipeline-3 fil

Now we can search our translated transcript - in the language it was translated into (here spanish).

In [6]:
output = pipeline.vector_search(query="ellos importan y tu no", 
                                symbolic_directory_paths=['/*'])

json_print(output)

{
  "status_code": 200,
  "request_id": "5f897211-1206-4d7b-94d1-72536f547fd7",
  "message": "Successfully queried 1 user file.",
  "items": [
    {
      "file_id": "ccfc436f-efa3-45f7-837d-f18d396d10e4",
      "file_metadata": {
        "file_name": "krixik_generated_vtwikwigjk.mp3",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 5,
        "created_at": "2024-04-18 18:34:58",
        "last_updated": "2024-04-18 18:34:58"
      },
      "search_results": [
        {
          "snippet": "Cada vez que uso el trmino Latinx hay una rabia",
          "line_numbers": [
            1
          ],
          "distance": 0.271
        },
        {
          "snippet": "comentarios que simplemente no lo entiendo Todava puedes usar la",
          "line_numbers": [
            1
          ],
          "distance": 0.283
        },
        {
          "snippet": "Latino todo el tiempo",
          "line_numbers": [
            1
          ],
          "di

If you want to re-use your pipeline without having to re-build it in python, just save your config.  You can reload it later.

# Extra

### Summarize raw transcript output

```transcribe --> summarize```

In [5]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# create a few modules
module_1 = Module(name="transcribe")
module_2 = Module(name="summarize")

pipeline = CreatePipeline(name='transcribe-pipeline-4', 
                               module_chain=[module_1, module_2])
pipeline.save('pipeline_configs/transcribe-pipeline-4.yaml')

In [6]:
my_pipeline = krixik.load_pipeline(config_path="pipeline_configs/transcribe-pipeline-4.yaml")
test_file_name = 'data/latinx_pride_short.mp4'

output = my_pipeline.process(local_file_path = test_file_name,
                             expire_time=60*3,
                             modules={})  # purposefully placing modules={}, they are filled in as necessary, not needed 

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted data/latinx_pride_short.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpu8av3ehp/krixik_converted_version_latinx_pride_short.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'summarize': {'model': 'bart-large-cnn', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_vaikdghybu.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Tue Apr  9 15:42:48 2024 UTC
INFO: transcribe-pipeline-4 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This file's process_id is: 9aed893b-1b2d-a52f-877d-55c2a600213d
INFO: File process and processing status:
SUCCESS: module 1 (of 2) - transcribe processing complete.
SUCCE

In [7]:
json_print(output)

{
  "status_code": 200,
  "request_id": "3fc540bb-7746-4c22-baf6-53c58dcba514",
  "file_id": "6d3665bd-4a04-488f-baf3-c561dd74ca68",
  "message": "SUCCESS - output fetched for file_id 6d3665bd-4a04-488f-baf3-c561dd74ca68.",
  "process_output": [
    {
      "summary": "Every time I use the term Latinx there is a palpable rage that fills my comment sections that I just don't get it You can still use the word Latino. I still self identify as Latino all the time."
    }
  ]
}


### use sentiment analysis on transcript output

In [2]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# create a few modules
module_1 = Module(name="transcribe")
module_2 = Module(name="sentiment")

pipeline = CreatePipeline(name='transcribe-pipeline-5', 
                               module_chain=[module_1, module_2])
pipeline.save('pipeline_configs/transcribe-pipeline-5.yaml')

In [3]:
my_pipeline = krixik.load_pipeline(config_path="pipeline_configs/transcribe-pipeline-5.yaml")
test_file_name = 'data/latinx_pride_short.mp4'

output = my_pipeline.process(local_file_path = test_file_name,
                             expire_time=60*3,
                             modules={})  # purposefully placing modules={}, they are filled in as necessary, not needed 

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted data/latinx_pride_short.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmp9pf8jcm6/krixik_converted_version_latinx_pride_short.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'sentiment': {'model': 'distilbert-base-uncased-finetuned-sst-2-english', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_xgpovlhhvi.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Tue Apr  9 15:41:28 2024 UTC
INFO: transcribe-pipeline-5 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This file's process_id is: 98497c36-de15-3f2b-1d61-ac1b6b70c30f
INFO: File process and processing status:
SUCCESS: module 1 (of 2) - tran

In [4]:
json_print(output)

{
  "status_code": 200,
  "request_id": "7469baa2-105a-4b0c-89eb-af53c0fa37c9",
  "file_id": "0823ed91-a98c-4d38-a272-eff11e26deb7",
  "message": "SUCCESS - output fetched for file_id 0823ed91-a98c-4d38-a272-eff11e26deb7.",
  "process_output": [
    {
      "snippet": " Every time I use the term Latinx there is a palpable rage that fills my comment sections that I just don't get it You can still use the word Latino. I still self identify as Latino all the time",
      "positive": 0.02,
      "negative": 0.98,
      "neutral": 0.0
    }
  ]
}
