# multilingual-to-english transcription with sentiment analysis pipeline

This document details a modular pipeline that takes in an audio/video file in a non-english language, transcribes it, translates the transcription into english, and then performs sentiment analysis on each sentence of the translated transcript.

A table of contents for the remainder of this document is shown below.


- [pipeline setup](#pipeline-setup)
- [processing a file](#processing-a-file)
- [performing semantic search](#performing-semantic-search)
- [saving the pipeline config for future use](#saving-the-pipeline-config-for-future-use)


In [1]:
# import utilities
import sys 
import json
import importlib
sys.path.append('../../../')
reset = importlib.import_module("utilities.reset")
reset_pipeline = reset.reset_pipeline

# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


## Pipeline setup

Below we setup a multi module pipeline to serve our intended purpose, which is to build a pipeline that will transcribe any audio/video in a non-english language, translate the content of the corresponding transcription into english, and then perform sentiment analysis on the result - sentence-by-sentence.

To do this we will use the following modules:

- [`transcribe`](modules/transcribe.md): takes in audio/video input, outputs json of content transcription
- [`translate`](modules/translate.md): takes in json of text snippets, outputs json of translated snippets
- [`json-to-txt`](modules/json-to-txt.md): takes in json of text snippets, merges into text file
- [`parser`](modules/parser.md): takes in text, slices into (possibly overlapping) strings
- [`sentiment`](modules/sentiment): takes in text snippets and returns scores for their sentiments

We do this by passing the module names to the `module_chain` argument of [`create_pipeline`](system/create_save_load.md) along with a name for our pipeline.

In [2]:
# create a multi-module pipeline
pipeline = krixik.create_pipeline(name="examples-transcribe-multilingual-sentiment-docs",
                                  module_chain=["transcribe",
                                                "translate",
                                                "json-to-txt",
                                                "parser",
                                                "sentiment"])

With our `custom` pipeline built we now pass it, along with a test file, to our operator to process the file.

## Processing a file

Lets take a quick look at a test file before processing.

This is a short video in spanish.  After transcription we will translate it into english.

In [3]:
# examine contents of input file
test_file = "../../../data/input/deadlift.mp4"
from IPython.display import Video
Video(test_file)

The input video content language content is English.  We will use the `opus-mt-en-es` model of the [`translate`](modules/translate.md) to translate the transcript of this video into Spanish.

For this run we will use the default models for the remainder of the modules.


In [4]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

In [5]:
# test file
test_file = "../../../data/input/deadlift.mp4"

# process test input
process_output = pipeline.process(local_file_path = test_file,
                                  expire_time=60*10,
                                  modules={"translate": {"model": "opus-mt-es-en"}},
                                  verbose=True)

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted ../../../data/input/deadlift.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmphh53msr4/krixik_converted_version_deadlift.mp3
INFO: hydrated input modules: {'module_1': {'model': 'whisper-tiny', 'params': {}}, 'module_2': {'model': 'opus-mt-es-en', 'params': {}}, 'module_3': {'model': 'base', 'params': {}}, 'module_4': {'model': 'sentence', 'params': {}}, 'module_5': {'model': 'distilbert-base-uncased-finetuned-sst-2-english', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_rhlnmnumla.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 600 seconds, at Mon May  6 15:44:33 2024 UTC
INFO: examples-transcribe-multilingual-sentiment-docs file process and input processing started...
INFO: metadata can b

The output of this process is printed below.  Because the output of this particular pipeline is a database file, the process output is shown as null in the output.  The local address of the output file itself has been returned to the address noted in the `process_output_files` key.

In [6]:
# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "examples-transcribe-multilingual-sentiment-docs",
  "request_id": "718000ab-edd3-4554-a38c-89a3cdebf394",
  "file_id": "7c1d7f56-36aa-4beb-824e-772e50140506",
  "message": "SUCCESS - output fetched for file_id 7c1d7f56-36aa-4beb-824e-772e50140506.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "snippet": "To begin, we want to see the feet in the anchors of the chair and the men, the columns in the ground, a neutral column, mediated by the abdomen, the men are going to go through there.",
      "positive": 0.985,
      "negative": 0.015,
      "neutral": 0.0
    },
    {
      "snippet": "The men are lightly in front of the bar or in the top, the men are symmetric and sufficient to not be in the knees.",
      "positive": 0.573,
      "negative": 0.427,
      "neutral": 0.0
    },
    {
      "snippet": "We can have a look at the front.",
      "positive": 0.998,
      "negative": 0.002,
      "neutra

## Performing semantic search

Because our pipeline has `text-embedder` and `vector-db` modules we can semantically search the translated transcription, here in Spanish (since we processed our file with an English-Spanish model).  

In [7]:
# semantically search translated transcription
search_output = pipeline.semantic_search(query="hechos realmente bsicos", 
                                         file_ids=[process_output["file_id"]])

print(json.dumps(search_output, indent=2))

AttributeError: 'KrixikBasePipeline' object has no attribute 'semantic_search'

Learn more about the [`semantic_search` method here](system/semantic_search.md).

In [None]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)