## Multi-Module Pipeline: Sentiment Analysis on Translated Transcription

This document details a modular pipeline that takes in an audio/video file in a non-English language, [`transcribes`](../modules/ai_model_modules/transcribe_module.md) it, [`translates`](../modules/ai_model_modules/translate_module.md) the transcript into English, and then performs [`sentiment analysis`](../modules/ai_model_modules/sentiment_module.md) on each sentence of the translated transcript.

The document is divided into the following sections:

- [Pipeline Setup](#pipeline-setup)
- [Processing an Input File](#processing-an-input-file)

### Pipeline Setup

To achieve what we've described above, let's set up a pipeline sequentially consisting of the following modules:

- A [`transcribe`](../modules/ai_model_modules/transcribe_module.md) module.

- A [`translate`](../modules/ai_model_modules/translate_module.md) module.

- A [`json-to-txt`](../modules/support_function_modules/json-to-txt_module.md) module.

- A [`parser`](../modules/ai_model_modules/parser_module.md) module.

- A [`sentiment`](../modules/ai_model_modules/sentiment_module.md) module.

We use the [`json-to-txt`](../modules/support_function_modules/json-to-txt_module.md) and [`parser`](../modules/ai_model_modules/parser_module.md) combination, which combines the transcribed snippets into one document and then splices it again, to make sure that any pauses in speech don't make for partial snippets that can confuse the [`sentiment`](../modules/ai_model_modules/sentiment_module.md) model.

Pipeline setup is accomplished through the [`.create_pipeline`](../system/pipeline_creation/create_pipeline.md) method, as follows:

In [2]:
# create a pipeline as detailed above

pipeline_1 = krixik.create_pipeline(name="multi_sentiment_analysis_on_translated_transcription",
                                    module_chain=["transcribe",
                                                  "translate",
                                                  "json-to-txt",
                                                  "parser",
                                                  "sentiment"])

### Processing an Input File

Lets take a quick look at a test file before processing. Given that we're [`translating`](../modules/ai_model_modules/translate_module.md) before performing [`sentiment`](../modules/ai_model_modules/sentiment_module.md), we'll start with a Spanish-language video file.

In [3]:
# examine contents of input file

from IPython.display import Video
Video("../../../data/input/deadlift.mp4")

Since the input video is in Spanish, we'll use the (non-default) [`opus-mt-es-en`](https://huggingface.co/Helsinki-NLP/opus-mt-es-en) model of the [`translate`](../modules/ai_model_modules/translate_module.md) module to translate its transcript into English. We will also leverage a stronger model than the [default](../modules/ai_model_modules/transcribe_module.md#available-models-in-the-transcribe-module) for our [`transcription`](../modules/ai_model_modules/transcribe_module.md).

We will use the default models for every other module in the pipeline as well, so they don't have to be specified in the [`modules`](../system/parameters_processing_files_through_pipelines/process_method.md#selecting-models-via-the-modules-argument) argument of the [`.process`](../system/parameters_processing_files_through_pipelines/process_method.md) method.

In [5]:
# process the file through the pipeline, as described above

process_output_1 = pipeline_1.process(local_file_path = "../../../data/input/deadlift.mp4", # the initial local filepath where the input file is stored
                                      local_save_directory="../../../data/output", # the local directory that the output file will be saved to
                                      expire_time=60*30, # process data will be deleted from the Krixik system in 30 minutes
                                      wait_for_process=True, # wait for process to complete before returning IDE control to user
                                      verbose=False, # do not display process update printouts upon running code
                                      modules={"transcribe": {"model": "whisper-base"}, "translate": {"model": "opus-mt-es-en"}}) # specify a non-default model for use in two modules whose type is only present once each in the pipeline (otherwise, would have to refer to them positionally)

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted ../../../data/input/deadlift.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpbc1_ib05/krixik_converted_version_deadlift.mp3
INFO: hydrated input modules: {'module_1': {'model': 'whisper-medium', 'params': {}}, 'module_2': {'model': 'opus-mt-es-en', 'params': {}}, 'module_3': {'model': 'base', 'params': {}}, 'module_4': {'model': 'sentence', 'params': {}}, 'module_5': {'model': 'distilbert-base-uncased-finetuned-sst-2-english', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_tihuizzppb.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 600 seconds, at Mon May  6 16:44:53 2024 UTC
INFO: examples-transcribe-multilingual-sentiment-docs file process and input processing started...
INFO: metadata can

The output of this process is printed below. To learn more about each component of the output, review documentation for the [`.process`](../system/parameters_processing_files_through_pipelines/process_method.md) method.

Because the output of this particular module-model pair is a JSON file, the process output is provided in this object as well (this is only the case for JSON outputs).  Moreover, the output file itself has been saved to the location noted in the `process_output_files` key.  The `file_id` of the processed input is used as a filename prefix for the output file.

In [6]:
# nicely print the output of this process

print(json.dumps(process_output_1, indent=2))

{
  "status_code": 200,
  "pipeline": "examples-transcribe-multilingual-sentiment-docs",
  "request_id": "1119f07f-e4a1-4021-9668-2f19ea367568",
  "file_id": "efdc2954-9bef-4427-8de1-2bd18a830015",
  "message": "SUCCESS - output fetched for file_id efdc2954-9bef-4427-8de1-2bd18a830015.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "snippet": "For the starting position, we want to see the feed between the hip and shoulders width, the heels on the floor, a neutral column mediated by abdominal tension, the shoulders are lightly in front of the bar or above, straight arms, symmetrical hands and enough width to not rather the knees and we can have a lightly look forward.",
      "positive": 0.99,
      "negative": 0.01,
      "neutral": 0.0
    },
    {
      "snippet": "To perform the movement, our athlete will push from the heels, he will start to raise the hips and shoulders together, when the bar passes the knees, we extend the hip.",
   

To confirm that everything went as it should have, let's load in the text file output from `process_output_files`:

In [None]:
# load in process output from file

with open(process_output_1["process_output_files"][0]) as f:
  print(json.dumps(json.load(f), indent=2))

[
  {
    "snippet": " That's the episode looking at the great country of Columbia.",
    "positive": 0.993,
    "negative": 0.007,
    "neutral": 0.0
  },
  {
    "snippet": "We looked at some really basic facts.",
    "positive": 0.252,
    "negative": 0.748,
    "neutral": 0.0
  },
  {
    "snippet": "It's name, a bit of its history, the type of people that live there, land size, and all that jazz.",
    "positive": 0.998,
    "negative": 0.002,
    "neutral": 0.0
  },
  {
    "snippet": "But in this video, we're going to go into a little bit more of a detailed look.",
    "positive": 0.992,
    "negative": 0.008,
    "neutral": 0.0
  },
  {
    "snippet": "Yo, what is going on guys?",
    "positive": 0.005,
    "negative": 0.995,
    "neutral": 0.0
  },
  {
    "snippet": "Welcome back to F2D facts.",
    "positive": 0.999,
    "negative": 0.001,
    "neutral": 0.0
  },
  {
    "snippet": "The channel where I look at people cultures and places.",
    "positive": 0.999,
    "negativ

In [7]:
# delete all processed datapoints belonging to this pipeline

reset_pipeline(pipeline_1)