## Transcription with sentiment analysis pipeline

This document details a modular pipeline that takes in an audio/video file in the english language, transcribes it, and then performs sentiment analysis on each sentence of the transcript.

A table of contents for the remainder of this document is shown below.


- [pipeline setup](#pipeline-setup)
- [processing a file](#processing-a-file)

In [1]:
# import utilities
import sys 
import json
import importlib
sys.path.append('../../../')
reset = importlib.import_module("utilities.reset")
reset_pipeline = reset.reset_pipeline

# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

## Pipeline setup

Below we setup a multi module pipeline to serve our intended purpose, which is to build a pipeline that will transcribe any audio/video and perform sentiment analysis on the output transcription - sentence by sentence.

To do this we will use the following modules:

- [`transcribe`](modules/transcribe.md): takes in audio/video input, outputs json of content transcription
- [`json-to-txt`](modules/json-to-txt.md): takes in json of text snippets, merges into text file
- [`parser`](modules/parser.md): takes in text, slices into (possibly overlapping) strings
- [`sentiment`](modules/sentiment.md): takes in text snippets and returns scores for their sentiments

We do this by passing the module names to the `module_chain` argument of [`create_pipeline`](system/create_save_load.md) along with a name for our pipeline.

In [4]:
# create a multi-module pipeline
pipeline = krixik.create_pipeline(name="examples-transcribe-sentiment-docs",
                                  module_chain=["transcribe",
                                                "json-to-txt",
                                                "parser",
                                                "sentiment"])

This pipeline's available modeling options and parameters are stored in your custom [pipeline's configuration](system/create_save_load.md).

In [None]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

## Processing a file

We first define a path to a local input file.

Lets take a quick look at this file before processing.

In [6]:
# examine contents of input file
test_file = "../../../data/input/Interesting Facts About Colombia.mp4"
from IPython.display import Video
Video(test_file)

For this run we will use the default models for the entire chain of modules.

In [8]:
# test file
test_file = "../../../data/input/Interesting Facts About Colombia.mp4"

# process test input
process_output = pipeline.process(local_file_path = test_file,
                                  expire_time=60*10,
                                  verbose=False)

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted ../../input_data/Interesting Facts About Colombia.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpas8o2_np/krixik_converted_version_Interesting Facts About Colombia.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'json-to-txt': {'model': 'base', 'params': {}}, 'parser': {'model': 'sentence', 'params': {}}, 'sentiment': {'model': 'distilbert-base-uncased-finetuned-sst-2-english', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_gndqwwmyqb.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 300 seconds, at Mon Apr 29 15:42:13 2024 UTC
INFO: transcribe-sentiment-pipeline file process and input processing started...
INFO: metadata can be updated using the .up

The output of this process is printed below.  Because the output of this particular pipeline is a json file, the process output is shown with output.  The local address of the output file itself has been returned to the address noted in the `process_output_files` key.

In [9]:
# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "transcribe-sentiment-pipeline",
  "request_id": "bca798e6-85de-4f8a-9974-744108545dae",
  "file_id": "dfaced90-11ed-41c8-9bf0-8751656be563",
  "message": "SUCCESS - output fetched for file_id dfaced90-11ed-41c8-9bf0-8751656be563.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "snippet": " That's the episode looking at the great country of Columbia.",
      "positive": 0.993,
      "negative": 0.007,
      "neutral": 0.0
    },
    {
      "snippet": "We looked at some really basic facts.",
      "positive": 0.252,
      "negative": 0.748,
      "neutral": 0.0
    },
    {
      "snippet": "It's name, a bit of its history, the type of people that live there, land size, and all that jazz.",
      "positive": 0.998,
      "negative": 0.002,
      "neutral": 0.0
    },
    {
      "snippet": "But in this video, we're going to go into a little bit more of a detailed look.",
      "positive": 0.992,
      

We can also load the output from file to see the pipeline output.

In [14]:
# load in process output from file
with open(process_output["process_output_files"][0]) as f:
  print(json.dumps(json.load(f), indent=2))

[
  {
    "snippet": " That's the episode looking at the great country of Columbia.",
    "positive": 0.993,
    "negative": 0.007,
    "neutral": 0.0
  },
  {
    "snippet": "We looked at some really basic facts.",
    "positive": 0.252,
    "negative": 0.748,
    "neutral": 0.0
  },
  {
    "snippet": "It's name, a bit of its history, the type of people that live there, land size, and all that jazz.",
    "positive": 0.998,
    "negative": 0.002,
    "neutral": 0.0
  },
  {
    "snippet": "But in this video, we're going to go into a little bit more of a detailed look.",
    "positive": 0.992,
    "negative": 0.008,
    "neutral": 0.0
  },
  {
    "snippet": "Yo, what is going on guys?",
    "positive": 0.005,
    "negative": 0.995,
    "neutral": 0.0
  },
  {
    "snippet": "Welcome back to F2D facts.",
    "positive": 0.999,
    "negative": 0.001,
    "neutral": 0.0
  },
  {
    "snippet": "The channel where I look at people cultures and places.",
    "positive": 0.999,
    "negativ

In [None]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)