# transcription with sentiment analysis pipeline

This document details a modular pipeline that takes in an audio/video file in the english language, transcribes it, and then performs sentiment analysis on each sentence of the transcript.

To follow along with this demonstration be sure to initialize your krixik session with your api key and url as shown below. 

We illustrate loading these required secrets in via [python-dotenv](https://pypi.org/project/python-dotenv/), storing those secrets in a `.env` file.  This is always good practice for storing / loading secrets (e.g., doing so will reduce the chance you inadvertantly push secrets to a repo).


In [2]:
import sys 
sys.path.append('../../../')

from docs.utilities.reset import reset_pipeline

In [3]:
# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


This small function prints dictionaries very nicely in notebooks / markdown.

In [4]:
# print dictionaries / json nicely in notebooks / markdown
import json
def json_print(data):
    print(json.dumps(data, indent=2))

A table of contents for the remainder of this document is shown below.


- [pipeline setup](#pipeline-setup)
- [processing a file](#processing-a-file)
- [saving the pipeline config for future use](#saving-the-pipeline-config-for-future-use)

## pipeline setup

Below we setup a multi module pipeline to serve our intended purpose, which is to build a pipeline that will transcribe any audio/video and make it semantically searchable in any language.

To do this we will use the following modules:

- [`transcribe`](modules/transcribe.md): takes in audio/video input, outputs json of content transcription
- [`json-to-txt`](modules/json-to-txt.md): takes in json of text snippets, merges into text file
- [`parser`](modules/parser.md): takes in text, slices into (possibly overlapping) strings
- [`sentiment`](modules/sentiment): takes in text snippets and returns scores for their sentiments

In [4]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(module_type="transcribe")
module_2 = Module(module_type="json-to-txt")
module_3 = Module(module_type="parser")

# create custom pipeline object
custom = CreatePipeline(name='transcribe-sentiment-pipeline', 
                        module_chain=[module_1, module_2, module_3, module_4])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

With our `custom` pipeline built we now pass it, along with a test file, to our operator to process the file.

## processing a file

We first define a path to a local input file.

In [5]:
# define path to an input file
test_file = "../../input_data/Interesting Facts About Colombia.mp4"

Lets take a quick look at this file before processing.

In [6]:
# examine contents of input file
from IPython.display import Video
Video(test_file)

For this run we will use the default models for the entire chain of modules.

In [7]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

In [8]:
# test file
test_file = "../../input_data/Interesting Facts About Colombia.mp4"

# process test input
process_output = pipeline.process(local_file_path = test_file,
                                  expire_time=60*5)

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted ../../input_data/Interesting Facts About Colombia.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpas8o2_np/krixik_converted_version_Interesting Facts About Colombia.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'json-to-txt': {'model': 'base', 'params': {}}, 'parser': {'model': 'sentence', 'params': {}}, 'sentiment': {'model': 'distilbert-base-uncased-finetuned-sst-2-english', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_gndqwwmyqb.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 300 seconds, at Mon Apr 29 15:42:13 2024 UTC
INFO: transcribe-sentiment-pipeline file process and input processing started...
INFO: metadata can be updated using the .up

The output of this process is printed below.  Because the output of this particular pipeline is a json file, the process output is shown with output.  The local address of the output file itself has been returned to the address noted in the `process_output_files` key.

In [9]:
# nicely print the output of this process
json_print(process_output)

{
  "status_code": 200,
  "pipeline": "transcribe-sentiment-pipeline",
  "request_id": "bca798e6-85de-4f8a-9974-744108545dae",
  "file_id": "dfaced90-11ed-41c8-9bf0-8751656be563",
  "message": "SUCCESS - output fetched for file_id dfaced90-11ed-41c8-9bf0-8751656be563.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "snippet": " That's the episode looking at the great country of Columbia.",
      "positive": 0.993,
      "negative": 0.007,
      "neutral": 0.0
    },
    {
      "snippet": "We looked at some really basic facts.",
      "positive": 0.252,
      "negative": 0.748,
      "neutral": 0.0
    },
    {
      "snippet": "It's name, a bit of its history, the type of people that live there, land size, and all that jazz.",
      "positive": 0.998,
      "negative": 0.002,
      "neutral": 0.0
    },
    {
      "snippet": "But in this video, we're going to go into a little bit more of a detailed look.",
      "positive": 0.992,
      

We can also load the output from file to see the pipeline output.

In [14]:
# load in process output from file
with open(process_output['process_output_files'][0], "r") as file:
   json_print(json.load(file))

[
  {
    "snippet": " That's the episode looking at the great country of Columbia.",
    "positive": 0.993,
    "negative": 0.007,
    "neutral": 0.0
  },
  {
    "snippet": "We looked at some really basic facts.",
    "positive": 0.252,
    "negative": 0.748,
    "neutral": 0.0
  },
  {
    "snippet": "It's name, a bit of its history, the type of people that live there, land size, and all that jazz.",
    "positive": 0.998,
    "negative": 0.002,
    "neutral": 0.0
  },
  {
    "snippet": "But in this video, we're going to go into a little bit more of a detailed look.",
    "positive": 0.992,
    "negative": 0.008,
    "neutral": 0.0
  },
  {
    "snippet": "Yo, what is going on guys?",
    "positive": 0.005,
    "negative": 0.995,
    "neutral": 0.0
  },
  {
    "snippet": "Welcome back to F2D facts.",
    "positive": 0.999,
    "negative": 0.001,
    "neutral": 0.0
  },
  {
    "snippet": "The channel where I look at people cultures and places.",
    "positive": 0.999,
    "negativ

## saving the pipeline config for future use

You can save the configuration of this pipeline using the `custom` object, and use it later direclty without building it again in python.

In [11]:
# save your config for later use (that way you don't need to re-build in python)
custom.save(config_path='transcribe-sentiment-pipeline.yml')

See more about [saving and loading pipeline configuration files](LINNK GOES HERE).