# Challenge - Building a Semantic Index and Knowledge Base from Video Data

## 1. Introduction

This notebook walks through an end-to-end GPU enabled workflow where a semantic index and knowledge base is built from data available in video format; using tools found in the Riva Framework.

After completing this excercise, you will be able to use Riva to transcribe video data, build an index for keywords and entities found in the videos, and construct a basic knowledge base from the provided data. 


It is not required that the user is familiar with Riva beforehand. Since our aim is to go from raw videos to a knowledge base, a detailed introduction is out of scope for this notebook. We recommend [Introduction to Riva](../Introduction_to_Riva.ipynb) for additional information.

### 1.2. Problem statement

We are trying to answer the following questions, using videos from the GTC 2021 special address on healthcare as a source of data.

0: how is drug discovery traditionally done?

1: how many compounds are inferred in the molecular latent space learned?

2: what type of model was used to generate chemistry?

3: How many parameters does the world's largest clinical model have?

4: how many drug candidates can be simulated per year using the a100 MIG architecture?

5: how many new models are part of Clara discovery?

6: where does medical data flow in from?

7: how much data do hospitals generate each year?

8: what is Nvidia Clara?

9: how many models does Clara have?

10: how much of the population can AI driven diagnostic devices reach?

We will first use a small example video to answer a simple sample question, and use this as a template to answer the questions posed in the problem statement.



### 1.3 Why RIVA?

Riva allows us to transcribe audio data into text using ASR, and we can gain insights from these transcripts using NLP methods provided by Jarvis such as Named Entity Recognition and Question Answering


### 1.2.1 References


Dataset sources:
- NVIDIA Youtube Channel
- 


## 1. Sample Exercise
Let us use the following file as a sample input:
Sample Video (The Amazon Rainforest)

For convenience, we have converted the youtube video into a WAV file which can be used in our pipeline
The code for this process is found in section 2 

Sample Audio(../data/amazon-rainforest.wav)

For this sample, we will build a list of the Entities found in the transcript, and answer the following 2 questions:

1: What is the Amazon Rainforest?

2: How many reptile species are found in the Amazon Rainforest?


In [None]:
## first let us preview the video 
## use builtin IPython to playback youtube
from IPython.display import IFrame

# Sample Video
IFrame("https://www.youtube.com/embed/M_9xIVfXA1w?rel=0&amp;controls=0&amp;showinfo=0", width="560", height="315", frameborder="0", allowfullscreen=True)

In [None]:
## code to convert from Youtube URL to audio file
## Warning! do not use this for copyright violations
## this code for reference, the audio file is already provided in the next cell

##first, we will need to have two required libraries
## ffmpeg and youtube-dl


# !pip install youtube_dl
# !pip install ffmpeg

##lets make this available as a function, takes in a URL and generates a WAV file

# from youtube_dl import YoutubeDL

# def getAudio(url):
    
#     audio_downloader = YoutubeDL({'format':'bestaudio'})

#     try:

#         print('Dowloading Audio')

#         URL = url

#         audio_downloader.extract_info(URL)

#     except Exception:

#         print("Couldn\'t download the audio")







In [None]:
## Let's import required basic libraries
import io
import librosa
from time import time
import numpy as np
import IPython.display as ipd
import grpc
import requests


In [None]:
## let's import required RIVA libraries
# ASR 
import riva_api.riva_asr_pb2 as rasr
import riva_api.riva_asr_pb2_grpc as rasr_srv
import riva_api.riva_audio_pb2 as ra

# NLP 

import riva_api.riva_nlp_pb2 as rnlp
import riva_api.riva_nlp_pb2_grpc as rnlp_srv



In [None]:
## Let's configure our services
## note: change the host and portnumber to your provider Riva SERVER PORT and DGX that you are connected to


channel = grpc.insecure_channel('YOUR DGX HOST HERE:RIVA PORT NUM')

##setup service layer objects

riva_asr = rasr_srv.RivaSpeechRecognitionStub(channel)
riva_nlp = rnlp_srv.RivaLanguageUnderstandingStub(channel)



In [None]:
## lets preview the audio 
## note, the audio is provided in the data folder

# read in an audio file from local disk
path = "../datasets/amazon-rainforest.wav"
audio, sr = librosa.core.load(path, sr=None)
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

In [None]:
##now lets use Riva ASR to transcribe the audio
# Set up an offline/batch recognition request

req = rasr.RecognizeRequest()
req.audio = content                                   # raw bytes from previous cell
req.config.encoding = ra.AudioEncoding.LINEAR_PCM     # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings
req.config.sample_rate_hertz = sr                     # Audio will be resampled if necessary
req.config.language_code = "en-US"                    # Language model ode
req.config.max_alternatives = 1                       # How many top-N hypotheses to return, 1 means return the best one
req.config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
req.config.audio_channel_count = 1                    # Mono channel, change to 2 for stereo

response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript

print("ASR Transcript:", asr_best_transcript) ## we will use this in the next step

## lets inspect the response to ensure no anomalies

print("\n\nFull Response Message:")
print(response) 




In [None]:

## now, we have the raw transcript. Let us identify the named entities in this text
## first setup the nlp request object
req = rnlp.TokenClassRequest()

## now define the model
req.model.model_name = "riva_ner"     # If you have deployed a custom model with the domain_name 
                                        # parameter in ServiceMaker's `jarvis-build` command then you should use 
                                        # "jarvis_ner_<your_input_domain_name>" where <your_input_domain_name>
                                        # is the name you provided to the domain_name parameter.

## use the text from the previous part here            
req.text.append(asr_best_transcript)

resp = riva_nlp.ClassifyTokens(req)

##show us what entities are present
print("Named Entities:")
for result in resp.results[0].results:
    print(f"  {result.token} ({result.label[0].class_name})")

Now that we have seen which entities are available, we can use the text to answer our questions. 
for our sample we have a small file which contains all entities needed to answer our query, so we can go ahead and jump to the question answering phase straight away.

In [None]:


##now define the request object
req = rnlp.NaturalQueryRequest()

## now lets make a list of our queries

queries = ['What is the Amazon Rainforest?', 'How many reptile species are found in the Amazon Rainforest?']

## setup answers/knowledge base object
answers = []

##now lets iterate through the loop and provide answers

for input_query in queries:
    
    req.query = input_query
    ## we have a small transcript so we will only use this one context.
    ## for larger amounts of data we need to split the text to narrow down the context
    req.context = asr_best_transcript
    resp = riva_nlp.NaturalQuery(req)

    print(f"Query: {input_query}")
    print(f"Answer: {resp.results[0].answer}")
    
    ##add the answers to the knowledge base
    answers.append(resp.results[0].answer)

## show us the combined answer list
print(answers)



## 2. Challenge Exercise
Let us use the following files as a sample input:
Challenge Video (GTC-2021-healthcare)
Video URL: https://www.youtube.com/watch?v=AfC7-Iksl_M

For convenience, we have converted the youtube video into WAV files which can be used in our pipeline
The code for this process is found above 

Sample Audio(../data/amazon-rainforest.wav)

For the exercise, find a list of named entities in each video (or the entire talk) and  answer the following questions:

0: how is drug discovery traditionally done?

1: how many compounds are inferred in the molecular latent space learned?

2: what type of model was used to generate chemistry?

3: How many parameters does the world's largest clinical model have?

4: how many drug candidates can be simulated per year using the a100 MIG architecture?

5: how many new models are part of Clara discovery?

6: where does medical data flow in from?

7: how much data do hospitals generate each year?

8: what is Nvidia Clara?

9: how many models does Clara have?

10: how much of the population can AI driven diagnostic devices reach?

In [None]:
## first let us preview the video 
## use builtin IPython to playback youtube
from IPython.display import IFrame

# GTC Healthcare Special Talk
IFrame("https://www.youtube.com/embed/AfC7-Iksl_M?rel=0&amp;controls=0&amp;showinfo=0", width="560", height="315", frameborder="0", allowfullscreen=True)

The talk has been split into sections and the following WAV files are available in the data folder  
gtc1.wav  
gtc2.wav   
gtc3.wav    
gtc4.wav  
gtc5.wav  
gtc6.wav  
gtc7.wav  
gtc8.wav  
gtc9.wav  
gtc10.wav  
gtc11.wav  
gtc12.wav  
gtc13.wav  
gtc14.wav  
gtc15.wav  


In [15]:
##first, transcribe the files 
## using a function and a loop will be useful here
## TODO HERE

In [16]:
## next, identify entities
## again, defining a function and a loop is useful
##TODO here

In [20]:
##next, match entities with questions
## some entities may not appear, may be misspelled and mispunctuated

In [21]:
## next, run the QA pipeline for matches found for each question
## a loop or a nested loop is useful here

In [19]:
##finally, compile answers into knowledge base
## a dictionary structure, combined with lists is ideal