## Overview

This is a proof-of-concept implementation demonstrating that it's possible to use the ASR to generate transcript and use the said transcript to find the source content. This proposed method consists of the following steps:
1. Audio to text conversion by ASR
2. Dedup using text
  * a) Text search using ES
  * b) Semantic search using embedding

If ASR works well, then it's reasonable to assume that videos clips from the same source should generate similar texts, thus a simple text search should yiled a high similarity.


## Dataset

We use the [Lexicap: Lex Fridman Podcast Whisper captions](https://karpathy.ai/lexicap/) dataset created by Karpathy. This serves as the asset for our demo.

In [2]:
! wget https://karpathy.ai/lexicap/data.zip

--2024-01-18 19:30:02--  https://karpathy.ai/lexicap/data.zip
Resolving karpathy.ai (karpathy.ai)... 151.101.65.195, 151.101.1.195
Connecting to karpathy.ai (karpathy.ai)|151.101.65.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38234309 (36M) [application/zip]
Saving to: 'data.zip.1'


2024-01-18 19:30:02 (161 MB/s) - 'data.zip.1' saved [38234309/38234309]



In [3]:
!mkdir -p input && unzip \*.zip -d input && rm *.zip

Archive:  data.zip
   creating: input/vtt/
  inflating: input/vtt/episode_134_small.vtt  
  inflating: input/vtt/episode_139_large.vtt  
  inflating: input/vtt/episode_121_small.vtt  
  inflating: input/vtt/episode_064_small.vtt  
  inflating: input/vtt/episode_090_large.vtt  
  inflating: input/vtt/episode_132_small.vtt  
  inflating: input/vtt/episode_020_small.vtt  
  inflating: input/vtt/episode_105_small.vtt  
  inflating: input/vtt/episode_053_small.vtt  
  inflating: input/vtt/episode_161_large.vtt  
  inflating: input/vtt/episode_204_small.vtt  
  inflating: input/vtt/episode_012_large.vtt  
  inflating: input/vtt/episode_122_small.vtt  
  inflating: input/vtt/episode_233_small.vtt  
  inflating: input/vtt/episode_319_small.vtt  
  inflating: input/vtt/episode_257_small.vtt  
  inflating: input/vtt/episode_107_small.vtt  
  inflating: input/vtt/episode_135_small.vtt  
  inflating: input/vtt/episode_014_small.vtt  
  inflating: input/vtt/episode_089_small.vtt  
  inflating: inpu

In [18]:
WORKING_PATH = '/kaggle/working'
INPUT_PATH = f'{WORKING_PATH}/input'
VTT_PATH = f'{INPUT_PATH}/vtt'

In [5]:
from os import linesep
def collate_transcript(file_path):
  lines = []
  with open(file_path) as fid:
    while True:
        line = fid.readline()

        if 'WEBVTT' in line or '-->' in line or line == '\n':
          continue
        if not line:
            break

        lines.append(line.rstrip())
  transcript = ''.join(lines)
  return transcript

# transcript = collate_transcript('/content/input/vtt/episode_001_large.vtt')
# transcript

In [6]:
from os import walk
from os.path import join

filenames = []
for (dirpath, dirnames, fnames) in walk(VTT_PATH):
    filenames.extend(fnames)
    break

filenames = [fname for fname in filenames if 'large' in fname]
filenames.sort()
file_paths = [join(VTT_PATH, fname) for fname in filenames]
file_paths[:10]

['/kaggle/working/input/vtt/episode_001_large.vtt',
 '/kaggle/working/input/vtt/episode_002_large.vtt',
 '/kaggle/working/input/vtt/episode_003_large.vtt',
 '/kaggle/working/input/vtt/episode_004_large.vtt',
 '/kaggle/working/input/vtt/episode_005_large.vtt',
 '/kaggle/working/input/vtt/episode_006_large.vtt',
 '/kaggle/working/input/vtt/episode_007_large.vtt',
 '/kaggle/working/input/vtt/episode_008_large.vtt',
 '/kaggle/working/input/vtt/episode_009_large.vtt',
 '/kaggle/working/input/vtt/episode_010_large.vtt']

In [211]:
fname2idx = {}
for i, fname in enumerate(filenames):
    fname2idx[fname] = i

In [7]:
transcripts = [collate_transcript(fname) for fname in file_paths]

## Download and transcribe videos

In this section, we download Youtube audio files using video ids, and generate transcripts for all audio files using Whisper from OpenAI.

In [9]:
# Install yt-dlp to download YouTube videos
!python -m pip install -U yt-dlp

Collecting urllib3<3,>=1.26.17 (from yt-dlp)
  Obtaining dependency information for urllib3<3,>=1.26.17 from https://files.pythonhosted.org/packages/96/94/c31f58c7a7f470d5665935262ebd7455c7e4c7782eb525658d3dbf4b9403/urllib3-2.1.0-py3-none-any.whl.metadata
  Using cached urllib3-2.1.0-py3-none-any.whl.metadata (6.4 kB)
Using cached urllib3-2.1.0-py3-none-any.whl (104 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.16
    Uninstalling urllib3-1.26.16:
      Successfully uninstalled urllib3-1.26.16
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
botocore 1.33.13 requires urllib3<2.1,>=1.25.4; python_version >= "3.10", but you have urllib3 2.1.0 which is incompatible.
google-auth 2.22.0 requires urllib3<2.0, but you have urllib3 2.1.0 which is incompatible.
kfp 2.0.1 requires google-cloud

In [238]:
# Larger test data
video_ids = [
    'opMZib2qqeM',
    'ccRbCkdjgpQ',
    'Y3VI643ZtZY',
    '1hB0pIrDtwY',
    'KrFr_-f9PgA',
    'g5V-lC7pai8',
    '6A-RM62y8_U',
    'wABigIrbOLk',
    'SQJo_iL_AHY',
    'c7V0A4aG-4U',
    'u_dxgcYDkec',
    'wxyjT4ik9jo',
    '6y4QO0crrCo',
    '6u27INGhmAI',
    'OlOwd8ss1AI'
]
titles = ['episode_325_large.vtt']*5 + ['episode_324_large.vtt']*5 + ['episode_323_large.vtt']*5

In [240]:
# Smaller test data
video_ids = [
    'opMZib2qqeM',
    'g5V-lC7pai8',
    'u_dxgcYDkec',
]
titles = ['episode_325_large.vtt', 'episode_324_large.vtt', 'episode_323_large.vtt']

In [244]:
vid2gt_title = {}
for vid, title in zip(video_ids, titles):
    vid2gt_title[vid] = title
    
print(vid2gt_title)

{'opMZib2qqeM': 'episode_325_large.vtt', 'g5V-lC7pai8': 'episode_324_large.vtt', 'u_dxgcYDkec': 'episode_323_large.vtt'}


In [11]:
AUDIO_PATH = f'{INPUT_PATH}/audio'
!mkdir -p $AUDIO_PATH

In [12]:
# Add parallelism
for vid in video_ids:
  mp3_file = f'{AUDIO_PATH}/{vid}.mp3'
  !yt-dlp -x --audio-format mp3 -o $mp3_file -- $vid

[youtube] Extracting URL: opMZib2qqeM
[youtube] opMZib2qqeM: Downloading webpage
[youtube] opMZib2qqeM: Downloading ios player API JSON
[youtube] opMZib2qqeM: Downloading android player API JSON
[youtube] opMZib2qqeM: Downloading m3u8 information
[info] opMZib2qqeM: Downloading 1 format(s): 251
[download] Destination: /kaggle/working/input/audio/opMZib2qqeM.webm
[K[download] 100% of    2.79MiB in [1;37m00:00:00[0m at [0;32m13.15MiB/s[0m;33m00:00[0m
[ExtractAudio] Destination: /kaggle/working/input/audio/opMZib2qqeM.mp3
Deleting original file /kaggle/working/input/audio/opMZib2qqeM.webm (pass -k to keep)
[youtube] Extracting URL: g5V-lC7pai8
[youtube] g5V-lC7pai8: Downloading webpage
[youtube] g5V-lC7pai8: Downloading ios player API JSON
[youtube] g5V-lC7pai8: Downloading android player API JSON
[youtube] g5V-lC7pai8: Downloading m3u8 information
[info] g5V-lC7pai8: Downloading 1 format(s): 251
[download] Destination: /kaggle/working/input/audio/g5V-lC7pai8.webm
[K[download] 100%

In [13]:
!python -m pip install -U openai-whisper

Collecting openai-whisper
  Downloading openai-whisper-20231117.tar.gz (798 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.6/798.6 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting triton<3,>=2.0.0 (from openai-whisper)
  Obtaining dependency information for triton<3,>=2.0.0 from https://files.pythonhosted.org/packages/95/05/ed974ce87fe8c8843855daa2136b3409ee1c126707ab54a8b72815c08b49/triton-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading triton-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Collecting tiktoken (from openai-whisper)
  Obtaining dependency information for tiktoken from https://files.pythonhosted.org/packages/bf/56/a8910841d1f501cf8affeb06a0335a518888505c60ec9f2a2a6393190e48

In [14]:
import whisper

MODEL_NAME = 'base' # tiny, base, small, medium, large
model = whisper.load_model(MODEL_NAME)

100%|███████████████████████████████████████| 139M/139M [00:03<00:00, 39.6MiB/s]


In [15]:
# Test whisper model
result = model.transcribe(f"{AUDIO_PATH}/{video_ids[0]}.mp3")
print(result["text"])



 So, as a fear to say, just like this idea that the laws of mathematics are discovered, they're latent within the fabric of the universe in that same way the laws of biology are kind of discovered. Yeah, I think that's absolutely, and it's probably not a popular view, but I think that's right on the money. Yeah. Well, I think that's a really deep idea. And embryogenesis is the process of revealing, of embodying, of manifesting these laws. You're not building the laws. Yeah. You're just creating the capacity to reveal. Yes. I think, again, not the standard view of molecular biology by any means, but I think that's right on the money. I'll give you a simple example. You know, some of our latest work with these xenobots, right? So what we've done is to take some skin cells off of an early frog embryo. And basically ask about their plasticity. If we give you a chance to sort of reboot your multicellularity in a different context, what would you do? Because what you might assume by the thin

In [17]:
vid2text = {}
for vid in video_ids:
  transcript = model.transcribe(f"{AUDIO_PATH}/{vid}.mp3")
  vid2text[vid] = transcript['text']



## Text Search

### Full-text search with Whoosh

In this section, we use the transcribed text to find the corresponding podcast. If we're successful, we should be able to find the correct podcast given snippets of the transcript.

In [19]:
!python -m pip install Whoosh

Collecting Whoosh
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.8/468.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: Whoosh
Successfully installed Whoosh-2.7.4


In [20]:
!mkdir -p $WORKING_PATH/indexdir

In [203]:
!rm -r $WORKING_PATH/indexdir/*

In [204]:
from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), content=TEXT(phrase=True))
ix = create_in("indexdir", schema)

In [205]:
writer = ix.writer()
for fname, text in zip(filenames[-1:-4:-1], transcripts[-1:-4:-1]):
#   print(fname, text[:100])
  writer.add_document(title=fname, content=text)
writer.commit()

episode_325_large.vtt  turns out that if you train a planarian and then cut their heads off, the tail will regenerate a br
episode_324_large.vtt  you could be the seventh best player in the whole world, like literally seventh best player. But if
episode_323_large.vtt  Once this whole thing falls apart and we are climbing the kudzu vines that spiral up the Sears Towe


In [197]:
# Deprecated
from whoosh.query import FuzzyTerm

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=3, prefixlength=1, constantscore=True):
         super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

In [224]:
# Test
from whoosh.qparser import QueryParser,FuzzyTermPlugin

# text = "So, as a fear to say, just like this idea that the laws of mathematics are discovered, they're latent within the fabric of the universe in that same way the laws of biology are kind of discovered."
# text = "There's a cold absurdity to the fact that you can play extremely well and still lose. I mean, actua~50"
text = "What about impressions? Is there similarity between that and acting? Is there some fundamental way in which you become the person? If you have a couple of the things, you can just fill in the blanks."

title2count = {}
parser = QueryParser("content", ix.schema)
with ix.searcher() as searcher:
    word_list = text.split(' ')
#     print(word_list)
    for i in range(len(word_list)-2):
#         print(f'word: {word}')
        query_text = ' '.join(word_list[i: i+3])
#         print(query_text)
        query = parser.parse('"%s"'%query_text)
#         print(query)
        results = searcher.search(query)
        if len(results) == 0:
            continue
#         print(f'{len(results)} results')
        for res in results:
#             print(res['title'], fname2idx[res['title']])
            title = res['title']
            if title in title2count:
                title2count[title] += 1
            else:
                title2count[title] = 1
print(title2count)
max(title2count, key=title2count.get)

{'episode_323_large.vtt': 31, 'episode_325_large.vtt': 14, 'episode_324_large.vtt': 15}


'episode_323_large.vtt'

In [245]:
import six

MAX_LENGTH = 200
NUM_WORDS = 3

parser = QueryParser("content", ix.schema)
with ix.searcher() as searcher:
  for vid, text in six.iteritems(vid2text):
    title2count = {}
    word_list = text[:min(MAX_LENGTH, len(text))].split(' ')
    for i in range(len(word_list)-NUM_WORDS+1):
        query_text = ' '.join(word_list[i: i+3])
        query = parser.parse('"%s"'%query_text)
        results = searcher.search(query)
        if len(results) == 0:
            continue
        for res in results:
            title = res['title']
            if title in title2count:
                title2count[title] += 1
            else:
                title2count[title] = 1
    title = max(title2count, key=title2count.get) # get title with max count
    print(f'vid: {vid}, groundtruth: {vid2gt_title[vid]}, pred: {title}')

vid: opMZib2qqeM, groundtruth: episode_325_large.vtt, pred: episode_325_large.vtt
vid: g5V-lC7pai8, groundtruth: episode_324_large.vtt, pred: episode_324_large.vtt
vid: u_dxgcYDkec, groundtruth: episode_323_large.vtt, pred: episode_323_large.vtt


As we can see here, with the help of a full-text search engine `Whoosh`, we are able to find the correct podcast given the transcript generated by audio file.

### Full-text Search with ElasicSearch

**[Section not finished]**

https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/elasticsearch.ipynb#scrollTo=YUj0878jPyz7

In [66]:
!python -m pip install elasticsearch

Collecting elasticsearch

  Downloading elasticsearch-8.11.1-py3-none-any.whl (412 kB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m412.8/412.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m

[?25hCollecting elastic-transport<9,>=8 (from elasticsearch)

  Downloading elastic_transport-8.11.0-py3-none-any.whl (59 kB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.8/59.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m



Installing collected packages: elastic-transport, elasticsearch

Successfully installed elastic-transport-8.11.0 elasticsearch-8.11.1


In [67]:
from elasticsearch import Elasticsearch

In [None]:
# Initialize the Elasticsearch client
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
# Create an index
es.indices.create(index='my_index', ignore=400)
# Index a document
document = {
    'title': 'Getting Started with Elasticsearch',
    'content': 'Elasticsearch is a powerful search engine.',
}
es.index(index='my_index', doc_type='document', id=1, body=document)

In [None]:
# Search for documents
search_results = es.search(index='my_index', body={'query': {'match': {'content': 'powerful search engine'}}})

In [None]:
# Print the results
for hit in search_results['hits']['hits']:
    print(f"Document ID: {hit['_id']}, Score: {hit['_score']}")

In [75]:
from datetime import datetime
from elasticsearch import Elasticsearch

In [None]:
es = Elasticsearch(hosts = [{"host":"localhost", "port":9200, "scheme": "https"}])

doc = {
    'author': 'kimchy',
    'text': 'Elasticsearch: cool. bonsai cool.',
    'timestamp': datetime.now(),
}
resp = es.index(index="test-index", id=1, document=doc)
print(resp['result'])

resp = es.get(index="test-index", id=1)
print(resp['_source'])

es.indices.refresh(index="test-index")

resp = es.search(index="test-index", query={"match_all": {}})
print("Got %d Hits:" % resp['hits']['total']['value'])
for hit in resp['hits']['hits']:
    print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])