In this demo, we will build a video to searchable transcript pipeline using pixeltable primitives and openAI whisper. 
Along the way, we demonstrate how building this pipeline and inspecting the intermediate data is made easy by pixeltable, and how
pixeltable makes it easy to add more data and explore its results.

1) Ingests video
2) Extract the corresponding audio
3) Transcribe audio to text using openAI whisper
4) Build a semantic index based on sentence_transformers text embeddings

Once this pre-processing pipeline is built, we show how we can
5) Search the extracted data at a sentence granularity.
6) And how the pipeline is run and and data outputs are kept up to date when adding new videos, quickly letting us explore how our pipeline behaves with new data, and making any new videos searchable within an instant.

In [1]:
%pip install git+https://github.com/ytdl-org/youtube-dl

Collecting git+https://github.com/ytdl-org/youtube-dl
  Cloning https://github.com/ytdl-org/youtube-dl to /private/var/folders/8v/d886z5j13dsctyjpw29t7y480000gn/T/pip-req-build-ljze0xi6
  Running command git clone --filter=blob:none --quiet https://github.com/ytdl-org/youtube-dl /private/var/folders/8v/d886z5j13dsctyjpw29t7y480000gn/T/pip-req-build-ljze0xi6


  Resolved https://github.com/ytdl-org/youtube-dl to commit a08f2b7e4567cdc50c0614ee0a4ffdff49b8b6e6


  Preparing metadata (setup.py) ... [?25l-

 done


[?25h

Note: you may need to restart the kernel to use updated packages.


In [2]:
%%bash
# check the right python is being used (same as kernel python)
# which python
# which youtube-dl
mkdir -p sample_videos
cd sample_videos
youtube-dl 'https://www.youtube.com/watch?v=YwWtDSponlc&ab_channel=CNBCTelevision'
youtube-dl 'https://www.youtube.com/watch?v=L9Tyb_ycRfU&ab_channel=CNBCTelevision'
youtube-dl 'https://www.youtube.com/watch?v=0wJqgHSfYi0&ab_channel=CNBCTelevision'

/Users/orm/mambaforge/envs/pixeltable_39/bin/python


/Users/orm/mambaforge/envs/pixeltable_39/bin/youtube-dl


/Users/orm/mambaforge/envs/pixeltable_39/bin/youtube-dl


[youtube] YwWtDSponlc: Downloading webpage


[download] Right now you want to be invested in companies that don't cater to the consumer, says Jim Cramer-YwWtDSponlc.mp4 has already been downloaded and merged


[youtube] L9Tyb_ycRfU: Downloading webpage


[download] Jim Cramer looks at how the Fed minutes spooked the markets today-L9Tyb_ycRfU.mp4 has already been downloaded and merged


[youtube] 0wJqgHSfYi0: Downloading webpage


[download] Snowflake CEO joins Jim Cramer after earnings report drives stock higher-0wJqgHSfYi0.mp4 has already been downloaded and merged


In [8]:
import pathlib
import pixeltable as pxt

In [13]:
pxt.create_dir('transcription_demo', ignore_errors=True)

In [14]:
pxt.drop_table('transcription_demo.sentence_view', ignore_errors=True)
pxt.drop_table('transcription_demo.video_table', ignore_errors=True)
video_table = pxt.create_table('transcription_demo.video_table', {'video': pxt.VideoType()},)

Created table `video_table`.


In [15]:
paths = [str(pathlib.Path(p).absolute()) for p in pathlib.Path('./sample_videos/').iterdir()]
video_table.insert([{'video': video_path} for video_path in paths[:1] ])

Inserting rows into `video_table`: 1 rows [00:00, 673.89 rows/s]
Inserted 1 row with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])

In [16]:
from pixeltable.functions.video import get_metadata, extract_audio
from pixeltable.functions import openai

In [17]:
video_table.add_column(audio=extract_audio(video_table.video, format='mp3'))

Computing cells:   0%|                                                    | 0/1 [00:00<?, ? cells/s]

Computing cells: 100%|████████████████████████████████████████████| 1/1 [00:03<00:00,  3.74s/ cells]
Added 1 column value with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=1, num_excs=0, updated_cols=[], cols_with_excs=[])

In [18]:
video_table.show()

video,audio
,"const wavesurfer = WaveSurfer.create({  container: ""#waveform_372333"",  waveColor: '#4F4A85',  progressColor: '#383351',  url: 'http://127.0.0.1:50473/Users/orm/.pixeltable/media/f8cb59561806456f8bfcac279d68c9f7/c2/c2ee/f8cb59561806456f8bfcac279d68c9f7_1_1_c2ee1b317f0641c4aaccc7c1c7f5f6fc.mp3',  })"


In [19]:
video_table.add_column(audio_meta=get_metadata(video_table.audio))

Computing cells: 100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 315.96 cells/s]
Added 1 column value with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=1, num_excs=0, updated_cols=[], cols_with_excs=[])

In [20]:
video_table.show()

video,audio,audio_meta
,"const wavesurfer = WaveSurfer.create({  container: ""#waveform_793978"",  waveColor: '#4F4A85',  progressColor: '#383351',  url: 'http://127.0.0.1:50473/Users/orm/.pixeltable/media/f8cb59561806456f8bfcac279d68c9f7/c2/c2ee/f8cb59561806456f8bfcac279d68c9f7_1_1_c2ee1b317f0641c4aaccc7c1c7f5f6fc.mp3',  })","{'size': 8266796, 'streams': [{'type': 'audio', 'frames': 0, 'duration': 7290936576, 'metadata': {'encoder': 'Lavf'}, 'time_base': '1/14112000', 'codec_context': {'name': 'mp3float', 'profile': None, 'channels': 2, 'codec_tag': '\\x00\\x00\\x00\\x00'}, 'duration_seconds': 516.648}], 'bit_rate': 128006, 'metadata': {'encoder': 'Lavf60.3.100'}, 'bit_exact': False}"


In [21]:
video_table.add_column(transcription=openai.transcriptions(audio=video_table.audio, model='whisper-1'))

Computing cells: 100%|████████████████████████████████████████████| 1/1 [00:23<00:00, 23.43s/ cells]
Added 1 column value with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=1, num_excs=0, updated_cols=[], cols_with_excs=[])

In [24]:
video_table.show()

video,audio,audio_meta,transcription,transcription_text
,"const wavesurfer = WaveSurfer.create({  container: ""#waveform_699362"",  waveColor: '#4F4A85',  progressColor: '#383351',  url: 'http://127.0.0.1:50473/Users/orm/.pixeltable/media/f8cb59561806456f8bfcac279d68c9f7/c2/c2ee/f8cb59561806456f8bfcac279d68c9f7_1_1_c2ee1b317f0641c4aaccc7c1c7f5f6fc.mp3',  })","{'size': 8266796, 'streams': [{'type': 'audio', 'frames': 0, 'duration': 7290936576, 'metadata': {'encoder': 'Lavf'}, 'time_base': '1/14112000', 'codec_context': {'name': 'mp3float', 'profile': None, 'channels': 2, 'codec_tag': '\\x00\\x00\\x00\\x00'}, 'duration_seconds': 516.648}], 'bit_rate': 128006, 'metadata': {'encoder': 'Lavf60.3.100'}, 'bit_exact': False}",{'text': 'The Snowflake back on track after a couple of months in the wilderness. The last time we heard from this enterprise software data analytics companies back in February they put a strong quarter with a tepid four year forecast stock plunge from two hundred thirty down to the mid 100s. Since then while many other tech names have rebounded like crazy stuff is only traded back up to 163 as of today's close. But tonight these guys report tremendous core stuff like big expectations on every key line item for the quarter revenue product revenue operating income free cash flow. You name it. Take time as we gave a strong product revenue guidance for the current quarter and raise their full year product revenue forecast. They gave you a little less a lower margin number but we'll find out about that. So with the stock coming into the quarter cold these numbers were enough to send it higher. And if you are just the beginning let's check in with Sridhar Ramaswamy. He is the new CEO of Snowflake. We interviewed him months before we had a GCC. Find out more about the quarter where it's going. Mr. Ramaswamy welcome back to Bad Bunny. Great to be chatting with you Jim. OK so here this was a very impressive set of numbers. The one that really stood out was this 46 percent growth in what's known as remaining performance obligation. I regard that as the key indicator of the future. What's driving it. Jim I think all at all. There are two broad strokes to the quarter. One is that our financial performance was really really good. Our product revenue was up 34 percent. Remaining performance obligations as you talked about was up 46 percent. Some very huge deals. It's really an indication of how much our customers believe in us. Our free cash flow margins but also amazing. The other part of Q1 is really how our product pipeline especially in A.I. has been in overdrive. Our A.I. products are now generally available. Over 750 customers are developing on it sending applications to production. And I would say the enterprise is here right here at Snowflake. Well let's talk about enterprise A.I. because you gave a number of use cases and some real some customers. Everybody knows I'm going to pick one. People know because it's on their dining room table. Kraft Heinz. Why is it. Why does Kraft Heinz need Snowflake. Can you repeat the question. Why does Kraft Heinz need Snowflake. Well you know Kraft Heinz is and is an iconic brand but they have lots and lots of data. And so part of the magic that Snowflake brings to the table with its A.I. offerings is that you can analyze customer feedback data very easily using using language models and figure out which questions for example have automated responses as you can send which ones you should send to like an actual human. These are the kinds of applications that people are thinking and implementing with with Snowflake. And the beauty is we make it real easy out of the box and super efficient to get these done. Now you also made an acquisition. Some people said to me you know what I can use Snowflake but I have to observe. I have to interrogate my own data. I don't know. I mean I rent these guys. I have to bring it back. Tell me about what it will mean that you have true era A.I. observability now that you've bought this new company that I think is going to make it so that you guys are. I don't know how much you need Amazon Web Services once you do that. I don't know. You tell me. Well one small clarification. We signed a definitive agreement to acquire them. The actual acquisition we expect to happen soon enough. But as people are racing to develop applications you know things like observability becomes important because let's say you change the product. You still want to make sure that the applications working well or you want to try out a new model. It's all part of our mission to make A.I. reliable and change management which observability closely ties into is an important part of making A.I. reliable. That's why we acquired this this great team. But the general theme again is we make end to end A.I. easy to implement. Easy to maintain. Dramatically lower total cost of ownership. You don't have to run GPUs if you want to use A.I. with Snowflake. That's the stuff our customers. Let's talk about GPU because you've got your June 3rd to 6th data cloud summit. I remember watching a video of Jason Wong with your predecessor Mr. Slootman. Mr. Slootman was famously tough on price when it came to Jensen. What will the power be like this time. Well you know I've gotten to know Jensen really well over the past few months. We are super excited by the promise of accelerated computing. Language models are just the beginning. I think it's a powerful way to scale things. We collaborate with India on a number of fronts. Our foundation model Arctic was unsurprisingly done on top of Nvidia chips. We collaborate with them on on models. There's a lot to come. And Jensen's of course a visionary when it comes to A.I. We're going to be talking about all of this and many other new product announcements at our user conference. We're going to be exciting. I'm looking forward to seeing you. Well let's see what we can do. I do want to ask you about the margins. You know your revenues going very well. The margins a little bit of decline something I should be worried about. You know how much we care about margins in this business. Margins are really really important. You know I of course I work with Mike who is amazing at this. We are leaning ahead into investing with with A.I. Now these are modest size investments and I don't expect these numbers to like dramatically go up. And what already clearly showed is that you can get a lot done with a small motivated team and a small amount of compute. Arctic was done on two million dollars off of GPU compute. And of course the products are out in G.A. and we are driving it. We are taking it to market. We want customers to use it for us to make dollars. I think we are very much in the mode of driving revenue for our A.I. A.I. products and definitely hope to share more of that in the coming. Got it. Now Mike your CFO did mention at one point that growth moderated in April. But he said that was a normal component of the way that things are in your business. Why is that. Well the snowflake is a consumption model which means that we make money only when our customers consume. Now when there are holidays for example people don't run certain kinds of jobs as you know like Easter is usually in April. So there are seasonal variations like that. But the overall trend that we are seeing in the business be just you know the conversations the vibe that I have with the customers that I talk to is hugely positive. People are truly excited by snowflake as their data platform for data for collaboration and now applications. And you have customer after customer take multimillion you know multi year contracts with snowflake. It points to a bright future where the court is strong and you're pressing the gas really hard on new things like. All right. So do you still speak to Mr. Slipman. I only mentioned it was one of the few friends of the show where I just respect him greatly. So how's the communication. It's it's actually he is incredibly kind. I talk to him every other week. So we also chitchat on WhatsApp pretty often. Obviously he's the chairman of the board and spent 10 quality hours with him yesterday. I kind of tapped into his wisdom for how to create a great business. And he is going to stay you know my friend and snowflakes friend for the foreseeable future. And very much a part. Will you tell him we said hi and congratulations on a great quarter. That's Shridhar Ramaswamy snowflake CEO. Thank you sir. Great to see you. Great to see you. Thank you. Everybody's back. Coming up hit us with your best shot. An electrified fast fire lightning round is next. Don't miss a second of mad money. Follow at Jim Cramer on X. Have a question. Tweet Kramer hashtag mad mentions. Send Jim an email to mad money at CNBC dot com or give us a call at 1 800 7 4 3 CNBC. Miss something. Head to mad money dot CNBC dot com.'},The Snowflake back on track after a couple of months in the wilderness. The last time we heard from this enterprise software data analytics companies back in February they put a strong quarter with a tepid four year forecast stock plunge from two hundred thirty down to the mid 100s. Since then while many other tech names have rebounded like crazy stuff is only traded back up to 163 as of today's close. But tonight these guys report tremendous core stuff like big expectations on every key line item for the quarter revenue product revenue operating income free cash flow. You name it. Take time as we gave a strong product revenue guidance for the current quarter and raise their full year product revenue forecast. They gave you a little less a lower margin number but we'll find out about that. So with the stock coming into the quarter cold these numbers were enough to send it higher. And if you are just the beginning let's check in with Sridhar Ramaswamy. He is the new CEO of Snowflake. We interviewed him months before we had a GCC. Find out more about the quarter where it's going. Mr. Ramaswamy welcome back to Bad Bunny. Great to be chatting with you Jim. OK so here this was a very impressive set of numbers. The one that really stood out was this 46 percent growth in what's known as remaining performance obligation. I regard that as the key indicator of the future. What's driving it. Jim I think all at all. There are two broad strokes to the quarter. One is that our financial performance was really really good. Our product revenue was up 34 percent. Remaining performance obligations as you talked about was up 46 percent. Some very huge deals. It's really an indication of how much our customers believe in us. Our free cash flow margins but also amazing. The other part of Q1 is really how our product pipeline especially in A.I. has been in overdrive. Our A.I. products are now generally available. Over 750 customers are developing on it sending applications to production. And I would say the enterprise is here right here at Snowflake. Well let's talk about enterprise A.I. because you gave a number of use cases and some real some customers. Everybody knows I'm going to pick one. People know because it's on their dining room table. Kraft Heinz. Why is it. Why does Kraft Heinz need Snowflake. Can you repeat the question. Why does Kraft Heinz need Snowflake. Well you know Kraft Heinz is and is an iconic brand but they have lots and lots of data. And so part of the magic that Snowflake brings to the table with its A.I. offerings is that you can analyze customer feedback data very easily using using language models and figure out which questions for example have automated responses as you can send which ones you should send to like an actual human. These are the kinds of applications that people are thinking and implementing with with Snowflake. And the beauty is we make it real easy out of the box and super efficient to get these done. Now you also made an acquisition. Some people said to me you know what I can use Snowflake but I have to observe. I have to interrogate my own data. I don't know. I mean I rent these guys. I have to bring it back. Tell me about what it will mean that you have true era A.I. observability now that you've bought this new company that I think is going to make it so that you guys are. I don't know how much you need Amazon Web Services once you do that. I don't know. You tell me. Well one small clarification. We signed a definitive agreement to acquire them. The actual acquisition we expect to happen soon enough. But as people are racing to develop applications you know things like observability becomes important because let's say you change the product. You still want to make sure that the applications working well or you want to try out a new model. It's all part of our mission to make A.I. reliable and change management which observability closely ties into is an important part of making A.I. reliable. That's why we acquired this this great team. But the general theme again is we make end to end A.I. easy to implement. Easy to maintain. Dramatically lower total cost of ownership. You don't have to run GPUs if you want to use A.I. with Snowflake. That's the stuff our customers. Let's talk about GPU because you've got your June 3rd to 6th data cloud summit. I remember watching a video of Jason Wong with your predecessor Mr. Slootman. Mr. Slootman was famously tough on price when it came to Jensen. What will the power be like this time. Well you know I've gotten to know Jensen really well over the past few months. We are super excited by the promise of accelerated computing. Language models are just the beginning. I think it's a powerful way to scale things. We collaborate with India on a number of fronts. Our foundation model Arctic was unsurprisingly done on top of Nvidia chips. We collaborate with them on on models. There's a lot to come. And Jensen's of course a visionary when it comes to A.I. We're going to be talking about all of this and many other new product announcements at our user conference. We're going to be exciting. I'm looking forward to seeing you. Well let's see what we can do. I do want to ask you about the margins. You know your revenues going very well. The margins a little bit of decline something I should be worried about. You know how much we care about margins in this business. Margins are really really important. You know I of course I work with Mike who is amazing at this. We are leaning ahead into investing with with A.I. Now these are modest size investments and I don't expect these numbers to like dramatically go up. And what already clearly showed is that you can get a lot done with a small motivated team and a small amount of compute. Arctic was done on two million dollars off of GPU compute. And of course the products are out in G.A. and we are driving it. We are taking it to market. We want customers to use it for us to make dollars. I think we are very much in the mode of driving revenue for our A.I. A.I. products and definitely hope to share more of that in the coming. Got it. Now Mike your CFO did mention at one point that growth moderated in April. But he said that was a normal component of the way that things are in your business. Why is that. Well the snowflake is a consumption model which means that we make money only when our customers consume. Now when there are holidays for example people don't run certain kinds of jobs as you know like Easter is usually in April. So there are seasonal variations like that. But the overall trend that we are seeing in the business be just you know the conversations the vibe that I have with the customers that I talk to is hugely positive. People are truly excited by snowflake as their data platform for data for collaboration and now applications. And you have customer after customer take multimillion you know multi year contracts with snowflake. It points to a bright future where the court is strong and you're pressing the gas really hard on new things like. All right. So do you still speak to Mr. Slipman. I only mentioned it was one of the few friends of the show where I just respect him greatly. So how's the communication. It's it's actually he is incredibly kind. I talk to him every other week. So we also chitchat on WhatsApp pretty often. Obviously he's the chairman of the board and spent 10 quality hours with him yesterday. I kind of tapped into his wisdom for how to create a great business. And he is going to stay you know my friend and snowflakes friend for the foreseeable future. And very much a part. Will you tell him we said hi and congratulations on a great quarter. That's Shridhar Ramaswamy snowflake CEO. Thank you sir. Great to see you. Great to see you. Thank you. Everybody's back. Coming up hit us with your best shot. An electrified fast fire lightning round is next. Don't miss a second of mad money. Follow at Jim Cramer on X. Have a question. Tweet Kramer hashtag mad mentions. Send Jim an email to mad money at CNBC dot com or give us a call at 1 800 7 4 3 CNBC. Miss something. Head to mad money dot CNBC dot com.


In [27]:
video_table.add_column(transcription_text=video_table.transcription.text)

Computing cells: 100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 244.10 cells/s]
Added 1 column value with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=1, num_excs=0, updated_cols=[], cols_with_excs=[])

In [38]:
import embeddings
import importlib
importlib.reload(embeddings)
from embeddings import TextSplitter, e5_embed

In [36]:
sentence_view = pxt.create_view('transcription_demo.sentence_view',
                                video_table,
                                iterator=TextSplitter.create(text=video_table.transcription_text))

Inserting rows into `sentence_view`: 131 rows [00:00, 12150.95 rows/s]
Created view `sentence_view` with 131 rows, 0 exceptions.


In [37]:
sentence_view.select(sentence_view.pos, sentence_view.text).where(sentence_view.pos <= 10).show()

pos,text
0,The Snowflake back on track after a couple of months in the wilderness.
1,The last time we heard from this enterprise software data analytics companies back in February they put a strong quarter with a tepid four year forecast stock plunge from two hundred thirty down to the mid 100s.
2,Since then while many other tech names have rebounded like crazy stuff is only traded back up to 163 as of today's close.
3,But tonight these guys report tremendous core stuff like big expectations on every key line item for the quarter revenue product revenue operating income free cash flow.
4,You name it.
5,Take time as we gave a strong product revenue guidance for the current quarter and raise their full year product revenue forecast.
6,They gave you a little less a lower margin number but we'll find out about that.
7,So with the stock coming into the quarter cold these numbers were enough to send it higher.
8,And if you are just the beginning let's check in with Sridhar Ramaswamy.
9,He is the new CEO of Snowflake.


In [39]:
sentence_view.add_embedding_index(col_name='text', text_embed=e5_embed)

Computing cells: 100%|████████████████████████████████████████| 131/131 [00:03<00:00, 39.08 cells/s]


In [42]:
similarity = sentence_view.text.similarity('you should buy NVIDIA')
sentence_view.select(sentence_view.text, similarity).order_by(similarity, asc=False).limit(20).collect()

text,col_1
To make you money.,0.835868
Now all this actual macro activity is vying for headlines with Nvidia.,0.824716
My wife wants to buy one.,0.818205
You still want to make sure that the applications working well or you want to try out a new model.,0.817813
Our foundation model Arctic was unsurprisingly done on top of Nvidia chips.,0.814672
Follow at Jim Cramer on X. Have a question.,0.812509
Follow at Jim Cramer on X. Have a question.,0.812509
Let's talk about GPU because you've got your June 3rd to 6th data cloud summit.,0.812478
Welcome to Mad Money.,0.812043
Head to mad money dot CNBC dot com.,0.810483


In [41]:
video_table.insert([{'video': video_path} for video_path in paths[2:]])

Inserting rows into `video_table`: 1 rows [00:00, 120.76 rows/s]██| 5/5 [00:43<00:00,  8.68s/ cells]
Computing cells: 100%|████████████████████████████████████████████| 5/5 [00:43<00:00,  8.69s/ cells]
Inserting rows into `sentence_view`: 240 rows [00:00, 446.46 rows/s]
Inserted 241 rows with 0 errors.


UpdateStatus(num_rows=241, num_computed_values=5, num_excs=0, updated_cols=[], cols_with_excs=[])

In [21]:
video_table.select(video_table.video, video_table.audio, video_table.audio_meta).show()

video,audio,audio_meta
,,"{'size': 8266796, 'streams': [{'type': 'audio', 'frames': 0, 'duration': 7290936576, 'metadata': {'encoder': 'Lavf'}, 'time_base': '1/14112000', 'codec_context': {'name': 'mp3float', 'profile': None, 'channels': 2, 'codec_tag': '\\x00\\x00\\x00\\x00'}, 'duration_seconds': 516.648}], 'bit_rate': 128006, 'metadata': {'encoder': 'Lavf60.3.100'}, 'bit_exact': False}"
,,"{'size': 9245228, 'streams': [{'type': 'audio', 'frames': 0, 'duration': 8153913600, 'metadata': {'encoder': 'Lavf'}, 'time_base': '1/14112000', 'codec_context': {'name': 'mp3float', 'profile': None, 'channels': 2, 'codec_tag': '\\x00\\x00\\x00\\x00'}, 'duration_seconds': 577.8}], 'bit_rate': 128005, 'metadata': {'encoder': 'Lavf60.3.100'}, 'bit_exact': False}"
,,"{'size': 10607276, 'streams': [{'type': 'audio', 'frames': 0, 'duration': 9355239936, 'metadata': {'encoder': 'Lavf'}, 'time_base': '1/14112000', 'codec_context': {'name': 'mp3float', 'profile': None, 'channels': 2, 'codec_tag': '\\x00\\x00\\x00\\x00'}, 'duration_seconds': 662.928}], 'bit_rate': 128005, 'metadata': {'encoder': 'Lavf60.3.100'}, 'bit_exact': False}"
