# Local Video Ingestion for VLM summarization

Building video summarization or QA applications on you're local videos is a topic of high interest. Current Large Vision Language models (VLMs) can only utilize a finite set of frames to represent any given video - this is usually between 8 and 64 frames sampled from an input video, but varies by model. This means that for accurate transcription of long form videos, a chunking mechanism is required to trim videos into smaller clips such that the finite set of frames better represents the video that is being used as visual input to the VLM.

Below we show how to easily go from an `.mp4` file to chunked clips used for more accurate LVM video summarizations that can used in downstream tasks including RAG applications.


In [None]:
from langchain_community.document_loaders.video import VideoChunkLoader

### Load and chunk local video using the sliding window chunking scheme

Use `wget` to fetch / download the sample video file.

Then, uses `VideoChunkLoader()` to chunk the video sytematically.

Let's take a look at the sample video first.

In [None]:
!wget -O sample_video.mp4 https://github.com/intel-iot-devkit/sample-videos/raw/master/store-aisle-detection.mp4

from IPython.display import Video

video_path = "sample_video.mp4"
Video(video_path, width=600, height=400)

--2025-03-01 14:23:45--  https://github.com/intel-iot-devkit/sample-videos/raw/master/store-aisle-detection.mp4
140.82.112.4thub.com (github.com)... 
connected. to github.com (github.com)|140.82.112.4|:443... 
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/intel-iot-devkit/sample-videos/master/store-aisle-detection.mp4 [following]
--2025-03-01 14:23:46--  https://raw.githubusercontent.com/intel-iot-devkit/sample-videos/master/store-aisle-detection.mp4
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9214573 (8.8M) [application/octet-stream]
Saving to: ‘sample_video.mp4’


2025-03-01 14:23:46 (34.7 MB/s) - ‘sample_video.mp4’ saved [9214573/9214573]



## Building a video summarization app from local sample video

Given `sample_video.mp4`, we can enable ingestion meant for video summarization. Below we will show the two different ingestion mechanisms provided by VideoChunkLoader.

### Example of "Sliding Window" with overlap implementation

To overcome the issue of losing context at start and end of clips, we allow the user to configure a sliding window techinique that includes an overlap to address the missing context at clip start/endpoints.

In [None]:
# Setup VideoChunkLoader object for ingesting video, using the sliding window technique
loader = VideoChunkLoader(
    video_path=video_path,
    chunking_mechanism="sliding_window",
    chunk_duration=10,
    chunk_overlap=2,
)

# Display the langchain documents created after loading the sample video, the video chunks are saved in the 'video_chunks' directory
for doc in loader.lazy_load():
    print(f"Chunk metadata: {doc.metadata}")
    print(f"Chunk content: {doc.page_content}")

# Playback of the first interval chunked video, should have length 'chunk_duration'
Video("video_chunks/chunk_0.mp4", width=600, height=400)

Chunk metadata: {'chunk_id': 0, 'chunk_path': 'video_chunks/chunk_0.mp4', 'start_time': 0, 'end_time': 10, 'source': 'sample_video.mp4'}
Chunk content: Video chunk from 0s to 10s
Chunk metadata: {'chunk_id': 1, 'chunk_path': 'video_chunks/chunk_1.mp4', 'start_time': 8, 'end_time': 18, 'source': 'sample_video.mp4'}
Chunk content: Video chunk from 8s to 18s
Chunk metadata: {'chunk_id': 2, 'chunk_path': 'video_chunks/chunk_2.mp4', 'start_time': 16, 'end_time': 26, 'source': 'sample_video.mp4'}
Chunk content: Video chunk from 16s to 26s
Chunk metadata: {'chunk_id': 3, 'chunk_path': 'video_chunks/chunk_3.mp4', 'start_time': 24, 'end_time': 34, 'source': 'sample_video.mp4'}
Chunk content: Video chunk from 24s to 34s
Chunk metadata: {'chunk_id': 4, 'chunk_path': 'video_chunks/chunk_4.mp4', 'start_time': 32, 'end_time': 42, 'source': 'sample_video.mp4'}
Chunk content: Video chunk from 32s to 42s
Chunk metadata: {'chunk_id': 5, 'chunk_path': 'video_chunks/chunk_5.mp4', 'start_time': 40, 'end_ti

### Example of "Specific Intervals" implementation

To overcome the issue of many video-summarization workloads that only require summarization of selective parts of the video, we allow the user to pass pre-defined time intervals into the configuration such that only specific portions of the video are chunked, and then used in downstream summarization workloads.

In [None]:
# Setup VideoChunkLoader object for ingesting video, using the specific itervals technique
loader = VideoChunkLoader(
    video_path=video_path,
    chunking_mechanism="specific_chunks",
    specific_intervals=[{"start": 10, "duration": 10}, {"start": 20, "duration": 8}],
)

# Display the langchain documents created after loading the sample video, the video chunks are saved in the 'video_chunks' directory
for doc in loader.lazy_load():
    print(f"Chunk metadata: {doc.metadata}")
    print(f"Chunk content: {doc.page_content}")

# Playback of the first interval chunked video, should have length 'duration' from first interval
Video("video_chunks/chunk_0.mp4", width=600, height=400)

Chunk metadata: {'chunk_id': 0, 'chunk_path': 'video_chunks/chunk_0.mp4', 'start_time': 10, 'end_time': 20, 'source': 'sample_video.mp4'}
Chunk content: Video chunk from 10s to 20s
Chunk metadata: {'chunk_id': 1, 'chunk_path': 'video_chunks/chunk_1.mp4', 'start_time': 20, 'end_time': 28, 'source': 'sample_video.mp4'}
Chunk content: Video chunk from 20s to 28s
