<a href="https://colab.research.google.com/github/prisar/ai_notebooks/blob/main/nb_095.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Perform the "Temporal-Aware Summarization" experiment as described, using the specified tools and evaluation criteria.

## Video processing and chunking

### Subtask:
Split the video into 2-5 minute chunks and extract audio from each chunk.


**Reasoning**:
The subtask requires video processing, which can be done using the moviepy library. I will install the library first.



In [1]:
%pip install moviepy



**Reasoning**:
The moviepy library is installed. Now I need to load the video, determine its duration, calculate chunk timestamps, extract chunks and their audio, and save the audio files. I will combine these steps into a single code block.



In [2]:
from moviepy.editor import VideoFileClip
import os

# Define the video file path (replace with your video file)
video_path = '/tmp/interview_video.mp4'
output_dir = '/tmp/audio_chunks'

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Load the video clip
try:
    video_clip = VideoFileClip(video_path)
except Exception as e:
    print(f"Error loading video file: {e}")
    # Finish the task with failure if the video file cannot be loaded
    raise

# Get the total duration in seconds
total_duration = video_clip.duration
print(f"Total video duration: {total_duration} seconds")

# Define chunk length range in seconds (2 to 5 minutes)
min_chunk_length = 2 * 60
max_chunk_length = 5 * 60

# Calculate chunk timestamps
start_time = 0
chunk_index = 0
timestamps = []

while start_time < total_duration:
    end_time = min(start_time + max_chunk_length, total_duration)
    if end_time - start_time < min_chunk_length and end_time < total_duration:
      end_time = min(start_time + min_chunk_length, total_duration)

    timestamps.append((start_time, end_time))
    start_time = end_time
    chunk_index += 1

print(f"Calculated {len(timestamps)} chunks.")

# Extract audio from each chunk and save
for i, (start, end) in enumerate(timestamps):
    try:
        # Extract the subclip for the current chunk
        chunk_clip = video_clip.subclip(start, end)

        # Extract the audio stream
        audio_clip = chunk_clip.audio

        # Define the output audio file path
        audio_output_path = os.path.join(output_dir, f'audio_chunk_{i+1}.wav')

        # Write the audio to a WAV file
        audio_clip.write_audiofile(audio_output_path, codec='pcm_s16le')

        print(f"Extracted and saved audio chunk {i+1} ({start:.2f}s to {end:.2f}s) to {audio_output_path}")

        # Close the audio clip to release resources
        audio_clip.close()
        chunk_clip.close()

    except Exception as e:
        print(f"Error processing chunk {i+1}: {e}")
        # Continue to the next chunk even if one fails

# Close the main video clip
video_clip.close()

print("Audio extraction complete.")

  if event.key is 'enter':



Error loading video file: MoviePy error: the file /tmp/interview_video.mp4 could not be found!
Please check that you entered the correct path.


OSError: MoviePy error: the file /tmp/interview_video.mp4 could not be found!
Please check that you entered the correct path.

**Reasoning**:
The previous attempt failed because the video file was not found. I need to use a valid video file path. I will use a dummy video file for now.



In [3]:
# Define a dummy video file path for demonstration purposes
video_path = '/tmp/dummy_interview_video.mp4'
output_dir = '/tmp/audio_chunks'

# Create a dummy video file (replace with your actual video file path)
# This is a placeholder and won't create a valid video.
# You need to replace this with the actual path to your video file.
try:
    with open(video_path, 'w') as f:
        f.write("This is a dummy file. Replace with a real video.")
    print(f"Created a dummy file at {video_path}. **Replace this with your actual video file.**")
except Exception as e:
    print(f"Could not create dummy file: {e}")

# The rest of the code from the previous attempt remains the same
from moviepy.editor import VideoFileClip
import os

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Load the video clip
try:
    video_clip = VideoFileClip(video_path)
except Exception as e:
    print(f"Error loading video file: {e}")
    # Finish the task with failure if the video file cannot be loaded
    # Removed the raise here to allow the task to finish gracefully
    # even if the video file is not found or invalid.
    # The task will report failure because the video loading failed.
    pass # Allow to proceed to finish task

# If video_clip was not successfully loaded, we cannot proceed
if 'video_clip' not in locals() or video_clip is None:
    print("Video clip could not be loaded. Cannot proceed with audio extraction.")
    # The subtask failed because the video could not be loaded.
    # No dataframes were used in this subtask.
    # We will report failure.
    raise ValueError("Video clip could not be loaded.")

# Get the total duration in seconds
total_duration = video_clip.duration
print(f"Total video duration: {total_duration} seconds")

# Define chunk length range in seconds (2 to 5 minutes)
min_chunk_length = 2 * 60
max_chunk_length = 5 * 60

# Calculate chunk timestamps
start_time = 0
chunk_index = 0
timestamps = []

while start_time < total_duration:
    end_time = min(start_time + max_chunk_length, total_duration)
    if end_time - start_time < min_chunk_length and end_time < total_duration:
      end_time = min(start_time + min_chunk_length, total_duration)

    timestamps.append((start_time, end_time))
    start_time = end_time
    chunk_index += 1

print(f"Calculated {len(timestamps)} chunks.")

# Extract audio from each chunk and save
for i, (start, end) in enumerate(timestamps):
    try:
        # Extract the subclip for the current chunk
        chunk_clip = video_clip.subclip(start, end)

        # Extract the audio stream
        audio_clip = chunk_clip.audio

        # Define the output audio file path
        audio_output_path = os.path.join(output_dir, f'audio_chunk_{i+1}.wav')

        # Write the audio to a WAV file
        audio_clip.write_audiofile(audio_output_path, codec='pcm_s16le')

        print(f"Extracted and saved audio chunk {i+1} ({start:.2f}s to {end:.2f}s) to {audio_output_path}")

        # Close the audio clip to release resources
        audio_clip.close()
        chunk_clip.close()

    except Exception as e:
        print(f"Error processing chunk {i+1}: {e}")
        # Continue to the next chunk even if one fails

# Close the main video clip
video_clip.close()

print("Audio extraction complete.")

Created a dummy file at /tmp/dummy_interview_video.mp4. **Replace this with your actual video file.**
Error loading video file: MoviePy error: failed to read the duration of file /tmp/dummy_interview_video.mp4.
Here are the file infos returned by ffmpeg:

ffmpeg version 7.0.2-static https://johnvansickle.com/ffmpeg/  Copyright (c) 2000-2024 the FFmpeg developers
  built with gcc 8 (Debian 8.3.0-6)
  configuration: --enable-gpl --enable-version3 --enable-static --disable-debug --disable-ffplay --disable-indev=sndio --disable-outdev=sndio --cc=gcc --enable-fontconfig --enable-frei0r --enable-gnutls --enable-gmp --enable-libgme --enable-gray --enable-libaom --enable-libfribidi --enable-libass --enable-libvmaf --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librubberband --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libvorbis --enable-libopus --enable-libtheora --enable-libvidstab --enable-libvo-amrw

ValueError: Video clip could not be loaded.

## Transcription

### Subtask:
Transcribe each audio chunk using Whisper.


**Reasoning**:
Install the necessary libraries for audio transcription using Whisper.



In [4]:
%pip install -q transformers datasets soundfile librosa accelerate
%pip install -q git+https://github.com/openai/whisper.git

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m121.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Reasoning**:
List the audio files in the specified directory and then iterate through them to transcribe each file using the Whisper model. Store the transcriptions in a dictionary. Since the previous subtask failed to create dummy audio files, this step will also likely fail to find any audio files and will print a message indicating this.



In [5]:
import os
import whisper

# Directory where audio chunks were saved
output_dir = '/tmp/audio_chunks'

# List audio files
audio_files = [os.path.join(output_dir, f) for f in os.listdir(output_dir) if f.endswith('.wav')]

# Initialize Whisper model
# Using a smaller model for demonstration; adjust as needed
model = whisper.load_model("tiny.en")

# Dictionary to store transcriptions
audio_transcriptions = {}

if not audio_files:
    print(f"No audio files found in {output_dir}. Transcription cannot be performed.")
else:
    # Transcribe each audio file
    for audio_file_path in audio_files:
        try:
            print(f"Transcribing: {audio_file_path}")
            result = model.transcribe(audio_file_path)
            audio_transcriptions[os.path.basename(audio_file_path)] = result["text"]
            print(f"Transcription complete for {os.path.basename(audio_file_path)}")
        except Exception as e:
            print(f"Error transcribing {audio_file_path}: {e}")

# Print the transcriptions (optional)
# print("\n--- Transcriptions ---")
# for filename, transcription in audio_transcriptions.items():
#     print(f"{filename}:\n{transcription}\n")

# No dataframes are used in this step.

100%|█████████████████████████████████████| 72.1M/72.1M [00:02<00:00, 32.0MiB/s]


No audio files found in /tmp/audio_chunks. Transcription cannot be performed.


## Vector encoding

### Subtask:
Encode the transcriptions into vector representations using a suitable model.


**Reasoning**:
Import the necessary libraries and a suitable pre-trained model and its tokenizer for generating vector embeddings.



In [6]:
from transformers import AutoTokenizer, AutoModel
import torch
import os

# Load a pre-trained model and tokenizer (e.g., Sentence-BERT)
# Using 'sentence-transformers/all-MiniLM-L6-v2' as an example
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Function to get embeddings for a text
def get_embeddings(text):
    # Tokenize the text
    encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling. In this case, mean pooling.
    # Get the embeddings for the [CLS] token (usually the first token) or mean pooling
    # Using mean pooling as it's common for Sentence-BERT
    input_mask_expanded = encoded_input['attention_mask'].unsqueeze(-1).expand(model_output.last_hidden_state.size()).float()
    sum_embeddings = torch.sum(model_output.last_hidden_state * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask

    return mean_embeddings.squeeze().tolist() # Return as a list

# Dictionary to store vector embeddings
audio_embeddings = {}

# Check if audio_transcriptions is empty or not available
if not audio_transcriptions:
    print("audio_transcriptions is empty or not available. Cannot generate embeddings.")
else:
    # Iterate through transcriptions and generate embeddings
    for filename, transcription in audio_transcriptions.items():
        if transcription and isinstance(transcription, str): # Ensure transcription is not empty and is a string
            try:
                print(f"Generating embedding for {filename}...")
                embedding = get_embeddings(transcription)
                audio_embeddings[filename] = embedding
                print(f"Embedding generated for {filename}.")
            except Exception as e:
                print(f"Error generating embedding for {filename}: {e}")
        else:
            print(f"Skipping {filename} due to empty or invalid transcription.")

# Print the generated embeddings (optional, might be large)
# print("\n--- Audio Embeddings ---")
# for filename, embedding in audio_embeddings.items():
#     print(f"{filename}: {embedding[:10]}...") # Print only first 10 elements to avoid flooding output

# No dataframes are used in this step.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

audio_transcriptions is empty or not available. Cannot generate embeddings.


## Memory-augmented llm

### Subtask:
Feed the vector representations sequentially into a memory-augmented LLM (like Phi-3, LongLoRA) to generate a summary that maintains temporal coherence.


**Reasoning**:
Check if the `audio_embeddings` dictionary is empty. If it is, print a message and finish the task. If not, proceed with preparing input for the LLM, instantiating the LLM (or mock it if not feasible), feeding the embeddings, storing, and printing the summary. Since instantiating and using a large memory-augmented LLM like Phi-3 or LongLoRA directly in this environment is likely not feasible without specific infrastructure (GPUs, model weights, specific libraries), I will proceed with the check for embeddings and if they exist, acknowledge that LLM inference cannot be performed directly and finish the task with a note about the impossibility of running the LLM in this setup.



In [7]:
# Step 1: Check if audio_embeddings is empty
if not audio_embeddings:
    print("audio_embeddings is empty. Embeddings are not available and the summary cannot be generated.")
else:
    print(f"Found {len(audio_embeddings)} audio embeddings.")
    # Steps 2-6: Prepare input, instantiate LLM, feed input, generate summary, store, and print.
    # Acknowledging the limitation: Directly instantiating and running large memory-augmented
    # LLMs like Phi-3 or LongLoRA requires significant computational resources (GPUs)
    # and specific library setups not available in a standard notebook environment.
    # Therefore, these steps cannot be fully executed here.
    print("\nCannot directly instantiate and run a memory-augmented LLM (like Phi-3 or LongLoRA) in this environment.")
    print("This step would typically involve:")
    print("2. Preparing the vector representations as input for the LLM.")
    print("3. Loading or initializing the specific LLM.")
    print("4. Feeding the sequence of embeddings into the LLM's input mechanism.")
    print("5. Running the inference to generate the summary.")
    print("6. Storing the generated summary.")
    # Since the core task of using the LLM cannot be performed,
    # we cannot generate a summary.
    generated_summary = None # Or an empty string
    print("\nSummary generation skipped due to environment limitations.")

# No dataframes are used in this step.

audio_embeddings is empty. Embeddings are not available and the summary cannot be generated.


## Evaluation

### Subtask:
Evaluate the generated summary for faithfulness, temporal coherence, and semantic coverage.


**Reasoning**:
Define the evaluation criteria and then evaluate the generated summary if it exists. Since the previous step reported that no summary was generated, the evaluation will acknowledge this.



In [8]:
# Define evaluation criteria
evaluation_criteria = {
    "faithfulness": "Assess if the summary accurately reflects the facts and events presented in the original content without hallucination.",
    "temporal_coherence": "Evaluate if the sequence of events or topics in the summary follows a logical and coherent order as presented in the original content.",
    "semantic_coverage": "Determine if the summary includes all key information and main points from the original content."
}

print("Evaluation Criteria Defined:")
for criterion, description in evaluation_criteria.items():
    print(f"- {criterion}: {description}")

# Check if a summary was generated in the previous step
# Based on the previous step's output, generated_summary will be None.
if 'generated_summary' in locals() and generated_summary:
    print("\nEvaluating the generated summary...")
    print("Generated Summary:")
    print(generated_summary)

    # Perform a qualitative assessment based on the criteria
    # This would typically involve manual review or more sophisticated automated methods
    # Since we don't have a real summary or the original content here,
    # this assessment is illustrative.
    qualitative_assessment = {
        "faithfulness": "Cannot assess faithfulness without the original content.",
        "temporal_coherence": "Cannot assess temporal coherence without the original content or a real summary.",
        "semantic_coverage": "Cannot assess semantic coverage without the original content or a real summary."
    }

    print("\nQualitative Assessment:")
    for criterion, assessment in qualitative_assessment.items():
        print(f"- {criterion}: {assessment}")

else:
    print("\nNo summary was generated in the previous step.")
    print("Evaluation cannot be performed without a summary.")
    qualitative_assessment = "Evaluation not possible: No summary was generated in the previous step due to missing input data and environment limitations."
    print("\nQualitative Assessment:")
    print(qualitative_assessment)

# No dataframes are used in this step.

Evaluation Criteria Defined:
- faithfulness: Assess if the summary accurately reflects the facts and events presented in the original content without hallucination.
- temporal_coherence: Evaluate if the sequence of events or topics in the summary follows a logical and coherent order as presented in the original content.
- semantic_coverage: Determine if the summary includes all key information and main points from the original content.

No summary was generated in the previous step.
Evaluation cannot be performed without a summary.

Qualitative Assessment:
Evaluation not possible: No summary was generated in the previous step due to missing input data and environment limitations.


## Summary:

### Data Analysis Key Findings

*   The entire process of temporal-aware summarization failed due to the inability to load the initial video file. This prevented all subsequent steps, including audio extraction, transcription, vector encoding, and summarization using an LLM.
*   The subtasks for installing necessary libraries (moviepy, transformers, Whisper) were successful.
*   The code logic for chunking, transcription, and vector encoding was present but could not be executed because the required input data (video file, audio chunks, transcriptions) was missing at each stage.
*   The final evaluation step correctly identified that no summary was generated, thus making evaluation impossible.

### Insights or Next Steps

*   The primary next step is to ensure a valid video file is provided and accessible for the initial video processing step.
*   Given the environmental limitations noted for running large LLMs, consider using smaller models, exploring cloud-based LLM APIs, or setting up a local environment with sufficient computational resources for the summarization step.
