<a href="https://colab.research.google.com/github/prisar/ai_notebooks/blob/main/nb_103.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Authentication and service account setup
from google.colab import auth
from google.auth import default
import os

# Authenticate with Google Cloud
auth.authenticate_user()

# Set project ID
os.environ['GOOGLE_CLOUD_PROJECT'] = 'mrc-quant-ml'


In [2]:

# Install required packages
!pip install -q google-genai google-cloud-aiplatform

In [3]:
# Import and initialize
from google.genai import Client
from google.genai.types import Part, VideoMetadata, FileData
from google.cloud import storage
import asyncio
from concurrent.futures import ThreadPoolExecutor
import nest_asyncio
import time # Import time module for delays
import moviepy.editor as mp # Import moviepy for video duration


def summarize_video_chunk(video_uri: str, start_offset: str, end_offset: str, prompt: str = "Analyze this video and provide a summary."):
    """Summarizes a video chunk using the Gemini API."""
    client = Client(
        vertexai=True,
        project="mrc-quant-ml",
        location="us-central1",
    )

    response = client.models.generate_content(
        model="gemini-2.0-flash-exp",
        contents=[
            Part(
                video_metadata=VideoMetadata(
                    fps=1,
                    start_offset=start_offset,
                    end_offset=end_offset
                ),
                file_data=FileData(
                    file_uri=video_uri,
                    mime_type="video/mp4",
                ),
            ),
            prompt
        ],
    )
    return response.text

# Function to get video duration
async def get_video_duration(video_uri: str) -> int:
    """Gets the duration of a video from a GCS URI."""
    try:
        # Assuming the video is in a GCS bucket
        client = storage.Client()
        bucket_name, blob_name = video_uri.replace("gs://", "").split("/", 1)
        bucket = client.get_bucket(bucket_name)
        blob = bucket.blob(blob_name)
        # Download the video temporarily to get duration (consider optimizing this)
        temp_file = f"/tmp/{blob_name.split('/')[-1]}"
        blob.download_to_filename(temp_file)
        clip = mp.VideoFileClip(temp_file)
        duration = int(clip.duration)
        os.remove(temp_file) # Clean up the temporary file
        return duration
    except Exception as e:
        print(f"Error getting video duration: {e}")
        return 0 # Return 0 or raise an error based on desired behavior

# Batch processing optimization
async def process_video_chunks_parallel(video_uri: str, chunk_duration_minutes: int = 30, max_workers: int = 4, delay_seconds: int = 1):
    """Process video chunks in parallel for better throughput with 30-minute intervals and a delay between API calls."""
    chunk_duration = chunk_duration_minutes * 60 # Convert minutes to seconds

    total_duration = 7302 # await get_video_duration(video_uri)
    print(f"Total video duration: {total_duration} seconds")
    if total_duration == 0:
        print("Could not get video duration. Aborting processing.")
        return []

    chunks = [(i, min(i + chunk_duration, total_duration))
              for i in range(0, total_duration, chunk_duration)]

    def run_summarize_chunk(start, end):
      """Helper function to run the summarize_video_chunk coroutine."""
      return asyncio.run(process_chunk_with_delay(start, end))

    async def process_chunk_with_delay(start, end):
        """Helper function to process a chunk with a delay."""
        summary = summarize_video_chunk(video_uri, f"{start}s", f"{end}s")
        await asyncio.sleep(delay_seconds) # Add delay between calls
        return summary


    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(
                executor,
                run_summarize_chunk,
                start,
                end
            ) for start, end in chunks
        ]

        summaries = await asyncio.gather(*tasks)

    return summaries

# Example usage with error handling
video_uri = "gs://mrc-quant-ml-video-analysis/videoplayback.mp4"

# Example of how to use the parallel processing function

nest_asyncio.apply() # Apply this if running in Colab

try:
    # Add delay_seconds parameter to control delay
    all_summaries = asyncio.run(process_video_chunks_parallel(video_uri, chunk_duration_minutes=30, delay_seconds=5))
    for i, summary in enumerate(all_summaries):
        print(f"Summary for chunk {i+1}:\n{summary}\n")
except Exception as e:
    print(f"Error during parallel processing: {e}")

  if event.key is 'enter':



Total video duration: 7302 seconds
Summary for chunk 1:
Here is a summary of the video:

The presenter, hu_po, is discussing AI language learning models. He begins by doing a sound test and then introduces that the stream will cover the topic "Diffusion Beats Autoregressive in Data-Constrained Settings". 

Hu_po shares that the thumbnail for the video was generated by Open AI and he briefly comments on the AI mispelling “autoregressive” as “autoregressssive”. 

Hu_po brings up a research paper with the same name that came out July 21, 2025 from Carnegie Mellon University to give context to the topic he will discuss. 

The presenter emphasizes the core argument of the paper, which is that if you are in a data constrained environment, then diffusion models are the right way to go. 

Hu_po also states that the Welsh Labs video has a great explanation on machine learning and he provides a video clip explaining how AI works for generating images. He also talks about “threeblue1brown” and th

# Task
Create a relevancy scoring engine to score the relevancy of the chunks based on a user query.

## Define relevancy scoring logic

### Subtask:
Determine how relevancy will be calculated. This could involve using techniques like TF-IDF, cosine similarity, or embedding-based similarity to compare the user query with each summary.


**Reasoning**:
Discuss the different approaches for calculating relevancy, considering their pros and cons for text summaries and user queries, and select the most appropriate method with justification.



In [4]:
# Discuss different approaches for calculating relevancy.

# 1. TF-IDF (Term Frequency-Inverse Document Frequency):
#    - Pros: Simple to understand and implement. Captures term importance within a document relative to a corpus.
#    - Cons: Doesn't capture semantic meaning or relationships between words. Treats documents as bags of words. Sensitive to vocabulary size.

# 2. Cosine Similarity on TF-IDF vectors:
#    - Pros: Extends TF-IDF by providing a measure of similarity between documents based on their TF-IDF representations.
#    - Cons: Still suffers from the limitations of TF-IDF regarding semantic understanding. High-dimensional and sparse vectors can be an issue.

# 3. Embedding-based Similarity (e.g., using Sentence Transformers):
#    - Pros: Captures semantic meaning and context of words and sentences by representing them as dense vectors in a continuous space. Can handle synonyms and related terms effectively. Generally performs better on tasks requiring semantic understanding.
#    - Cons: Requires a pre-trained model (which can be large). Computationally more expensive than TF-IDF for vector generation.

# Selecting the most appropriate method:
# Given that the data consists of text summaries and the goal is to score relevancy to a user query, understanding the semantic meaning of both the query and the summaries is crucial. TF-IDF and cosine similarity on TF-IDF vectors are limited in their ability to capture semantic relationships. Embedding-based similarity, particularly using models like Sentence Transformers, excels at generating semantically meaningful representations of text. Therefore, embedding-based similarity is the most appropriate method for this task as it will provide a more accurate measure of relevancy by considering the underlying meaning of the text.
print("Discussion of relevancy calculation methods completed.")

Discussion of relevancy calculation methods completed.


## Implement the scoring function

### Subtask:
Write a Python function that takes a user query and a list of summaries as input and returns a list of scores, one for each summary.


**Reasoning**:
Implement the function to calculate relevancy scores using embedding-based similarity as determined in the previous step.



In [5]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

def score_relevancy(user_query: str, summaries: list[str]) -> list[float]:
    """
    Calculates relevancy scores for a list of summaries based on a user query
    using embedding-based cosine similarity.

    Args:
        user_query: The user's query string.
        summaries: A list of summary strings.

    Returns:
        A list of cosine similarity scores, one for each summary.
    """
    # Encode the user query and each summary into embeddings
    query_embedding = model.encode(user_query)
    summary_embeddings = model.encode(summaries)

    # Calculate the cosine similarity between the query embedding and each summary embedding
    # cosine_similarity expects a 2D array for the first argument, so reshape the query embedding
    similarity_scores = cosine_similarity([query_embedding], summary_embeddings)

    # The result of cosine_similarity is a 2D array, so flatten it to get a list of scores
    return similarity_scores[0].tolist()

# Example usage (optional, for testing)
# user_query = "diffusion models"
# summaries = all_summaries # Assuming all_summaries is available from previous execution
# relevancy_scores = score_relevancy(user_query, summaries)
# print(f"Relevancy scores for query '{user_query}': {relevancy_scores}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Apply the scoring function

### Subtask:
Use the implemented scoring function to score the existing summaries based on a user query.


**Reasoning**:
Define the user query and summaries, then call the scoring function to get the relevancy scores.



In [6]:
user_query = "What is the difference between diffusion and autoregressive models?"
summaries = all_summaries
relevancy_scores = score_relevancy(user_query, summaries)
print(f"Relevancy scores for query '{user_query}': {relevancy_scores}")

  return forward_call(*args, **kwargs)



Relevancy scores for query 'What is the difference between diffusion and autoregressive models?': [0.4642210304737091, 0.32776379585266113, 0.5090007781982422, 0.5026441216468811, 0.32718920707702637]


## Present the results

### Subtask:
Display the summaries along with their relevancy scores.


**Reasoning**:
Display the summaries and their corresponding relevancy scores in a readable format.



In [7]:
print(f"Displaying summaries and their relevancy scores for the query: '{user_query}'\n")

for i, (summary, score) in enumerate(zip(summaries, relevancy_scores)):
    print(f"--- Summary {i+1} ---")
    print(f"Score: {score:.4f}")
    print(f"Summary:\n{summary}\n")

Displaying summaries and their relevancy scores for the query: 'What is the difference between diffusion and autoregressive models?'

--- Summary 1 ---
Score: 0.4642
Summary:
Here is a summary of the video:

The presenter, hu_po, is discussing AI language learning models. He begins by doing a sound test and then introduces that the stream will cover the topic "Diffusion Beats Autoregressive in Data-Constrained Settings". 

Hu_po shares that the thumbnail for the video was generated by Open AI and he briefly comments on the AI mispelling “autoregressive” as “autoregressssive”. 

Hu_po brings up a research paper with the same name that came out July 21, 2025 from Carnegie Mellon University to give context to the topic he will discuss. 

The presenter emphasizes the core argument of the paper, which is that if you are in a data constrained environment, then diffusion models are the right way to go. 

Hu_po also states that the Welsh Labs video has a great explanation on machine learning a

## Summary:

### Data Analysis Key Findings

*   Embedding-based similarity was chosen as the most suitable method for calculating relevancy scores due to its ability to capture semantic meaning.
*   A Python function `score_relevancy` was successfully implemented using a pre-trained Sentence Transformer model and cosine similarity to calculate relevancy scores between a user query and a list of summaries.
*   The `score_relevancy` function was applied to score existing summaries based on the user query "What is the difference between diffusion and autoregressive models?".
*   The calculated relevancy scores for each summary were successfully displayed alongside their corresponding summaries.

### Insights or Next Steps

*   The relevancy scoring engine is functional and provides a numerical measure of how relevant each summary is to the user's query. This can be used to rank or filter the summaries.
*   Consider exploring different pre-trained sentence transformer models or fine-tuning a model on a domain-specific dataset to potentially improve the accuracy of the relevancy scores.
