<a href="https://colab.research.google.com/github/prisar/ai_notebooks/blob/main/nb_081.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
You are given parts of a video and its summary. Write the LLM call logic to combine these and ask the LLM Gemini to find the relevant chunks. Then, list the relevant chunks in a JSON format with a relevance score. Finally, take the top chunks and feed them to Gemini for answer generation.

## Represent video chunks

### Subtask:
I will define a data structure to represent the video chunks and their summaries. This will likely be a list of dictionaries, where each dictionary contains the chunk ID, the summary, and the video segment.


**Reasoning**:
The subtask is to create a Python list of dictionaries representing video chunks. This involves defining the structure and populating it with sample data.



In [1]:
video_chunks = [
    {
        'chunk_id': 'chunk_1',
        'summary': 'Introduction to data analysis and its importance.',
        'video_segment': 'Welcome to this course on data analysis. Data analysis is crucial for making informed decisions.'
    },
    {
        'chunk_id': 'chunk_2',
        'summary': 'Steps involved in data cleaning and preprocessing.',
        'video_segment': 'The first step in data analysis is data cleaning. This involves handling missing values and outliers.'
    },
    {
        'chunk_id': 'chunk_3',
        'summary': 'Exploring data visualization techniques.',
        'video_segment': 'Data visualization helps in understanding patterns in the data. We will look at various plotting techniques.'
    },
    {
        'chunk_id': 'chunk_4',
        'summary': 'Introduction to machine learning concepts.',
        'video_segment': 'After exploring the data, we can move to machine learning. This involves building predictive models.'
    }
]

print(video_chunks)

[{'chunk_id': 'chunk_1', 'summary': 'Introduction to data analysis and its importance.', 'video_segment': 'Welcome to this course on data analysis. Data analysis is crucial for making informed decisions.'}, {'chunk_id': 'chunk_2', 'summary': 'Steps involved in data cleaning and preprocessing.', 'video_segment': 'The first step in data analysis is data cleaning. This involves handling missing values and outliers.'}, {'chunk_id': 'chunk_3', 'summary': 'Exploring data visualization techniques.', 'video_segment': 'Data visualization helps in understanding patterns in the data. We will look at various plotting techniques.'}, {'chunk_id': 'chunk_4', 'summary': 'Introduction to machine learning concepts.', 'video_segment': 'After exploring the data, we can move to machine learning. This involves building predictive models.'}]


## Formulate the prompt

### Subtask:
I will create a prompt for the first Gemini call. This prompt will instruct Gemini to act as a relevance-scoring engine. It will take the user's question and the video chunk summaries as input and output a JSON object with relevance scores for each chunk.


**Reasoning**:
Define a string variable to hold the prompt for the first Gemini call, instructing it to act as a relevance scoring engine and output a JSON object with relevance scores for video chunk summaries based on a user question.



In [2]:
relevance_prompt = """
You are a relevance scoring engine. Your task is to assess the relevance of provided video chunk summaries to a user's question.

You will be given a user question and a list of video chunk summaries. For each chunk summary, you need to provide a relevance score.

Output your response as a JSON object. The keys of the JSON object should be the 'chunk_id' of each video chunk, and the values should be an integer relevance score between 0 and 10, where 10 is highly relevant and 0 is not relevant at all.

Here is the user question:
{user_question}

Here are the video chunk summaries:
{video_chunk_summaries}
"""
print(relevance_prompt)


You are a relevance scoring engine. Your task is to assess the relevance of provided video chunk summaries to a user's question.

You will be given a user question and a list of video chunk summaries. For each chunk summary, you need to provide a relevance score.

Output your response as a JSON object. The keys of the JSON object should be the 'chunk_id' of each video chunk, and the values should be an integer relevance score between 0 and 10, where 10 is highly relevant and 0 is not relevant at all.

Here is the user question:
{user_question}

Here are the video chunk summaries:
{video_chunk_summaries}



## First gemini call (relevance scoring)

### Subtask:
Generate the Python code to make the first call to the Gemini API. This code will send the prompt and the data to Gemini and receive the JSON output with relevance scores.


**Reasoning**:
Generate the Python code to define the user question, prepare the video chunk summaries for the prompt, format the prompt, configure the Gemini API key, initialize the Gemini model, make the API call, and print the raw response.



In [5]:
import json
import getpass

api_key = getpass.getpass("Enter your Google API Key: ")

genai.configure(api_key=api_key)

model = genai.GenerativeModel('gemini-1.5-flash-latest')

response = model.generate_content(formatted_relevance_prompt)

# Extract JSON content - Assuming the response is a string containing the JSON object
# Need to handle cases where the response might have extra text before or after the JSON
import re
json_match = re.search(r'\{.*\}', response.text, re.DOTALL)

if json_match:
    json_string = json_match.group(0)
    relevance_scores = json.loads(json_string)
    print(relevance_scores)
else:
    print("Could not find a valid JSON object in the response.")
    relevance_scores = None # Or handle as an error


Enter your Google API Key: ··········
{'Introduction to data analysis and its importance.': 2, 'Steps involved in data cleaning and preprocessing.': 8, 'Exploring data visualization techniques.': 7, 'Introduction to machine learning concepts.': 3}


## Process the relevance scores

### Subtask:
Process the JSON output from the first Gemini call and select the top N most relevant chunks based on their scores.


**Reasoning**:
The goal is to select the top N most relevant chunks from the `relevance_scores` dictionary. This involves defining N, sorting the dictionary by values, and extracting the keys of the top N items. These steps can be combined into a single code block.



In [6]:
top_n = 2
sorted_scores = sorted(relevance_scores.items(), key=lambda item: item[1], reverse=True)
top_chunks = [item[0] for item in sorted_scores[:top_n]]
print(top_chunks)

['Steps involved in data cleaning and preprocessing.', 'Exploring data visualization techniques.']


## Formulate the second prompt (answer generation)

### Subtask:
Create a second prompt for Gemini. This prompt will include the user's question and the full content of the top-ranked video chunks. It will instruct Gemini to generate a comprehensive answer based on the provided context.


**Reasoning**:
Define the string variable `answer_prompt` with the specified content and placeholders.



In [7]:
answer_prompt = """
You are an answer generator. Your task is to generate a comprehensive answer to a user's question based *only* on the provided video chunks content. Do not use any external knowledge.

Here is the user question:
{user_question}

Here is the content of the top video chunks:
{video_chunks_content}

Generate a comprehensive answer based *only* on the above content.
"""
print(answer_prompt)


You are an answer generator. Your task is to generate a comprehensive answer to a user's question based *only* on the provided video chunks content. Do not use any external knowledge.

Here is the user question:
{user_question}

Here is the content of the top video chunks:
{video_chunks_content}

Generate a comprehensive answer based *only* on the above content.



## Second gemini call (answer generation)

### Subtask:
Make the second Gemini call to generate a comprehensive answer.


**Reasoning**:
I will now proceed with the final subtask of making the second Gemini call to generate a comprehensive answer. I will create the `top_chunks_content` by iterating through the `top_chunks` and finding the corresponding `video_segment` from the `video_chunks` list. Then, I will format the `answer_prompt` with the `user_question` and the created `top_chunks_content`. Finally, I will make the API call to Gemini and print the generated answer.



In [8]:
top_chunks_content = ""
for chunk_summary in top_chunks:
    for chunk in video_chunks:
        if chunk['summary'] == chunk_summary:
            top_chunks_content += chunk['video_segment'] + "\n"

formatted_answer_prompt = answer_prompt.format(
    user_question=user_question,
    video_chunks_content=top_chunks_content
)

answer_response = model.generate_content(formatted_answer_prompt)

print(answer_response.text)

Based on the provided video chunks, the steps involved in data analysis begin with data cleaning.  Data cleaning includes addressing missing values and outliers.  Following data cleaning, data visualization is used to understand patterns within the data, employing various plotting techniques.



## Summary:

### Data Analysis Key Findings
* The relevance scoring prompt successfully instructed the Gemini model to return a JSON object with relevance scores for each video chunk.
* The top two most relevant chunks were identified as "Steps involved in data cleaning and preprocessing" and "Exploring data visualization techniques" based on the relevance scores provided by the Gemini model.
* The final answer generated by the Gemini model was a concise summary of the data analysis steps, beginning with data cleaning and followed by data visualization, as described in the top-ranked video chunks.

### Insights or Next Steps
* The relevance scoring and answer generation process can be further refined by experimenting with different prompt engineering techniques to improve the accuracy and relevance of the results.
