# Multimodal Analysis with Snowflake Cortex AI

This notebook demonstrates how to use Snowflake Cortex AI for comprehensive multimodal analysis, including image analysis with AI_COMPLETE and audio transcription with AI_TRANSCRIBE.

## Setup

First, let's import the required packages and set up our session:

In [None]:
# Import required packages
import pandas as pd
import json
from snowflake.snowpark.context import get_active_session

# Setup session
session = get_active_session()
session.use_schema("MULTIMODAL_ANALYSIS.MEDIA")

In [None]:
-- Create an image table that references the files in our stage
CREATE OR REPLACE TABLE MULTIMODAL_ANALYSIS.MEDIA.IMAGE_TABLE AS
  SELECT 
    RELATIVE_PATH AS image_path,
    TO_FILE('@MULTIMODAL_ANALYSIS.MEDIA.IMAGES', RELATIVE_PATH) AS img_file
  FROM DIRECTORY('@MULTIMODAL_ANALYSIS.MEDIA.IMAGES');

-- Create an audio table that references the files in our stage
CREATE OR REPLACE TABLE MULTIMODAL_ANALYSIS.MEDIA.AUDIO_TABLE AS
  SELECT 
    RELATIVE_PATH AS audio_path,
    TO_FILE('@MULTIMODAL_ANALYSIS.MEDIA.AUDIO', RELATIVE_PATH) AS audio_file
  FROM DIRECTORY('@MULTIMODAL_ANALYSIS.MEDIA.AUDIO');


## View Available Media Files

Let's see what images and audio files we have available:

In [None]:
# View available images
image_df = session.sql("SELECT * FROM IMAGE_TABLE").collect()
print(f"Found {len(image_df)} images in the stage")

# Display the first few images
for i, row in enumerate(image_df[:3]):
    print(f"Image {i+1}: {row['IMAGE_PATH']}")

# View available audio files
audio_df = session.sql("SELECT * FROM AUDIO_TABLE").collect()
print(f"\nFound {len(audio_df)} audio files in the stage")

# Display the first few audio files
for i, row in enumerate(audio_df[:3]):
    print(f"Audio {i+1}: {row['AUDIO_PATH']}")

## Image Analysis with AI_COMPLETE

Create a function to analyze images with AI_COMPLETE:

In [None]:
# Function to analyze an image with chosen AI model
def analyze_image(image_path, prompt_template, model='claude-3-5-sonnet'):
    # Escape single quotes in the prompt
    escaped_prompt = prompt_template.replace("'", "''")
    
    sql_query = f"""
    SELECT 
      AI_COMPLETE(
        '{model}', 
        '{escaped_prompt}',
         TO_FILE('@IMAGES', '{image_path}')
      )
    """
    result = session.sql(sql_query).collect()
    return result[0][0]

## Basic Image Analysis

Let's analyze our first image with a custom prompt:

In [None]:
# Example analysis with custom prompt
if len(image_df) > 0:
    image_path = image_df[0]['IMAGE_PATH']  # Get first image
    prompt = "Describe what's happening in this image in the style of a detective investigating a scene. Be brief but detailed."

    analysis = analyze_image(image_path, prompt)
    print(f"Analysis for {image_path}:")
    print(analysis)
else:
    print("No images found in the table.")

## Object Detection

Now let's identify objects in the image:

In [None]:
# Object detection
if len(image_df) > 0:
    image_path = image_df[0]['IMAGE_PATH']
    object_prompt = "List all visible objects in this image and count how many there are of each type."
    object_analysis = analyze_image(image_path, object_prompt)
    print("Objects detected:")
    print(object_analysis)
else:
    print("No images found in the table.")

## Text Extraction from Images

Let's extract any visible text from the image:

In [None]:
# Text extraction
if len(image_df) > 0:
    image_path = image_df[4]['IMAGE_PATH']
    text_prompt = "Extract and transcribe any visible text in this image. If no text is visible, respond with 'No text detected'."
    text_analysis = analyze_image(image_path, text_prompt)
    print("Text extracted:")
    print(text_analysis)
else:
    print("No images found in the table.")

## Compare Different Models

Let's compare Claude 3.5 Sonnet with Pixtral-large:

In [None]:
# Try with different models
if len(image_df) > 0:
    image_path = image_df[0]['IMAGE_PATH']
    prompt = "Describe what you see in this image. Be detailed but concise."
    
    # Claude analysis
    claude_analysis = analyze_image(image_path, prompt, model='claude-4-sonnet')
    print(f"Claude 4 Sonnet analysis for {image_path}:")
    print(claude_analysis)
    
    # Pixtral analysis
    pixtral_analysis = analyze_image(image_path, prompt, model='pixtral-large')
    print(f"\nPixtral-large analysis for {image_path}:")
    print(pixtral_analysis)
else:
    print("No images found in the table.")

## Audio Transcription with AI_TRANSCRIBE

Now let's work with audio files using AI_TRANSCRIBE:

In [None]:
# Function to transcribe audio with different modes
def transcribe_audio(audio_path, mode='text'):
    mode_options = {
        'text': None,  # No second parameter for text mode
        'word': "{'timestamp_granularity': 'word'}",
        'speaker': "{'timestamp_granularity': 'speaker'}"
    }
    
    if mode_options[mode] is None:
        sql_query = f"""
        SELECT AI_TRANSCRIBE(
            TO_FILE('@AUDIO', '{audio_path}')
        ) as transcription_result
        """
    else:
        sql_query = f"""
        SELECT AI_TRANSCRIBE(
            TO_FILE('@AUDIO', '{audio_path}'),
            {mode_options[mode]}
        ) as transcription_result
        """
    
    result = session.sql(sql_query).collect()
    print(result)
    return json.loads(result[0]['TRANSCRIPTION_RESULT'])

## Basic Audio Transcription

Let's transcribe our first audio file:

In [None]:
# Basic text transcription
if len(audio_df) > 0:
    audio_path = audio_df[0]['AUDIO_PATH']
    word_transcription = transcribe_audio(audio_path, mode='text')
    
    print(f"Transcription for {audio_path}:")
    print(f"Duration: {word_transcription['audio_duration']} seconds")
    print(f"Text: {word_transcription['text']}")
else:
    print("No audio files found in the table.")

## Word-Level Timestamps

Let's get word-level timestamps for precise navigation:

In [None]:
# Word-level transcription
if len(audio_df) > 0:
    audio_path = audio_df[0]['AUDIO_PATH']
    word_transcription = transcribe_audio(audio_path, mode='word')
    
    print(f"Word-level transcription for {audio_path}:")
    print(f"Duration: {word_transcription['audio_duration']} seconds")
    print("First 10 words with timestamps:")
    
    for segment in word_transcription['segments'][:10]:
        print(f"  {segment['start']:.2f}s - {segment['end']:.2f}s: {segment['text']}")
else:
    print("No audio files found in the table.")

## Speaker Identification

Let's identify different speakers in the audio:

In [None]:
# Speaker identification
if len(audio_df) > 0:
    audio_path = audio_df[0]['AUDIO_PATH']
    speaker_transcription = transcribe_audio(audio_path, mode='speaker')
    
    print(f"Speaker-segmented transcription for {audio_path}:")
    print(f"Duration: {speaker_transcription['audio_duration']} seconds")
    
    # Group by speaker
    speakers = {}
    for segment in speaker_transcription['segments']:
        speaker = segment['speaker_label']
        if speaker not in speakers:
            speakers[speaker] = []
        speakers[speaker].append(segment['text'])
    
    for speaker, texts in speakers.items():
        print(f"\n{speaker}:")
        print(" ".join(texts[:3]))  # Show first 3 segments
else:
    print("No audio files found in the table.")

## Conclusion

This notebook demonstrates the full capabilities of Snowflake Cortex AI for multimodal analysis, combining powerful image analysis with AI_COMPLETE and comprehensive audio processing with AI_TRANSCRIBE.

Key takeaways:
- **AI_COMPLETE** provides sophisticated image analysis capabilities with multiple model options
- **AI_TRANSCRIBE** offers flexible audio processing with text, word-level, and speaker identification modes

Explore different combinations of prompts, models, and analysis types to find the best approach for your specific use case!