In [1]:
!pip install torch transformers sentence-transformers scikit-learn pandas opencv-python moviepy mediapipe

Collecting mediapipe
  Downloading mediapipe-0.10.21-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
INFO: pip is looking at multiple versions of mediapipe to determine which version is compatible with other requirements. This could take a while.
  Downloading mediapipe-0.10.20-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
  Downloading mediapipe-0.10.18-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.7 kB)
  Downloading mediapipe-0.10.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.7 kB)
  Downloading mediapipe-0.10.14-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.7 kB)
Collecting protobuf<5,>=4.25.3 (from mediapipe)
  Downloading protobuf-4.25.8-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Collecting sounddevice>=0.4.4 (from mediapipe)
  Downloading sounddevice-0.5.2-py3-none-any.whl.metadata (1.6 kB)
Downloading mediapipe-0.10.14-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_

In [1]:
import os
import cv2
import mediapipe as mp
from moviepy.editor import VideoFileClip
from transformers import pipeline

  IMAGEMAGICK_BINARY = r"C:\Program Files\ImageMagick-6.8.8-Q16\magick.exe"
  lines_video = [l for l in lines if ' Video: ' in l and re.search('\d+x\d+', l)]
  rotation_lines = [l for l in lines if 'rotate          :' in l and re.search('\d+$', l)]
  match = re.search('\d+$', rotation_line)
  if event.key is 'enter':



This section initializes and loads the three core AI models that form the backbone of our multimodal system. Each model is responsible for a different modality: speech, vision, and language.
1.  **Whisper**: A state-of-the-art speech-to-text model from OpenAI for transcribing spoken words.
2.  **MediaPipe Hands**: A computer vision model from Google for detecting hand landmarks in real-time.
3.  **Zero-Shot Classifier**: A powerful NLP model (BART) that can classify text into predefined categories (intents) without being explicitly trained on them.
Using a GPU (`device=0`) is specified to significantly speed up model inference.

In [2]:
# 1. Speech-to-Text Model (Whisper)
# Using a GPU (device=0) is highly recommended for Whisper
stt_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base.en", device=0)
print("--> Whisper Speech-to-Text model loaded.")

# 2. Hand Gesture Model (MediaPipe)
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=1, min_detection_confidence=0.7)
mp_drawing = mp.solutions.drawing_utils
print("--> MediaPipe Hand Gesture model loaded.")

# 3. ZERO-SHOT TEXT-TO-INTENT NLP Model
# We replace our custom classifier with a powerful pre-trained model.
# facebook/bart-large-mnli is a popular choice for this task.
zero_shot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
# Define our possible intents which will be the candidate labels
CANDIDATE_INTENTS = ["forward", "left", "right", "stop"]
print("--> Zero-Shot Intent NLP model loaded.")
print("\n" + "="*50 + "\nAll models are ready.\n" + "="*50)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


--> Whisper Speech-to-Text model loaded.
--> MediaPipe Hand Gesture model loaded.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


--> Zero-Shot Intent NLP model loaded.

All models are ready.


This function takes a string of text (the transcript from the audio) and uses the pre-trained zero-shot classification model to determine which of the `CANDIDATE_INTENTS` it most closely matches. It works "zero-shot," meaning the model was not specifically trained on our "forward," "left," "right," or "stop" commands but can generalize to understand them. The function only returns an intent if the model's confidence score exceeds a specified threshold, preventing uncertain classifications.

In [3]:
def get_intent_from_text_zero_shot(transcript, confidence_threshold=0.60):
    """
    Classifies a text command into an intent using a zero-shot model.
    """
    if not transcript:
        return None

    print(f"[NLP] Classifying text: '{transcript}'")

    # The model returns scores for all candidate labels, sorted from highest to lowest.
    results = zero_shot_classifier(transcript, CANDIDATE_INTENTS)

    best_intent = results['labels'][0]
    best_score = results['scores'][0]

    print(f"[NLP] Top classification: '{best_intent}' with confidence: {best_score:.2f}")

    # Only return the intent if the model is confident enough
    if best_score > confidence_threshold:
        print(f"[NLP] Confidence is above threshold. Intent is '{best_intent}'.")
        return best_intent
    else:
        print(f"[NLP] Confidence is below threshold. Intent is uncertain.")
        return None

It takes the path to an audio file, uses the Whisper model to transcribe the speech into text, and then passes this text to our `get_intent_from_text_zero_shot` function to determine the final command intent. It includes error handling in case the audio processing fails.

In [4]:
def get_intent_from_audio(audio_path):
    """
    Takes an audio file path, transcribes it, and classifies the intent using the zero-shot model.
    """
    try:
        print("\n[Audio] Transcribing speech to text...")
        transcription_result = stt_pipeline(audio_path)
        transcript = transcription_result['text'].strip().lower()

        # We now call our new zero-shot function
        return get_intent_from_text_zero_shot(transcript)

    except Exception as e:
        print(f"[Audio] Error processing audio: {e}")
        return None


This function handles the visual modality. It analyzes a video file frame by frame to identify hand gestures. It uses MediaPipe to detect hand landmarks (the positions of joints) and then applies a set of geometric rules to recognize specific gestures: a fist with an extended thumb (for "left" or "right"), an open palm ("stop"), and a thumbs-up ("forward"). To make the detection robust, it counts the occurrences of each gesture throughout the video and returns the most frequently seen (dominant) gesture, as long as it's detected a minimum number of times.

For a finger to be curled, its tip must be "lower" on the screen than its middle joint (the PIP joint). In screen coordinates, a higher y value means lower on the screen. This condition checks if the main fingers are bent downwards.

In [16]:
def get_intent_from_video(video_path):
    """
    Analyzes a video for hand gestures using a prioritized check:
    1. Fist w/ Thumb (Left/Right)
    2. Open Palm (Stop)
    3. Thumbs Up (Forward)
    """
    print("\n[Video] Analyzing video for hand gestures...")
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened(): return None

    gesture_counts = {"left": 0, "right": 0, "forward": 0, "stop": 0, "unknown": 0}
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break

        if frame_count % 5 == 0:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            results = hands.process(frame_rgb)

            if results.multi_hand_landmarks:
                for hand_landmarks in results.multi_hand_landmarks:
                    # Collect key landmarks
                    thumb_tip = hand_landmarks.landmark[mp_hands.HandLandmark.THUMB_TIP]
                    index_tip = hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP]
                    middle_tip = hand_landmarks.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_TIP]
                    ring_tip = hand_landmarks.landmark[mp_hands.HandLandmark.RING_FINGER_TIP]
                    pinky_tip = hand_landmarks.landmark[mp_hands.HandLandmark.PINKY_TIP]
                    wrist = hand_landmarks.landmark[mp_hands.HandLandmark.WRIST]

                    index_pip = hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_PIP] # iam definigng the pip to dirrentiaite easily curl is happening ior not
                    middle_pip = hand_landmarks.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_PIP]
                    ring_pip = hand_landmarks.landmark[mp_hands.HandLandmark.RING_FINGER_PIP]
                    pinky_pip = hand_landmarks.landmark[mp_hands.HandLandmark.PINKY_PIP]

                    # 1. Condition for Left/Right Fist
                    is_fist_with_thumb = (index_tip.y > index_pip.y and
                                          middle_tip.y>middle_pip.y and
                                          ring_tip.y>ring_pip.y and pinky_tip.y > pinky_pip.y and thumb_tip.y < wrist.y)

                    # 2. Condition for Stop (Open Palm)
                    fingers_open = (index_tip.y < index_pip.y and
                                    middle_tip.y < middle_pip.y and ring_tip.y < ring_pip.y and
                                    pinky_tip.y < pinky_pip.y and thumb_tip.y < wrist.y)

                    # 3. Condition for Forward (Thumbs Up)
                    is_thumbs_up = (thumb_tip.y < wrist.y and
                                    index_tip.y > index_pip.y and
                                    middle_tip.y > middle_pip.y and
                                    ring_tip.y > ring_pip.y and pinky_tip.y > pinky_pip.y)


                    # PRIORITY 1: Check for Left/Right Fist
                    if is_fist_with_thumb:

                        if thumb_tip.x < wrist.x - 0.04:
                            gesture_counts["left"] += 1
                        elif thumb_tip.x > wrist.x + 0.04:
                            gesture_counts["right"] += 1
                        else: # Could be a thumbs up, check in the next step
                            if is_thumbs_up:
                                gesture_counts["forward"] += 1
                            else:
                                gesture_counts["unknown"] += 1

                    # PRIORITY 2: Check for Stop (Open Palm)
                    elif fingers_open:
                        gesture_counts["stop"] += 1

                    # PRIORITY 3: Check for Forward (Thumbs Up) if not caught by fist logic
                    elif is_thumbs_up:
                        gesture_counts["forward"] += 1

                    # FALLBACK
                    else:
                        gesture_counts["unknown"] += 1

        frame_count += 1

    cap.release()

    if sum(gesture_counts.values()) > 0:
        dominant_gesture = max(gesture_counts, key=gesture_counts.get)
        if dominant_gesture != "unknown" and gesture_counts[dominant_gesture] > 2:
             print(f"[Video] Detected Gesture Counts: {gesture_counts}")
             print(f"[Video] Detected Intent: '{dominant_gesture}'")
             return dominant_gesture

    print("[Video] No definitive gesture detected.")
    return None

This is the core function that combines the entire multimodal analysis. It takes a video file path as input and performs the following steps:
1.  Extracts the audio from the video into a temporary file.
2.  Runs the audio processing pipeline to get an `audio_intent`.
3.  Runs the video gesture recognition pipeline to get a `video_intent`.
4.  Decision

In [17]:
def process_multimodal_command(video_path):
    """
    The main pipeline function with updated, more flexible decision logic.
    """
    print(f"\n{'='*20} PROCESSING NEW COMMAND: {video_path} {'='*20}")
    if not os.path.exists(video_path):
        print(f"Error: Video file not found at {video_path}"); return

    # --- Step 1: Extract Audio & Get Intents ---
    temp_audio_path = "temp_audio.wav"
    try:
        with VideoFileClip(video_path) as video_clip:
            video_clip.audio.write_audiofile(temp_audio_path, logger=None)
        audio_intent = get_intent_from_audio(temp_audio_path)
    except Exception:
        audio_intent = None # Assume no audio if extraction fails
    finally:
        if os.path.exists(temp_audio_path): os.remove(temp_audio_path)

    video_intent = get_intent_from_video(video_path)

    # --- Step 2: NEW DECISION LOGIC ---
    print("\n[Fusion] Comparing intents...")
    print(f"[Fusion] Audio Intent: {audio_intent} | Video Intent: {video_intent}")

    # Case 1: High confidence match
    if audio_intent and video_intent and audio_intent == video_intent:
        print(f"\nHIGH CONFIDENCE: Intents match! Executing command: {audio_intent.upper()}")
        # Your robot action call, e.g., move_robot(audio_intent)

    # Case 2: Conflict
    elif audio_intent and video_intent and audio_intent != video_intent:
        print(f"\n CONFLICT: Audio detected '{audio_intent}' but Video detected '{video_intent}'. No action taken.")

    # Case 3: Audio only
    elif audio_intent and not video_intent:
        print(f"\n AUDIO ONLY: Proceeding with audio command: {audio_intent.upper()}")
        # Your robot action call, e.g., move_robot(audio_intent)

    # Case 4: Video only
    elif video_intent and not audio_intent:
        print(f"\n VIDEO ONLY: Proceeding with video command: {video_intent.upper()}")
        # Your robot action call, e.g., move_robot(video_intent)

    # Case 5: No intent detected
    else: # This covers the case where both are None
        print("\nFAILED: No clear audio or video intent was detected. Please try again.")


In [31]:
if __name__ == "__main__":

    test_videos = [
        "WIN_20250911_18_12_26_Pro.mp4","WIN_20250911_18_02_57_Pro.mp4","WIN_20250911_18_17_56_Pro.mp4",'WIN_20250911_17_41_49_Pro.mp4'
    ]

    for video_file in test_videos:
        process_multimodal_command(video_file)



[Audio] Transcribing speech to text...
[NLP] Classifying text: 'i am so glad to see you.'
[NLP] Top classification: 'forward' with confidence: 0.49
[NLP] Confidence is below threshold. Intent is uncertain.

[Video] Analyzing video for hand gestures...





[Video] Detected Gesture Counts: {'left': 0, 'right': 0, 'forward': 3, 'stop': 0, 'unknown': 1}
[Video] Detected Intent: 'forward'

[Fusion] Comparing intents...
[Fusion] Audio Intent: None | Video Intent: forward

 VIDEO ONLY: Proceeding with video command: FORWARD


[Audio] Transcribing speech to text...
[NLP] Classifying text: 'right. forward. stop.'
[NLP] Top classification: 'forward' with confidence: 0.46
[NLP] Confidence is below threshold. Intent is uncertain.

[Video] Analyzing video for hand gestures...





[Video] Detected Gesture Counts: {'left': 0, 'right': 3, 'forward': 0, 'stop': 13, 'unknown': 7}
[Video] Detected Intent: 'stop'

[Fusion] Comparing intents...
[Fusion] Audio Intent: None | Video Intent: stop

 VIDEO ONLY: Proceeding with video command: STOP


[Audio] Transcribing speech to text...
[NLP] Classifying text: 'excited.'
[NLP] Top classification: 'forward' with confidence: 0.82
[NLP] Confidence is above threshold. Intent is 'forward'.

[Video] Analyzing video for hand gestures...





[Video] Detected Gesture Counts: {'left': 0, 'right': 7, 'forward': 0, 'stop': 0, 'unknown': 0}
[Video] Detected Intent: 'right'

[Fusion] Comparing intents...
[Fusion] Audio Intent: forward | Video Intent: right

 CONFLICT: Audio detected 'forward' but Video detected 'right'. No action taken.


[Audio] Transcribing speech to text...
[NLP] Classifying text: 'if left, if left, if left, if left, if left, if left, if left, if left, if left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, left, 




[Video] Detected Gesture Counts: {'left': 8, 'right': 5, 'forward': 0, 'stop': 14, 'unknown': 4}
[Video] Detected Intent: 'stop'

[Fusion] Comparing intents...
[Fusion] Audio Intent: left | Video Intent: stop

 CONFLICT: Audio detected 'left' but Video detected 'stop'. No action taken.


In [25]:
if __name__ == "__main__":

    test_videos = [
        "WIN_20250911_17_25_17_Pro.mp4",
    ]

    for video_file in test_videos:
        process_multimodal_command(video_file)



[Audio] Transcribing speech to text...
[NLP] Classifying text: 'you're still calling.'
[NLP] Top classification: 'left' with confidence: 0.50
[NLP] Confidence is below threshold. Intent is uncertain.

[Video] Analyzing video for hand gestures...





[Video] Detected Gesture Counts: {'left': 0, 'right': 0, 'forward': 0, 'stop': 9, 'unknown': 0}
[Video] Detected Intent: 'stop'

[Fusion] Comparing intents...
[Fusion] Audio Intent: None | Video Intent: stop

 VIDEO ONLY: Proceeding with video command: STOP


In [None]:
if __name__ == "__main__":

    test_videos = [
        "/content/stop_merged.mp4",
    ]

    for video_file in test_videos:
        process_multimodal_command(video_file)



[Audio] Transcribing speech to text...





[NLP] Classifying text: 'that's enough. you can stop now. you can stop now.'
[NLP] Top classification: 'stop' with confidence: 0.63
[NLP] Confidence is above threshold. Intent is 'stop'.

[Video] Analyzing video for hand gestures...
[Video] Detected Gesture Counts: {'left': 0, 'right': 0, 'forward': 0, 'stop': 23, 'unknown': 0}
[Video] Detected Intent: 'stop'

[Fusion] Comparing intents...
[Fusion] Audio Intent: stop | Video Intent: stop

HIGH CONFIDENCE: Intents match! Executing command: STOP


In [None]:
if __name__ == "__main__":

    test_videos = [
        "/content/left me.mp4",
    ]

    for video_file in test_videos:
        process_multimodal_command(video_file)



[Audio] Transcribing speech to text...





[NLP] Classifying text: 'can you go to the left?'
[NLP] Top classification: 'left' with confidence: 0.86
[NLP] Confidence is above threshold. Intent is 'left'.

[Video] Analyzing video for hand gestures...
[Video] Detected Gesture Counts: {'left': 20, 'right': 0, 'forward': 0, 'stop': 0, 'unknown': 0}
[Video] Detected Intent: 'left'

[Fusion] Comparing intents...
[Fusion] Audio Intent: left | Video Intent: left

HIGH CONFIDENCE: Intents match! Executing command: LEFT
