# Moondream Vision Language Model - Sliding Window for Large Images

[![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/opengeos/geoai/blob/main/docs/examples/moondream_sliding_window.ipynb)

This notebook demonstrates how to use the sliding window methods for processing large images with [Moondream](https://moondream.ai) vision language model. The sliding window approach divides large images into smaller overlapping tiles for more effective processing.

## Why Sliding Window?

- **Better Performance on Large Images**: Moondream VLM processes smaller image tiles more effectively than very large images
- **Memory Efficiency**: Reduces memory requirements by processing one tile at a time
- **Better Detail Recognition**: Smaller tiles allow the model to focus on finer details
- **Overlap Handling**: Overlapping tiles prevent missing objects at tile boundaries

## Available Sliding Window Methods

1. `detect_sliding_window()` - Object detection with bounding boxes
2. `point_sliding_window()` - Point detection for object locations
3. `query_sliding_window()` - Visual question answering
4. `caption_sliding_window()` - Image captioning

## Install packages

Uncomment the following line to install the required packages.

In [None]:
# %pip install -U geoai-py

## Import libraries

In [None]:
import leafmap
from geoai import MoondreamGeo
import geoai

## Download sample data

We'll use a large GeoTIFF image for demonstration. For this example, let's use a larger area that benefits from sliding window processing.

In [None]:
# Download a sample large image
url = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/parking_lot.tif"
image_path = geoai.download_file(url)
image_path

## Visualize the image

Let's first visualize the sample image on an interactive map.

In [None]:
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
m

## Initialize the Moondream processor

Load the Moondream2 model. The first time you run this, the model will be downloaded from HuggingFace (~3.7GB).

In [None]:
processor = MoondreamGeo(
    model_name="vikhyatk/moondream2",
    revision="2025-06-21",
    device="cuda",  # Use "cpu" if you don't have a GPU
)

## 1. Object Detection with Sliding Window

Detect objects in large images using the sliding window approach. The method automatically:
- Divides the image into overlapping tiles
- Detects objects in each tile
- Applies Non-Maximum Suppression (NMS) to merge overlapping detections

### Key Parameters:
- `window_size`: Size of each tile (default: 512)
- `overlap`: Overlap between tiles (default: 64)
- `iou_threshold`: IoU threshold for NMS (default: 0.5)

In [None]:
# Detect cars using sliding window
result = processor.detect_sliding_window(
    image_path,
    "car",
    window_size=512,
    overlap=64,
    iou_threshold=0.5,
    output_path="cars_sliding_window.geojson",
)

print(f"Detected {len(result['objects'])} cars")

### Visualize Detection Results

In [None]:
# View the GeoDataFrame
if "gdf" in result:
    display(result["gdf"].head())

In [None]:
# Visualize on map
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
if "gdf" in result:
    m.add_gdf(
        result["gdf"],
        layer_name="Detected Cars",
        style={"color": "red", "fillOpacity": 0.3},
    )
m

### Detect Buildings

In [None]:
# Detect buildings using sliding window
buildings = processor.detect_sliding_window(
    image_path,
    "building",
    window_size=512,
    overlap=64,
    output_path="buildings_sliding_window.geojson",
)

print(f"Detected {len(buildings['objects'])} buildings")

## 2. Point Detection with Sliding Window

Find specific object locations as points across large images.

In [None]:
# Find tree locations using sliding window
trees = processor.point_sliding_window(
    image_path,
    "tree",
    window_size=512,
    overlap=64,
    output_path="trees_sliding_window.geojson",
)

print(f"Found {len(trees['points'])} tree locations")

In [None]:
# Visualize tree locations
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
if "gdf" in trees:
    m.add_gdf(trees["gdf"], layer_name="Trees", style={"color": "green", "radius": 3})
m

## 3. Visual Question Answering with Sliding Window

Query large images by processing them in tiles and combining answers.

### Combine Strategies:
- `concatenate`: Simply join all tile answers (faster)
- `summarize`: Use the model to create a coherent summary (better quality)

In [None]:
# Query with concatenation
result = processor.query_sliding_window(
    "What types of vehicles are visible?",
    image_path,
    window_size=512,
    overlap=64,
    combine_strategy="concatenate",
)

print("Combined Answer:")
print(result["answer"])

In [None]:
# Query with summarization (requires additional model call)
result = processor.query_sliding_window(
    "Describe the land use and features in this area.",
    image_path,
    window_size=512,
    overlap=64,
    combine_strategy="summarize",
)

print("Summary:")
print(result["answer"])

In [None]:
# View individual tile answers
print("\nIndividual Tile Answers:")
for tile in result["tile_answers"][:3]:  # Show first 3 tiles
    print(f"Tile {tile['tile_id']}: {tile['answer']}")

## 4. Image Captioning with Sliding Window

Generate comprehensive captions for large images by captioning tiles and combining them.

In [None]:
# Generate caption with concatenation
result = processor.caption_sliding_window(
    image_path,
    window_size=512,
    overlap=64,
    length="normal",
    combine_strategy="concatenate",
)

print("Combined Caption:")
print(result["caption"])

In [None]:
# Generate caption with summarization for better coherence
result = processor.caption_sliding_window(
    image_path, window_size=512, overlap=64, length="long", combine_strategy="summarize"
)

print("Summarized Caption:")
print(result["caption"])

## Using Convenience Functions

You can also use the convenience functions for one-off processing without creating a processor instance.

In [None]:
from geoai import moondream_detect_sliding_window

# Quick detection
result = moondream_detect_sliding_window(
    image_path,
    "parking space",
    window_size=512,
    overlap=64,
    model_name="vikhyatk/moondream2",
    revision="2025-06-21",
)

print(f"Detected {len(result['objects'])} parking spaces")

## Performance Tips

1. **Window Size**: 
   - Smaller windows (256-512): Better for small objects, more tiles to process
   - Larger windows (512-1024): Faster processing, may miss small objects

2. **Overlap**: 
   - Larger overlap (64-128): Better for objects at tile boundaries, slower
   - Smaller overlap (32-64): Faster, may miss objects at boundaries

3. **IoU Threshold** (for detection):
   - Higher (0.6-0.8): Keeps more detections, may have duplicates
   - Lower (0.3-0.5): More aggressive merging, may lose some objects

4. **Combine Strategy**:
   - `concatenate`: Faster, preserves all information
   - `summarize`: Better quality, requires extra model call

## Compare: Regular vs Sliding Window

Let's compare regular detection with sliding window detection.

In [None]:
# Regular detection (without sliding window)
regular_result = processor.detect(image_path, "car")
print(f"Regular detection: {len(regular_result['objects'])} cars")

# Sliding window detection
sliding_result = processor.detect_sliding_window(
    image_path, "car", window_size=512, overlap=64
)
print(f"Sliding window detection: {len(sliding_result['objects'])} cars")

print(
    f"\nDifference: {len(sliding_result['objects']) - len(regular_result['objects'])} more detections"
)

## Summary

This notebook demonstrated the sliding window methods for Moondream VLM:

1. **Object Detection**: Process large images in tiles with NMS for merging
2. **Point Detection**: Find object locations across large images
3. **Query**: Answer questions about large images by querying tiles
4. **Caption**: Generate comprehensive captions by combining tile descriptions

The sliding window approach is particularly useful for:
- Very large satellite/aerial imagery
- High-resolution images where details matter
- Scenes with many small objects
- Memory-constrained environments

## Next Steps

- Try different window sizes and overlaps for your use case
- Experiment with both combine strategies for queries and captions
- Use georeferenced outputs with GIS tools
- Combine with other geoai tools for complete workflows