# Tutorial 1: Multi-threaded Fetching and Downloading

This notebook demonstrates how to use `landlens_db`'s multi-threaded capabilities to efficiently fetch and download Mapillary images.

## Prerequisites

You'll need:
- A Mapillary API token
- The .env file with MLY_TOKEN and DOWNLOAD_DIR defined
- landlens_db installed

## Setup

First, let's import the necessary modules and load environment variables:

In [1]:
from landlens_db.handlers.cloud import Mapillary
from landlens_db.geoclasses.geoimageframe import GeoImageFrame
from dotenv import load_dotenv
import os
import time

load_dotenv()

MLY_TOKEN = os.environ.get("MLY_TOKEN")
DOWNLOAD_DIR = os.environ.get("DOWNLOAD_DIR")

# Create download directory if it doesn't exist
os.makedirs(DOWNLOAD_DIR, exist_ok=True)

## Multi-threaded Fetching and Downloading

Both `fetch_within_bbox` and `download_images_to_local` methods now support multi-threading for improved performance. Let's test both operations:

In [2]:
# Initialize Mapillary connection
importer = Mapillary(MLY_TOKEN)

# Fetch a small sample of images from Shibuya area
bbox = [139.699, 35.658, 139.7, 35.659]  # Small area in Shibuya
fields = [
    'id',
    'captured_at',
    'compass_angle',
    'thumb_1024_url',
    'geometry'
]

print("Testing fetch_within_bbox speeds:")

for workers in [1, 5, 10]:
    start_time = time.time()

    # Fetch images with different worker counts
    gdf = importer.fetch_within_bbox(
        bbox,
        fields=fields,
        max_images=50,
        max_workers=workers  # Specify worker count for parallel fetching
    )

    duration = time.time() - start_time
    print(f"\nFetch results with {workers} worker{'s' if workers > 1 else ''}:")
    print(f"Time taken: {duration:.2f} seconds")
    print(f"Speed: {len(gdf)/duration:.2f} images/second")

Testing fetch_within_bbox speeds:
Fetching 1 tiles...
Reached maximum images (50), stopping tile fetching
Found 83737 total images
After removing duplicates: 83737 unique images
Limiting to 50 images for processing


Fetching metadata: 100%|██████████| 50/50 [00:24<00:00,  2.07it/s]



Fetch results with 1 worker:
Time taken: 27.44 seconds
Speed: 1.82 images/second
Fetching 1 tiles...
Reached maximum images (50), stopping tile fetching
Found 83737 total images
After removing duplicates: 83737 unique images
Limiting to 50 images for processing


Fetching metadata: 100%|██████████| 50/50 [00:03<00:00, 13.68it/s]



Fetch results with 5 workers:
Time taken: 7.11 seconds
Speed: 7.03 images/second
Fetching 1 tiles...
Reached maximum images (50), stopping tile fetching
Found 83737 total images
After removing duplicates: 83737 unique images
Limiting to 50 images for processing


Fetching metadata: 100%|██████████| 50/50 [00:02<00:00, 24.06it/s]


Fetch results with 10 workers:
Time taken: 5.16 seconds
Speed: 9.69 images/second





### Prepare Images for Download

Convert to GeoImageFrame and set up proper filenames:

In [3]:

# Convert 'mly|123' to 'mly_123' for proper filename format
images['filename'] = images['name'].str.replace('|', '_')

print(f"\nFound {len(images)} images")
print("\nSample of filenames:")
print(images['filename'].head())


Found 50 images

Sample of filenames:
0    mly_624212905601399
1    mly_306562965681688
2    mly_164604372260810
3    mly_180754317158144
4    mly_485201359357219
Name: filename, dtype: object


### Multi-threaded Download Performance

Test downloading with different numbers of workers. Based on testing, multi-threading provides significant speedups:

1. **Fetch Performance**:
   - Single thread: ~1.8 images/second (baseline)
   - 5 workers: ~7.0 images/second (4.4x faster)
   - 10 workers: ~9.7 images/second (5.8x faster)

2. **Download Performance**:
   - Single thread: ~1.0 images/second (baseline)
   - 5 workers: ~37.4 images/second (48x faster)
   - 10 workers: ~84.6 images/second (83x faster)

In [7]:
print("Testing download speeds:")

for workers in [1, 5, 10]:
    start_time = time.time()

    # Download images using converted filename column
    local_images = images.download_images_to_local(
        DOWNLOAD_DIR,
        filename_column='filename',  # Use the converted names
        max_workers=workers
    )

    duration = time.time() - start_time
    print(f"\nDownload results with {workers} worker{'s' if workers > 1 else ''}:")
    print(f"Time taken: {duration:.2f} seconds")
    print(f"Speed: {len(images)/duration:.2f} images/second")

    # Verify some downloaded files
    print("\nSample of downloaded files:")
    for file in sorted(os.listdir(DOWNLOAD_DIR))[:3]:
        print(f"- {file}")

Testing download speeds:


Downloading images: 100%|██████████| 100/100 [00:09<00:00, 10.16it/s]



Download results with 1 worker:
Time taken: 9.86 seconds
Speed: 10.15 images/second

Sample of downloaded files:
- mly_1003835734760959.jpg
- mly_1019237306186418.jpg
- mly_1026095268511169.jpg


Downloading images: 100%|██████████| 100/100 [00:02<00:00, 46.84it/s]



Download results with 5 workers:
Time taken: 2.15 seconds
Speed: 46.58 images/second

Sample of downloaded files:
- mly_1003835734760959.jpg
- mly_1019237306186418.jpg
- mly_1026095268511169.jpg


Downloading images: 100%|██████████| 100/100 [00:01<00:00, 86.44it/s]


Download results with 10 workers:
Time taken: 1.17 seconds
Speed: 85.40 images/second

Sample of downloaded files:
- mly_1003835734760959.jpg
- mly_1019237306186418.jpg
- mly_1026095268511169.jpg





### Important Notes About Multi-threaded Operations

1. **Number of Workers**: 
   - Default is 10 workers for both operations
   - Fetch shows modest scaling (up to 5.8x with 10 workers)
   - Download shows excellent scaling (up to 83x with 10 workers)

2. **Error Handling**:
   - Built-in retry mechanism for both operations
   - Progress bars show real-time status
   - Failed operations are logged but don't stop the process

3. **Filename Handling**:
   - Always convert names: `images['filename'] = images['name'].str.replace('|', '_')`
   - Use `filename_column='filename'` when downloading
   - Results in clean filenames like 'mly_123456789.jpg'

### Example Usage

In [5]:
# Fetch with parallel processing
gdf = importer.fetch_within_bbox(
    bbox,
    fields=fields,
    max_images=100,
    max_workers=10  # Use optimal worker count
)

# Convert to GeoImageFrame and fix filenames
images = GeoImageFrame(gdf)
images['filename'] = images['name'].str.replace('|', '_')

# Download with optimal settings
local_images = images.download_images_to_local(
    DOWNLOAD_DIR,
    filename_column='filename',
    max_workers=10
)

Fetching 1 tiles...
Reached maximum images (100), stopping tile fetching
Found 83737 total images
After removing duplicates: 83737 unique images
Limiting to 100 images for processing


Fetching metadata: 100%|██████████| 100/100 [00:04<00:00, 22.62it/s]
Downloading images: 100%|██████████| 100/100 [00:06<00:00, 15.44it/s]


### Best Practices

1. **Choose Worker Count Wisely**:
   - Use 10 workers for optimal download performance
   - Consider 5 workers for fetching (good balance)
   - Adjust based on your system resources

2. **Manage Filenames**:
   - Always convert to proper format
   - Use descriptive filename columns
   - Verify filenames after downloads

3. **Monitor Performance**:
   - Watch progress bars
   - Check for failed operations
   - Adjust worker counts if needed

### API Considerations

The implementation carefully respects Mapillary's rate limits:
Tile requests limited to 50,000 per day
Entity API requests limited to 60,000 per minute
Search API requests limited to 10,000 per minute


see the detail: [https://www.mapillary.com/developer/api-documentation#rate-limits](https://www.mapillary.com/developer/api-documentation#rate-limits)