# Using Pexels API for Image Collection with fastai

This notebook demonstrates how to collect and organize image datasets using the Pexels API and fastai's utilities.

## Key Concepts:
1. **Image Collection**: We use Pexels as an alternative to Bing Image Search to gather images programmatically.
   
2. **fastai's download_images**: This is a powerful utility function that:
   - Takes a list of URLs
   - Downloads them efficiently
   - Handles errors gracefully
   - Saves images directly to specified folders

3. **Directory Organization**: We create a structured dataset by:
   - Making a main directory for our project
   - Creating subdirectories for each category
   - Saving images in their respective categories

4. **Data Verification**: We use fastai's verify_images to:
   - Check for corrupted downloads
   - Remove problematic files
   - Ensure our dataset is clean and ready for model training

## Process Flow:
1. Get image URLs from Pexels API
2. Create organized folder structure
3. Download images to appropriate folders
4. Verify and clean the dataset
5. Generate a summary report

This approach sets us up for the next steps in the deep learning process by creating a well-organized image dataset that's ready for training.

Note: While the original course used Bing's API, this notebook shows how to adapt the same principles using Pexels, demonstrating how fastai's utilities work independently of the image source.

Cell 1 - Setup and Imports

In [2]:
# Setup for Google Colab
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

# Import necessary libraries
import os
import requests
from fastbook import *
from fastai.vision.widgets import *
from PIL import Image

Cell 2 - Function Definition and API Key

In [3]:
def search_images_pexels(api_key, query, per_page=80):
    """Search for images on Pexels and return an L object with contentUrl attribute."""
    headers = {'Authorization': api_key}
    url = "https://api.pexels.com/v1/search"

    params = {
        'query': query,
        'per_page': per_page
    }

    response = requests.get(url, headers=headers, params=params)
    if response.status_code != 200:
        raise Exception(f"Error fetching data: {response.status_code}")

    data = response.json()
    # Create an L object to match Bing's structure
    urls = L([{'contentUrl': photo['src']['original']} for photo in data['photos']])
    return urls

# Set your Pexels API key
key = os.environ.get('PEXELS_API_KEY', 'MgHvaLKOOwPoAfmIqoC1uIhrhW0T1IsFypI2WuTOrc2SlOdf4bvS5zVv')  # Replace XXX with your key

Cell 3 - Download Images

In [4]:
# Define your categories
bear_types = 'grizzly','black','teddy'

# Create main directory
path = Path('bears')
if not path.exists():
    path.mkdir()
    for o in bear_types:
        dest = (path/o)
        dest.mkdir(exist_ok=True)
        results = search_images_pexels(key, f'{o} bear')
        download_images(dest, urls=results.attrgot('contentUrl'))

# Verify the files were downloaded
fns = get_image_files(path)
fns

(#240) [Path('bears/black/0fa056c6-96be-4bb1-adb9-a98aabf36f24.jpeg'),Path('bears/black/6c11dd31-1939-4efd-b1c2-ff40072bada6.jpeg'),Path('bears/black/120912bd-d1b0-4e14-a950-80f95ec3fd56.jpeg'),Path('bears/black/3c1dbae9-9e32-4d15-95db-1d24f083e9d9.jpeg'),Path('bears/black/7865bf20-1fc7-4468-bcc8-68a1f8e58b7e.jpeg'),Path('bears/black/7f8bf0fb-8b2d-4bda-8f1f-6cf340d12b6f.jpeg'),Path('bears/black/df526cec-e6ce-4690-ace7-cc47ee51848e.jpeg'),Path('bears/black/ebb98d9d-d6f2-4d07-94c3-61c21f850976.jpeg'),Path('bears/black/285270e3-4b60-4a74-9cb9-ed820349a6f2.jpeg'),Path('bears/black/485ecf23-4407-4b99-aef5-f456884ae085.jpeg'),Path('bears/black/6bfb47f7-f9f9-4763-b1e0-438c0f0a68c2.jpeg'),Path('bears/black/af9689a9-914d-4d94-b42c-2e83fc0ef409.jpeg'),Path('bears/black/a0428795-c54d-4a60-94d3-351a2408027e.jpeg'),Path('bears/black/e8ff71f7-0704-42fc-a64c-aa211ea4c862.jpeg'),Path('bears/black/e8d4aef6-d251-4efb-aaa1-2d4e86686635.jpeg'),Path('bears/black/b472a0cc-797a-4359-9278-485bf20aa8e5.jpeg'),

Cell 4 - Verify and Clean Images

In [5]:
# Check for corrupted images
failed = verify_images(fns)
failed

# Remove corrupted images
failed.map(Path.unlink);

Cell 5 - Summary Report

In [6]:
# Get counts for each category
print("=== Download Summary ===")
for bear_type in bear_types:
    category_path = path/bear_type
    category_files = get_image_files(category_path)
    print(f"\n{bear_type.title()} Bears:")
    print(f"- Successfully downloaded: {len(category_files)} images")
    print(f"- Saved in: {category_path}")

print("\n=== Error Report ===")
if len(failed) > 0:
    print(f"Found and removed {len(failed)} corrupted images:")
    for f in failed:
        print(f"- {f.name}")
else:
    print("No corrupted images were found!")

print("\n=== Final Results ===")
final_files = get_image_files(path)
print(f"Total images across all categories: {len(final_files)}")
print(f"Average images per category: {len(final_files)/len(bear_types):.1f}")
print(f"\nImages are ready for use in: {path}")

=== Download Summary ===

Grizzly Bears:
- Successfully downloaded: 80 images
- Saved in: bears/grizzly

Black Bears:
- Successfully downloaded: 80 images
- Saved in: bears/black

Teddy Bears:
- Successfully downloaded: 80 images
- Saved in: bears/teddy

=== Error Report ===
No corrupted images were found!

=== Final Results ===
Total images across all categories: 240
Average images per category: 80.0

Images are ready for use in: bears
