# Data Acquisition

The image dataset used in this project was obtained from [MushroomObserver](https://mushroomobserver.org/articles/20), a website dedicated to helping people identify mushrooms. They provide access to more than 370,000 community-annotated images with reasonable certainty levels. High-quality labeled data is crucial for machine learning models, as noisy labels can significantly reduce model performance.

## Download Dataset Catalogue

Before processing the image data, we need to download the dataset catalogue from Google Sheets to get a comprehensive overview of available data. The catalogue contains image URLs, species names, and metadata from mushroom observations collected by the community.

This step establishes the foundation for our dataset by providing a complete inventory of available images and their associated metadata.

### Output files

- `mo_catalogue.tsv`: Contains image URLs and species names


In [11]:
import os

import pandas as pd

from utils.dataset_acquisition import download_google_sheets_tsv

GOOGLE_SHEETS_URL = "https://docs.google.com/spreadsheets/d/1aQSmLlthx99pCt_IS6aHyhZdn_hUiv3EBLbf4h3Zg7s/edit?usp=sharing"
TSV_OUTPUT_PATH = os.path.join("data", "mo_catalogue.tsv")

# Download the dataset if it doesn't exist
if not os.path.exists(TSV_OUTPUT_PATH):
    print("TSV file not found locally. Downloading from Google Sheets...")
    download_google_sheets_tsv(GOOGLE_SHEETS_URL, TSV_OUTPUT_PATH)

else:
    print("TSV file already exists: " + TSV_OUTPUT_PATH)

# Show basic info about existing file
file_size = os.path.getsize(TSV_OUTPUT_PATH)
df = pd.read_csv(TSV_OUTPUT_PATH, sep="\t")
print(f"Dataset contains {len(df):,} rows and {len(df.columns)} columns")

TSV file already exists: data/mo_catalogue.tsv
Dataset contains 370,110 rows and 5 columns


## Explore Dataset Catalogue

Here we analyze the distribution of images across different species to identify which species have sufficient representation for neural network training. This exploration helps us understand the dataset composition and select species with adequate sample sizes.

A minimum threshold is applied to filter species with insufficient data, ensuring we work with species that have enough examples for reliable model training and evaluation.

### Output files

- `species_image_count_<threshold>.csv`: Contains species names and their corresponding image counts above the specified threshold

In [None]:
import os

import pandas as pd

from utils.dataset_acquisition import count_species

TSV_OUTPUT_PATH = os.path.join("data", "mo_catalogue.tsv")
data_catalogue_df = pd.read_csv(TSV_OUTPUT_PATH, sep="\t")
count_species(data_catalogue_df, 250)

2025-07-18 09:23:34,150 - INFO - Processing species data...
2025-07-18 09:23:34,165 - INFO - Image list contains 10194 unique names
2025-07-18 09:23:34,167 - INFO - Image list contains 10193 unique species
2025-07-18 09:23:34,168 - INFO - Image list contains 287 unique species with more than 250 images
2025-07-18 09:23:34,169 - INFO - CSV file saved with name and counts: data/species_image_count_250.csv
2025-07-18 09:23:34,169 - INFO - Species with ≥250 images:
2025-07-18 09:23:34,170 - INFO -   - Trametes versicolor: 2018 images
2025-07-18 09:23:34,170 - INFO -   - Psilocybe pelliculosa: 1911 images
2025-07-18 09:23:34,170 - INFO -   - Psilocybe cyanescens: 1576 images
2025-07-18 09:23:34,171 - INFO -   - Ganoderma oregonense: 1544 images
2025-07-18 09:23:34,171 - INFO -   - Schizophyllum commune: 1395 images
2025-07-18 09:23:34,171 - INFO -   - Psilocybe ovoideocystidiata: 1353 images
2025-07-18 09:23:34,172 - INFO -   - Psilocybe zapotecorum: 1337 images
2025-07-18 09:23:34,172 - IN

## Download Image Dataset

After identifying species with sufficient representation, we proceed to download the actual images for the selected species. This process involves careful consideration of server resources and ethical web scraping practices.

## Ethical Data Scraping

Data scraping can place significant burden on server infrastructures. MushroomObserver requests limiting the number of requests per minute, so I have implemented a comprehensive rate limiting system to strictly respect their guidelines and support this valuable non-profit resource.

**Rate Limiting Implementation:**

- **Maximum 20 requests per minute** (1 request every 3 seconds on average)
- **Global rate limiting** across all concurrent operations  
- **Adaptive timing** based on server response times when available
- **Automatic backoff** for rate limit violations (HTTP 429)
- **Request queue management** to track requests over 60-second windows
- **Respectful delays** between requests to minimize server load

This enhanced rate limiting ensures we never exceed their capacity while maintaining efficient downloading, supporting the sustainability of this important community resource.

### Output files

- `data/images/<species_name>/`: Directory structure containing downloaded images organized by species

In [12]:
from utils.dataset_acquisition import scrape_species_from_list

excluded_species_path = os.path.join("data", "excluded_species.csv")
species_list = ["Tylopilus felleus", "Imleria badia", "Boletus edulis"]

if scrape_species_from_list(data_catalogue_df, species_list):
    print("Data acquisition completed successfully.")
else:
    print("Data acquisition failed. Please check the logs for details.")

2025-07-18 09:28:06,526 - INFO - Starting to scrape 3 species from provided list
2025-07-18 09:28:06,528 - INFO - Rate limiting: Maximum 20 requests per minute (3 seconds average)
2025-07-18 09:28:06,589 - INFO - Starting download for Tylopilus felleus: all 601 available images
2025-07-18 09:28:06,607 - INFO - Species Tylopilus felleus: All 601 images already downloaded
2025-07-18 09:28:06,622 - INFO - Starting download for Imleria badia: all 217 available images
2025-07-18 09:28:06,637 - INFO - Species Imleria badia: All 217 images already downloaded
2025-07-18 09:28:06,652 - INFO - Starting download for Boletus edulis: all 593 available images
2025-07-18 09:28:06,668 - INFO - Species Boletus edulis: All 593 images already downloaded
Processing: Boletus edulis: 100%|[32m██████████[0m| 3/3 [00:00<00:00, 30.29species/s]
2025-07-18 09:28:06,669 - INFO - All species from list processed successfully.


Data acquisition completed successfully.
