# 📂 DATA COLLECTION

## 🎯 Objective

The goal of this step is to collect a high-quality dataset of sports images with corresponding captions for the Image Captioning task. We utilize two data sources:
- Google Images Crawling - Scraping sports images and generating captions by using GPT-4-Vision model.
- UIT-ViIC Dataset - A publicly available dataset containing Vietnamese captions for images.

## 🌐 Data Sources

**1️⃣ Google Images Crawling**

- Reason: Beside Pinterest, Google Images contains a vast collection of sports-related images.
- Approach:
    - Use web scraping techniques to extract images and descriptions.
    - Annotate captions using **GPT-4o-mini API**.
- Challenges:
    - Ensuring high-quality, relevant captions from the model.
    - Handling duplicates and low-quality images.

**2️⃣ UIT-ViIC Dataset**

- Reason: It provides a structured dataset with Vietnamese captions.
- Approach:
    - Load from HuggingFace dataset.
- Challenges:
    - Ensuring consistency between the two datasets.
    - Handling different caption formats.

## 📥 Data Collection Methods

### 1️⃣ Self-crawl from GG Images

#### Import libraries

In [87]:
# !pip install icrawler
from icrawler.builtin import GoogleImageCrawler
from icrawler.downloader import Downloader
import os

# Disable warnings
import warnings
warnings.filterwarnings('ignore')

# Disable logging from icrawler
import logging
logging.getLogger('icrawler').setLevel(logging.CRITICAL)


#### Define functions/classes

In [88]:
# Remove images      
def remove_images(folder):
    try:
        os.system(f"rm -rf {folder}")
    except Exception as e:
        print(e)

def crawl_google_images(sport, num_images, cnt):
    keyword = sport + " action real life shot"
    save_dir = f"../data/self_crawl/google_images"
    crawler = GoogleImageCrawler(storage={"root_dir": save_dir})
    crawler.crawl(keyword=keyword, max_num=num_images, file_idx_offset=100*cnt)

#### TEST

In [89]:
crawler = GoogleImageCrawler(storage={"root_dir": "../data/self_crawl/google_images"})
crawler.crawl(keyword="Archery action real life shot", max_num=10, file_idx_offset=100*2)
# Remove existing images if any

#### MAIN

In [90]:
with open("../data/metadata/sports_cate.txt", "r", encoding="utf-8") as file:
    sports_list = [line.strip() for line in file.readlines()]
    
# Liệt kê thử vài môn thể thao
sports_list[:5]

['Soccer', 'Basketball', 'Tennis', 'Athletics', 'Swimming']

In [91]:
remove_images("../data/self_crawl/google_images")

In [93]:
cnt = 0
for sport in sports_list:
    # check if there's sport_40.jpg or .png in the folder or not
    cnt += 1
    number = str(cnt*100+30)
    while len(number) < 6:
        number = "0" + number
    if os.path.exists(f"../data/self_crawl/google_images/{number}.jpg") or os.path.exists(f"../data/self_crawl/google_images/{number}.png"):
        continue
    print(f"Downloading images for {sport}...")
    crawl_google_images(sport, 40, cnt)
    print(f">> Finish for {sport}...")
    
print("🎉 Done")

Downloading images for Basketball...
>> Finish for Basketball...
Downloading images for Boxing...
>> Finish for Boxing...
Downloading images for Wrestling...
>> Finish for Wrestling...
Downloading images for Judo...
>> Finish for Judo...
Downloading images for Taekwondo...
>> Finish for Taekwondo...
Downloading images for Cycling...
>> Finish for Cycling...
Downloading images for Rowing...
>> Finish for Rowing...
Downloading images for Sailing...
>> Finish for Sailing...
Downloading images for Canoeing...
>> Finish for Canoeing...
Downloading images for Fencing...
>> Finish for Fencing...
Downloading images for Archery...
>> Finish for Archery...
Downloading images for Weightlifting...
>> Finish for Weightlifting...
Downloading images for Triathlon...
>> Finish for Triathlon...
Downloading images for Equestrian...
>> Finish for Equestrian...
Downloading images for Modern Pentathlon...
>> Finish for Modern Pentathlon...
Downloading images for Handball...
>> Finish for Handball...
Downlo