# Building Image and Video Datasets
In this notebook, we will be creating our datasets from which to perform a single-shot face classifier from images on videos. Thye subject matter: Brazilian Jiu Jitsu No Gi Worlds Championships in December 2023. 

The single reference images will come from professional winning photos of the athletes as posted on the IBJJF Facebook page. The source videos will be extracted from the IBJJF's public YouTube page.

We will use OpenCV's Haar Cascade classifier to detect faces and their locations using bounding boxes. 

Image & Video Datasets:
- Image dataset: podium photos from a major tournament for facial recognition
    - IBJJF Podium Pics from the 2023 No Gi Worlds Tournament https://www.facebook.com/media/set/?set=a.728657779287780&type=3
- Video dataset: Black Belt matches on IBJJF's YouTube channel
    - IBJJF YouTube Playlist https://www.youtube.com/watch?v=VZN9Di_Ou-c&list=PLndFOMjO-W278-AspLyh5IWGC7eNQmF4U

In [None]:
import os
from pytube import Playlist, YouTube
import cv2
import glob
from PIL import Image
from IPython.display import display


### Manually Download Images 
Podium pics are on Facebook and since there are less than 20 iamges, we will manually download them and skip the Facebook API process.

### Download Videos

In [None]:
# Path to save videos
save_path = r'your_path'

# URL of the YouTube playlist
playlist_url = 'https://www.youtube.com/watch?v=VZN9Di_Ou-c&list=PLndFOMjO-W278-AspLyh5IWGC7eNQmF4U'

# Use the PyTube library to extract the playlist
playlist = Playlist(playlist_url)

# Loop through each video in the playlist
for video in playlist.video_urls:
    try:
        # Use the PyTube library to extract the video
        yt = YouTube(video)
        # Get the highest resolution stream
        stream = yt.streams.get_highest_resolution()
        # Download the video
        stream.download(output_path=save_path)
        print(f"Downloaded: {yt.title}")
    except Exception as e:
        print(f"Error downloading {video}: {str(e)}")


In [None]:
# Simple command line version for downloading an entire playlit
# pytube 'https://www.youtube.com/watch?v=VZN9Di_Ou-c&list=PLndFOMjO-W278-AspLyh5IWGC7eNQmF4U'

### Detect Faces

In [None]:
save_path = r'your_path'

# Create a list of image files and their paths 
image_files = glob.glob(save_path + '/*.jpg')
len(image_files)

In [None]:
# Visually inspect the image
img = Image.open(image_files[1])
display(img)

Haar Cascade Classifier

In [None]:
# Load a single image to test the face detector
img = cv2.imread(image_files[1])
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Define the cascade classifier
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')

# Detect faces, use higher scale factor to reduce false positives
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.3, minNeighbors=5, minSize=(30, 30))

# Draw bounding boxes
for (x, y, w, h) in faces:
    cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 4)
    
pil_img = Image.fromarray(img)
display(pil_img)

In [None]:
def detect_faces(image):
    img = cv2.imread(image)
    gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.3, minNeighbors=5, minSize=(30, 30))
    return faces

face_dict = {}

for i in image_files:
    face_roi = detect_faces(i)
    file_name = os.path.basename(i)
    face_dict[file_name] = face_roi

face_dict    

In [None]:
len(face_dict)