<a href="https://colab.research.google.com/github/kareemhamedd/AI-Project/blob/main/AI_fINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

README (Final Summary – Data Acquisition & Preprocessing)

Celebrity Face Classification – Data Acquisition & Preprocessing Summary BY KAREEM HAMED

All processed data (aligned and augmented) is available in the following Google Drive folder:
https://drive.google.com/drive/folders/1MehPFT7pIbAbpMOxwHM7b464OYZx6TuJ?usp=drive_link


This document provides a concise overview of all work completed for the data acquisition and preprocessing stage of the Celebrity Face Classification project. The goal was to prepare clean, consistent, and well-structured facial image data for downstream model training and evaluation.

1. Dataset Collection

Collected two datasets: LFW and VGGFace2 from Kaggle.

Extracted, organized, and prepared the raw data for preprocessing.

2. Data Cleaning & Face Processing

Applied face detection and alignment using MTCNN.

Standardized all images to a unified resolution (112×112).

Implemented error handling and skip logic to avoid pipeline interruptions.

3. Data Augmentation (Training Set Only)

Performed medium-level augmentation to increase data variability.

Applied transformations such as flipping, rotation, and brightness adjustments.

Augmentation was intentionally excluded from the validation set to maintain evaluation integrity.

4. Training Set Preprocessing

Executed full preprocessing (alignment + augmentation) on all identities in the VGGFace2 training split.

Generated enhanced datasets suitable for robust model training.

Stored all processed outputs in Google Drive for team access.

5. Validation Set Preprocessing

Applied alignment only to ensure clean, realistic evaluation data.

Maintained consistency with the training preprocessing pipeline while excluding augmentation.

6. Final Delivery

Delivered fully processed training and validation datasets.

Provided all preprocessing scripts used in the pipeline for reproducibility.

Ensured that the final data is organized, aligned, augmented where needed, and ready for model development.

7.datasets:

lfw:
https://www.kaggle.com/datasets/ashfaqsyed/labelled-faces-in-the-wild?utm_source=chatgpt.com


vggface2: https://www.kaggle.com/datasets/hearfool/vggface2?utm_source=chatgpt.com




Status:
# All preprocessing tasks are complete. The dataset is fully prepared for model training and validation.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


'/content/drive/MyDrive/celebrity_face_project'

In [None]:
import os, shutil

os.makedirs('/root/.kaggle', exist_ok=True)
shutil.move('/content/kaggle.json', '/root/.kaggle/kaggle.json')
os.chmod('/root/.kaggle/kaggle.json', 0o600)

!kaggle --version


Kaggle API 1.7.4.5


In [None]:
!kaggle datasets download -d ashfaqsyed/labelled-faces-in-the-wild -p /content

import zipfile, glob, os

LFW_RAW = "/content/lfw_raw"
os.makedirs(LFW_RAW, exist_ok=True)

for z in glob.glob('/content/*.zip'):
    if "labelled" in z.lower() or "lfw" in z.lower():
        print("Extracting:", z)
        with zipfile.ZipFile(z, 'r') as f:
            f.extractall(LFW_RAW)


Dataset URL: https://www.kaggle.com/datasets/ashfaqsyed/labelled-faces-in-the-wild
License(s): Community Data License Agreement - Permissive - Version 1.0
Downloading labelled-faces-in-the-wild.zip to /content
 99% 1.27G/1.28G [00:11<00:00, 258MB/s]
100% 1.28G/1.28G [00:11<00:00, 123MB/s]
Extracting: /content/labelled-faces-in-the-wild.zip


In [None]:
!kaggle datasets download -d hearfool/vggface2 -p /content

VGG_RAW = "/content/vgg_raw"
os.makedirs(VGG_RAW, exist_ok=True)

for z in glob.glob('/content/*.zip'):
    if "vgg" in z.lower():
        print("Extracting:", z)
        with zipfile.ZipFile(z, 'r') as f:
            f.extractall(VGG_RAW)


Dataset URL: https://www.kaggle.com/datasets/hearfool/vggface2
License(s): unknown
Downloading vggface2.zip to /content
100% 2.32G/2.32G [00:47<00:00, 15.7MB/s]
100% 2.32G/2.32G [00:47<00:00, 52.6MB/s]
Extracting: /content/vggface2.zip


In [None]:
import os

LFW_RAW = "/content/lfw_raw"

print("LFW Raw Path:", LFW_RAW)
print("Total classes (folders):", len(os.listdir(LFW_RAW)))
print(os.listdir(LFW_RAW)[:10])  # Show first few identities


LFW Raw Path: /content/lfw_raw
Total classes (folders): 9
['lfw-funneled.tgz', 'lfw_funneled_superpixels_fine.tgz', 'lfw_superpixels_fine.tgz', 'lfw-deepfunneled-sp.tgz', 'lfw.tgz', 'lfw-a.tgz', 'README.txt', 'lfw-bush.tgz', 'lfw-deepfunneled.tgz']


In [None]:
import tarfile
import os

LFW_RAW = "/content/lfw_raw"

for file in os.listdir(LFW_RAW):
    if file.endswith(".tgz"):
        file_path = os.path.join(LFW_RAW, file)
        print("Extracting:", file_path)

        with tarfile.open(file_path, 'r:gz') as tar:
            tar.extractall(LFW_RAW)


Extracting: /content/lfw_raw/lfw-funneled.tgz


  tar.extractall(LFW_RAW)


Extracting: /content/lfw_raw/lfw_funneled_superpixels_fine.tgz
Extracting: /content/lfw_raw/lfw_superpixels_fine.tgz
Extracting: /content/lfw_raw/lfw-deepfunneled-sp.tgz
Extracting: /content/lfw_raw/lfw.tgz
Extracting: /content/lfw_raw/lfw-a.tgz
Extracting: /content/lfw_raw/lfw-bush.tgz
Extracting: /content/lfw_raw/lfw-deepfunneled.tgz


In [None]:
import os

LFW_RAW = "/content/lfw_raw"

folders = [f for f in os.listdir(LFW_RAW) if os.path.isdir(os.path.join(LFW_RAW, f))]
print("Extracted folders:", len(folders))
print(folders[:10])


Extracted folders: 6
['lfw_funneled', 'lfw-deepfunneled-sp', 'lfw_funneled_superpixels_fine', 'lfw_superpixels_fine', 'lfw-deepfunneled', 'lfw']


In [None]:
import os

LFW_PATH = "/content/lfw_raw/lfw-deepfunneled"

ids = [f for f in os.listdir(LFW_PATH) if os.path.isdir(os.path.join(LFW_PATH, f))]
print("Number of identities:", len(ids))
print(ids[:20])


Number of identities: 5749
['Mitchell_Crooks', 'Mikhail_Kasyanov', 'Dennis_Miller', 'Jorge_Marquez-Ruarte', 'Elisha_Cuthbert', 'Shaul_Mofaz', 'Ray_Wasden', 'Dennis_Erickson', 'Luis_Guzman', 'Saied_Hadi_al_Mudarissi', 'Leland_Chapman', 'Krishna_Bhadur_Mahara', 'Mary_Frances_Seiter', 'Robert_Mueller', 'Placido_Domingo', 'Steven_Craig', 'Jane_Pauley', 'Christopher_Russell', 'Paul_Murphy', 'James_Butts']


In [None]:
import os

VGG_RAW = "/content/vgg_raw"

print("Folders in VGG_RAW:", len(os.listdir(VGG_RAW)))
print(os.listdir(VGG_RAW)[:20])


Folders in VGG_RAW: 2
['train', 'val']


# **code**

In [None]:
!pip install mtcnn opencv-python pillow tqdm


Collecting mtcnn
  Downloading mtcnn-1.0.0-py3-none-any.whl.metadata (5.8 kB)
Collecting lz4>=4.3.3 (from mtcnn)
  Downloading lz4-4.4.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Downloading mtcnn-1.0.0-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lz4-4.4.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m67.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lz4, mtcnn
Successfully installed lz4-4.4.5 mtcnn-1.0.0


In [None]:
import os

os.makedirs("/content/drive/MyDrive/celebrity_face_project/scripts/preprocessing", exist_ok=True)

print("Folders created successfully!")


Folders created successfully!


In [None]:
%%writefile /content/drive/MyDrive/celebrity_face_project/scripts/preprocessing/aligner.py
import cv2
import numpy as np
from PIL import Image
from mtcnn.mtcnn import MTCNN

class FacePreprocessor:
    def __init__(self, size=(112,112)):
        self.detector = MTCNN()
        self.size = size

    def align_face(self, img: Image.Image):
        np_img = np.array(img)
        faces = self.detector.detect_faces(np_img)

        if len(faces) == 0:
            return img.resize(self.size)

        faces.sort(key=lambda x: x['box'][2] * x['box'][3], reverse=True)
        x,y,w,h = faces[0]['box']
        x,y = max(0,x), max(0,y)

        face = np_img[y:y+h, x:x+w]
        face = cv2.resize(face, self.size)
        return Image.fromarray(face)


Overwriting /content/drive/MyDrive/celebrity_face_project/scripts/preprocessing/aligner.py


In [None]:
%%writefile /content/drive/MyDrive/celebrity_face_project/scripts/preprocessing/augmentor.py
from PIL import Image, ImageEnhance
import random

class FaceAugmentor:
    def random_flip(self, img):
        return img.transpose(Image.FLIP_LEFT_RIGHT)

    def brightness(self, img):
        enhancer = ImageEnhance.Brightness(img)
        return enhancer.enhance(random.uniform(0.7, 1.3))

    def random_crop(self, img):
        w, h = img.size
        crop_frac = random.uniform(0.85, 1.0)

        new_w, new_h = int(w * crop_frac), int(h * crop_frac)

        left = random.randint(0, w - new_w)
        top = random.randint(0, h - new_h)

        cropped = img.crop((left, top, left + new_w, top + new_h))
        return cropped.resize((w, h))

    def augment(self, img, num_aug=2):
        ops = [self.random_flip, self.brightness, self.random_crop]
        out = []
        for _ in range(num_aug):
            op = random.choice(ops)
            out.append(op(img))
        return out


Writing /content/drive/MyDrive/celebrity_face_project/scripts/preprocessing/augmentor.py


In [None]:
%%writefile /content/drive/MyDrive/celebrity_face_project/scripts/preprocessing/aligner.py
import cv2
import numpy as np
from PIL import Image
from mtcnn.mtcnn import MTCNN
import signal

# -------- TIMEOUT HANDLER --------
class TimeoutException(Exception):
    pass

def timeout_handler(signum, frame):
    raise TimeoutException

signal.signal(signal.SIGALRM, timeout_handler)

class FacePreprocessor:
    def __init__(self, size=(112,112)):
        self.detector = MTCNN()
        self.size = size

    def align_face(self, img: Image.Image):
        np_img = np.array(img)

        # 5-second timeout per image
        signal.alarm(5)
        try:
            faces = self.detector.detect_faces(np_img)
        except TimeoutException:
            print("[TIMEOUT] Face detection skipped")
            return img.resize(self.size)
        finally:
            signal.alarm(0)

        if len(faces) == 0:
            return img.resize(self.size)

        faces.sort(key=lambda x: x['box'][2] * x['box'][3], reverse=True)
        x, y, w, h = faces[0]['box']
        x, y = max(0, x), max(0, y)

        face = np_img[y:y+h, x:x+w]
        face = cv2.resize(face, self.size)

        return Image.fromarray(face)


Overwriting /content/drive/MyDrive/celebrity_face_project/scripts/preprocessing/aligner.py


In [None]:
LFW_INPUT = "/content/lfw_raw/lfw-deepfunneled"

VGG_TRAIN = "/content/vgg_raw/train"
VGG_VAL = "/content/vgg_raw/val"

BASE = "/content/drive/MyDrive/celebrity_face_project"

ALIGNED = f"{BASE}/processed/aligned"
AUG = f"{BASE}/processed/augmented"

# Create output structure
import os

os.makedirs(f"{ALIGNED}/lfw", exist_ok=True)
os.makedirs(f"{AUG}/lfw", exist_ok=True)

os.makedirs(f"{ALIGNED}/vgg_train", exist_ok=True)
os.makedirs(f"{AUG}/vgg_train", exist_ok=True)

os.makedirs(f"{ALIGNED}/vgg_val", exist_ok=True)
os.makedirs(f"{AUG}/vgg_val", exist_ok=True)


In [None]:
import sys
print(sys.path)


['/content', '/env/python', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.12/dist-packages/IPython/extensions', '/root/.ipython']


In [None]:
import sys

sys.path.append("/content/drive/MyDrive/celebrity_face_project/scripts")
sys.path.append("/content/drive/MyDrive/celebrity_face_project/scripts/preprocessing")

print("Paths added.")


Paths added.


In [None]:
from preprocessing.pipeline import PreprocessingPipeline


In [None]:
BASE = "/content/drive/MyDrive/celebrity_face_project"

# INPUT PATHS (raw data)
LFW_INPUT = "/content/lfw_raw/lfw-deepfunneled"
VGG_TRAIN = "/content/vgg_raw/train"
VGG_VAL = "/content/vgg_raw/val"

# OUTPUT ROOTS
ALIGNED = f"{BASE}/processed/aligned"
AUG = f"{BASE}/processed/augmented"

import os
os.makedirs(f"{ALIGNED}/lfw", exist_ok=True)
os.makedirs(f"{AUG}/lfw", exist_ok=True)
os.makedirs(f"{ALIGNED}/vgg_train", exist_ok=True)
os.makedirs(f"{AUG}/vgg_train", exist_ok=True)
os.makedirs(f"{ALIGNED}/vgg_val", exist_ok=True)
os.makedirs(f"{AUG}/vgg_val", exist_ok=True)


In [None]:
from pipeline import PreprocessingPipeline

pipeline = PreprocessingPipeline(img_size=(112,112), num_aug=2)


In [None]:
print(sys.path)


['/content', '/env/python', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.12/dist-packages/IPython/extensions', '/root/.ipython', '/content/drive/MyDrive/celebrity_face_project/scripts', '/content/drive/MyDrive/celebrity_face_project/scripts/preprocessing', '/content/drive/MyDrive/celebrity_face_project/scripts/preprocessing']


In [None]:
pipeline = PreprocessingPipeline(img_size=(112,112), num_aug=2)

pipeline.process_folder(
    input_dir="/content/lfw_raw/lfw-deepfunneled",
    aligned_dir="/content/drive/MyDrive/celebrity_face_project/processed/aligned/lfw",
    aug_dir="/content/drive/MyDrive/celebrity_face_project/processed/augmented/lfw"
)


Processing lfw-deepfunneled: 100%|██████████| 5749/5749 [56:50<00:00,  1.69it/s]


In [None]:
pipeline.process_folder(
    input_dir="/content/vgg_raw/train",
    aligned_dir="/content/drive/MyDrive/celebrity_face_project/processed/aligned/vgg_train",
    aug_dir="/content/drive/MyDrive/celebrity_face_project/processed/augmented/vgg_train"
)

pipeline.process_folder(
    input_dir="/content/vgg_raw/val",
    aligned_dir="/content/drive/MyDrive/celebrity_face_project/processed/aligned/vgg_val",
    aug_dir="/content/drive/MyDrive/celebrity_face_project/processed/augmented/vgg_val"
)


Processing train:  21%|██▏       | 103/480 [2:32:32<10:09:36, 97.02s/it]

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!ls "/content/drive/MyDrive/celebrity_face_project/scripts/preprocessing"


aligner.py  augmentor.py  pipeline.py  __pycache__


In [None]:
import sys
sys.path.append("/content/drive/MyDrive/celebrity_face_project/scripts/preprocessing")


In [None]:
!pip install mtcnn


Collecting mtcnn
  Downloading mtcnn-1.0.0-py3-none-any.whl.metadata (5.8 kB)
Collecting lz4>=4.3.3 (from mtcnn)
  Downloading lz4-4.4.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Downloading mtcnn-1.0.0-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lz4-4.4.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lz4, mtcnn
Successfully installed lz4-4.4.5 mtcnn-1.0.0


In [None]:
!pip install opencv-python pillow tqdm




In [None]:
import sys
sys.path.append("/content/drive/MyDrive/celebrity_face_project/scripts/preprocessing")

from pipeline import PreprocessingPipeline


In [None]:
pipeline = PreprocessingPipeline(img_size=(112,112), num_aug=2)


In [None]:
!ls "/content/drive/MyDrive"


 12edab40-71c0-11ef-a237-49738a978907.jpg.webp
 147505301_max.jpg
'Advanced Programming Language 2- Lecture 2 –  Java Basics.pdf'
'Advanced Programming Language 2- Lecture 3–  OOP Part 1.pdf'
'Advanced Programming Language 2- Lecture 4–  OOP Part 2.pdf'
'Advanced Programming Language 2- Lecture 5 –  OOP Part 3 .pdf'
'Advanced Programming Language 2- Lecture 6 –  Ecxeptions handling.pdf'
'Advanced Programming Language 2- Lecture 7 –  IO Files.pdf'
'AI assignment 2.pdf'
 celebrity_face_project
 Classroom
'cleaned Data'
'Colab Notebooks'
'data management '
'data management  (1)'
'English Assignment 3 (1).docx'
'kareem 3'
'kareem_hamed_fahmy_931230215 (1).java'
 kareem_hamed_fahmy_931230215.java
 KareemHamedResume.pdf
'Max and the lucky key'
'Screenshot (1803).png'
'Screenshot (1805).png'
'Screenshot (1806).png'
'Sheet2 931230215.pdf'
'sheet3 931230215.pdf'
'sheet4 931230215.pdf'
'sheet5 931230215.pdf'
 test
'test (1)'
'كريم حامد فهمى-93120215'


In [None]:
!pip install kaggle




In [None]:
!ls /root/.cache/kagglehub/datasets/hearfool/vggface2/1


ls: cannot access '/root/.cache/kagglehub/datasets/hearfool/vggface2/1': No such file or directory


In [None]:
from google.colab import files
files.upload()


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"kareemhamedd","key":"8afc4e35d05c86f3a080c0d14aefd54c"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [None]:
!kaggle datasets download -d hearfool/vggface2 -p /content


Dataset URL: https://www.kaggle.com/datasets/hearfool/vggface2
License(s): unknown
Downloading vggface2.zip to /content
100% 2.32G/2.32G [00:17<00:00, 179MB/s]
100% 2.32G/2.32G [00:17<00:00, 141MB/s]


In [None]:
!mkdir -p /content/vgg_raw


In [None]:
!unzip -q "/content/vggface2.zip" -d "/content/vgg_raw"


In [None]:
!ls /content/vgg_raw

train  val


In [None]:
pipeline.process_folder(
    input_dir="/content/vgg_raw/train",
    aligned_dir="/content/drive/MyDrive/celebrity_face_project/processed/aligned/vgg_train",
    aug_dir="/content/drive/MyDrive/celebrity_face_project/processed/augmented/vgg_train"
)


Processing train:  51%|█████     | 243/480 [10:23:44<7:33:35, 114.83s/it]

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!ls "/content/drive/MyDrive/celebrity_face_project/processed/aligned"


lfw  vgg_train	vgg_val


In [None]:
import sys
sys.path.append("/content/drive/MyDrive/celebrity_face_project/scripts")


In [None]:
from preprocessing.pipeline import PreprocessingPipeline



In [None]:
from google.colab import files
files.upload()


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"kareemhamedd","key":"24976c57fe5566faa29d55d7e3be910c"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [None]:
!kaggle datasets download -d hearfool/vggface2 -p /content



Dataset URL: https://www.kaggle.com/datasets/hearfool/vggface2
License(s): unknown
Downloading vggface2.zip to /content
 99% 2.31G/2.32G [00:35<00:00, 24.1MB/s]
100% 2.32G/2.32G [00:36<00:00, 68.4MB/s]


In [None]:
!unzip -q "/content/vggface2.zip" -d "/content/vgg_raw"


In [3]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
import sys
sys.path.append('/content/drive/MyDrive/celebrity_face_project/scripts')


In [7]:
from google.colab import files
files.upload()


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"kareemhamedd","key":"086286d452bf154feead30bb85e7b005"}'}

In [8]:
!mkdir -p /root/.kaggle
!cp /content/kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json


In [9]:
!kaggle datasets list | head


ref                                                                title                                                     size  lastUpdated                 downloadCount  voteCount  usabilityRating  
-----------------------------------------------------------------  --------------------------------------------------  ----------  --------------------------  -------------  ---------  ---------------  
wardabilal/spotify-global-music-dataset-20092025                   Spotify Global Music Dataset (2009–2025)               1289021  2025-11-11 09:43:05.933000          13421        301  1.0              
ranaghulamnabi/shopping-behavior-and-preferences-study             Shopping Behavior & Preferences Study                    72157  2025-12-03 09:14:26.797000           1322         28  1.0              
rohiteng/amazon-sales-dataset                                      Amazon Sales Dataset                                   4037578  2025-11-23 14:29:37.973000           4846         67  1.0

In [10]:
!kaggle datasets download -d hearfool/vggface2 -p /content


Dataset URL: https://www.kaggle.com/datasets/hearfool/vggface2
License(s): unknown
Downloading vggface2.zip to /content
100% 2.32G/2.32G [00:32<00:00, 208MB/s]
100% 2.32G/2.32G [00:32<00:00, 77.3MB/s]


In [15]:
import cv2
import numpy as np
from PIL import Image
from mtcnn.mtcnn import MTCNN

class FacePreprocessor:
    def __init__(self, size=(112, 112)):
        self.detector = MTCNN()
        self.size = size

    def align_face(self, img: Image.Image):
        try:
            img_np = np.array(img)

            if img_np is None or img_np.size == 0:
                return img.resize(self.size)

            faces = self.detector.detect_faces(img_np)

            if len(faces) == 0:
                return img.resize(self.size)

            faces.sort(
                key=lambda x: x['box'][2] * x['box'][3],
                reverse=True
            )
            x, y, w, h = faces[0]['box']
            x, y = max(0, x), max(0, y)

            face = img_np[y:y+h, x:x+w]

            if face.size == 0:
                return img.resize(self.size)

            face = cv2.resize(face, self.size)
            return Image.fromarray(face)

        except Exception:
            return img.resize(self.size)


In [13]:
from mtcnn.mtcnn import MTCNN


In [17]:
import sys
sys.path.append('/content/drive/MyDrive/celebrity_face_project/scripts/preprocessing')


In [18]:
from pipeline import PreprocessingPipeline


In [22]:
!ls /content/vgg_raw/val | head


n000001
n000009
n000029
n000040
n000078
n000082
n000106
n000129
n000148
n000149


In [23]:
pipeline.process_folder(
    input_dir="/content/vgg_raw/val",
    aligned_dir="/content/drive/MyDrive/celebrity_face_project/processed/aligned/vgg_val",
    aug_dir="/content/drive/MyDrive/celebrity_face_project/processed/augmented/vgg_val"
)


Processing val: 100%|██████████| 60/60 [08:44<00:00,  8.74s/it]
