This code contains pre-processing steps done on [CelebA dataset](https://www.kaggle.com/jessicali9530/celeba-dataset) from Kaggle to prepare for multi-class facial attribute classification model.

## Boiler Plate

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

  'Matplotlib is building the font cache using fc-list. '


In [2]:
import pandas as pd
import numpy as np
import imutils
import glob
import cv2
import shutil
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
tqdm_notebook().pandas()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # Remove the CWD from sys.path while we load stuff.


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

## Downloading databse

We are going to download [CelebFaces Attributes(CelebA) dataset](https://www.kaggle.com/jessicali9530/celeba-dataset) from Kaggle. This dataset is excellent for training and testing models for face detection, particularly for recognizing facial attributes such as finding people with brown hair, are smiling, or wearing glasses. Images cover large pose variations, background clutter, diverse people, supported by a large number of images and rich annotations. This data was originally collected by researchers at MMLAB, The Chinese University of Hong Kong (specific reference in Acknowledgment section).

**Content**

- 202,599 number of face images of various celebrities
- 10,177 unique identities, but names of identities are not given
- 40 binary attribute annotations per image
- 5 landmark locations

We are going to use kaggle-cli to download data.

In [5]:
## !kaggle datasets download -d jessicali9530/celeba-dataset

Downloading celeba-dataset.zip to /home/ubuntu/code/Deep_learning_explorations/7_Facial_attributes_fastai_opencv
 98%|█████████████████████████████████████▎| 1.31G/1.33G [00:13<00:00, 98.5MB/s]
100%|███████████████████████████████████████| 1.33G/1.33G [00:13<00:00, 103MB/s]


## Extracting frontal face from images

CelebA dataset contains images of faces which are taken from the side and with different orientation and zoom angle. The first step of preprocessing is you select images which in which front face is visible and isolate that part of the image alone for our model training of facial attributes. We are going to use OpenCV Haar Cascades to find the location of the face in the image and crop to only keep the facial parts of the image.

In [6]:
## Loading Haar Cascade
## Taken from https://github.com/opencv/opencv/tree/master/data/haarcascades
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

In [7]:
def face_extractor(origin, destination, fc):
    ## Importing image using open cv
    img = cv2.imread(origin,1)

    ## Resizing to constant width
    img = imutils.resize(img, width=200)
    
    ## Finding actual size of image
    H,W,_ = img.shape
    
    ## Converting BGR to RGB
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    ## Detecting faces on the image
    face_coord = fc.detectMultiScale(gray,1.2,10,minSize=(50,50))
    
    ## If only one face is foung
    if len(face_coord) == 1:
        X, Y, w, h = face_coord[0]
    
    ## If no face found --> SKIP
    elif len(face_coord)==0:
        return None
    
    ## If multiple faces are found take the one with largest area
    else:
        max_val = 0
        max_idx = 0
        for idx in range(len(face_coord)):
            _, _, w_i, h_i = face_coord[idx]
            if w_i*h_i > max_val:
                max_idx = idx
                max_val = w_i*h_i
            else:
                pass
            
            X, Y, w, h = face_coord[max_idx]
    
    ## Crop and export the image
    img_cp = img[
            max(0,Y - int(0.35*h)): min(Y + int(1.35*h), H),
            max(0,X - int(w*0.35)): min(X + int(1.35*w), W)
        ].copy()
    
    cv2.imwrite(destination, img_cp)

In [19]:
## Defining destination path
path = 'data/faces/'

## Finding all the images in the folder
item_list = glob.glob('data/img_align_celeba/img_align_celeba/*.jpg')
print(len(item_list))

202599


In [20]:
## Will run for about 45 min
for org in tqdm_notebook(item_list):
    face_extractor(origin = org, destination = path+org.split('/')[-1], fc=face_cascade)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(IntProgress(value=0, max=202599), HTML(value='')))




In [21]:
## Findign all the images and separating in training and validation
item_list = glob.glob(path+'*.jpg')

for idx in tqdm_notebook(range(1,202600)):
    if idx <= 182637:
        destination = path+'training/'
    else:
        destination = path+'validation/'
    try:
        shutil.move(
            path+str(idx).zfill(6)+'.jpg', 
            destination+str(idx).zfill(6)+'.jpg'
        )
    except:
        pass

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(IntProgress(value=0, max=202599), HTML(value='')))




## Label Creation

In [22]:
## Combining all label attributes
label_df = pd.read_csv('data/list_attr_celeba.csv')
column_list = pd.Series(list(label_df.columns)[1:])

def label_generator(row):
    return(' '.join(column_list[[True if i==1 else False for i in row[column_list]]]))

label_df['label'] = label_df.progress_apply(lambda x: label_generator(x), axis=1)
label_df = label_df.loc[:,['image_id','label']]
label_df.to_csv('data/labels.csv')

HBox(children=(IntProgress(value=0, max=202599), HTML(value='')))




In [46]:
## Attachhing label to correct file names
item_list = glob.glob('data/faces/*/*.jpg')
item_df = pd.DataFrame({'image_name':pd.Series(item_list).apply(lambda x: '/'.join(x.split('/')[-2:]))})
item_df['image_id'] = item_df.image_name.apply(lambda x: x.split('/')[1])

In [48]:
## Creating final label set
label_df = pd.read_csv('data/labels.csv')
label_df = label_df.merge(item_df, on='image_id', how='inner')
label_df.rename(columns={'label':'tags'}, inplace=True)
label_df.loc[:,['image_name','tags']].to_csv('data/faces/labels.csv', index=False)

In [49]:
label_df.head()

Unnamed: 0.1,Unnamed: 0,image_id,tags,image_name
0,0,000001.jpg,Arched_Eyebrows Attractive Brown_Hair Heavy_Ma...,training/000001.jpg
1,1,000002.jpg,Bags_Under_Eyes Big_Nose Brown_Hair High_Cheek...,training/000002.jpg
2,4,000005.jpg,Arched_Eyebrows Attractive Big_Lips Heavy_Make...,training/000005.jpg
3,5,000006.jpg,Arched_Eyebrows Attractive Big_Lips Brown_Hair...,training/000006.jpg
4,6,000007.jpg,5_o_Clock_Shadow Attractive Bags_Under_Eyes Bi...,training/000007.jpg
