# CE7454 2019 Project -- Group 3

**Add the full name here**

Please find all the models and data at [https://github.com/occia/ce7454-group3-project](https://github.com/occia/ce7454-group3-project)

In [4]:
import sys, os
import time
import copy
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import urllib.request as urllib
import scrapy
import json
import re
import matplotlib.pyplot as plt
import cv2 as cv
import argparse

from random import randint
from torchvision import transforms

ImportError: cannot import name 'etree' from 'lxml' (/usr/lib/python3/dist-packages/lxml/__init__.py)

## 1. Project Description

Here briefly introduces the background, and throw out the 2 project questions:
- How accurate the modern neural network models could be? (How to relate this with the Identity Acquisition?)
- What's the performance of age authentication (below/above 18) for current neural networks?

## 2. Data Preparation

### 2.1 Data Acquisition


For the project, we prepare 2 kinds of data:
- training data
- validation/testing data

#### Bechmark Data

Due to the requirement of the large amount of labeled data, we merged 3 existing labeled benchmark datasets as our training data, including [All-Age-Faces](https://github.com/JingchunCheng/All-Age-Faces-Dataset), [FGNET](https://yanweifu.github.io/FG_NET_data/index.html), [UTK Face](https://susanqq.github.io/UTKFace).

In total, the amount of labelled images from these 3 benchmarks is 38000, 32818 is used for training and 5182 is used for validation/testing.

<span style="color:red"> P: Should we mention that we selected the benchmark datasets for age classification? We can also insert the table from the papers about the bechmarks with their characteristics to justify our choice.</span>

#### Real-world Data

To validate the performance on the real-world data, we selected Instagram as a source for an additional dataset of unfiltered face images.
We assumed that we may retrieve the age from bio and the actual image from the picture profile.

##### Usernames Scraping 

As we intended to select the users from a particular country, we chose one influencer per country.
We considered the following countries:

European:
* Australia - [@eddiebthe3rd](https://www.instagram.com/eddiebthe3rd/)
* Canada - [@od_officiel](https://www.instagram.com/od_officiel/)
* Germany - [@kontrak](https://www.instagram.com/kontrak/)
* Russia - [@\_agentgirl\_](https://www.instagram.com/_agentgirl_/)

Asian:
* China - [@bingbing_fan](https://www.instagram.com/bingbing_fan/)
* Indonesia - [@jokowi](http://instagram.com/jokowi)
* India - [@narendramodi](http://instagram.com/narendramodi)

Middle Eastern:
* Iran - [@golfarahani](http://instagram.com/golfarahani)

African:
* Ethiopia - [@addisalem_getaneh](https://www.instagram.com/addisalem_getaneh/)
* Nigeria - [@iambisola](https://www.instagram.com/iambisola)

Hispanic:
* Brazil - [@carlinhosmaiaof](https://www.instagram.com/carlinhosmaiaof/)

To identify the necessary influencers we used websites such as [HypeAuditor](https://hypeauditor.com/top-instagram-all-australia/) and [Heepsy](https://www.heepsy.com/ranking/top-instagram-influencers-in-ethiopia) -- we have used it to get the information about the audience, as we were interested in the influencers with the audience which is at least 80% local.

Instagram does not allow scrapping and detects spiders, so we used a third-party application for Instagram called [Imgtagram](https://imgtagram.com/followers/justinbieber). It also allows us to retrieve the usernames in a most efficient way. We also attempted to retrieve usernames by means of visual testing, i.e. Selenium, that imitates user behavior, but it was much slower. 

To do so, we opened a web page of a particular user's followers and scrolled down the page till the number of users shown reaches 125k. We ran <span style="color:red"> the following JS script </span> in a developer's console of a browser:

<pre>
function scrapLinksAndScroll() {
  window.scrollTo(0, document.body.scrollHeight);
}

setInterval(scrapLinksAndScroll, 3); </pre>

##### Biography and Name Scraping
After we collected the usernames, we applied a library called scrapy that allows to scrap the webpage content based on html elements.
    scrapy allows us to do so in a multiprocessing way. The source code of a scraper looks as follows and requires a command <span style="color:red"> scrapy crawl -o country.json </span>
In this way, we write the collected data of users per country in a json file storing the information regarding their username, name, bio, country, and image URL.

We processed all the countries one by one as the data required careful validation.
We showcase the data scraping process on a small (30 users) subset of Canadian instagram users.

In [2]:
country = 'canada'

In [None]:
# cd scrapy/instascraper/instascraper/

class QuotesSpider(scrapy.Spider):
    country = 'canada'
    name = "profiles"
    file_path = '../../../data/test/usernames/%s.txt' %country
    with open(file_path) as f:
        start_urls = []
        for u in f.readlines():
            start_urls.append('https://imgtagram.com/u/' + u)

    def parse(self, response):
        country = 'canada'
        for quote in response.css('div.text-block'):
            yield {
                'username': quote.css('h3::text').get(),
                'name': quote.css('h1::text').get(),
                'bio': quote.css('p.descp::text').get(),
                'image': response.css('img.icon::attr(src)').get(),
                'country': country
            }

The resulting information is stored in a JSON file that looks like this:

In [89]:
profiles_path = './data/test/bio/%s.json' %country

with open(input_path, 'r') as file:
    user_profiles = json.load(file)
    retrieved_profiles = pd.DataFrame.from_dict(user_profiles)

retrieved_profiles

Unnamed: 0,username,name,bio,image,country
0,@x_.bellita._x,bella❤️,,https://scontent-cdg2-1.cdninstagram.com/vp/02...,canada
1,@_juliette_girard,Juliette Girard,,https://scontent-sin2-1.cdninstagram.com/vp/fa...,canada
2,@justine_marcoux10,Justine Marcoux,enjoy the little things🌞\n_13 y/o\n_🎿\n_,https://scontent-cdg2-1.cdninstagram.com/vp/fa...,canada
3,@poutine_myra,Jerami🥰,,https://scontent-cdg2-1.cdninstagram.com/vp/6f...,canada
4,@coraliebillette,Coralie :),Dancer💛\n,https://scontent-cdg2-1.cdninstagram.com/vp/8b...,canada
...,...,...,...,...,...
96,@enyalachance._,,,https://scontent-cdg2-1.cdninstagram.com/vp/07...,canada
97,@rraaphb,,,https://scontent-cdg2-1.cdninstagram.com/vp/eb...,canada
98,@audreyann.paquet,Audrey-Ann Paquet,27 ans . Rimouski 🌼 ...,https://scontent-cdg2-1.cdninstagram.com/vp/9d...,canada
99,@marie_pierjolin,Marie-Pier Jolin,Une Pinkie heureuse 🌻,https://scontent-cdg2-1.cdninstagram.com/vp/15...,canada


##### Filtering the bio

To identify names and bios that contain age, we have used a regular expression that looks for the numbers in the aforementioned fields that meet the following requirements:

* The previous symbol is not an alphanumeric character or an underscore, except for the case when the previous two symbols are represent a control sequence (\n, \t or \r)
* The number is either in range 1930-1999, or 2000-2019, or 10-99
* The number is not followed by a digit


When collecting our dataset, we also did manual checking to confirm the results.

In [96]:
import json 
import re
import pandas as pd

filtered_profiles = []

input_path = './data/test/bio/%s.json' %country
output_path = './data/test/bio/filtered/%s.json' %country

with open(input_path, 'r') as file:
    user_profiles = json.load(file)
    for user in user_profiles:
        if user['bio'] and user['bio'] != ' ':
            year_pattern = re.compile("(?:(?<!\w)|(?<=\\[ntr]))(19[3-9]\d|20[01]\d)(?!\d)")
            age_pattern = re.compile("(?:(?<!\w)|(?<=\\[ntr]))([1-9]\d)(?!\d)")
            birth_year_bio = year_pattern.match(user['bio'])
            str_name = str(user['name'])
            birth_year_name = year_pattern.match(str_name)
            age_bio = age_pattern.match(user['bio'])
            age_name = age_pattern.match(str_name)

            if birth_year_bio:
                year = birth_year_bio.group(1)
                age = 2019 - int(year)
                user['age'] = str(age)
                filtered_profiles.append(user)
            elif birth_year_name:
                year = birth_year_name.group(1)
                age = 2019 - int(year)
                user['age'] = str(age)
                filtered_profiles.append(user)
            elif age_bio:
                user['age'] = age_bio.group(1)
                filtered_profiles.append(user)
            elif age_name:
                user['age'] = age_name.group(1)
                filtered_profiles.append(user)
                
with open(output_path, 'w') as outfile:
    json.dump(filtered_profiles, outfile)
    
age_profiles = pd.DataFrame.from_dict(filtered_profiles)
age_profiles

Unnamed: 0,username,name,bio,image,country,age
0,@_frederiquem,Frédérique Marceau ♧,"19, St-Félicien, Québec 📍",https://scontent-cdg2-1.cdninstagram.com/vp/a8...,canada,19
1,@maude_montpetit,Maude 🌵,21 | 01.09.17 👼🏼💙 | 🐈🐈🐈🐈🐕🐕 |,https://scontent-cdg2-1.cdninstagram.com/vp/66...,canada,21
2,@daphhh.hamel,Daphh,18/05/19,https://scontent-cdg2-1.cdninstagram.com/vp/8c...,canada,18
3,@mariiepierp21,marie-pier picard,20 ans\n📚Cégep Garneau\n🦷Finissante en hygiène...,https://scontent-cdg2-1.cdninstagram.com/vp/fc...,canada,20
4,@camybr_,CAMILLE,23 | 🔒 | UQAC 🧠,https://scontent-cdg2-1.cdninstagram.com/vp/af...,canada,23
5,@1997kakou,Karel Soucy,22ans 💁‍♀️ ...,https://scontent-cdg2-1.cdninstagram.com/vp/65...,canada,22
6,@annesosimo,𝒜𝓃𝓃𝑒-𝒮𝑜,16yo | ☁️🥥⛓✉️🖇💭\n,https://scontent-cdg2-1.cdninstagram.com/vp/41...,canada,16
7,@alyson_cote,aly ♡,18 || csf\n🥰🌞😙✌,https://scontent-cdg2-1.cdninstagram.com/vp/0a...,canada,18
8,@tiffounne,Tiffany ✴,26 | B.Sc. Kinésiologie | M.Sc. Ergonomie,https://scontent-cdg2-1.cdninstagram.com/vp/47...,canada,26
9,@coralie.sav,Coralie Savard,19/08/02 ❤️ Sc:coralie_sav,https://scontent-cdg2-1.cdninstagram.com/vp/21...,canada,19


##### Scraping the profile pictures

For the collected users having a bio valid with respect to the regex described above, we then downloaded the profile pictures.

In [97]:
# image retrieval
def url_to_image(url):
    resp = urllib.urlopen(url)
    return resp

profiles_path = './data/test/bio/filtered/%s.json' %country
img_path = './data/test/images/%s/' %country
parsed_users = []

print('Parsing users of', country)

# profiles JSON parsing
with open(profiles_path, 'r') as file:
    user_profiles = json.load(file)
    for usr in user_profiles[0:]:
        url = usr['image']
        username = usr['username']
        image_path = img_path + username + '.jpg'
    # download the image URL
        if url != '':
            print ("Downloading image")
            image = url_to_image(url)
            f = open(image_path,'wb')
            f.write(image.read())
            f.close()

Parsing users of canada
Downloading image
Downloading image
Downloading image
Downloading image
Downloading image
Downloading image
Downloading image
Downloading image
Downloading image
Downloading image
Downloading image
Downloading image


### 2.2 Data Exploration

// **add the age distribution graph for 38000**

// **add the age distribution graph for 1900+ instagram data**

In [None]:
# Age distribution for Instagram

image_nums = {}
image_nums_all = {}

for i in range(1,101):
    if i == 0:
        pass
    if i < 10:
        i = '0' + str(i)
        
    # Fix the path and variables
    data_dir = './data/processed/merged_raw/%s/' %i
    if os.path.exists(data_dir):
        os.listdir(data_dir)
        image_num = len([name for name in os.listdir(data_dir)])
        i = int(i)
        image_nums[i] = image_num
        image_nums_all[i] = image_num
    else: 
        i = int(i)
        image_nums_all[i] = 0
        
print(image_nums)
vals = image_nums
lists = sorted(image_nums_all.items()) 
x, y = zip(*lists)
plt.plot(x, y)
plt.show()

### 2.3 Data Validation

To perform the validation we used the opencv trained model for face detection. We ran the detection algorithm on all the collected images to eliminate the pictures which contain more or less than one face as shown below.
Apart from face detection, the network also cuts the faces from the picture.
We (after some more manual verification).

In [98]:
from face_utils import getFaces
from os import walk

from PIL import Image
import numpy as np


image_path = './data/test/images/%s' %country


for (dirpath, dirnames, filenames) in walk(image_path):
    for image in filenames:        
        username = image.split('.jpg')[0].split('/')
        name = username[len(username) - 1]        
        save_path = './data/test/images/%s/cut/%s.jpg' %(country,name)
        path = image_path + '/' + image
        print(path)
        faces = getFaces(path)
        if faces == 0:
            print('no face detected')
        elif len(faces) == 2:
            print('that is a couple')
        elif len(faces) == 1:
            print('there is one face') 
            cv.imwrite(save_path, faces[0])

./data/test/images/canada/@camybr_.jpg
there is one face
./data/test/images/canada/@mariiepierp21.jpg
there is one face
./data/test/images/canada/@maude_montpetit.jpg
there is one face
./data/test/images/canada/@coralie.sav.jpg
there is one face
./data/test/images/canada/@tiffounne.jpg
No face Detected, Checking next frame
no face detected
./data/test/images/canada/@audreyann.paquet.jpg
there is one face
./data/test/images/canada/@daphhh.hamel.jpg
No face Detected, Checking next frame
no face detected
./data/test/images/canada/@_frederiquem.jpg
that is a couple
./data/test/images/canada/@annesosimo.jpg
there is one face
./data/test/images/canada/@alyson_cote.jpg
there is one face
./data/test/images/canada/@1997kakou.jpg
there is one face
./data/test/images/canada/@lolobouu.jpg
that is a couple
./data/test/images/canada/@camybr_.jpg
there is one face
./data/test/images/canada/@mariiepierp21.jpg
there is one face
./data/test/images/canada/@maude_montpetit.jpg
there is one face
./data/tes

### 2.4 Data Preprocessing

// **this section should describe the bin-size splitting thing, section 3 will use that**

// **it should also contain Instagram data labelling**

In [99]:
# Countries indexes mapping

countries = {
    "australia": "0",
    "brazil": "1",
    "canada": "2",
    "china": "3",
    "ethiopia": "4",
    "nigeria": "4",
    "germany": "5",
    "india": "6",
    "indonesia": "7",
    "iran": "8",
    "russia": "9"
}

image_path = './data/test/images/%s/cut/' %country
data_path = './data/test/bio/filtered/%s.json' %country
result_path = './data/test/result'

with open(data_path, 'r') as file:
    user_profiles = json.load(file)
    for user in user_profiles:
        username = user['username']
        age = user['age']
        country_index = countries[country]
        
        image = image_path + username + '.jpg'
        
        if os.path.exists(image):
        # The new name is 'age_country_username'
            new_name = '%s/%s_%s_%s.jpg' %(result_path, age, country_index, username)
            print(new_name)
            os.rename(image, new_name)

./data/test/result/21_2_@maude_montpetit.jpg
./data/test/result/20_2_@mariiepierp21.jpg
./data/test/result/23_2_@camybr_.jpg
./data/test/result/22_2_@1997kakou.jpg
./data/test/result/16_2_@annesosimo.jpg
./data/test/result/18_2_@alyson_cote.jpg
./data/test/result/19_2_@coralie.sav.jpg
./data/test/result/27_2_@audreyann.paquet.jpg


## 3. Models and Training

In this section, we discuss the choosen models, the training configurations for each model, and the whole training pipeline. The outputs of this section are the saved trained weights for all models.

### 3.1 Model Selection

We targeted on 3 representative models in face recognition and age prediction, the MLP, VGG, and ResNet.

As there are many variants of these networks, the first thing is to determine which variants of these model are suitable for our project. 
We probed ResNet18, ResNet50, ResNet152 using parts of the training data (around 10,000) and found that the performance has no big difference. 
Thus we made the following selection:
- ResNet18, resnet with 18 layers
- VGG19_bn, vgg 19 layers with batch normalization
- MLP18, 18-layer mlp

The ResNet and VGG models can directly imported using the following statements:

In [None]:
from torchvision.models import resnet18
from torchvision.models import vgg19_bn

And the MLP model is implemented by ourself and you can find it in the `./src/neural_network/mlp.py` in the [project github](https://github.com/occia/ce7454-group3-project).

For demo usage, here is a smaller version MLP implementation.

In [None]:
# this class is for demo use
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, hidden_size3, hidden_size4, output_size):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size1),
            nn.ReLU(),
            nn.Linear(hidden_size1, hidden_size2),
            nn.ReLU(),
            nn.Linear(hidden_size2, hidden_size3),
            nn.ReLU(),
            nn.Linear(hidden_size3, hidden_size4),
            nn.ReLU(),
            nn.Linear(hidden_size4, output_size)
        )
        
    def forward(self, x):
        # convert tensor (128, 1, 28, 28) --> (128, 1*28*28)
        x = x.view(x.size(0), -1)
        x = self.layers(x)
        return x

### 3.2 Training Configuration

#### 3.2.1 Training Parameters Setup

We keep the following training configuration for all 3 choosen models:
- Learning Rate, the initial value of learning rate is set as `0.001`
- Optimizer, using **Adam** rather than **SGD**
- Criterion, using `torch.nn.CrossEntropyLoss()`
- Epoches, set to 50 as it balances the training time costs and the training consequence
- Batch size, set as 256
- Image pixels, set as `(3, 200, 200)`, 3 means 3 channels (a.k.a colors)

In [None]:
#
# training parameters setup for demo use
#

device= torch.device("cuda")
#device= torch.device("cpu")

channels = 3
img_pixels = (200,200)
lr = 0.001
num_epochs = 2
batch_size = 128

# loading dataset
def loading_dataset(train_dataset, test_dataset):
    transform = transforms.Compose([
        transforms.Resize(img_pixels),
        transforms.ToTensor()])

    img_data_train = torchvision.datasets.ImageFolder(root=train_dataset, transform=transform)
    data_loader_train = torch.utils.data.DataLoader(img_data_train, batch_size=batch_size,shuffle=True)

    img_data_val = torchvision.datasets.ImageFolder(root=test_dataset, transform=transform)
    data_loader_val = torch.utils.data.DataLoader(img_data_val, batch_size=batch_size,shuffle=True)

    dataloaders = {}
    dataloaders['train'] = data_loader_train
    dataloaders['val'] = data_loader_val
    
    return dataloaders

#### 3.2.2 Model Training WorkFlow

The workflow is based on the template teacher provided in the class, and is improved in some aspects.

Here lists the code.

In [None]:
#
# main training workflow
#
def train_model(model, dataloaders, criterion, optimizer, num_epochs=25):
    since = time.time()
    last = since
    time_elapsed = since

    val_acc_history = []

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    # Get model outputs and calculate loss
                    # Special case for inception because in training it has an auxiliary output. In train
                    #   mode we calculate the loss by summing the final output and the auxiliary output
                    #   but in testing we only consider the final output.
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)

                    _, preds = torch.max(outputs, 1)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects.double() / len(dataloaders[phase].dataset)
            
            time_elapsed = time.time() - last
            last = time.time()
            
            print('{} Loss: {:.4f} Acc: {:.4f} Time: {:.0f}m {:.0f}s'.format(phase, epoch_loss, epoch_acc, time_elapsed // 60, time_elapsed % 60))

            # deep copy the modeltopk
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())
            if phase == 'val':
                val_acc_history.append(epoch_acc)

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, val_acc_history

The weights of the networks are initialized randomly.
Also the images of the dataset are shuffled every time.
The key different parts of our implementation from the teacher's template are:
- we do train & validation for every epoch
- based on the validation result, we save the best epoch's weights, and return that instead of the one be trained longest

### 3.3 Training Pipeline

Till now, we know which model to train and how to train a model. To answer the questions we raised at the beginning, we need to train all the combinations of the selected models and the prepared datasets.

Thus, the next step is building the training pipeline for all training combinations.

In [None]:
# download demo dataset
#!wget -nc "https://somelink"
#!ls
#!tar xf ce7454_demo_dataset.tar.gz
#!ls dataset
#!mkdir -p ./saved_models

In [None]:
def training_and_save_model(net, num_epochs, model_save_name):
    net = net.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer=torch.optim.Adam(net.parameters(), lr)
    net, _ = train_model(net, dataloaders, criterion, optimizer, num_epochs)
    torch.save(net.state_dict(), os.path.join("./saved_models/", model_save_name))

#
# whole training pipeline
#
print("[+] This training pipeline is for demo usage")
for binsize in [1, 6, 10]:
    classes = int((100 + binsize - 1) / binsize)
    
    dataloaders = loading_dataset("./dataset/demo_train_bin_%d" % (binsize), "./dataset/demo_test_bin_%d" % (binsize))
    
    for model in ["MLP", "ResNet", "VGG"]:
        print("[+] Training for %s with binsize %d dataset started" % (model, binsize))
        
        if model == "MLP":
            net = MLP(channels * img_pixels[0] * img_pixels[1], 512, 512, 512, 512, classes)
        elif model == "ResNet":
            net = resnet18(num_classes=classes)
            # comment this as this is a demo
            #continue
        else:
            net = vgg19_bn(num_classes=classes)
            # comment this as this is a demo
            #continue
        
        model_save_name = "%s_%s_demo_merged_train_bin%d" % (num_epochs, net.__class__.__name__, binsize)
        training_and_save_model(net, num_epochs, model_save_name)

        print("[+] Training for %s with binsize %d dataset done" % (model, binsize))

        del net

As shown in the pipeline code, we saved weights of the best epoch for all the models towards all the datasets.

## 4. Evaluation

### 4.1 Accuracy Comparison Among Models

### 4.2 Identity Acquision Accuracy Cross Ages

### 4.3 Age Authentication For 18
