# Overview

1. import packages and assign bias attribute
2. look at the distribution of one protected attribute
3. run through all the attributes and asses the largest imbalance
4. fix imbalance and check for bias
    - 4.1 Example of augmentation
    - 4.2 Which pictures to augment
    - 4.3 Augmenting the right pictures
5. Checking the number of channels
6. Training

## 1. import packages and assign bias attribute

In [1]:
import pickle
import numpy as np
import pandas as pd
import random
import os
import matplotlib.pyplot as plt
import seaborn as sns
import copy
plt.style.use(['seaborn-whitegrid'])

In [2]:
train = pd.read_csv("Data/train.csv")

In [3]:
train_len = len(train)

In [4]:
attr = list(train.columns[1:41])

In [None]:
protected_attr = "Eyeglasses"
target = "Smiling"

## 2. Look at the distribution of one protected attribute

In [None]:
train = train[[target,protected_attr]]

In [None]:
#This is only done for the groupby command in the below cell
train["fill"] = 9

In [None]:
dist = train.groupby([protected_attr,target]).count()

In [None]:
dist = dist.fill/train_len

In [None]:
labels = 'attr=0, target=0', 'attr=0, target=1', 'attr=1, target=0', 'attr=1, target=1'
sizes = list(dist)
#colors = ['blue', 'yellowgreen', 'lightcoral', 'lightskyblue']

# Plot
plt.pie(sizes, labels=labels,autopct='%1.1f%%')

plt.axis('equal')
plt.show()

In [None]:
dist

In [None]:
k = list(dist)

In [None]:
k

In [None]:
d = []

d.append([k[0]/k[1], k[2]/k[3]])

In [None]:
d

Lav en dataframe som opsummerer alle de forskellige uligheder i data for at assesse hvilke af de to der er størst ulighed på. Her kan man bruge en ratio ligesom på posteren

## 3. Run trough all attributes and asses the largest imbalance

In [None]:
train = pd.read_csv("Data/train.csv")
train_len = len(train)

In [None]:
attr = list(train.columns[1:41])
target = "Smiling"
attr.remove('Smiling')

In [None]:
diff = []
for i in attr:
    protected_attr = i
    train_ = train[[target,protected_attr]]
    train_["fill"] = 9
    dist = train_.groupby([protected_attr,target]).count()
    dist = dist.fill/train_len
    k = list(dist)
    diff.append([i, k[0]/k[1], k[2]/k[3]])

In [None]:
diff = pd.DataFrame(diff)

In [None]:
diff.columns = ['Protected_attr', 'ratio attr 0', 'ratio attr 1']

In [None]:
diff.sort_values(['ratio attr 0', 'ratio attr 1'], ascending=[False, False])

In [None]:
diff.sort_values(['ratio attr 1'], ascending=[False]).head()

When looking at the two heads, we that
- 3/10 is not a part of our poster (wearing hat, lipstick, no_beard) 
- 4/10 has detected bias from our definition (gotee, pale_skin, mouth_slightly_open, high_cheekbones)
- 3/10 has no detected bias from our definition (mustache, sideburns, attractive) 

## 4. Fix imbalance

We would like to balance the unbalanced training set. Looking at the attribute with the largest imbalance, `high_cheekbones` (were our metric also detected a bias), we see that the `ratio_attr_0` is 4.8 and `ratio_attr_1` is 0.17. 

The ratio formula is essentially
$$
ratio\_attr\_0 = \frac{attr = 0, target = 0}{attr = 0, target = 1}
$$
and 
$$
ratio\_attr\_1 = \frac{attr = 1, target = 0}{attr = 1, target = 1}
$$

This mans that 
- out of all the people NOT having high cheeckbones, 4.8 times as many were not smiling than smiling 
- out of all the people HAVING high cheeckbones, many more were smiling than not smiling

and hence this particular training data is very unbbalanced..

Inspired by the following article, https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6, we have three absic approaches (more complex ones exist of course): 

1. Undersampling- Randomly delete the class which has sufficient observations so that the comparative ratio of two classes is significant in our data.Although this approach is really simple to follow but there is a high possibility that the data that we are deleting may contain important information about the predictive class.
2. Oversampling-For the unbalanced class randomly increase the number of observations which are just copies of existing samples.This ideally gives us sufficient number of samples to play with.The oversampling may lead to overfitting to the training data
3. Synthetic sampling(SMOTE)-The technique asks to synthetically manufacture observations of unbalanced classes which are similar to the existing using nearest neighbors classification.The problem is what to do when the number of observations of is an extremely rare class .For example-we may have only one picture of a rare species which we want to identify using image classification algorithm

I am using option 2 below

In [None]:
import os
from PIL import Image
from PIL import ImageFilter

#### 4.1 Example of augmentation

In [None]:
im=Image.open('/Users/MartinJohnsen/Documents/Martin Johnsen/MMC/3. Semester/Deep Learning/Projects/Algorithmic fairness/Data/celebA_resize3/000001.jpg')

In [None]:
im

In [None]:
im=im.convert("RGB")
r,g,b=im.split()
r=r.convert("RGB")
g=g.convert("RGB")
b=b.convert("RGB")
#im_blur=im.filter(ImageFilter.GaussianBlur)
im_unsharp=im.filter(ImageFilter.UnsharpMask)

In [None]:
r

In [None]:
g

In [None]:
b

In [None]:
im_unsharp

In [None]:
#We choose only to use the im_unsharp and the b picture together with the normal pictur
# which means, that we are augmenting 3 pictures every time we upsample 1 time
pictures_upsampling = 3

#### 4.2 Which pictures to augment

As we in this particular example, want to upsample the number of smiling people with high cheekbones and the number of non-smiling people with high cheekbones. 

In [None]:
#assessing the number of pictures that are different
train = pd.read_csv("Data/train.csv")
train_len = len(train)
attr = list(train.columns[1:41])
protected_attr = "High_Cheekbones"
target = "Smiling"

In [None]:
train.head()

In [None]:
#Defining train as only the target and the protected attribute
train = train[[target,protected_attr]]
#This is only done for the groupby command in the below cell
train["fill"] = 9
#Groupby and counting the number of occurances
dist = train.groupby([protected_attr,target]).count()

In [None]:
labels = 'attr=0, target=0', 'attr=0, target=1', 'attr=1, target=0', 'attr=1, target=1'
sizes = list(dist.fill)
#colors = ['blue', 'yellowgreen', 'lightcoral', 'lightskyblue']

# Plot
plt.pie(sizes, labels=labels,autopct='%1.1f%%')

plt.axis('equal')
plt.show()

In [None]:
dist

In [None]:
#This is only training data
sum(dist.fill)

As seen from the table above, we could potentially also downsample, however, we would loose a lot of information. We need to upsample around 60,000 pictures of smiling people without high cheeckbone (smiling_not_highcheekbones = `s_n_hc`), and around 50,000 pictures of non-smiling people with high cheekbones (nonsmiling_highcheekbones = `ns_hc`). 

In [None]:
#Reading all the attributes for every single image and which partition it belongs to:
pp = pd.read_csv('Data/list_attr_celeba.txt', sep= " ")
part = pd.read_csv('Data/list_eval_partition.txt', sep= " ",header = None)

In [None]:
pp.head()

In [None]:
#naming columns in the partition dataset
part.columns = ['im_id','partition']

Filtering the training examples

In [None]:
pp = pp.merge(part, how = 'left', on = 'im_id')

In [None]:
pp.shape

In [None]:
pp = pp[pp.partition == 0]

In [None]:
#Now we have filtered the test data out of the sample
pp.shape

In [None]:
#AND it is equal to the amount of datapoints in the groupby table
sum(dist.fill)

Creating a list of the two different characteristics that we want to obtain

In [None]:
s_n_hc = pp[(pp[target] == 1) & (pp[protected_attr] == -1)].im_id

In [None]:
#We see that this number is the same as in the dist table above
len(s_n_hc)

In [None]:
ns_hc = pp[(pp[target] == -1) & (pp[protected_attr] == 1)].im_id

In [None]:
#We see that this number is the same as in the dist table above
len(ns_hc)

#### 4.3 Augmenting the right pictures

In [None]:
dist.fill

In [None]:
#Ensuring that we are subtracting the right numbers. It is always the differece between
# the two attributes we want:
if dist.fill[0][0]>dist.fill[0][1]:
    s_n_hc_count = dist.fill[0][0]-dist.fill[0][1]
else:
    s_n_hc_count = dist.fill[0][1]-dist.fill[0][0]

In [None]:
#Ensuring that we are subtracting the right numbers. It is always the differece between
# the two attributes we want:
if dist.fill[1][1]>dist.fill[1][0]:
    ns_hc_count = dist.fill[1][1]-dist.fill[1][0]
else: 
    ns_hc_count = dist.fill[1][0]-dist.fill[1][1]

In [None]:
print('We need to get',s_n_hc_count,'number of smiling non HC-people, and',ns_hc_count,'numbber of non-similing HC-people. \n')
print('As we can make', pictures_upsampling, 'different augmentations of every image, that means we need to random sample',\
     s_n_hc_count/pictures_upsampling,'smiling non HC-people, and',ns_hc_count/pictures_upsampling\
      ,'non-similing HC-people.')

In [None]:
upsampling = [list(s_n_hc) , list(ns_hc)]
s_n_hc_up = int(np.round(s_n_hc_count/pictures_upsampling))
ns_hc_up = int(np.round(ns_hc_count/pictures_upsampling))
range_ = [s_n_hc_up , ns_hc_up]

In [None]:
#Check that this is equal to the print statement above:
s_n_hc_up

In [None]:
#Check that this is equal to the print statement above:
ns_hc_up

In [None]:
img_root = '/Users/MartinJohnsen/Documents/Martin Johnsen/MMC/3. Semester/Deep Learning/Projects/Algorithmic fairness/Data/celebA_resize3/'
saveto_root = '/Users/MartinJohnsen/Documents/Martin Johnsen/MMC/3. Semester/Deep Learning/Projects/Algorithmic fairness/Data/celebA_resize3_aug_HC/'
train = pd.read_csv("Data/train.csv")

In [None]:
from tqdm import tqdm

In [None]:
traindf = copy.deepcopy(train)

In [None]:
#Writing pictures 
for i in range(2):
    augment = upsampling[i]
    rang = range_[i]
    for k in tqdm(range(rang)):
        image = random.randrange(len(augment))
        image = augment[image]
        im=Image.open(img_root+image)
        im_=im.convert("RGB")
        r,g,b=im_.split()
        r=r.convert("RGB")
        g=g.convert("RGB")
        b=b.convert("RGB")
        im_unsharp=im_.filter(ImageFilter.UnsharpMask)

        #r.save(saveto_root+'r_'+str(k)+"_"+image)
        #g.save(saveto_root+'g_'+str(k)+"_"+image)
        
        #Saving image, b, and unsharp:
        im.save(saveto_root+'im_'+str(k)+"_"+image)#Added
        b.save(saveto_root+'b_'+str(k)+"_"+image)
        im_unsharp.save(saveto_root+'un_'+str(k)+"_"+image)
        
        
        traindf = traindf.append([train[train.im_id==image]]*pictures_upsampling\
                                 ,ignore_index=True)
        
        traindf.iloc[-1,0] = 'un_'+str(k)+"_"+image
        traindf.iloc[-2,0] = 'b_'+str(k)+"_"+image
        traindf.iloc[-3,0] = 'im_'+str(k)+"_"+image
        
traindf.to_csv('Data/train_augmented_'+protected_attr+'.csv')

In [None]:
len(traindf)

In [None]:
traindf.tail()

In [None]:
len(train)

In [None]:
traindf.tail()

In [None]:
traindf = pd.read_csv("Data/train_augmented_mouth.csv")

In [None]:
train = traindf[[target,protected_attr]]
#This is only done for the groupby command in the below cell
train["fill"] = 9
dist = train.groupby([protected_attr,target]).count()

In [None]:
dist

In [None]:
#len(train)

In [None]:
traindf.tail(30)

In [None]:
#Lets check out many pictures is in the folder:
import os
files = os.listdir(saveto_root)
print(len(files))

In [None]:
#when we would like to have 
sum(range_)*3

In [None]:
#added also has this many pictures - thises all needs to be added to the dataframe, traindf
len(added)

In [None]:
#Saving pictures as pickle
with open("Data/pictures_augmented.txt", "wb") as fp:   #Pickling
    pickle.dump(added, fp)

In [None]:
#Reading pickle with list of photos that ahs been augmented
with open("Data/pictures_augmented.txt", "rb") as fp:   # Unpickling
    b = pickle.load(fp)

Since the method defined for the training of the models needs a data-frame where it reads from the `im_id` columns in order to find the picture, we need to add all these pictures with their attributes to the dataframe. 

In [None]:
for i, img in enumerate(added):
    #print(img)
    if i%10000==0:
        print(i)

In [None]:
for img in added:
    traindf = traindf.append(traindf[traindf.im_id==img])
    traindf = traindf.append(traindf[traindf.im_id==img])
    traindf = traindf.append(traindf[traindf.im_id==img])
    traindf = traindf.append(traindf[traindf.im_id==img])
    
    traindf.iloc[-1,0] = str(k)+'r_'+img
    traindf.iloc[-2,0] = str(k)+'g_'+img
    traindf.iloc[-3,0] = str(k)+'b_'+img
    traindf.iloc[-4,0] = str(k)+'un_'+img

In [None]:
len(traindf)-len(train)

In [None]:
traindf.tail(20)

We have sampled the images so that we have the correct amount of each class! 

## 5. Checking the number of channels

In [None]:
im=Image.open('/Users/MartinJohnsen/Documents/Martin Johnsen/MMC/3. Semester/Deep Learning/Projects/Algorithmic fairness/Data/celebA_resize3_aug/0r_057136.jpg')

In [None]:
#channels
len(im.mode)

In [None]:
im.getbands()

## 6. Training

Happens on AWS!