<h1> Adversarial validation </h1>

Main purpose of adversarial validation is checking if training and test images come from same distribution.

**How does it work?**
1. Load training images and assign them class 0.
1. Load test images and assign them class 1.
1. Combine training and test images, shuffle them and split into new_train and new_validation (pay attention to distribution of classes in each new dataset)
1. Create simple model/neural network and train on new_train dataset
1. Check metric score (accuracy) for new_validation dataset. It should be similar to value for new_train.
1. High value means that data is from different distributions, low - the same (perfectly it should be about 50%).

**Problems**
- Images have different dimensions. -> Cropping central image with the smallest dimension from all images can be enough for this task.
- Some test images have 3 channels. -> Getting three channels can be enough for this task.

In [1]:
path_to_train_bw_files = '../data/colour_model/train/bw_images.csv'
path_to_train_colour_files = '../data/colour_model/train/colour_imgs.csv'
path_to_test_bw_files = '../data/colour_model/test/bw_images.csv'
path_to_test_colour_files = '../data/colour_model/test/colour_imgs.csv'

In [2]:
import pandas as pd
from sklearn.utils import shuffle

def create_dataframe(path_to_csv_1, path_to_csv_2, marker):
    """Create dataframe based on 2 csv files containg paths.

    Parameters
    ----------
    path_to_csv_1 : path to first csv
    path_to_csv_2 : path to second csv
    marker: how should be marked images in 'Marker' column

    Returns
    -------
    pd.DataFrame
            first datamframe, second dataframe
    """
    df_1 = pd.read_csv(path_to_csv_1)
    df_2 = pd.read_csv(path_to_csv_2)
    df = pd.concat([df_1, df_2])
    df['Marker'] = marker
    return df

def concat_and_split_df(df_1, df_2, split_factor):
    """Get concatenated and randomly splitted dataframes.

    Parameters
    ----------
    df_1 : first dataframe to concatenate
    df_2 : second dataframe to concatenate
    split_factor: how many percents should be in first dataframe

    Returns
    -------
    pd.DataFrame
            first dataframe, second dataframe
    """
    df = pd.concat([df_1, df_2])
    df = shuffle(df)
    elements_count = int(split_factor * len(df))
    splitted_1_df = df[:elements_count]
    splitted_2_df = df[elements_count:]
    splitted_1_df.reset_index(drop=True, inplace=True)
    splitted_2_df.reset_index(drop=True, inplace=True)
    return splitted_1_df, splitted_2_df


images_train_df = create_dataframe(path_to_train_bw_files, path_to_train_colour_files, 0.0)
images_test_df = create_dataframe(path_to_test_bw_files, path_to_test_colour_files, 1.0)

images_with_markers_train_df, images_with_markers_test_df = concat_and_split_df(
    images_train_df, images_test_df, 0.8)

In [3]:
from torch.utils.data import Dataset
from PIL import Image
 
class ImageWithMarkerDataset(Dataset):
    def __init__(self, paths_with_markers_df, transform=None):
        self.paths_with_markers_df = paths_with_markers_df
        self.transform = transform
 
    def __len__(self):
        return len(self.paths_with_markers_df)
 
    def __getitem__(self, idx):
        img = Image.open(self.paths_with_markers_df['Path'][idx]).convert('RGB')
        if self.transform is not None:
            img = self.transform(img)
        return img, torch.tensor(self.paths_with_markers_df['Marker'][idx])

In [4]:
from torch.utils.data import DataLoader
from torchvision.transforms import Compose, CenterCrop, ToTensor
 
the_smallest_dimension = 160 #based on EDA
transforms = Compose([ToTensor()])
images_with_markers_dataset_train = ImageWithMarkerDataset(images_with_markers_train_df, transforms)
images_with_markers_dataset_test = ImageWithMarkerDataset(images_with_markers_test_df, transforms)
 
train_dataloader = DataLoader(images_with_markers_dataset_train, batch_size=1, shuffle=True)
test_dataloader = DataLoader(images_with_markers_dataset_test, batch_size=1, shuffle=False)

In [5]:
import torch.nn as nn
import torch.nn.functional as F
import torch
torch.manual_seed(1024)
 
class SimpleNeuralNetwork(nn.Module):
    def __init__(self):
        super(SimpleNeuralNetwork, self).__init__()
        self.conv1 = torch.nn.Conv2d(3, 64, kernel_size=3, stride=2)
        self.conv2 = torch.nn.Conv2d(64, 128, kernel_size=3, stride=2)
        self.conv3 = torch.nn.Conv2d(128, 256, kernel_size=2, stride=2)
 
        self.pool = torch.nn.MaxPool2d(kernel_size=3, stride=2)
        self.global_pool = torch.nn.AdaptiveAvgPool2d(1)
        self.fc = torch.nn.Linear(256, 1)
    
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(x)
 
        x = F.relu(self.conv2(x))
        x = self.pool(x)
 
        x = F.relu(self.conv3(x))
        x = self.global_pool(x).squeeze()
 
        x = self.fc(x)
        return x
 

In [6]:
import torch.optim as optim
 
loss = torch.nn.BCEWithLogitsLoss()
neural_network = SimpleNeuralNetwork()
optimizer = optim.Adam(neural_network.parameters(), lr=0.01)
n_epochs = 10
correct_test = 0

neural_network.train()
for epoch in range(n_epochs):
    for img, target in train_dataloader:
        optimizer.zero_grad()
        output = neural_network(img)
        loss_value = loss(target, output)
        loss_value.backward()
        optimizer.step()

with torch.no_grad():
    neural_network.eval()
    for img, target in test_dataloader:
        output = neural_network(img)
        if output.item() == target.item():
            correct_test += 1
        
print('Correct values in test: ', float(correct_test)/ float(len(images_with_markers_dataset_test)))

Correct values in test:  0.0


As we can see, it is very hard for model to classify which image is from train and test dataset. 

`Observations:`
- Above statement means that images are from the same distribution. 