### Reimplementation of the study: <br> ***"DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image GenerationModels"* <br> from Zeyang Sha, Zheng Li, Ning Yu, Yang Zhang**

**Name**: *Laura Papi*

**Matricola**: *1760732*

# Project Description

The above cited study focuses on the growing concerns about the possible misuse of AI generated images, and assesses the necessity for a tool to detect and attribute these fake images.<br>
In particular, it points out the lack of research on the particular case of images generated by a text prompt.
<br>

<br>
This project proposes methods to answer 2 of the research questions [RQ] proposed in the paper:

- **RQ1**. Detection of images generated by text-to-image generation models (See sections 1 and 2)

- **RQ2**. Attribution of the fake images to the text-to-image model that generated it (See sections 3 and 4)

<br>
This notebook contains the instructions to reproduce the entire project from scratch, from the creation of the datasets to the design and training of the models.<br><br>

For quick examples on how to use the implemented models see the __[tldrNotebook](tldr_notebook.ipynb)__, where the pre-built datasets and pre-trained weights can be downloaded.<br>

For furhter informations the complete code of this project can be found in the source directory of the public GitHub repository __[Source Code](https://github.com/parwal-lp/De-Fake_nn_final_project/tree/main/src)__.

Before proceeding to run the code in this Notebook, please read the instructions contained in the __[Readme](https://github.com/parwal-lp/De-Fake_nn_final_project/tree/main)__ on GitHub.

# How to reproduce this project from scratch

Declare the path variables to be used globally in this notebook:

In [2]:
# Replace these paths as described below
proj_dir = "/home/parwal/Documents/test/De-Fake_nn_final_project"    # set here the absolute path to the root of the current project (De-Fake)
clip_dir = "/home/parwal/Documents/GitHub/CLIP"    # set here the absolute path to the CLIP directory cloned from GitHub
ld_dir = "/home/parwal/Documents/GitHub/latent-diffusion"   # set here the absolute path to the LD directory cloned from GitHub
glide_dir = "/home/parwal/Documents/GitHub/glide-text2im"   # set here the absolute path to the GLIDE directory cloned from GitHub

SD_api_key = 'sk-6MTDQWuQSLiU3SIc8GEkQrFK7Yjh85JIj0nTfZZKFwircCQQ' # Set here your Stable Diffusion API key (with enough credit to generate at least 400 images

# Do not change these paths, they are part of the implementation
SD_generated_temp_dir = "data/generated/SD+MSCOCO/"
GLIDE_generated_temp_dir = "data/generated/GLIDE+MSCOCO/"
LD_generated_temp_dir = ld_dir + "/outputs/txt2img-samples/"

Import all the necessary libraries and functions:

In [3]:
# -- Declare all the imports needed in this notebook

# External libraries imports
import sys
import os
import torch
import torchvision

# References to other files of this project
# Functions for the management of data
from src.data_collector import fetchImagesFromMSCOCO
from src.dataset_generator import SD_generation, LD_generation, GLIDE_generation
from src.format_dataset import format_dataset_binaryclass, formatIntoTrainTest, format_dataset_multiclass
from src.encoder import get_multiclass_dataset_loader, get_dataset_loader
# Functions for building and training the models
from src.imageonly_detector.model import train_imageonly_detector, eval_imageonly_detector
from src.imageonly_attributor.model import train_imageonly_attributor, eval_imageonly_attributor
from src.hybrid_detector.hybrid_detector import TwoLayerPerceptron, train_hybrid_detector, eval_hybrid_detector
from src.hybrid_attributor.model import MultiClassTwoLayerPerceptron, train_hybrid_attributor




## RQ1. Detection of images generated by text-to-image generation models

The study proposes two detector models:

1. **Image-only detector**<br>binary classifier that decides whether an input image is fake or real.

2. **Hybrid detector**<br>binary classifier that is able to tell if an image is fake or real, based on the input image and its corresponding text prompt.


### 1. Image-only detector
This model is implemented as a two-layer perceptron, to be used for binary classification.

#### 1.1 Dataset
All the datasets are constitueted by a set of N real images (labeled 1), and a set of N corresponding fake generated images (labeled 0).

Training (on a single dataset):
- real images fetched from MSCOCO (class 1)
- fake images generated by Stable Diffusion (SD) (class 0)

Evaluation (on 3 different datasets):
- real images always fetched from MSCOCO (class 1)
- fake images generated respectively by Stable Diffusion (SD), Latent Diffusion (LD) and GLIDE (class 0)

The data is structured as follows:

imageonly_detector_data/<br>
&emsp;&emsp;├── train/<br>
&emsp;&emsp;&emsp;&emsp;├── class_0/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *fake images generated by SD*<br>
&emsp;&emsp;&emsp;&emsp;├── class_1/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *real images fetched by MSCOCO*<br>
&emsp;&emsp;├── val/<br>
&emsp;&emsp;&emsp;&emsp;├── class_0/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *fake images generated by SD*<br>
&emsp;&emsp;&emsp;&emsp;├── class_1/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *real images fetched by MSCOCO*<br>
&emsp;&emsp;├── val_LD/<br>
&emsp;&emsp;&emsp;&emsp;├── class_0/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *fake images generated by LD*<br>
&emsp;&emsp;&emsp;&emsp;├── class_1/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *real images fetched by MSCOCO*<br>
&emsp;&emsp;├── val_GLIDE/<br>
&emsp;&emsp;&emsp;&emsp;├── ...<br>


First we fetch the real images together with their captions, for all the datasets described above:

In [None]:
#SD+MSCOCO
# The dataset generated using SD will be divided into train and test later on, it is temporarely saved in the directory fetched/MSCOCO_for_SD
fetchImagesFromMSCOCO("data/fetched/MSCOCO_for_SD", "data/fetched/MSCOCO_for_SD", 100)

#LD+MSCOCO --------------------------------------------------------------------------
# real images and captions needed as input for the LD model, directly saved in the dataset folder with class 1 = real
fetchImagesFromMSCOCO("data/imageonly_detector_data/val_LD/class_1", "data/imageonly_detector_data/val_LD", 50)

#GLIDE+MSCOCO -----------------------------------------------------------------------
# real images and captions needed as input for the GLIDE model, directly saved in the dataset folder with class 1 = real
fetchImagesFromMSCOCO("data/imageonly_detector_data/val_GLIDE/class_1", "data/imageonly_detector_data/val_GLIDE", 50)

Use the previously fetched captions to generate the fake images.<br><br>
Notice that for SD we use the APIs, while for LD and GLIDE we used a downloaded local model.<br><br>
Also, SD has a very strict protection against inappropriate text prompts, so it might refuse to process some of the prompts, even if they are legit.<br>
This exception is handled by the implemented methods, and ignores the unprocessed images labeling them as invalid, those won't be part of the datasets, and even the real counterparts are not included in the datasets.

In [None]:
#SD+MSCOCO --------------------------------------------------------------------------
#use stable-diffusion API to generate 100 fake images from the 100 captions collected before
print("generating images using SD...")
SD_generation("data/fetched/MSCOCO_for_SD/mscoco_captions.csv", SD_generated_temp_dir, SD_api_key)
print("SD images generated successfully!")

#LD+MSCOCO --------------------------------------------------------------------------
#use Latent Diffusion model to generate 50 images starting from the captions fetched before
print("generating images using LD...")
LD_generation("data/imageonly_detector_data/val_LD/mscoco_captions.csv", ld_dir, proj_dir)
print("LD images generated successfully!")

#GLIDE+MSCOCO -----------------------------------------------------------------------
#use GLIDE model to generate 50 images starting from the captions fetched before
print("generating images using GLIDE...")
GLIDE_generation("data/imageonly_detector_data/val_GLIDE/mscoco_captions.csv", GLIDE_generated_temp_dir, glide_dir)
print("GLIDE images generated successfully!")

Move the collected and generated images from their respective folders into the dataset directory structured as described before.

In [None]:
#SD+MSCOCO
#this function generates a pair of datasets (train and val), starting from data from the Stable Diffusion generation
#the data generated from SD contains 100 images, this original dataset is split in half (50 for train, 50 for test)
formatIntoTrainTest("data/fetched/MSCOCO_for_SD", SD_generated_temp_dir, "data/imageonly_detector_data")
print("ok SD")

#LD+MSCOCO --------------------------------------------------------------------------
format_dataset_binaryclass(LD_generated_temp_dir, "data/imageonly_detector_data/val_LD")
print("ok LD")

#GLIDE+MSCOCO -----------------------------------------------------------------------
format_dataset_binaryclass(GLIDE_generated_temp_dir, "data/imageonly_detector_data/val_GLIDE")
print("ok GLIDE")

#### 1.2 Model

In the next block we build and train the actual model, a two-layer perceptron for binary classification, with the following steps:
- **Build the model** starting from a pre-trained version of ResNet18<br><br>
- **Create Dataset and DataLoader** objects starting from the row data fetched at 1.1 (jpg images)<br>Each item of this dataset is transformed in order to achieve better performance.<br><br>
- **Train the model** using a custom train function and the DataLoader obtained in the previous step

In [9]:
# Build the model
print("Building the model...")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1')

# Build the datasets
print("Building the dataset...")
data_transforms = torchvision.transforms.Compose([
    torchvision.transforms.RandomResizedCrop(224),
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

data_dir = 'data/imageonly_detector_data'
image_datasets = {x: torchvision.datasets.ImageFolder(os.path.join(data_dir, x), data_transforms) for x in ['train', 'val', 'val_LD', 'val_GLIDE']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4, shuffle=True, num_workers=4) for x in ['train', 'val', 'val_LD', 'val_GLIDE']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val', 'val_LD', 'val_GLIDE']}

# Train the model
print("Training starts")
trained_model = train_imageonly_detector(model, dataloaders, dataset_sizes, num_epochs=20)

# Load a model with the trained weights from the previous step, and evaluate it on test data
print("Evaluation starts")
print("loading model with trained weights...")
test_model = torchvision.models.resnet18(weights='IMAGENET1K_V1')
test_model.load_state_dict(torch.load('trained_models/imageonly_detector.pth'))
eval_imageonly_detector(test_model, dataloaders, dataset_sizes)

Building the model...
Building the dataset...
Training starts
Epoch 0/19
----------
train Loss: 3.1605 Acc: 0.5400
val Loss: 0.5465 Acc: 0.8061

Epoch 1/19
----------
train Loss: 0.4093 Acc: 0.8400
val Loss: 1.2111 Acc: 0.6633

Epoch 2/19
----------
train Loss: 0.3474 Acc: 0.8600
val Loss: 0.4639 Acc: 0.8061

Epoch 3/19
----------
train Loss: 0.5000 Acc: 0.8000
val Loss: 0.3040 Acc: 0.8469

Epoch 4/19
----------
train Loss: 0.3316 Acc: 0.8600
val Loss: 0.3483 Acc: 0.8776

Epoch 5/19
----------
train Loss: 0.2674 Acc: 0.9200
val Loss: 0.2744 Acc: 0.8776

Epoch 6/19
----------
train Loss: 0.2270 Acc: 0.9000
val Loss: 0.7405 Acc: 0.7755

Epoch 7/19
----------
train Loss: 0.3479 Acc: 0.8400
val Loss: 0.4503 Acc: 0.8163

Epoch 8/19
----------
train Loss: 0.2347 Acc: 0.9000
val Loss: 0.2777 Acc: 0.8673

Epoch 9/19
----------
train Loss: 0.5108 Acc: 0.8700
val Loss: 0.2971 Acc: 0.8776

Epoch 10/19
----------
train Loss: 0.4476 Acc: 0.8200
val Loss: 0.2558 Acc: 0.8980

Epoch 11/19
----------
t

### 2. Hybrid detector
For this problem we again implement a two-layer perceptron for binary classification, but in this case it will take as input not only the images but also their captions.

#### 2.1 Dataset

The data is first fetched and generated in the exact same way as the dataset for the image-only detector.<br>

In [None]:
# ------------------- COLLECT REAL IMAGES FROM MSCOCO -------------------- #

#SD+MSCOCO
print("fetching images for SD...")
fetchImagesFromMSCOCO("data/fetched/MSCOCO_for_SD", "data/hybrid_detector_data", 100)

#LD+MSCOCO --------------------------------------------------------------------------
print("fetching images for LD...")
fetchImagesFromMSCOCO("data/hybrid_detector_data/val_LD/class_1", "data/hybrid_detector_data/val_LD", 50)

#GLIDE+MSCOCO -----------------------------------------------------------------------
print("fetching images for GLIDE...")
fetchImagesFromMSCOCO("data/hybrid_detector_data/val_GLIDE/class_1", "data/hybrid_detector_data/val_GLIDE", 50)

In [None]:
# ------------------- GENERATE FAKE IMAGES USING SD, LD, GLIDE -------------------- #
#SD+MSCOCO --------------------------------------------------------------------------
print("generating images using SD...")
SD_generation("data/hybrid_detector_data/mscoco_captions.csv", SD_generated_temp_dir, SD_api_key)
print("SD images generated successfully!")

#LD+MSCOCO --------------------------------------------------------------------------
print("generating images using LD...")
LD_generation("data/hybrid_detector_data/val_LD/mscoco_captions.csv", ld_dir, proj_dir)
print("LD images generated successfully!")

#GLIDE+MSCOCO -----------------------------------------------------------------------
print("generating images using GLIDE...")
GLIDE_generation("data/hybrid_detector_data/val_GLIDE/mscoco_captions.csv", GLIDE_generated_temp_dir, glide_dir)
print("GLIDE images generated successfully!")

In [None]:
# ------------------- FORMAT THE DATA INTO THE STRUCTURE NEEDED FOR TRAINING/TESTING -------------------- #
#SD+MSCOCO --------------------------------------------------------------------------
#this function generates a pair of datasets (train and val), starting from data from the Stable Diffusion generation
#the data generated from SD contains 100 images, this original dataset is split in half (50 for train, 50 for test)
print("Building SD dataset...")
formatIntoTrainTest("data/fetched/MSCOCO_for_SD/", SD_generated_temp_dir, "data/hybrid_detector_data")
print("ok SD")

#LD+MSCOCO --------------------------------------------------------------------------
print("Building LD dataset...")
format_dataset_binaryclass(LD_generated_temp_dir, "data/hybrid_detector_data/val_LD")
print("ok LD")

#GLIDE+MSCOCO -----------------------------------------------------------------------
print("Building GLIDE dataset...")
format_dataset_binaryclass(GLIDE_generated_temp_dir, "data/hybrid_detector_data/val_GLIDE")
print("ok GLIDE")

#### 2.2 Model

In the next block we build and train the actual model, a two-layer perceptron for binary classification, with the following steps:
- **Build the model** using a custom implemented module.<br>A two-layer perceptron that outputs 0 (fake) or 1 (real) for each sample.<br><br>
- **Create Dataset and DataLoader** starting from the row data fetched at 2.1 (jpg images and string captions).<br>Each item of this dataset is composed by the encoding of an image concatenated with the encoding of its caption,<br>the encodings are generated using the CLIP model.<br><br>
- **Train the model** using a custom train function and the DataLoader from the previous step.

In [4]:
#Build the model
print('Building the model...')
hybrid_detector = TwoLayerPerceptron(1024, 100, 1)

#Build the dataset
print('Building the dataset...')
captions_file = "data/hybrid_detector_data/mscoco_captions.csv"
real_img_dir = "data/hybrid_detector_data/train/class_1"
fake_img_dir = "data/hybrid_detector_data/train/class_0"
train_data_loader = get_dataset_loader(captions_file, real_img_dir, fake_img_dir, clip_dir, proj_dir)

#Train the model
print('Training starts')
train_hybrid_detector(hybrid_detector, train_data_loader, 40, 0.0009)

Building the model...
Building the dataset...
Training starts
EPOCH:  1/40  - MEAN ACCURACY:  tensor(0.5600)  - MEAN LOSS:  tensor(0.6888)
EPOCH:  2/40  - MEAN ACCURACY:  tensor(0.5800)  - MEAN LOSS:  tensor(0.6864)
EPOCH:  3/40  - MEAN ACCURACY:  tensor(0.6000)  - MEAN LOSS:  tensor(0.6840)
EPOCH:  4/40  - MEAN ACCURACY:  tensor(0.6000)  - MEAN LOSS:  tensor(0.6817)
EPOCH:  5/40  - MEAN ACCURACY:  tensor(0.6300)  - MEAN LOSS:  tensor(0.6794)
EPOCH:  6/40  - MEAN ACCURACY:  tensor(0.6400)  - MEAN LOSS:  tensor(0.6771)
EPOCH:  7/40  - MEAN ACCURACY:  tensor(0.6600)  - MEAN LOSS:  tensor(0.6748)
EPOCH:  8/40  - MEAN ACCURACY:  tensor(0.6700)  - MEAN LOSS:  tensor(0.6727)
EPOCH:  9/40  - MEAN ACCURACY:  tensor(0.7000)  - MEAN LOSS:  tensor(0.6703)
EPOCH:  10/40  - MEAN ACCURACY:  tensor(0.7100)  - MEAN LOSS:  tensor(0.6682)
EPOCH:  11/40  - MEAN ACCURACY:  tensor(0.7400)  - MEAN LOSS:  tensor(0.6658)
EPOCH:  12/40  - MEAN ACCURACY:  tensor(0.7600)  - MEAN LOSS:  tensor(0.6634)
EPOCH:  13/

Now that the model is trained, we can evaluate it on some test datasets.<br>
In particular we will evaluate it on:<br>
- Stable Diffusion (SD), dataset generated from the same image-to-text generator used for the train dataset.
- GLIDE
- Latent Diffusion

In [5]:
# Build the model with the weights we trained in the previous code block
print("loading model with trained weights...")
test_hybrid_detector = TwoLayerPerceptron(1024, 100, 1)
test_hybrid_detector.load_state_dict(torch.load('trained_models/hybrid_detector.pth'))

eval_dirs = {'SD': {
                'captions': "data/hybrid_detector_data/mscoco_captions.csv", 
                'real': "data/hybrid_detector_data/val/class_1", 
                'fake': "data/hybrid_detector_data/val/class_0"},
             'GLIDE': {
                 'captions': "data/hybrid_detector_data/val_GLIDE/mscoco_captions.csv",
                  'real': "data/hybrid_detector_data/val_GLIDE/class_1", 
                  'fake': "data/hybrid_detector_data/val_GLIDE/class_0"},
             'LD': {
                 'captions': "data/hybrid_detector_data/val_LD/mscoco_captions.csv", 
                 'real': "data/hybrid_detector_data/val_LD/class_1", 
                 'fake': "data/hybrid_detector_data/val_LD/class_0"}}

#Build a the dataloaders and test the model on each of them
print("Evaluation starts")
for dataset_name in eval_dirs:
    eval_data_loader = get_dataset_loader(eval_dirs[dataset_name]['captions'], eval_dirs[dataset_name]['real'], eval_dirs[dataset_name]['fake'], clip_dir, proj_dir)
    SDloss, SDacc = eval_hybrid_detector(test_hybrid_detector, eval_data_loader)
    print(f'Evaluation on {dataset_name} --> Accuracy: {SDacc} - Loss: {SDloss}')

loading model with trained weights...
Evaluation starts
Evaluation on SD --> Accuracy: 0.8666667342185974 - Loss: 0.6129067540168762
Evaluation on GLIDE --> Accuracy: 0.7199999690055847 - Loss: 0.6579972505569458
Evaluation on LD --> Accuracy: 0.800000011920929 - Loss: 0.6495194435119629


## RQ2. Attribution of the fake images to their source model

The study proposes two attributor models:

1. **Image-only attributor**<br>multi-class classifier that assigns each input image to its source generation model, given the image only.

2. **Hybrid attributor**<br>multi-class classifier that assigns each input image to its source generation model, based on the input image and its corresponding text prompt.


### 1. Image-only attributor

In this section we will build and train a model that is able to assign an image to the model that generated it, given only that image.<br><br>
The classes that this model will be able to address are the following:
- real image -> class 0
- fake image generated by SD -> class 1
- fake image generated by LD -> class 2
- fake image generated by GLIDE -> class 3

#### 1.1 Dataset

We generate two datasets, one for training and one for evaluating the model.<br>
The steps needed to generate the two datasets are the same:
- fetch real images and their captions from MSCOCO (class 0)
- generate fake images with SD using the captions of the real images (class 1)
- generate fake images with LD using the captions of the real images (class 2)
- generate fake images with GLIDE using the captions of the real images (class 3)
- move the real and generated images into a dataset directory, with the following structure:

imageonly_attributor_data/<br>
&emsp;&emsp;├── train/<br>
&emsp;&emsp;&emsp;&emsp;├── class_real/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *all the images fetched by MSCOCO*<br>
&emsp;&emsp;&emsp;&emsp;├── class_SD/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *all the images generated by SD*<br>
&emsp;&emsp;&emsp;&emsp;├── class_GLIDE/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *all the images generated by GLIDE*<br>
&emsp;&emsp;&emsp;&emsp;├── class_LD/<br>
&emsp;&emsp;&emsp;&emsp;│   └── *all the images generated by LD*<br>
&emsp;&emsp;├── test/<br>
&emsp;&emsp;&emsp;&emsp;├── ...<br>

Generate the TRAIN dataset first:

In [4]:
# fetch the images with their captions from MSCOCO (N=50)
fetchImagesFromMSCOCO("data/imageonly_attributor_data/train/class_real", "data/imageonly_attributor_data/train", 50)

# use the same 50 captions to generate images with SD
SD_generation("data/imageonly_attributor_data/train/mscoco_captions.csv", SD_generated_temp_dir, SD_api_key)

# use the same 50 captions to generate images with GLIDE
GLIDE_generation("data/imageonly_attributor_data/train/mscoco_captions.csv", GLIDE_generated_temp_dir, glide_dir)

# use the same 50 captions to generate images with LD
LD_generation("data/imageonly_attributor_data/train/mscoco_captions.csv", ld_dir, proj_dir)

# move the generated images to the dataset dir
format_dataset_multiclass(SD_generated_temp_dir, LD_generated_temp_dir, GLIDE_generated_temp_dir, "data/imageonly_attributor_data/train")

sposto le immagini LD
sposto le immagini SD
['84610.jpg', '331551.jpg', '434986.jpg', '89589.jpg', '165835.jpg', '264884.jpg', '471335.jpg', '391519.jpg', '117466.jpg', '29715.jpg', '564204.jpg', '27157.jpg']
sposto le immagini GLIDE


Then generate the TEST dataset:

In [4]:
# Repeat the same procedure for the test dataset

# fetch the images with their captions from MSCOCO (N=50)
fetchImagesFromMSCOCO("data/imageonly_attributor_data/test/class_real", "data/imageonly_attributor_data/test", 50)

# use the same 50 captions to generate images with SD
SD_generation("data/imageonly_attributor_data/test/mscoco_captions.csv", SD_generated_temp_dir, SD_api_key)

# use the same 50 captions to generate images with GLIDE
GLIDE_generation("data/imageonly_attributor_data/test/mscoco_captions.csv", GLIDE_generated_temp_dir, glide_dir)

# use the same 50 captions to generate images with LD
LD_generation("data/imageonly_attributor_data/test/mscoco_captions.csv", ld_dir, proj_dir)

# move the generated images to the dataset dir
format_dataset_multiclass(SD_generated_temp_dir, LD_generated_temp_dir, GLIDE_generated_temp_dir, "data/imageonly_attributor_data/test")

sposto le immagini LD
sposto le immagini SD
['84610.jpg', '331551.jpg', '434986.jpg', '89589.jpg', '516774.jpg', '165835.jpg', '264884.jpg', '471335.jpg', '391519.jpg', '117466.jpg', '29715.jpg', '564204.jpg', '451012.jpg', '27157.jpg']
sposto le immagini GLIDE


#### 1.2 Model

In the next block we build and train the actual model, a two-layer perceptron for multiclass classification, with the following steps:
- **Build the model** starting from a pre-trained version of ResNet18<br><br>
- **Create Dataset and DataLoader** objects starting from the row data fetched at 1.1 (jpg images)<br>Each item of this dataset is transformed in order to obtain better generalization<br><br>
- **Train the model** using a custom train function and the DataLoader from the previous step

In [7]:
# Build the model
print("Building the model...")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1')

# Build the datasets
print("Building the dataset...")
data_transforms = {
    'train': torchvision.transforms.Compose([
        torchvision.transforms.RandomResizedCrop(224),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'test': torchvision.transforms.Compose([
        torchvision.transforms.Resize(256),
        torchvision.transforms.CenterCrop(224),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])
}

data_dir = 'data/imageonly_attributor_data'
image_datasets = {x: torchvision.datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'test']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4, shuffle=True, num_workers=4) for x in ['train', 'test']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'test']}

# Train the model
print("Training starts:")
trained_model = train_imageonly_attributor(model, dataloaders, dataset_sizes, num_epochs=30)

# Evaluate the model
print("Evaluation starts:")
print("loading model with trained weights...")
test_model = torchvision.models.resnet18(weights='IMAGENET1K_V1')
test_model.load_state_dict(torch.load('trained_models/imageonly_attributor.pth'))
eval_imageonly_attributor(test_model, dataloaders, dataset_sizes)


Building the model...
Building the dataset...
Training starts:
Epoch 0/29
----------
train Loss: 6.1191 Acc: 0.1960
test Loss: 4.4390 Acc: 0.3687
Epoch 1/29
----------
train Loss: 2.6933 Acc: 0.5477
test Loss: 2.2395 Acc: 0.5657
Epoch 2/29
----------
train Loss: 1.3284 Acc: 0.7538
test Loss: 1.4689 Acc: 0.6515
Epoch 3/29
----------
train Loss: 0.9205 Acc: 0.7789
test Loss: 1.0956 Acc: 0.7374
Epoch 4/29
----------
train Loss: 0.9387 Acc: 0.7789
test Loss: 1.8423 Acc: 0.6162
Epoch 5/29
----------
train Loss: 0.6712 Acc: 0.8291
test Loss: 1.1617 Acc: 0.6818
Epoch 6/29
----------
train Loss: 0.4897 Acc: 0.8543
test Loss: 1.4001 Acc: 0.6212
Epoch 7/29
----------
train Loss: 0.7176 Acc: 0.7940
test Loss: 1.6833 Acc: 0.6465
Epoch 8/29
----------
train Loss: 0.7309 Acc: 0.8442
test Loss: 1.0576 Acc: 0.7424
Epoch 9/29
----------
train Loss: 0.4961 Acc: 0.8291
test Loss: 0.8649 Acc: 0.7475
Epoch 10/29
----------
train Loss: 0.4871 Acc: 0.8241
test Loss: 0.8571 Acc: 0.7576
Epoch 11/29
----------


### 2. Hybrid attributor
In this section we will build and train a model similar to the model built in section 1.<br>
The difference is that instead of taking as input only the image, this model also considers its textual caption.

#### 2.1 Dataset

Train and test datasets are generated in the same way as in the image-only attributor case.<br>
For the dataset directory structure also refer to the previous section.

Generate the TRAIN dataset first:

In [5]:
# fetch the images with their captions from MSCOCO (N=50)
fetchImagesFromMSCOCO("data/hybrid_attributor_data/train/class_real", "data/hybrid_attributor_data/train", 50)

# use the same 50 captions to generate images with SD
SD_generation("data/hybrid_attributor_data/train/mscoco_captions.csv", SD_generated_temp_dir, SD_api_key)

# use the same 50 captions to generate images with GLIDE
GLIDE_generation("data/hybrid_attributor_data/train/mscoco_captions.csv", GLIDE_generated_temp_dir, glide_dir)

# use the same 50 captions to generate images with LD OK
LD_generation("data/hybrid_attributor_data/train/mscoco_captions.csv", ld_dir, proj_dir)

# move the generated images to the dataset dir
format_dataset_multiclass(SD_generated_temp_dir, LD_generated_temp_dir, GLIDE_generated_temp_dir, "data/hybrid_attributor_data/train")

sposto le immagini LD
sposto le immagini SD
['84610.jpg', '331551.jpg', '333946.jpg', '434986.jpg', '89589.jpg', '516774.jpg', '165835.jpg', '264884.jpg', '471335.jpg', '391519.jpg', '117466.jpg', '29715.jpg', '564204.jpg', '451012.jpg', '27157.jpg']
sposto le immagini GLIDE


Then denerate the TEST dataset:

In [None]:
# fetch the images with their captions from MSCOCO (N=50)
fetchImagesFromMSCOCO("data/hybrid_attributor_data/test/class_real", "data/hybrid_attributor_data/test", 50)

# use the same 50 captions to generate images with SD
SD_generation("data/hybrid_attributor_data/test/mscoco_captions.csv", SD_generated_temp_dir, SD_api_key)

# use the same 50 captions to generate images with GLIDE
GLIDE_generation("data/hybrid_attributor_data/test/mscoco_captions.csv", GLIDE_generated_temp_dir, glide_dir)

# use the same 50 captions to generate images with LD OK
LD_generation("data/hybrid_attributor_data/test/mscoco_captions.csv", ld_dir, proj_dir)

# move the generated images to the dataset dir
format_dataset_multiclass(SD_generated_temp_dir, LD_generated_temp_dir, GLIDE_generated_temp_dir, "data/hybrid_attributor_data/test")

#### 2.2 Model

In the next block we build and train the actual model, a two-layer perceptron for multiclass classification, with the following steps:
- **Build the model** using a custom implemented module<br>A two-layer perceptron that outputs the class predicted for each sample<br><br>
- **Create Dataset and DataLoader** objects starting from the row data fetched at 2.1 (jpg images and string captions)<br>Each item of this dataset is composed by the encoding of an image concatenated with the encoding of its caption,<br>the encodings are generated using the CLIP model<br><br>
- **Train the model** using a custom train function and the DataLoader from the previous step

In [8]:
# Build the model
print('Building the model...')
hybrid_attributor = MultiClassTwoLayerPerceptron(1024, 100, 4)

# Build the dataset (each sample in the dataset is the encoding of an image concatenated to the encoding of its caption - encodings generated using the CLIP model)
print('Building the dataset...')
captions_file = "data/hybrid_attributor_data/train/mscoco_captions.csv"
dataset_dir = "data/hybrid_attributor_data/train"
classes = {"class_real", "class_SD", "class_LD", "class_GLIDE"}

train_data_loader = get_multiclass_dataset_loader(captions_file, dataset_dir, classes, clip_dir, proj_dir)

# Train the model on the dataset just generated
print('Training starts:')
train_hybrid_attributor(hybrid_attributor, train_data_loader, 30, 0.005)

Building the model...
Building the dataset...
Training starts:
EPOCH:  1/30  - MEAN ACCURACY:  tensor(0.2833)  - MEAN LOSS:  tensor(1.3881)
EPOCH:  2/30  - MEAN ACCURACY:  tensor(0.4417)  - MEAN LOSS:  tensor(1.3570)
EPOCH:  3/30  - MEAN ACCURACY:  tensor(0.5217)  - MEAN LOSS:  tensor(1.3271)
EPOCH:  4/30  - MEAN ACCURACY:  tensor(0.6200)  - MEAN LOSS:  tensor(1.2942)
EPOCH:  5/30  - MEAN ACCURACY:  tensor(0.6300)  - MEAN LOSS:  tensor(1.2596)
EPOCH:  6/30  - MEAN ACCURACY:  tensor(0.7300)  - MEAN LOSS:  tensor(1.2201)
EPOCH:  7/30  - MEAN ACCURACY:  tensor(0.8017)  - MEAN LOSS:  tensor(1.1798)
EPOCH:  8/30  - MEAN ACCURACY:  tensor(0.8200)  - MEAN LOSS:  tensor(1.1288)
EPOCH:  9/30  - MEAN ACCURACY:  tensor(0.7967)  - MEAN LOSS:  tensor(1.0886)
EPOCH:  10/30  - MEAN ACCURACY:  tensor(0.8600)  - MEAN LOSS:  tensor(1.0332)
EPOCH:  11/30  - MEAN ACCURACY:  tensor(0.8867)  - MEAN LOSS:  tensor(0.9833)
EPOCH:  12/30  - MEAN ACCURACY:  tensor(0.8800)  - MEAN LOSS:  tensor(0.9266)
EPOCH:  13