# Materialist fashion dataset

## The dataset and task description

The materialist fashion dataset consists of 3 json files and associated images. The json files for training and validation contains also the groud truth labels, while the test file only contains the images url. In general each image is associated with multiple labels (some may only be associated with single labels). 

There are a total of 228 (1 to 228) labels for the training set and 225 (1 to 225) labels for the validation set. There are no specific names associated with the labels. As a consequence of the label dicrepancy between the training and validation set, the total number of labels for the final task is assumed to be 228. The test dataset consists of 39706 images and the task consists of submitting a csv file where each line contains the image id followed by a number of labels describing each image. Csv file is submitted to kaggle which returns a score based on the accuracy.

## Data analysis

All images downloaded. 

Max image resolution 600x600 -> resized to 256x256 -> cropped to 224x224

Json files stored to take labels.

Total train images: 1014544
Total train labels: 228

Total test images: 39706

Total validation images: 9897
Total validation labels: 225


The explore script.

Inverse view labels and associated image ids. Positive labels and negative labels. 

These positive and negative label views can be used to train 228 individual binary classifiers that specialize in discriminating whether an image fits to the specific label it is trained for. This method was not implemented due to the high amount of classifiers needed to be maintained and finely tuned and also because frameworks already exist that implement this method in a more efficient way.

Another possible addition is to have a view of strictly negative labels where the image id is associated with a specific label id if and only if it does not exist as a combination of other labels in another image. But practical uses for this view were not found and so the view was abandoned.

## Preprocessing

In order to load the data for the machine learning algorithm some preporcessing needs to be done. 

The images are resized to 224px by 224px. They are also normilized according to recommendations for the resnet50 neural network model and transformed to a tensor representation (a 3d tensor of 224x224x3 dimensions). Pytorch and torchvision is used to load a pretrained model for resnet50 (pretrained on imagenet). This is used in order to extract a good quality feature vector from the images. The size of the feature vector is 1000 dimensions (1000x1).

The labels for each image are transformed into multihot encoded vectors of size 228 (228x1). It is similar to a one hot enconding for simple binary classification, but for this case there is a 1 for each ground truth label position.

The image feature vectors and their multihot encoded label representations are saved as numpy arrays on disk and loaded when needed for training/validation/testing.

## Proposed approach

Two proposed approaches are disscussed here.

### Machine learning approach

Uses the image feature vectors and multihot encoded vectors to fit a multilabel classifier to the data. 

The first tried version uses a OneVsRest classifier and a LinearSVC (SVM implementation) classifier plugged into it. When training on 50% of the dataset it ran out of memory.

Because of the memory issues another version uses the SGDClassifier which also plugs into the OneVsRest classifier. This classifier can work on mini-batches and is also much faster in training times. The classifier will be fine tuned in order to enhance it's prediction capabilities.

This method uses a pretrained neural network to extract the image features and then a multilabel classifier to fit on the training data and later predict the labels from the test data.

### Neural network approach

The neural network approach proposes a more end-to-end approach to this problem. This method proposes using a already well established neural network architecture such as ResNet (50, 152), VGG (16, 19), or a standard baseline model (with 2d conv layers, relu activations and a final fully connected layers), where the final layer is a fully connected layer that maps to a 228 feature vector and has a sigmoid activation. Each label is given a probability between 0 and 1 signaling how likely it is for the image to belong to that specific label.

This method has some advantages to the machine learning approach. The nn model, being trained on the images of the dataset will be more receptive to it's specific features. The model will learn directly to extract the labels from the feature vector (no classifier will be needed). The neural network model has the possibility to outperform the machine learning approach if the hyperparameters are tweaked properly, the model does not overfit and is balanced in terms of the various labels. It also offers the possibility of tweaking the treshold that signals a label represents the image.

However there are some drawbacks. Training is harder and longer. Finding the right hyperparameters is also a harder task. 

A partly end-to-end approach is also possible where only the last layer of a pretrained neural network is tweaked in order to map to the 228 features (a fully connected layer with sigmoid activations). This approach is efficient in both training and finding hyper parameters but it is also less tied to the dataset, but offers the best compromise between efficiency and accuracy.

### Object detection

As a prepocessing step it makes sense to use object detection and extract only the bounding box proposed regions and then map those images as inputs to the neural network for making the labeling decision.

How else can object detection architectures fit with multilabel classification?

-you look only at the regions of interest (ROI) that have been detected
-end-to-end? you still use a neural net to extract your image features and then you use some of the tricks from the RPN and ROI pooling align to add more abstraction between the final layer of predicted labels and the features outputs from the resnet50 neural network.

### Spatial Transformer Network

On top of object detection a spatial transformer network (STN) is used to better algin images.


## The practical results

1) Use ImageAI to extract detectable persons/objects and create a new image by concatenating those results

2) Take the images and use pytorch transformations to resize them to 3x224x224

3) Take the STN to align the images

4) Feed them throgh the pretrained ResNet50/ResNet101 network (Note: keep the weights fixed or not?)

5) Feed the output through the last 2 FC layers with a dropout layer between them.

6) Apply softmax over 228 dim the output and only get those values above a threshold of 0.2 ~ 0.3 (Note: the softmax is automatically applied on some standard Loss functions. Make sure not to apply it twice)

Preprocessing can be done and experimented with on my local machine, but the network training needs to be done on an AWS instance. and the preprocessed images need to be transfered there as well.

The neural networks were hard to train and not really good in predicting. For some reason they kept preffering some labels all the time as opposed to the correct labels.

The classical machine learning results were best on the kaggle evaluation so far. More data will be used to train beter models and try to improve the 0.28 score (The best score on the kaggle website so far is around 0.69-0.70).