# Dataset selection
Before selecting a model to identify the car and clothing brands present in the dataset, it's important to
understand the available data in more detail. The choice of model to be used in identifying these logos is naturally dependent on the size and quality of the input dataset. In this notebook I look specifically at the brand-by-brand breakdown for the car and clothing brands in the BelgaLogos set. 

#### Data pre-processing
I have extracted all the unique brand labels, from the dataset (37 in total) and tagged them by hand as a 'car', 'clothing' or 'NA' (not applicable) brand. The brand-name -> brand-category dictionary is located in `load_data.py` and used in the function `read_metadata`.


In [1]:
# Import full dataset and filter by acceptable bounding-box sizes
import load_data as ld
metadata = ld.read_metadata()
metadata = ld.filter_by_boundingbox(metadata, 10, 800)

In [2]:
# Select only clothing or car brands
car_metadata      = metadata[metadata.category == 'car']
clothing_metadata = metadata[metadata.category == 'clothing']

In [3]:
import util as ut
car_summary      = ut.metadata_count_summary(car_metadata)
clothing_summary = ut.metadata_count_summary(clothing_metadata)
ut.multi_table([car_summary, clothing_summary])

Unnamed: 0_level_0,#OK,#Junk,Total
Logo name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Unnamed: 0_level_2,#OK,#Junk,Total
Logo name,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3
Citroen,78,156.0,234.0
Citroen-text,196,129.0,325.0
Ferrari,76,131.0,207.0
Kia,136,85.0,221.0
Mercedes,84,187.0,271.0
Peugeot,6,1.0,7.0
Adidas,143,790.0,933.0
Adidas-text,59,97.0,156.0
Airness,11,97.0,108.0
Gucci,2,2.0,4.0

Unnamed: 0_level_0,#OK,#Junk,Total
Logo name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Citroen,78,156,234
Citroen-text,196,129,325
Ferrari,76,131,207
Kia,136,85,221
Mercedes,84,187,271
Peugeot,6,1,7

Unnamed: 0_level_0,#OK,#Junk,Total
Logo name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adidas,143,790,933
Adidas-text,59,97,156
Airness,11,97,108
Gucci,2,2,4
Nike,231,1742,1973
Puma,154,601,755
Puma-text,27,41,68
Reebok,18,46,64
Umbro,149,466,615


Despite the reasonably large size of the total dataset, the number of high-quality annotated images for each logo is rather limited. If we restrict ourselves to the 'OK' images (those images judged by the human assesors to be identifiable without context or prior information) then the largest brand-logo dataset is only 231 images (Nike). The model and methodology selected to solve the problem must therefore be able to handle a small dataset. 