# Getting Started - Parsing Validation

The BelaLogos dataset consists of a large collection of images, along with a metadata file (`qset3_internal_and_local.gt`) which annotates the images with information on identified brand logos, locations and sizes (in the form of bounding-box coordinates). 

Let's start the analysis by parsing the data into a convenient format (a pandas DataFrame) and validating it against the information provided on the BelgaLogos website.

#### Preprocessing of data file

There are two malformed entries in the metadata file `qset3_internal_and_local.gt`.
Attempting to import it as a tab-delineated pandas DataFrame yields two rows with NaN entries:
 
 ```
 Peugeot_0007    Peugeot 07585235.jpg    logo   ...   NaN        NaN  NaN  NaN   NaN   NaN   NaN   NaN
 StellaArtois_0028       StellaArtois    0764798...   215        231  274  NaN   NaN   NaN   NaN   NaN
 ```

Unlike the rest of the data, these records were space-delineated (looks like they may have been edited by hand?).
I restored the tabs manually and have the modified version of the file in the data folder of this repository.

#### Parsing and validation

Let's perform a sanity check to ensure that the parsing has been completed correctly. My data loading and scraping routines are in `load_data.py`. First we build a pandas DataFrame around the BelgaLogos annotation file and compare the image counts against the data scraped from the BelgaLogos webpage via BeautifulSoup (as a quick check).

In [1]:
# General imports
import load_data as ld
import pandas as pd
import numpy as np

In [2]:
# Parse the annotations from the BelgaLogos metadata file and
# perform a count summary on them, brand-by-brand
import util as ut
md = ld.read_metadata() 
parsed_counts = ut.metadata_count_summary(md)

In [3]:
# Scrape the BelgaLogos website for the equivalent table
scraped_counts = ld.scrape_testdata()

### Comparison of parsed and scraped dataset statistics
The table below compares image counts from the parsed annotations (left table) with those scraped from the website (right table). The counts are separated into those images judged by human assessors as 'OK' and as 'Junk', with the total count also shown. Note that the scraped table is not complete (The 'Bridgestone' Junk column is incorrectly filled on the website). Furthermore a few 'Logo names' have different spellings than in the source data, which needs to be corrected before we can automatically test.

Things look ok by eye: This should be tested more automatically.

In [4]:
from util import multi_table
multi_table([parsed_counts, scraped_counts])

Unnamed: 0_level_0,#OK,#Junk,Total
Logo name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Unnamed: 0_level_2,#OK,#Junk,Total
Logo name,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3
Adidas,147,896.0,1043.0
Adidas-text,63,115.0,178.0
Airness,11,109.0,120.0
BFGoodrich,86,222.0,308.0
Base,162,86.0,248.0
Bik,65,205.0,270.0
Bouigues,14,18.0,32.0
Bridgestone,31,74.0,105.0
Bridgestone-text,64,137.0,201.0
Carglass,18,47.0,65.0

Unnamed: 0_level_0,#OK,#Junk,Total
Logo name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adidas,147,896.0,1043
Adidas-text,63,115.0,178
Airness,11,109.0,120
BFGoodrich,86,222.0,308
Base,162,86.0,248
Bik,65,205.0,270
Bouigues,14,18.0,32
Bridgestone,31,74.0,105
Bridgestone-text,64,137.0,201
Carglass,18,47.0,65

Unnamed: 0_level_0,#OK,#Junk,Total
Logo name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adidas,147,896,1043
Adidas-text,63,115,178
Airness,11,109,120
Base,162,86,248
BFGoodrich,86,222,308
Bik,65,205,270
Bouygues,14,18,32
Bridgestone,31,Junk,105
Bridgestone-text,64,74,201
Carglass,18,47,65


#### TODO
1. Run an automated check between the parsed and scraped counts (will require a name map).
2. Should check why the sorting on index order is not working sometimes (see BFGoodrich and Base)