# Getting Started - Parsing Validation

Let's start by getting the dataset metadata from BelaLogos parsed.

##### Preprocessing
There are two malformed entries in the metadata file `qset3_internal_and_local.gt`.
Attempting to import it as a pandas dataframe yields two rows with NaN entries:
 
 ```
 Peugeot_0007    Peugeot 07585235.jpg    logo   ...   NaN        NaN  NaN  NaN   NaN   NaN   NaN   NaN
 StellaArtois_0028       StellaArtois    0764798...   215        231  274  NaN   NaN   NaN   NaN   NaN
 ```

Unlike the rest of the data, these records were not tab-delineated (looks like they may have been edited by hand?).
I restored the tabs manually and have the modified version of the file in this repository.

Let's perform a sanity check to ensure that the parsing has been completed correctly. My data loading and scraping routines are in `load_data.py`. First we build a pandas DataFrame around the BelgaLogos annotation file and compare the image counts against the data scraped from the BelgaLogos webpage (as a quick check).

In [1]:
import load_data as ld
import pandas as pd
import numpy as np
from IPython.display import display, HTML

# Formatting for tables
CSS = """
.output {
    flex-direction: row;
}
"""
HTML('<style>{}</style>'.format(CSS))

In [2]:
# Read the annotations file into a DataFrame
md = ld.read_metadata()

# Perform image counts (total, ok images and junk images)
total_images = md['brand'].value_counts()
ok_images    = md[md.ok]['brand'].value_counts()
junk_images  = md[md.ok == False]['brand'].value_counts()

# Build a table of image counts from the parsed dataset
parsed_counts = pd.concat([ok_images, junk_images, total_images], axis=1, sort=True)
parsed_counts.columns = ['#OK', '#Junk', 'Total']
parsed_counts.index.name = 'Logo name'

In [3]:
# Scrape the BelgaLogos website for the equivalent table
scraped_counts = ld.scrape_testdata()

### Comparison of parsed and scraped dataset statistics
On the left here, we have the image counts (separated into 'OK' and "Junk' annotations, with the 'Total' count also)
from the downloaded dataset. The table on the right is the equivalent scraped from the BelgaLogos website. Note that it is not complete (The 'Bridgestone' Junk column is incorrectly filled). Furthermore a few 'Logo names' have different spellings, which needs to be corrected before we can automatically test.

Things look ok by eye: This should be tested more automatically.

#### TODO
Improve the formatting a bit here if there is time, perform automated tests.

In [4]:
display(parsed_counts)
display(scraped_counts)

Unnamed: 0_level_0,#OK,#Junk,Total
Logo name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adidas,147,896.0,1043
Adidas-text,63,115.0,178
Airness,11,109.0,120
BFGoodrich,86,222.0,308
Base,162,86.0,248
Bik,65,205.0,270
Bouigues,14,18.0,32
Bridgestone,31,74.0,105
Bridgestone-text,64,137.0,201
Carglass,18,47.0,65


Unnamed: 0_level_0,#OK,#Junk,Total
Logo name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adidas,147,896,1043
Adidas-text,63,115,178
Airness,11,109,120
Base,162,86,248
BFGoodrich,86,222,308
Bik,65,205,270
Bouygues,14,18,32
Bridgestone,31,Junk,105
Bridgestone-text,64,74,201
Carglass,18,47,65
