# Home assignment - Tomato allergies - EDA

In [1]:
from collections import Counter
import json
import pandas as pd

## Data loading

In [2]:
mapping = pd.read_csv("../dataset/label_mapping.csv")

In [3]:
with open("../dataset/img_annotations.json") as f:
    annot = json.load(f)

## Data distribution

In [4]:
mapping[['tomato' in item.lower() for item in mapping.labelling_name_en]]

Unnamed: 0,labelling_id,labelling_name_fr,labelling_name_en
136,513535c5382eebf51dca54b46d570fe5_lab,Verre de Jus de tomate,Tomato juice
526,939030726152341c154ba28629341da6_lab,Tomates (coupées),Tomatoes
527,9f2c42629209f86b2d5fbe152eb54803_lab,Tomates cerises,Cherry tomatoes
528,4e884654d97603dedb7e3bd8991335d0_lab,Tomates (entières),Tomatoe whole
539,f3c12eecb7abc706b032ad9a387e6f01_lab,Tomate à la provençale,Stuffed Tomatoes
540,e306505b150e000f5e1b4f377ad688a0_lab,Tomate farcie,Stuffed tomatoes
640,5816e75b36aa2708f126fe22abeda6ed_lab,Raviolis sauce tomate,Ravioli with tomato sauce
655,4e2da86105869fc35c947c1b467a7f96_lab,Part de quiche provençale à la tomate,Quiche provence w. tomato
696,fb9547240ac8bb62713892d7e83e7ce2_lab,Sardines sauce tomate,Sardine tomato sauce
788,c262e42a627986076c07c4a194946a93_lab,Tomate Mozzarella (Salade Caprese plat complet),Tomato Mozzarella


Out of the 994 recognized labels, we have at least 13 that are of interest to us. They directly mention 'tomato' in their name - this may not be all the ingredients/dishes that include tomatoes but we can clearly see here that we shouldn't restrict ourselves to using `939030726152341c154ba28629341da6_lab`: in the context of a tomato allergy, we'd want to track down all derivatives as well, such as tomato sauce.

Let's save the IDs of all suspected tomato-related ingredients:

In [5]:
tomatoes = mapping[['tomato' in item.lower() for item in mapping.labelling_name_en]]['labelling_id'].to_list()

Now let's consider our annotations. Their format is the following (for each image):

In [6]:
list(annot.items())[0]

('ec2f4cece94a8b249c97277951d71396.jpeg',
 [{'box': [32, 54, 525, 541],
   'id': '939030726152341c154ba28629341da6_lab',
   'is_background': False},
  {'box': [140, 209, 175, 125],
   'id': '807c6457c23082f3b0a260984df7f8c5_lab',
   'is_background': False},
  {'box': [365, 281, 93, 105],
   'id': '807c6457c23082f3b0a260984df7f8c5_lab',
   'is_background': False},
  {'box': [263, 109, 108, 109],
   'id': '807c6457c23082f3b0a260984df7f8c5_lab',
   'is_background': False},
  {'box': [0, 0, 599, 64], 'id': 'main_lab', 'is_background': True}])

A quick operation allows us to find 737 tomato-related ingredients across all 3000 photos:

In [7]:
len([item['id'] for img, info in annot.items() for item in info if item['id'] in tomatoes])

737

A photo may include a dish that contains multiple counts and/or types of tomato-related ingredients.  
What's going to be important here, even more so if we reduce the problem to a classification one where we collapse the information about tomatoes as being "is there *ANY* tomato in this photo?" (not how many/where), is the number of photos that contain 1 or more tomato-related ingredients.

We first group this information per image:

In [8]:
tomatoes_per_img = Counter([img for img, info in annot.items() for item in info if item['id'] in tomatoes])
tomatoes_per_img_df = pd.DataFrame.from_dict(tomatoes_per_img, orient='index')

In [9]:
tomatoes_per_img_df.head()

Unnamed: 0,0
ec2f4cece94a8b249c97277951d71396.jpeg,1
83590b62bcb71a17bf0fa8d5af941eb3.jpeg,2
4fc88151232cab1b03bfa0d47895a2cb.jpeg,1
b01a2555bee8b6a55d70a413d17d0779.jpeg,3
0f92168eaab8fd44a02b74ad0f0972a8.jpeg,1


In [10]:
tomatoes_per_img_df.describe()

Unnamed: 0,0
count,534.0
mean,1.38015
std,0.881145
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,7.0


We ultimately find that there are 534 photos (out of 3000) that contain tomato-related ingredients.

This does not make it a "*rare*" event, but it is small engough to carefuly consider how we're going to split this dataset to take into consideration the class imbalance.