# Image Analysis

## Please don't run all cells !

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
import math
from tqdm import tqdm

In [None]:
#load data
data = pd.read_csv("data/metadata/selection.csv")

In [None]:
data["Scientific Name"].unique()

## Create pdf files

The images are our principal data, which makes them very important to analyze. However, we can't use the same tools we would use for numerical or text data. Here even more than elsewhere, visualisation is considerably more powerful than other approachs. Typically, the images may differ by the type of take : some birds are taken flying, while other are taken at rest on the floor. Unfortunately, the metadata doesn't contain any information about the type of take. Since the number of images available is relatively small, I decided to display all pictures as thumbnails in order to assess their characteristics and quality.

Printing thumbnails of images on serveral rows and columns is a good mean to have a quick visual assessment of the pictures. On the other hand, printing all images in notebook is not very efficient since it will overload the RAM. The first section of the present notebook is used to create figures with matplotlib and save them as pdf files. Those files will then be used for visual assessment.

In order to avoid overloading RAM, clear output after creating a pdf file for the species of interest.

In [None]:
#create dataframe for "Aquila chrysaetos"
ac = data[data["Scientific Name"]=="Aquila chrysaetos"]

#get the number of observations
n = len(ac)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(100)

for i in tqdm(range(n)):
    pil_img = Image.open(ac.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = ac.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = ac.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions

plt.tight_layout()    
    
plt.savefig("data/visual_assessment/aquila_chrysaetos.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Ardea cinerea'
ar = data[data["Scientific Name"]=='Ardea cinerea']

#get the number of observations
n = len(ar)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(300)

for i in tqdm(range(n)):
    pil_img = Image.open(ar.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = ar.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = ar.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions

plt.tight_layout()
    
plt.savefig("data/visual_assessment/ardea_cinerea.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Buteo buteo'
bb = data[data["Scientific Name"]=='Buteo buteo']

#get the number of observations
n = len(bb)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(280)

for i in tqdm(range(n)):
    pil_img = Image.open(bb.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = bb.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = bb.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
    
plt.tight_layout()

plt.savefig("data/visual_assessment/buteo_buteo.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Aythya fuligula'
af = data[data["Scientific Name"]=='Aythya fuligula']

#get the number of observations
n = len(af)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(360)

for i in tqdm(range(n)):
    pil_img = Image.open(af.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = af.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = af.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions

plt.tight_layout()    
    
plt.savefig("data/visual_assessment/aythya_fuligula.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Mergus merganser'
mm = data[data["Scientific Name"]=='Mergus merganser']

#get the number of observations
n = len(mm)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(380)

for i in tqdm(range(n)):
    pil_img = Image.open(mm.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = mm.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = mm.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions

plt.tight_layout()    
    
plt.savefig("data/visual_assessment/mergus_merganser.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Larus michahellis'
lm = data[data["Scientific Name"]=='Larus michahellis']

#get the number of observations
n = len(lm)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(320)

for i in tqdm(range(n)):
    pil_img = Image.open(lm.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = lm.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = lm.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
    
plt.tight_layout()
    
plt.savefig("data/visual_assessment/larus_michahellis.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Cygnus olor'
co = data[data["Scientific Name"]=='Cygnus olor']

#get the number of observations
n = len(co)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(420)

for i in tqdm(range(n)):
    pil_img = Image.open(co.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = co.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = co.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
    
plt.tight_layout()
    
plt.savefig("data/visual_assessment/cygnus_olor.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Fulica atra'
fa = data[data["Scientific Name"]=='Fulica atra']

#get the number of observations
n = len(fa)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(500)

for i in tqdm(range(n)):
    pil_img = Image.open(fa.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = fa.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = fa.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions

plt.tight_layout()

plt.savefig("data/visual_assessment/fulica_atra.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Milvus milvus'
mi = data[data["Scientific Name"]=='Milvus milvus']

#get the number of observations
n = len(mi)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(250)

for i in tqdm(range(n)):
    pil_img = Image.open(mi.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = mi.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = mi.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
     
plt.tight_layout()

plt.savefig("data/visual_assessment/milvus_milvus.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Pyrrhocorax graculus'
pg = data[data["Scientific Name"]=='Pyrrhocorax graculus']

#get the number of observations
n = len(pg)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(380)

for i in tqdm(range(n)):
    pil_img = Image.open(pg.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = pg.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = pg.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
    
plt.tight_layout()

plt.savefig("data/visual_assessment/pyrrhocorax_graculus.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Turdus merula'
tm = data[data["Scientific Name"]=='Turdus merula']

#get the number of observations
n = len(tm)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(150)

for i in tqdm(range(n)):
    pil_img = Image.open(tm.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = tm.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = tm.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
    
plt.tight_layout()

plt.savefig("data/visual_assessment/turdus_merula.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Podiceps cristatus'
pc = data[data["Scientific Name"]=='Podiceps cristatus']

#get the number of observations
n = len(pc)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(540)

for i in tqdm(range(n)):
    pil_img = Image.open(pc.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = pc.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = pc.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
    
plt.tight_layout()

plt.savefig("data/visual_assessment/podiceps_cristatus.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Chroicocephalus ridibundus'
cr = data[data["Scientific Name"]=='Chroicocephalus ridibundus']

#get the number of observations
n = len(cr)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(540)

for i in tqdm(range(n)):
    pil_img = Image.open(cr.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = cr.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = cr.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
       
plt.tight_layout()

plt.savefig("data/visual_assessment/chroicocephalus_ridibundus.jpg") #save the figure as a jpg file

In [None]:
#create dataframe for 'Netta rufina'
nr = data[data["Scientific Name"]=='Netta rufina']

#get the number of observations
n = len(nr)

#plot the observations
fig,axes = plt.subplots(nrows=math.ceil(n/3), ncols=3)
fig.set_figwidth(20)
fig.set_figheight(500)

for i in tqdm(range(n)):
    pil_img = Image.open(nr.iloc[i,:]["storage"]) #for each observation, get the storage path and open a pillow image
    np_img = np.array(pil_img) #turn the image into an array
    axes.ravel()[i].imshow(np_img) #display the image at the right position in the grid
    ML = nr.iloc[i,:]["ML Catalog Number"] #get the ML number of the observation
    rating = nr.iloc[i,:]["Average Community Rating"] #get the average rating
    axes.ravel()[i].set_title("ML : {} -- rating : {} -- index : {}".format(ML,rating,i)) #print the ML number and the index in species dataframe and dimensions
    
plt.tight_layout()

plt.savefig("data/visual_assessment/netta_rufina.jpg") #save the figure as a jpg file

## Visual assessment of the images

Based on the pdf files created above, we can make some observations about the content of the images. 

### 1) Aquila chrysaetos
   - not enough observations (47)

### 2) Ardea cinerea
- relatively small number of observations (151)
- good general quality of images
- types of take 
    - bird mostly standing
    - a few bird flying
    - a few macros
    - a few birds in group
- no sexual dimorphism
- no look-alike

### 3) Aythya fuligula
- moderate number of observations (181)
- good general quality of images
- types of take :
    - bird mostly afloat
    - non-negligeable number of birds in group
    - a few macros
- light sexual dimorphism : 
    - male's flanks are white whereas female's ones are brown
    - silhouette is the same for both sexs
- look-alike:
    - very light look-alike of both sexs with netta rufina since the silhouette is quite similar but the colors are different
    - very light look-alike of the male with Fulicula atra since they are both mostly black but the silhouette is different

### 4) Buteo buteo

- relatively small number of images (139)
- good general quality of images
- types of take :
    - large number of birds flying
    - large number of birds standing
- plumage varies considerably from specimen to specimen !
- look-alike:
    - strong with Aquila chrysaetos since the size is difficult to assess on a picture
    - light with Milvus milvus but the shape of the tail and the wings are different

### 5) Chroicocephalus ridibundus
- relatively large number of images (274)
- good general quality of images
- types of take:
    - large number of birds flying
    - large number of birds standing
    - moderate number of birds afloat
- no sexual dimorphism but plumage varies with the seasons !
- light look-alike with Larus michahellis since they are both white lake birds, but the silhouette is different

### 6) Cygnus olor
- relatively large number of images (213)
- quite good general quality of images (non negligeable share of birds in group)
- types of take:
    - birds mostly afloat
    - moderate number of birds flying
    - moderate number of birds in group
- no sexual dimorphism
- no look-alike

### 7) Fulica atra
- relatively large number of images (251)
- good general quality of images
- types of take:
    - birds mostly alfloat
    - non negligeable number of immature specimens
- no sexual dimorphism
- very light look-alike with Aythya fuligula male since they are both mostly black but the silhouette is different

### 8) Larus michahellis
- relatively small number of images (162)
- good general quality of images
- types of take:
    - birds mostly flying
    - moderate number of birds afloat
    - moderate number of birds standing
    - moderate number of birds in group
- no sexual dimorphism but plumage varies with age
- light look-alike with Chroicocephalus ridibundus since they are both white lake birds, but the silhouette is different

### 9) Mergus merganser
- moderate number of pictures (189)
- good general quality of images
- types of take:
    - birds mostly afloat
    - small number of birds flying
- sexual dimorphism:
    - male has a dark green head, white flanks and black back
    - female has tan head and light grey body
    - silhouette is the same except that the female has a hoopoe
- look-alike
    - female: look-alike with Podiceps cristatus since the silhouette and the colors are similar

### 10) Milvus milvus
- relatively small number of images (128)
- quite good general quality of images (some birds are relatively distant)
- types of take:
    - birds mostly flying
    - small number of birds standing
- no sexual dimorphism
- light look-alike with Buteo buteo and Aquila chrysaetos but the shape of the tail and the wings are different

### 11) Netta rufina
- relatively large number of images (251)
- good general quality of images
- types of take:
    - birds mostly afloat
    - small number of birds in group
- sexual dimorphism
    - male has a tan head, a red beak and a black breast
    - female has a dark grey and white head, a dark beak and a light grey body
    - silhouette is the same for both sexs
- look-alike:
    - very light look-alike of both sexs with Aythya fuligula since the silhouette is quite similar but the colors are different
    - ligth look-alike with Podiceps cristatus since the colors are rather similar but the silhouette is different

### 12) Podiceps cristatus
- relatively large number of images (273)
- good general quality of images
- types of take:
    - birds mostly afloat
    - large number of images containing multiple specimens
- no sexual dimorphism
- look-alike:
    - strong look alike with Mergus merganser female since the silhouette and the colors are similar
    - light look-alike with netta rufina since the colors are rather similar but the silhouette is different

### 13) Pyrrhocorax graculus
- moderate number of observations (190)
- good general quality of images
- types of take:
    - birds mostly stand
    - moderate number of bird flying
- no sexual dimorphism
- strong look-alike with Turdus merula

### 14) Turdus merula
- very small number of observations (75)
- good general quality of images
- types of take:
    - birds are mostly stand
    - moderate share of macros
    - small number of birds flying
- sexual dimorphism : male is black whereas female is mottled brown
- strong look-alike of the male with Pyrrhocorax graculus ; Turdus merula is a little stockier and has a golden circle around the eye

## Select classes

### First classification pool

Given the small number of observations, we must be humble and avoid setting irrealistic goals. A classification task with 5 classes that are relatively homogenous within each class and relatively heterogenous between them is a reasonable challenge. Based on the results of exploratory metadata analysis and the visual assessment of the pictures, I have drawn the following conclusions.

- Aquila chrisaetos and Turdus merula can be eliminated since we don't have enough observations. 
- Mergus merganser has a pronounced sexual dimorphism. In addition, the female looks pretty much like Podiceps cristatus. As a result, female Mergus merganser are highly likely to be misclassified as Podiceps cristatus. For this reason, I rule out Mergus merganser.
- For Aythya fuligula, Cygnus olor, Fulica atra, Netta rufina, Podiceps cristatus, and to a lesser extent Larus michahellis and Chroicocephalus ridibundus, the background is mostly water. This may make the images very similar for the algorithm. Nevertheless, it can be reduced by cropping the images close to the bird.
- The same is true with respect to the blue sky for Buteo buteo, Milvus milvus, Pyrrhocorax graculus and to a lesser extent Larus michahellis and Chroicocephalus ridibundus.
- Species with a larger number of observations should be preferred ; we can always drop observations if necessary.
- All species still have some images that are not appropriate for classification: immature specimens, very distant group of birds, etc. However, these images are few and it will be easy to spot them visually and drop them from the dataset.

I selected the following species in a first classification pool:

1. Podiceps cristatus
2. Fulica atra
3. Chroicocephalus ridibundus
4. Cygnus olor
5. Pyrrhocorax graculus

### Second classification pool

In a second phase, I would like to increase the difficult a little by introducing species that looks like other species. Indeed, I want to replace Chroicocephalus ridibundus by Netta Rufina which quite similar in colors to Podiceps cristatus, and replace Cygnus olor by Turdus merula which is very similar to Pyrrhocorax graculus and has the same dominant color as Fulica atra.

In particular, it will be interesting to see if a can settle the algorithm in a way to give sufficiently weight to the metadata to separate similar looking species. Moreover, Netta rufina has a quite strong sexual dimorphism, but the male and female are likely to be observed in the same places at the same period.

We also note that Turdus merula has far less observations than other classes. In would be interesting to see what is the effect on the classification and which strategies we can find to handle potential problems.

The second classification contains the following species:

1. Podiceps cristatus
2. Fulica atra
3. Netta rufina
4. Turdus merula
5. Pyrrhocorax graculus