## Projet Deep Learning : Reconnaissance d'images de pays
## **Preprocessing**

#### Dataset: 
Google Landmarks Dataset v2 :
https://github.com/cvdfoundation/google-landmark?tab=readme-ov-file


#### Objectif du projet :
<p style="text-align: justify;">
    L'objectif de ce projet est de développer une solution de Deep Learning pour la reconnaissance d'images. L'input sera une image d'un lieu, et la sortie du modèle sera le pays correspondant, accompagnée de probabilités d'appartenance.
</p>


In [1]:
# Packages
import pandas as pd
import matplotlib.pyplot as plt


In [2]:
# Variables globales
DATA_TRAIN_PATH = "data/train"

### Importation des données

In [9]:
train_df = pd.read_csv(f"{DATA_TRAIN_PATH}/train.csv")
train_df.head()

Unnamed: 0,id,url,landmark_id
0,6e158a47eb2ca3f6,https://upload.wikimedia.org/wikipedia/commons...,142820
1,202cd79556f30760,http://upload.wikimedia.org/wikipedia/commons/...,104169
2,3ad87684c99c06e1,http://upload.wikimedia.org/wikipedia/commons/...,37914
3,e7f70e9c61e66af3,https://upload.wikimedia.org/wikipedia/commons...,102140
4,4072182eddd0100e,https://upload.wikimedia.org/wikipedia/commons...,2474


In [10]:
category_to_location_df = pd.read_csv(f"{DATA_TRAIN_PATH}/category_to_location.csv")
category_to_location_df.rename(columns={'id': 'landmark_id'}, inplace=True)
category_to_location_df.head()

Unnamed: 0,landmark_id,category_name,name,lat,lon,city,state,country
0,0,Category:Happy_Valley_Racecourse,Natural Turf Soccer Pitch No. 5,22.2728,114.182,Hong Kong Island,Hong Kong,China
1,1,Category:Luitpoldpark_in_Munich,,48.171494,11.569674,Munich,Bavaria,Germany
2,3,"Category:Tweed_Heads,_New_South_Wales",Ukerebagh Nature Reserve,-28.1833,153.55,Tweed Heads,New South Wales,Australia
3,14,Category:Delacorte_Theater,Delacorte Theater,40.7801,-73.968767,New York,New York,United States
4,15,Category:Tremper_Mound_and_Earthworks,Tremper Mound,38.8013,-83.0106,,Ohio,United States


### Jointure des données

On joint les lieux (ville, pays, ...) aux images

In [None]:
train_df = train_df.merge(category_to_location_df, left_on='landmark_id', right_on='landmark_id', how='left')
train_df.head()

Unnamed: 0,id,url,landmark_id,category_name,name,lat,lon,city,state,country
0,6e158a47eb2ca3f6,https://upload.wikimedia.org/wikipedia/commons...,142820,,,,,,,
1,202cd79556f30760,http://upload.wikimedia.org/wikipedia/commons/...,104169,Category:Stirling_Castle,Stirling Castle,56.123889,-3.947778,Stirling,Scotland,United Kingdom
2,3ad87684c99c06e1,http://upload.wikimedia.org/wikipedia/commons/...,37914,,,,,,,
3,e7f70e9c61e66af3,https://upload.wikimedia.org/wikipedia/commons...,102140,,,,,,,
4,4072182eddd0100e,https://upload.wikimedia.org/wikipedia/commons...,2474,Category:River_Severn,Aylburton,51.685278,-2.543611,Forest of Dean,England,United Kingdom


In [12]:
print('shape du dataset de base : ',train_df.shape)
train_df.dropna(subset=['country'], inplace=True)
print('shape du dataset ne conservant que les lieux reconnus : ',train_df.shape)
train_df.head()

shape du dataset de base :  (4132914, 10)
shape du dataset ne conservant que les lieux reconnus :  (1273626, 10)


Unnamed: 0,id,url,landmark_id,category_name,name,lat,lon,city,state,country
1,202cd79556f30760,http://upload.wikimedia.org/wikipedia/commons/...,104169,Category:Stirling_Castle,Stirling Castle,56.123889,-3.947778,Stirling,Scotland,United Kingdom
4,4072182eddd0100e,https://upload.wikimedia.org/wikipedia/commons...,2474,Category:River_Severn,Aylburton,51.685278,-2.543611,Forest of Dean,England,United Kingdom
7,16d8aa057cdd01b9,http://upload.wikimedia.org/wikipedia/commons/...,25719,Category:Duomo_(Monza),Monza Cathedral,45.58359,9.27567,Monza,Lombardy,Italy
12,88f3f71c2b71a6f9,https://upload.wikimedia.org/wikipedia/commons...,198623,"Category:Newark_Castle,_Nottinghamshire",,53.0775,-0.812415,Newark and Sherwood,England,United Kingdom
15,0851a257e5e872ef,https://upload.wikimedia.org/wikipedia/commons...,189446,Category:Castle_of_Peñíscola,Castillo de Peñiscola,40.3588,0.407926,Peníscola / Peñíscola,Valencian Community,Spain


### Importation des données