## Projet Deep Learning : Reconnaissance d'images de pays
## **Preprocessing**

#### Dataset: 
Google Landmarks Dataset v2 :
https://github.com/cvdfoundation/google-landmark?tab=readme-ov-file


#### Objectif du projet :
<p style="text-align: justify;">
    L'objectif de ce projet est de développer une solution de Deep Learning pour la reconnaissance d'images. L'input sera une image d'un lieu, et la sortie du modèle sera le pays correspondant, accompagnée de probabilités d'appartenance.
</p>


In [None]:
# Packages
import json
import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader
from collections import Counter

from preprocessing import Preprocessing

In [2]:
# Variables globales
DATA_PATH = "data/train"
DATA_IMAGES_PATH = "data/train/images"
DATA_RESULTS = "data/results"

### Importation des données

In [3]:
train_df = pd.read_csv(f"{DATA_PATH}/train.csv")
train_df.head()

Unnamed: 0,id,url,landmark_id
0,6e158a47eb2ca3f6,https://upload.wikimedia.org/wikipedia/commons...,142820
1,202cd79556f30760,http://upload.wikimedia.org/wikipedia/commons/...,104169
2,3ad87684c99c06e1,http://upload.wikimedia.org/wikipedia/commons/...,37914
3,e7f70e9c61e66af3,https://upload.wikimedia.org/wikipedia/commons...,102140
4,4072182eddd0100e,https://upload.wikimedia.org/wikipedia/commons...,2474


In [4]:
category_to_location_df = pd.read_csv(f"{DATA_PATH}/category_to_location.csv")
category_to_location_df.rename(columns={'id': 'landmark_id'}, inplace=True)
category_to_location_df.head()

Unnamed: 0,landmark_id,category_name,name,lat,lon,city,state,country
0,0,Category:Happy_Valley_Racecourse,Natural Turf Soccer Pitch No. 5,22.2728,114.182,Hong Kong Island,Hong Kong,China
1,1,Category:Luitpoldpark_in_Munich,,48.171494,11.569674,Munich,Bavaria,Germany
2,3,"Category:Tweed_Heads,_New_South_Wales",Ukerebagh Nature Reserve,-28.1833,153.55,Tweed Heads,New South Wales,Australia
3,14,Category:Delacorte_Theater,Delacorte Theater,40.7801,-73.968767,New York,New York,United States
4,15,Category:Tremper_Mound_and_Earthworks,Tremper Mound,38.8013,-83.0106,,Ohio,United States


### Jointure des données

On joint les lieux (ville, pays, ...) aux images

In [5]:
train_df = train_df.merge(category_to_location_df, left_on='landmark_id', right_on='landmark_id', how='left')
train_df.head()

Unnamed: 0,id,url,landmark_id,category_name,name,lat,lon,city,state,country
0,6e158a47eb2ca3f6,https://upload.wikimedia.org/wikipedia/commons...,142820,,,,,,,
1,202cd79556f30760,http://upload.wikimedia.org/wikipedia/commons/...,104169,Category:Stirling_Castle,Stirling Castle,56.123889,-3.947778,Stirling,Scotland,United Kingdom
2,3ad87684c99c06e1,http://upload.wikimedia.org/wikipedia/commons/...,37914,,,,,,,
3,e7f70e9c61e66af3,https://upload.wikimedia.org/wikipedia/commons...,102140,,,,,,,
4,4072182eddd0100e,https://upload.wikimedia.org/wikipedia/commons...,2474,Category:River_Severn,Aylburton,51.685278,-2.543611,Forest of Dean,England,United Kingdom


In [6]:
print('shape du dataset de base : ',train_df.shape)
train_df.dropna(subset=['country'], inplace=True)
print('shape du dataset ne conservant que les lieux reconnus : ',train_df.shape)
train_df.head()

shape du dataset de base :  (4132914, 10)
shape du dataset ne conservant que les lieux reconnus :  (1273626, 10)


Unnamed: 0,id,url,landmark_id,category_name,name,lat,lon,city,state,country
1,202cd79556f30760,http://upload.wikimedia.org/wikipedia/commons/...,104169,Category:Stirling_Castle,Stirling Castle,56.123889,-3.947778,Stirling,Scotland,United Kingdom
4,4072182eddd0100e,https://upload.wikimedia.org/wikipedia/commons...,2474,Category:River_Severn,Aylburton,51.685278,-2.543611,Forest of Dean,England,United Kingdom
7,16d8aa057cdd01b9,http://upload.wikimedia.org/wikipedia/commons/...,25719,Category:Duomo_(Monza),Monza Cathedral,45.58359,9.27567,Monza,Lombardy,Italy
12,88f3f71c2b71a6f9,https://upload.wikimedia.org/wikipedia/commons...,198623,"Category:Newark_Castle,_Nottinghamshire",,53.0775,-0.812415,Newark and Sherwood,England,United Kingdom
15,0851a257e5e872ef,https://upload.wikimedia.org/wikipedia/commons...,189446,Category:Castle_of_Peñíscola,Castillo de Peñiscola,40.3588,0.407926,Peníscola / Peñíscola,Valencian Community,Spain


Récupération des chemins des images

In [7]:
import importlib
import fetch_image
importlib.reload(fetch_image)

<module 'fetch_image' from 'c:\\Users\\lebre\\OneDrive\\Bureau\\Deep Learning\\projet_dl\\fetch_image.py'>

In [8]:

chemin_images_dict = fetch_image.fetch_images(train_df['id'], dossier_base = DATA_IMAGES_PATH)

Ajout des chemins au df train

In [9]:
train_df['image_path'] = train_df['id'].map(chemin_images_dict)
print('shape du dataset ne conservant que les lieux reconnu : ',train_df.shape)
train_df.dropna(subset=['image_path'], inplace=True)
print('shape du dataset ne conservant que les lieux reconnus et images trouvées dans les dossiers : ',train_df.shape)
train_df.head()

shape du dataset ne conservant que les lieux reconnu :  (1273626, 11)
shape du dataset ne conservant que les lieux reconnus et images trouvées dans les dossiers :  (15264, 11)


Unnamed: 0,id,url,landmark_id,category_name,name,lat,lon,city,state,country,image_path
172,00c08b162f34f53f,https://upload.wikimedia.org/wikipedia/commons...,163404,Category:North_Norfolk_Railway,Weybourne Yard Frame,52.9345,1.1545,North Norfolk,England,United Kingdom,data/train/images/0/0/c/00c08b162f34f53f.jpg
682,0129308917af0393,https://upload.wikimedia.org/wikipedia/commons...,20823,"Category:Westmoreland_County,_Pennsylvania",Dellview Court,40.31,-79.47,Unity Township,Pennsylvania,United States,data/train/images/0/1/2/0129308917af0393.jpg
710,00e5d77c905d94a6,https://upload.wikimedia.org/wikipedia/commons...,26066,Category:Santuário_de_Fátima,Basílica de Nossa Senhora do Rosário de Fátima,39.632427,-8.671538,Fátima,,Portugal,data/train/images/0/0/e/00e5d77c905d94a6.jpg
1180,0270b8d88aca27c4,https://upload.wikimedia.org/wikipedia/commons...,181586,Category:HMCS_Haida_(G63),HMCS Haida,43.2753,-79.8554,Hamilton,Ontario,Canada,data/train/images/0/2/7/0270b8d88aca27c4.jpg
1262,001cd787f1e9a803,https://upload.wikimedia.org/wikipedia/commons...,61937,Category:South_Horizons,HK Electric Co. Ltd. Operational HQ,22.243364,114.147564,Hong Kong Island,Hong Kong,China,data/train/images/0/0/1/001cd787f1e9a803.jpg


Export du dataframe obtenu

In [10]:
train_df.to_csv(f'{DATA_RESULTS}/train_final.csv')

### Transformation des images en tenseurs

Importation du dataframe obtenu

In [None]:
train_df = pd.read_csv(f'{DATA_RESULTS}/train_final.csv')

Transformation des images en tenseurs pytorch

In [12]:
prepro = Preprocessing() # instance pour transformer les images en tenseurs pytorch

Test sur une image

In [13]:
prepro.image_to_tensor('data/train/images/0/2/7/0270b8d88aca27c4.jpg')

tensor([[[[ 1.4269,  1.4954,  1.6153,  ..., -1.4158, -1.3815, -1.4158],
          [ 1.5297,  1.5639,  1.6495,  ..., -1.4329, -1.4329, -1.4672],
          [ 1.7009,  1.7352,  1.7694,  ..., -1.4329, -1.4672, -1.5014],
          ...,
          [ 1.7523,  1.7523,  1.7523,  ..., -1.1932, -1.2274, -1.2274],
          [ 1.7694,  1.7694,  1.7865,  ..., -1.1760, -1.2959, -1.2445],
          [ 1.7180,  1.7523,  1.8037,  ..., -0.8507, -1.2274, -1.2788]],

         [[ 2.1310,  2.1134,  2.1660,  ..., -1.2304, -1.1954, -1.2304],
          [ 2.1310,  2.1485,  2.1660,  ..., -1.2479, -1.2479, -1.2829],
          [ 2.1660,  2.2010,  2.1835,  ..., -1.2479, -1.2829, -1.3179],
          ...,
          [ 1.9209,  1.9034,  1.9034,  ..., -1.2129, -1.2129, -1.1954],
          [ 1.9384,  1.9384,  1.9559,  ..., -1.2129, -1.2654, -1.1954],
          [ 1.8859,  1.9209,  1.9734,  ..., -0.8978, -1.2304, -1.2479]],

         [[ 2.6051,  2.5877,  2.6051,  ..., -0.6715, -0.6367, -0.6715],
          [ 2.6226,  2.6226,  

Empilement des tenseurs / target / mapping pour créer le dataset pytorch

In [14]:
tensors, labels, label_mapping = prepro.store_image_tensors(train_df, image_path_column='image_path', label_column='country', batch_size=None)

# print('label_mapping : ', label_mapping)
# print('labels : ', labels)
# print('tensors : ', tensors)


Sauvegarde des labels de pays en json

In [None]:
with open(f"{DATA_RESULTS}/country_labels.json", "w") as f:
    json.dump(label_mapping, f)

Creation du dataset pytorch

In [19]:
dataset = TensorDataset(tensors, labels)
torch.save(dataset, f"{DATA_RESULTS}/train_subset.pt") # sauvegarde du dataset
#dataset = torch.load(f"{DATA_RESULTS}/train_subset.pt", weights_only=False)

In [20]:
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Observons la 1ere image puis le batch entier

In [21]:
image, label = dataset[0]
print(f"Forme de l'image: {image.shape}")
print(f"Étiquette: {label.item()}")

Forme de l'image: torch.Size([1, 3, 224, 224])
Étiquette: 80


In [22]:
for batch_images, batch_labels in dataloader:
    print(f"shape du batch d'images: {batch_images.shape}")
    print(f"shape du batch de labels : {batch_labels.shape}")
    print(f"labels dans ce batch: {batch_labels.tolist()}")
    print(f"distribution des labels: {Counter(batch_labels.tolist())}")

shape du batch d'images: torch.Size([32, 1, 3, 224, 224])
shape du batch de labels : torch.Size([32])
labels dans ce batch: [27, 27, 81, 33, 81, 81, 13, 27, 27, 57, 3, 13, 3, 81, 25, 81, 48, 25, 1, 81, 70, 80, 48, 77, 25, 57, 80, 29, 25, 81, 68, 80]
distribution des labels: Counter({81: 7, 27: 4, 25: 4, 80: 3, 13: 2, 57: 2, 3: 2, 48: 2, 33: 1, 1: 1, 70: 1, 77: 1, 29: 1, 68: 1})
shape du batch d'images: torch.Size([32, 1, 3, 224, 224])
shape du batch de labels : torch.Size([32])
labels dans ce batch: [43, 68, 27, 80, 15, 80, 34, 56, 29, 81, 27, 25, 33, 68, 81, 71, 27, 3, 19, 20, 68, 22, 6, 81, 81, 25, 9, 34, 81, 25, 25, 81]
distribution des labels: Counter({81: 6, 25: 4, 68: 3, 27: 3, 80: 2, 34: 2, 43: 1, 15: 1, 56: 1, 29: 1, 33: 1, 71: 1, 3: 1, 19: 1, 20: 1, 22: 1, 6: 1, 9: 1})
shape du batch d'images: torch.Size([32, 1, 3, 224, 224])
shape du batch de labels : torch.Size([32])
labels dans ce batch: [58, 71, 72, 19, 0, 50, 27, 31, 25, 13, 53, 33, 25, 13, 21, 27, 81, 80, 68, 80, 25, 34,