# Preparación para el TF MDS

## Preparación de los datos
### Creación de un dataset muy pequeño (toy dataset) para facilitar el diseño del modelo

Dataset: https://www.yelp.com/dataset/documentation/main

Extraer fotos del RAR en /dataset/photos.

Requiere ejecución prévia del notebook 00_EDA.

In [1]:
import pandas as pd
from PIL import Image
import math
import numpy as np
import pickle
from tqdm.notebook import trange, tqdm
import torchvision.transforms as T
import matplotlib.pyplot as plt



In [2]:
SUBSET_SIZE = 100
PHOTO_SIZE = 224.0 # De cara a usar transformer 224x224
FOLDER = 'photos_transformer' # output folder, inside /dataset

In [3]:
# cargar listado de fotos con dimensiones > 224 x 224
photo_data = pickle.load(open('checkpoints/df5.pkl','rb'))

Vamos crear un sub-dataset con SUBSET_SIZE fotos de cada etiqueta

In [4]:
photo_data.head(2)

Unnamed: 0,photo_id,label,x_dim,y_dim,z_channels,pixels,drink,food,inside,menu,outside
0,--0h6FMC0V8aMtKQylojEg,inside,400.0,300.0,3.0,120000.0,0.0,0.0,1.0,0.0,0.0
1,--3JQ4MlO-jHT9xbo7liug,food,400.0,400.0,3.0,160000.0,0.0,1.0,0.0,0.0,0.0


In [5]:
total_photos = photo_data.label.value_counts()
total_photos

food       106262
inside      55214
outside     18189
drink       15412
menu         1583
Name: label, dtype: int64

In [6]:
df_subset = pd.DataFrame([], columns=['photo_id',
                                        'label',
                                        'x_dim', 
                                        'y_dim', 
                                        'z_channels', 
                                        'pixels', 
                                        'drink',
                                        'food',
                                        'inside',
                                        'menu',
                                        'outside'])

In [7]:
for label in total_photos.index:
    if total_photos[label] > SUBSET_SIZE: # take sample of photos
        df_subset = pd.concat([df_subset,
                               photo_data.loc[photo_data.label == label].sample(n=SUBSET_SIZE)])
    else: # keep all photos
        df_subset = pd.concat([df_subset,
                               photo_data.loc[photo_data.label == label]])

In [8]:
len(df_subset)

500

In [9]:
df_subset.label.value_counts()

food       100
inside     100
outside    100
drink      100
menu       100
Name: label, dtype: int64

In [10]:
pickle.dump(df_subset, open("checkpoints/df_subset.pkl", "wb"))

In [11]:
# checkpoint
df_subset = pickle.load(open("checkpoints/df_subset.pkl",'rb'))

## Escalar, recortar y organizar las fotos del subset
El transformer [requiere](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification#provide-them-as-folders)  que las imagenes sean PNG y estén organizadas en carpetas con el nombre de la etiqueta.

In [12]:
# limpiar carpetas de salida
# Si las carpetas están vacias presenta un error (ignorar en *ese* caso)
! rm dataset/photos_transformer/drink/*
! rm dataset/photos_transformer/food/*
! rm dataset/photos_transformer/inside/*
! rm dataset/photos_transformer/menu/*
! rm dataset/photos_transformer/outside/*

rm: cannot remove 'dataset/photos_transformer/drink/*': No such file or directory
rm: cannot remove 'dataset/photos_transformer/food/*': No such file or directory
rm: cannot remove 'dataset/photos_transformer/inside/*': No such file or directory
rm: cannot remove 'dataset/photos_transformer/menu/*': No such file or directory
rm: cannot remove 'dataset/photos_transformer/outside/*': No such file or directory


In [13]:
for img in tqdm(range(len(df_subset)), desc='Photos cropped', miniters=20):
    
    im = Image.open('dataset/photos/' + df_subset.iloc[img].photo_id + '.jpg')

    # resize smallest dimension to PHOTO_SIZE
    if (df_subset.iloc[img].y_dim < df_subset.iloc[img].x_dim): # imagen estrecha
        width = int(PHOTO_SIZE)
        height = math.floor(PHOTO_SIZE * df_subset.iloc[img].x_dim/df_subset.iloc[img].y_dim)
    else: # imagen ancha
        width = math.floor(PHOTO_SIZE * df_subset.iloc[img].y_dim/df_subset.iloc[img].x_dim)
        height = int(PHOTO_SIZE)
    
    resized = T.Resize((height, width))(im)
    cropped = T.CenterCrop(size=int(PHOTO_SIZE))(resized)
    
    cropped.save('dataset/{}/{}/{}.png'.format(FOLDER, df_subset.iloc[img].label, df_subset.iloc[img].photo_id))

Photos cropped:   0%|          | 0/500 [00:00<?, ?it/s]