# Limpieza de datos

Imágenes obtenidas del siguiente [enlace](https://www.dropbox.com/s/k2i3p7puuwl2g59/Haze1k.zip?e=1&dl=0).

In [None]:
!unzip "/content/Haze1k.zip" # dataset original

## Eliminar imágenes corruptas

In [None]:
import os
from PIL import Image

def validate_images(directory):
    corrupted_files = []

    # Walk through directory and sub-directories
    for dirpath, _, filenames in os.walk(directory):
        print(f"Scanning directory: {dirpath}")

        for image_file in filenames:
            # Check for common image extensions
            if image_file.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif')):
                image_path = os.path.join(dirpath, image_file)
                try:
                    with Image.open(image_path) as img:
                        img.verify()
                except Exception as e:
                    corrupted_files.append(image_path)
                    print(f"Error with {image_path}: {e}")

    return corrupted_files



El script anterior fue obtenido del siguiente [enlace](https://drlee.io/identifying-corrupted-images-before-feeding-them-into-a-cnn-13397844ef3c).

In [None]:
directory = "/content/Haze1k"
corrupted_images = validate_images(directory)
if corrupted_images:
    print(f"Found {len(corrupted_images)} corrupted images.")
else:
    print("All images are valid!")

Scanning directory: /content/Haze1k
Scanning directory: /content/Haze1k/Haze1k_thin
Scanning directory: /content/Haze1k/Haze1k_thin/SAR
Scanning directory: /content/Haze1k/Haze1k_thin/SAR/train
Scanning directory: /content/Haze1k/Haze1k_thin/SAR/val
Scanning directory: /content/Haze1k/Haze1k_thin/SAR/test
Scanning directory: /content/Haze1k/Haze1k_thin/dataset
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/train
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/train/input
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/train/target
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/valid
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/valid/input
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/valid/target
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/test
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/test/input
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/test/target
Scanning directory: /content/Haze1k/Haze1k_

In [None]:
corrupted_images

['/content/Haze1k/Haze1k_moderate/dataset/train/target/265.png',
 '/content/Haze1k/Haze1k_moderate/dataset/train/target/271.png']

In [None]:
os.remove('/content/Haze1k/Haze1k_moderate/dataset/train/input/265.png')
os.remove('/content/Haze1k/Haze1k_moderate/dataset/train/target/265.png')
os.remove('/content/Haze1k/Haze1k_moderate/dataset/train/input/271.png')
os.remove('/content/Haze1k/Haze1k_moderate/dataset/train/target/271.png')

## Eliminar imágenes SAR

Estas imágenes no se utilizaran en ningún momento.

In [None]:
import shutil
shutil.rmtree('/content/Haze1k/Haze1k_moderate/SAR')
shutil.rmtree('/content/Haze1k/Haze1k_thick/SAR')
shutil.rmtree('/content/Haze1k/Haze1k_thin/SAR')

## Renombrar imágenes

Dejar sólo el número en la imágen. Ejemplo, si se titula la imagen `215-input.png` dejar sólo `215.png`.

In [None]:
import os

def validate_name(directory):
    for dirpath, _, filenames in os.walk(directory):
        print(f"Scanning directory: {dirpath}")

        for image_file in filenames:
            if image_file.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif')):
              file_split = os.path.splitext(image_file)
              numbers = [s for s in file_split[0] if s.isdigit()]

              image_name = os.path.join(dirpath, ''.join(numbers)+file_split[1])
              os.rename(os.path.join(dirpath, image_file), image_name)

In [None]:
directory = "/content/Haze1k"
validate_name(directory)

Scanning directory: /content/Haze1k
Scanning directory: /content/Haze1k/Haze1k_moderate
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/test
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/test/target
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/test/input
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/train
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/train/target
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/train/input
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/val
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/val/target
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/val/input
Scanning directory: /content/Haze1k/Haze1k_thin
Scanning directory: /content/Haze1k/Haze1k_thin/dataset
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/test
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/test/ta

## Eliminar imágenes que no tengan par

Se eliminan las imágenes que no tengan su par correspondienta, sea en `input` o `target`.

In [None]:
for dirpath, _, filenames in os.walk(directory):
  print(f"Scanning directory: {dirpath}")
  if os.path.basename(dirpath) == 'input':
    dirpath_target = dirpath.split(os.path.sep)
    dirpath_target[-1] = 'target'
    dirpath_target = '/' + os.path.join('', *dirpath_target)

    filenames_target = os.listdir(dirpath_target)


    for filename in filenames_target:
      if filename not in filenames:
        os.remove(os.path.join(dirpath_target, filename))
        print(f"Deleted: {os.path.join(dirpath_target, filename)}")

    for filename in filenames:
      if filename not in filenames_target:
        os.remove(os.path.join(dirpath, filename))
        print(f"Deleted: {os.path.join(dirpath, filename)}")

Scanning directory: /content/Haze1k
Scanning directory: /content/Haze1k/Haze1k_moderate
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/test
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/test/target
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/test/input
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/train
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/train/target
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/train/input
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/val
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/val/target
Scanning directory: /content/Haze1k/Haze1k_moderate/dataset/val/input
Scanning directory: /content/Haze1k/Haze1k_thin
Scanning directory: /content/Haze1k/Haze1k_thin/dataset
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/test
Scanning directory: /content/Haze1k/Haze1k_thin/dataset/test/ta

## Se comprimen en un archivo `zip`

Este será utilizado en todo el posterior procedimiento.

In [None]:
!zip -r '/Haze1k.zip' '/content/Haze1k'

  adding: content/Haze1k/ (stored 0%)
  adding: content/Haze1k/.DS_Store (deflated 90%)
  adding: content/Haze1k/Haze1k_moderate/ (stored 0%)
  adding: content/Haze1k/Haze1k_moderate/.DS_Store (deflated 92%)
  adding: content/Haze1k/Haze1k_moderate/dataset/ (stored 0%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/ (stored 0%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/target/ (stored 0%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/target/27.png (deflated 1%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/target/25.png (deflated 1%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/target/22.png (deflated 1%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/target/43.png (deflated 1%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/target/4.png (deflated 1%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/target/20.png (deflated 1%)
  adding: content/Haze1k/Haze1k_moderate/dataset/test/target/44.png (deflated 1%)
  adding: conten