<a href="https://colab.research.google.com/github/joanby/tensorflow2/blob/master/Collab%2010%20-%20Pre%20procesado%20de%20datos%20con%20TensorFlow%20Transform%20(TFT).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Paso 1: Instalar las dependencias y configurar el entorno

In [1]:
!pip install -q --upgrade tensorflow-transform

[K     |████████████████████████████████| 378kB 2.8MB/s 
[K     |████████████████████████████████| 1.8MB 13.2MB/s 
[K     |████████████████████████████████| 8.6MB 5.6MB/s 
[K     |████████████████████████████████| 63.8MB 47kB/s 
[K     |████████████████████████████████| 51kB 6.1MB/s 
[K     |████████████████████████████████| 1.4MB 50.4MB/s 
[K     |████████████████████████████████| 61kB 8.4MB/s 
[K     |████████████████████████████████| 71kB 10.7MB/s 
[K     |████████████████████████████████| 81kB 11.0MB/s 
[K     |████████████████████████████████| 153kB 56.7MB/s 
[K     |████████████████████████████████| 829kB 55.6MB/s 
[K     |████████████████████████████████| 256kB 59.9MB/s 
[K     |████████████████████████████████| 153kB 62.1MB/s 
[K     |████████████████████████████████| 440kB 52.6MB/s 
[K     |████████████████████████████████| 184kB 58.1MB/s 
[K     |████████████████████████████████| 276kB 55.8MB/s 
[K     |████████████████████████████████| 174kB 54.8MB/s 
[K   

In [None]:
import os
os.kill(os.getpid(), 9)

## Paso 2: Importar las dependencias del proyecto

In [10]:
import tempfile
import pandas as pd
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as tft_beam
from tensorflow_transform.tf_metadata import schema_utils

from __future__ import print_function
from tensorflow_transform.tf_metadata import dataset_metadata, dataset_schema

## Paso 3: Pre procesado de datos

### Cargar el dataset de polución

In [3]:
dataset = pd.read_csv("https://raw.githubusercontent.com/joanby/tensorflow2/master/datasets/pollution-small.csv")

In [4]:
dataset.head()

Unnamed: 0,Date,pm10,no2,so2,soot
0,1/1/2009,98.67,14.1,44.38,34.81
1,1/2/2009,52.33,14.1,29.75,33.06
2,1/3/2009,74.67,20.5,36.25,39.25
3,1/4/2009,72.0,17.3,46.44,34.38
4,1/5/2009,81.0,25.64,56.56,45.59


### Eliminar la columna de la fecha

In [5]:
features = dataset.drop("Date", axis=1)

In [6]:
features.head()

Unnamed: 0,pm10,no2,so2,soot
0,98.67,14.1,44.38,34.81
1,52.33,14.1,29.75,33.06
2,74.67,20.5,36.25,39.25
3,72.0,17.3,46.44,34.38
4,81.0,25.64,56.56,45.59


### Convertir el dataset de dataframe a lista de diccionarios de Python

In [7]:
dict_features = list(features.to_dict("index").values())

In [8]:
dict_features[:2]

[{'no2': 14.1, 'pm10': 98.67, 'so2': 44.38, 'soot': 34.81},
 {'no2': 14.1, 'pm10': 52.33, 'so2': 29.75, 'soot': 33.06}]

### Definir los metadatos del dataset

In [11]:
data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        "no2":tf.io.FixedLenFeature([], tf.float32),
        "so2":tf.io.FixedLenFeature([], tf.float32),
        "pm10":tf.io.FixedLenFeature([], tf.float32),
        "soot":tf.io.FixedLenFeature([], tf.float32),
    }
    )
)

In [12]:
data_metadata

{'_schema': feature {
  name: "no2"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "pm10"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "so2"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "soot"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
}

## Paso 4:La función preprocessing

In [13]:
def preprocessing_fn(inputs):
    
    no2 = inputs['no2']
    pm10 = inputs['pm10']
    so2 = inputs['so2']
    soot = inputs['soot']
    
    no2_normalized = no2 - tft.mean(no2)
    so2_normalized = so2 - tft.mean(so2)
    
    pm10_normalized = tft.scale_to_0_1(pm10)
    soot_normalized = tft.scale_by_min_max(soot)
    
    return {
        "no2_normalized":no2_normalized,
        "so2_normalized":so2_normalized,
        "pm10_normalized":pm10_normalized,
        "soot_normalized":soot_normalized
    }

## Paso 5: Pongámoslo todo a la vez

Tensorflow Transform utiliza **Apache Beam** en segundo plano para llevar a cabo transformaciones de datos escalables. En esta función usaremos un ejecutor directo (direct runner).

Argumentos para el ejector:

    dict_features - Nuestro dataset convertido a diccionario de Python.
    data_metadata - Los meta datos de nuestro dataset que hemos creado.
    preprocessing_fn - La función de pre procesado principal. Se llamará para aplicar la operación de pre procesado columna a columna.


El problema es que la sintaxis de Apache Beam es un poco especial. Se utiliza para apilar operaciones e invocar transformaciones en nuestros datos en formato de *pipe*.

```
resultado = dato_a_pasar | donde_pasar_el_dato
```

En nuestro caso sería algo como:

**resultado**  -> `transformed_dataset, transform_fn`

**dato_a_pasar** -> `(dict_features, data_metadata)`

**donde_pasar_el_dato** -> `tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)` 

```
transformed_dataset, transform_fn = ((dict_features, data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))

```

Si te interesa aprender más acerca de esta sintaxis, déjame que te recomende el siguiente link:
https://beam.apache.org/documentation/programming-guide/#applying-transforms

LINKS:
> Mas acerca de Apache Beam: https://beam.apache.org/ 

In [14]:
def data_transform():
    
    with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
        transformed_dataset, transform_fn = ((dict_features, data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
        
    transformed_data, transformed_metadata = transformed_dataset
    
    for i in range(len(transformed_data)):
        print("Raw: ", dict_features[i])
        print("Transformed:", transformed_data[i])

In [15]:
data_transform()













Instructions for updating:
Use ref() instead.


Instructions for updating:
Use ref() instead.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


INFO:tensorflow:SavedModel written to: /tmp/tmpjz2u8cqt/tftransform_tmp/a6bc5753a15b450c9de9fadfb8a9c07a/saved_model.pb


INFO:tensorflow:SavedModel written to: /tmp/tmpjz2u8cqt/tftransform_tmp/a6bc5753a15b450c9de9fadfb8a9c07a/saved_model.pb


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


INFO:tensorflow:SavedModel written to: /tmp/tmpjz2u8cqt/tftransform_tmp/1d37ec4633bf459a81effd3b79a110d2/saved_model.pb


INFO:tensorflow:SavedModel written to: /tmp/tmpjz2u8cqt/tftransform_tmp/1d37ec4633bf459a81effd3b79a110d2/saved_model.pb










INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:SavedModel written to: /tmp/tmpjz2u8cqt/tftransform_tmp/0c96a28531dc4796b077f44c881e80b6/saved_model.pb


INFO:tensorflow:SavedModel written to: /tmp/tmpjz2u8cqt/tftransform_tmp/0c96a28531dc4796b077f44c881e80b6/saved_model.pb


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Raw:  {'pm10': 98.67, 'no2': 14.1, 'so2': 44.38, 'soot': 34.81}
Transformed: {'no2_normalized': -18.577978, 'pm10_normalized': 0.34071696, 'so2_normalized': 28.855408, 'soot_normalized': 0.2834235}
Raw:  {'pm10': 52.33, 'no2': 14.1, 'so2': 29.75, 'soot': 33.06}
Transformed: {'no2_normalized': -18.577978, 'pm10_normalized': 0.16963857, 'so2_normalized': 14.225407, 'soot_normalized': 0.26620758}
Raw:  {'pm10': 74.67, 'no2': 20.5, 'so2': 36.25, 'soot': 39.25}
Transformed: {'no2_normalized': -12.1779785, 'pm10_normalized': 0.25211355, 'so2_normalized': 20.725407, 'soot_normalized': 0.32710278}
Raw:  {'pm10': 72.0, 'no2': 17.3, 'so2': 46.44, 'soot': 34.38}
Transformed: {'no2_normalized': -15.377979, 'pm10_normalized': 0.24225645, 'so2_normalized': 30.915405, 'soot_normalized': 0.2791933}
Raw:  {'pm10': 81.0, 'no2': 25.64, 'so2': 56.56, 'soot': 45.59}
Transformed: {'no2_normalized': -7.037979, 'pm10_normalized': 0.2754827, 'so2_normalized': 41.035408, 'soot_normalized': 0.38947365}
Raw:  {'p