<a href="https://colab.research.google.com/github/pereisergio/Creating-Pipelines-for-Machine-Learning/blob/main/Pr%C3%A9_processamento_com_TensorFlow_Transform_(TFT).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Etapa 1: Instalação das bibliotecas

In [None]:
!pip install tensorflow-transform

## Etapa 2: Importação das bibliotecas

In [2]:
import tempfile
import pandas as pd
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata, schema_utils

import apache_beam.io.iobase

from __future__ import print_function

print(tf.__version__)
print(tft.__version__)

1.15.0
2.15.1


## Etapa 3: Pré-processamento

### Carregando a base de dados

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/TensorFlow 2.0/assets/pollution-small.csv')

In [5]:
dataset.head()

Unnamed: 0,Date,pm10,no2,so2,soot
0,1/1/2009,98.67,14.1,44.38,34.81
1,1/2/2009,52.33,14.1,29.75,33.06
2,1/3/2009,74.67,20.5,36.25,39.25
3,1/4/2009,72.0,17.3,46.44,34.38
4,1/5/2009,81.0,25.64,56.56,45.59


In [6]:
features = dataset.drop('Date', axis=1)
features.head()

Unnamed: 0,pm10,no2,so2,soot
0,98.67,14.1,44.38,34.81
1,52.33,14.1,29.75,33.06
2,74.67,20.5,36.25,39.25
3,72.0,17.3,46.44,34.38
4,81.0,25.64,56.56,45.59


### Conversão da base de dados (dataframe) para dicionário

In [7]:
dict_features = list(features.to_dict('index').values())
dict_features[:2]

[{'pm10': 98.67, 'no2': 14.1, 'so2': 44.38, 'soot': 34.81},
 {'pm10': 52.33, 'no2': 14.1, 'so2': 29.75, 'soot': 33.06}]

### Definição dos metadados

In [20]:
# Definir o esquema do dataset
schema = {
    'no2': tf.io.FixedLenFeature([], tf.float32),
    'so2': tf.io.FixedLenFeature([], tf.float32),
    'pm10': tf.io.FixedLenFeature([], tf.float32),
    'soot': tf.io.FixedLenFeature([], tf.float32),
}

# Criar os metadados do dataset a partir do esquema
data_metadata = dataset_metadata.DatasetMetadata(schema_utils.schema_from_feature_spec(schema))

In [21]:
data_metadata

{'_schema': feature {
  name: "no2"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "pm10"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "so2"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "soot"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
, '_output_record_batches': True}

## Etapa 4: Função para pré-processamento

In [25]:
def preprocessing_fn(inputs):
    no2 = inputs["no2"]
    pm10 = inputs["pm10"]
    so2 = inputs["so2"]
    soot = inputs["soot"]

    no2_normalized = no2 - tft.mean(no2)
    so2_normalized = so2 - tft.mean(so2)

    pm10_normalized = tft.scale_to_0_1(pm10)
    soot_normalized = tft.scale_by_min_max(soot)

    return {
        "no2_normalized": no2_normalized,
        "so2_normalized": so2_normalized,
        "pm10_normalized": pm10_normalized,
        "soot_normalized": soot_normalized
    }

## Etapa 5: Unindo a codificação

O Tensorflow Transform usa o **Apache Beam** como background para realizar as operações.

Parâmetros a serem passados para a função:

    dict_features - Nossa base de dados convertida para dicionário
    data_metadata - Nossos metadados que criamos anteriormente
    preprocessing_fn - Função de pré-processamento que fará as transformações coluna por coluna


Abaixo temos a sintaxe usada pelo Apache Beam

```
result = data_to_pass | where_to_pass_the_data
```

Explicando cada um dos parâmetros:

**result**  -> `transformed_dataset, transform_fn`

**data_to_pass** -> `(dict_features, data_metadata)`

**where_to_pass_the_data** -> `tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)`

```
transformed_dataset, transform_fn = ((dict_features, data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))

```

Mais sobre essa sintaxe:
https://beam.apache.org/documentation/programming-guide/#transforms

LINKS:
> Mais sobre o Apache Beam: https://beam.apache.org/

In [26]:
def data_transform():
    with tft_beam.Context(temp_dir = tempfile.mkdtemp()):
        transformed_dataset, transform_fn = ((dict_features, data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))

    transformed_data, transformed_metadata = transformed_dataset

    for i in range(len(transformed_data)):
        print("Initial: ", dict_features[i])
        print("Transformed: ", transformed_data[i])

In [27]:
data_transform()



Initial:  {'pm10': 98.67, 'no2': 14.1, 'so2': 44.38, 'soot': 34.81}
Transformed:  {'no2_normalized': -18.577978134155273, 'pm10_normalized': 0.3407169580459595, 'so2_normalized': 28.85540771484375, 'soot_normalized': 0.283423513174057}
Initial:  {'pm10': 52.33, 'no2': 14.1, 'so2': 29.75, 'soot': 33.06}
Transformed:  {'no2_normalized': -18.577978134155273, 'pm10_normalized': 0.16963857412338257, 'so2_normalized': 14.225407600402832, 'soot_normalized': 0.26620757579803467}
Initial:  {'pm10': 74.67, 'no2': 20.5, 'so2': 36.25, 'soot': 39.25}
Transformed:  {'no2_normalized': -12.177978515625, 'pm10_normalized': 0.25211358070373535, 'so2_normalized': 20.725406646728516, 'soot_normalized': 0.32710281014442444}
Initial:  {'pm10': 72.0, 'no2': 17.3, 'so2': 46.44, 'soot': 34.38}
Transformed:  {'no2_normalized': -15.377979278564453, 'pm10_normalized': 0.24225644767284393, 'so2_normalized': 30.9154052734375, 'soot_normalized': 0.27919331192970276}
Initial:  {'pm10': 81.0, 'no2': 25.64, 'so2': 56.5