# Preprocessing Step (Etapa do Pré-processamento)
## Table of Contents
* [Packages](#1)
* [Preprocessing Functions](#2)
    * [Get User Dataset](#3.2.1)
    * [DataFrame to tf.data.Dataset](#3.2.2)
* [Preprocessing](#6)

<a class="anchor" id="1"></a>
## Packages (Pacotes)
**[EN-US]**

Packages used in the system.
* [pandas](https://pandas.pydata.org/): is the main package for data manipulation;
* [numpy](www.numpy.org): is the main package for scientific computing;
* [tensorflow](https://www.tensorflow.org/): framework that makes it easy to create ML models that can run in any environment;
* [scikit-learn](https://scikit-learn.org/stable/): open source machine learning library.

**[PT-BR]**

Pacotes utilizados no sistema.
* [pandas](https://pandas.pydata.org/): é o principal pacote para manipulação de dados;
* [numpy](www.numpy.org): é o principal pacote para computação científica;
* [tensorflow](https://www.tensorflow.org/): framework que facilita a criação de modelos de machine learning que podem ser executados em qualquer ambiente;
* [scikit-learn](https://scikit-learn.org/stable/): biblioteca open-source de machine learning.

In [331]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

<a name="3.2"></a>
### Preprocessing Functions (Funções de Pré-Processamento)
**[EN-US]**

Utility functions for preprocessing. The `get_user_dataset` function returns the user's dataset. Given a vector with the desired features, the function returns a dataset with these features copied across all examples, the number of examples is calculated over the number of examples in the dataset passed as reference. The `df_to_tfdataset` function converts a DataFrame or numpy arrays to a tensorflow **tf.data.Dataset** object, applies preprocessing and optimizes performance, and the `L2_norm` class transforms the calculation of the L2 norm into a layer.

**[PT-BR]**

Funções utilitárias para o pré-processamento. A função `get_user_dataset` retorna o dataset do usuário. Dado um vetor com as features desejadas, a função retorna um dataset com essas features copiadas em todos os exemplos, o número de exemplos é calculado sobre o número de exemplos do dataset passado como referência. A função `df_to_tfdataset` converte um DataFrame ou matrizes numpy para um objeto **tf.data.Dataset** do tensorflow, aplica os pré-processamentos e otimizan o desempenho, e a classe `L2_norm` transforma o cálculo da norma L2 em uma layer.

<a name="3.2.1"></a>
#### Get User Dataset (Cria o Dataset do Usuário)
**[EN-US]**

Creates the user's dataset with the same number of examples as the dataset passed as input.

**[PT-BR]**

Cria o dataset do usuário com a mesma quantidade de exemplos que o dataset passado como input.

In [334]:
def get_user_dataset(user_vec, item_df):
    """
    [EN-US]
    Creates the User Dataset with the same number of examples as the dataset that was given as reference.
    
    [PT-BR]
    Cria o Dataset do usuário com a mesma quantidade de exemplos do que o dataset que foi dado como referência.
    
    Arguments:
        user_vec -- Row vector with user features
                    (Vetor de linha com as features do usuário).
        item_df -- Reference Dataset to create User Dataset
                   (Dataset de referência para criar o Dataset do usuário).
    
    Return:
        user_df -- User dataset with the same dimensions as the reference dataset
                   (Dataset do usuário com as mesmas dimensões do dataset de referência).
    """
    # Creating the user matrix (Criando a matriz do usuário).
    user_df = np.tile(user_vec, (len(item_df), 1))
    features = list(item_df.columns)
    
    # Transforming the numpy array into a pandas DataFrame (Transformando o array numpy em um DataFrame pandas).
    user_df = pd.DataFrame(user_df, columns=features)
    return user_df

<a name="3.2.2"></a>
#### DataFrame to tf.data.Dataset (DataFrame para um tf.data.Dataset)
**[EN-US]**

Transforms a pandas dataframe or numpy array into a tf.data.Dataset object, where we will also apply preprocessing to prepare the data and optimize performance.

**[PT-BR]**

Transforma um dataframe to pandas ou matriz numpy em um objeto tf.data.Dataset, onde também aplicaremos pré-processamentos par preparar os dados e otimizar o desempenho.

In [336]:
def df_to_tfdataset(X, y, shuffle_buffer=1000, batch_size=32, shuffle=True):
    """
    [EN-US]
    Transforms a dataframe or numpy array into a tf.data.Dataset object, applies preprocessing and optimizes performance.
    
    [PT-BR]
    Transforma um dataframe ou matriz numpy em um objeto tf.data.Dataset, aplica os pré-processamentos e otimizan o desempenho.
    
    Arguments:
        X -- DataFrame or numpy array with features (DataFrame ou matriz numpy com as features).
        y -- DataFrame or numpy array with labels (DataFrame ou matriz numpy com os labels).
        shuffle_buffer -- Elements that will initially be left out and one of them is randomly chosen as part of the random dataset
                          (Elementos que serão inicialmente deixados de lado e um deles é escolhido aleatoriamente como parte do dataset aleatorio).
        batch_size -- Size of dataset mini-batches (Tamanho dos mini-batches do dataset).
        shuffle -- If True, the dataset will be shuffled, otherwise not (Caso seja True, o dataset será embaralhado, caso contrário, não).
    
    Return:
        dataset -- Preprocessed tf.data.Dataset (tf.data.Dataset pré-processado).
    """
    # Concatenating the dataset X with the labels y
    # Concatenando o dataset X com os labels y
    dataset = pd.concat([X, y], axis=1, ignore_index=True)
    # Transforming the concatenated dataset into a tf.data.Dataset object
    # Transformando o dataset concatenado em um objeto tf.data.Dataset
    dataset = tf.data.Dataset.from_tensor_slices(dataset)
    # Storing elements in memory (Armazenando elementos na memória)
    dataset = dataset.cache()
    if shuffle:
        # Shuffling the dataset (Embaralhando o dataset)
        dataset = dataset.shuffle(buffer_size=shuffle_buffer)
    # Applying the final preprocessing
    # Aplicando os pré-processamentos finais
    dataset = (
        dataset
        .map(lambda x: (x[:-1], x[-1])) # Separating features from labels into tuples (Separando as features dos labels em tuplas)
        .batch(batch_size) # Creating batches of this dataset (Criando batches desse dataset)
        .prefetch(buffer_size=tf.data.AUTOTUNE) # Allowing parallel execution of this dataset (Permitindo a execução paralela dessa dataset)
    )
    return dataset

<a name="6"></a>
## Preprocessing (Pré-processamento)
**[EN-US]**

In data preprocessing, the dataset will be divided into training, validation and test sets. Each of them will be scaled to have a mean of 0 and a standard deviation of 1, using the `z-score` method.

**[PT-BR]**

No pré-processamento dos dados, o dataset será dividido entre o set de treino, validação e teste. Cada um deles será escalado para ter média 0 e desvio-padrão 1, utilizando o método `z-score`.

Reading the dataset from the `../data/transformed/` directory (Lendo o dataset do diretório `../data/transformed/`).

In [339]:
data = pd.read_csv('../data/transformed/data.csv')
data.head()

Unnamed: 0,id,name,artists,duration_ms,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,y,duration_min
0,1XNdu4czbxZ9tdVOhDtI4M,Cheia de Manias,Raça Negra,211000,51,0.707,0.614,7,-9.377,0,0.058,0.198,0.0,0.177,0.957,156.473,1,3.516667
1,1SaeGiWS1LQhfRc6ObQfvq,Que Se Chama Amor,Só Pra Contrariar,245200,0,0.647,0.528,7,-12.739,1,0.0409,0.411,0.0,0.0481,0.621,136.118,1,4.086667
2,6jBZxE0VZpmgZud4L5lo0u,Paparico,Molejo,220426,53,0.738,0.85,9,-7.278,1,0.0756,0.53,0.0,0.213,0.967,88.799,1,3.673767
3,43uHeQc9kdMzFAbxCTcgiI,Telegrama,Exaltasamba,219866,54,0.517,0.516,11,-12.441,1,0.0342,0.443,4e-06,0.236,0.601,151.06,1,3.664433
4,4tILvSk9XcdCBzjyzhsuze,É Tarde Demais,Raça Negra,214693,50,0.615,0.403,4,-11.648,1,0.0282,0.425,4e-06,0.321,0.454,136.377,1,3.578217


Defining the dataset with only the track ID, song and artist names (Definindo o dataset apenas com o ID das tracks, os nomes das músicas e dos artistas).

In [341]:
items = data.iloc[:, :3].copy()
items

Unnamed: 0,id,name,artists
0,1XNdu4czbxZ9tdVOhDtI4M,Cheia de Manias,Raça Negra
1,1SaeGiWS1LQhfRc6ObQfvq,Que Se Chama Amor,Só Pra Contrariar
2,6jBZxE0VZpmgZud4L5lo0u,Paparico,Molejo
3,43uHeQc9kdMzFAbxCTcgiI,Telegrama,Exaltasamba
4,4tILvSk9XcdCBzjyzhsuze,É Tarde Demais,Raça Negra
...,...,...,...
977,2QKYsuDCtjbzoBdaan7Ybd,explicit,MC Gomes
978,4sDFmfir0mmHFeVgbg2WgR,explicit,MC Neguinho do Kaxeta
979,3Z0kqYFQYF7KLli5gAMkf3,explicit,MC Cainho
980,4he5PH2421GeKAJ5a4Ftqu,explicit,MC Maneirinho


Loading the item set into the `../data/preprocessing/` directory (Carregando o set de itens no diretório `../data/preprocessing/`)

In [343]:
items.to_csv('../data/transformed/items.csv', index=False)

Creating a dataset with only numerical features (Criando um dataset apenas com as features numéricas).

In [345]:
X_num = data.drop(columns=['id', 'name', 'artists', 'y', 'duration_ms']).copy()
X_num.head()

Unnamed: 0,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_min
0,51,0.707,0.614,7,-9.377,0,0.058,0.198,0.0,0.177,0.957,156.473,3.516667
1,0,0.647,0.528,7,-12.739,1,0.0409,0.411,0.0,0.0481,0.621,136.118,4.086667
2,53,0.738,0.85,9,-7.278,1,0.0756,0.53,0.0,0.213,0.967,88.799,3.673767
3,54,0.517,0.516,11,-12.441,1,0.0342,0.443,4e-06,0.236,0.601,151.06,3.664433
4,50,0.615,0.403,4,-11.648,1,0.0282,0.425,4e-06,0.321,0.454,136.377,3.578217


**[EN-US]**

Creating the one-hot encoding of the `key` feature. The `key` feature represents the key the track is in. 0 = C, 1 = C♯/D♭, 2 = D and so on. If no key is detected, the value will be -1. Therefore, the values ​​of this feature are not odinal, they are nominal, that is, the classes are not ordered, they do not have an intrinsic order and we do not establish a hierarchy between them.

**[PT-BR]**

Criando o one-hot encoding da feature `key`. A fetura `key` representa o tom em que a track está. 0 = C, 1 = C♯/D♭, 2 = D e assim por diante. Se nenhuma chave for detectada, o valor será -1. Portanto, os valores dessa feature não é odinal, é nominal, ou seja, as classes não são ordenadas, não têm um ordem intrínseca e não estabelecemos uma hierarquia entre elas.

In [347]:
key_oh = pd.get_dummies(X_num['key'], prefix='key', drop_first=True, dtype=np.int64)
key_oh

Unnamed: 0,key_1,key_2,key_3,key_4,key_5,key_6,key_7,key_8,key_9,key_10,key_11
0,0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
977,0,0,0,0,0,1,0,0,0,0,0
978,0,0,0,1,0,0,0,0,0,0,0
979,1,0,0,0,0,0,0,0,0,0,0
980,0,1,0,0,0,0,0,0,0,0,0


Concatenating the numeric dataset `X_num` with the one-hot encoding of the `key` feature (Concatenando o dataset numérico `X_num` com o one-hot enconding da feature `key`).

In [349]:
# Setting the numeric dataset without the feature key
# Definindo o dataset numérico sem a feature key
X_num_wkey = X_num.drop(columns=['key']).copy()
# Looping through the dataset columns X_num_key and key_oh to define them as feature names
# Percorrendo as colunas do dataset X_num_key e do key_oh para definí-las como nomes das features
columns_oh = {i: j for i, j in enumerate(X_num_wkey.columns.tolist() + key_oh.columns.tolist())}

# Concatenating the datasets
# Concatenando os datasets
X_num_oh = pd.concat(
    [X_num_wkey, key_oh],
    axis=1,
    ignore_index=True,
).rename(columns=columns_oh)
# Plotting the first 5 examples of the concatenated dataset
# Plotando os 5 primeiros exemplos do dataset concatenado
X_num_oh.head()

Unnamed: 0,popularity,danceability,energy,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,key_2,key_3,key_4,key_5,key_6,key_7,key_8,key_9,key_10,key_11
0,51,0.707,0.614,-9.377,0,0.058,0.198,0.0,0.177,0.957,...,0,0,0,0,0,1,0,0,0,0
1,0,0.647,0.528,-12.739,1,0.0409,0.411,0.0,0.0481,0.621,...,0,0,0,0,0,1,0,0,0,0
2,53,0.738,0.85,-7.278,1,0.0756,0.53,0.0,0.213,0.967,...,0,0,0,0,0,0,0,1,0,0
3,54,0.517,0.516,-12.441,1,0.0342,0.443,4e-06,0.236,0.601,...,0,0,0,0,0,0,0,0,0,1
4,50,0.615,0.403,-11.648,1,0.0282,0.425,4e-06,0.321,0.454,...,0,0,1,0,0,0,0,0,0,0


Creating the user dataset given the average of each feature from `good_df` (Criando o dataset do usuário dada a média de cada feature do `good_df`).

In [351]:
good_df = pd.read_csv('../data/raw/df_good.csv')

# Creating the good dataset with only the numerical features
# Criando o dataset good apenas com as features numéricas
good_df_num = good_df.drop(columns=['id', 'name', 'artists', 'y', 'duration_ms']).copy()
# Calculating the average of each feature to create the user dataset
# Calculando a média de cada feature para criar o dataset do usuário
user_vec = [[i for i in good_df_num.mean(axis=0)]]
user_df = get_user_dataset(
    user_vec,
    X_num
)
user_df.head(1)

Unnamed: 0,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_min
0,40.916938,0.50802,0.722435,5.382736,-6.664011,0.672638,0.06679,0.20019,0.032501,0.239447,0.505641,125.323767,3.895687


In [352]:
print(f'User dataset size: {user_df.shape[0]}')

User dataset size: 982


**[EN-US]**

Exporting the pre-processed numeric dataset into the `../data/preprocessed/` directory, however, without scaling, to adjust the final playlist dataset for the recommendations.

**[PT-BR]**

Exportando o dataset numérico pré-processado no diretório `../data/preprocessed/`, porém, sem o escalonamento, para ajustar o dataset da playlist final para as recomendações.

In [354]:
X_num_oh.to_csv('../data/preprocessed/X_pre.csv', index=False)

Splitting the dataset between the training, validation and test sets (Dividindo o dataset entre o set de treino, validação e teste).

In [356]:
# Creating the column vector of the target label y to be divided along
# Criando o vetor de coluna do target label y para ser divido junto
y = data.iloc[:, -2].copy().to_numpy().reshape((-1, 1))

item_train, item_, y_train, y_ = train_test_split(X_num_oh, y, test_size=.4, random_state=42)
user_train, user_ = train_test_split(user_df, test_size=.4, random_state=42)
print(f'item_train.shape: {item_train.shape}\ny_train.shape: {y_train.shape}\n')
print(f'user_train.shape: {user_train.shape}\n')

item_cv, item_test, y_cv, y_test = train_test_split(item_, y_, test_size=.5, random_state=42)
user_cv, user_test = train_test_split(user_, test_size=.5, random_state=42)
print(f'item_cv.shape: {item_cv.shape}, item_test.shape: {item_test.shape}\ny_cv.shape: {y_cv.shape}, y_test.shape: {y_test.shape}\n')
print(f'user_cv.shape: {user_cv.shape}, user_test.shape: {user_test.shape}')

item_train.shape: (589, 23)
y_train.shape: (589, 1)

user_train.shape: (589, 13)

item_cv.shape: (196, 23), item_test.shape: (197, 23)
y_cv.shape: (196, 1), y_test.shape: (197, 1)

user_cv.shape: (196, 13), user_test.shape: (197, 13)


Applying z-score normalization to each dataset, so that they have a mean of 0 and a standard deviation of 1. This makes training converge faster (Aplicando a normalização z-score em cada dataset, para eles terem média 0 e desvio padrão 1. Isso faz com que o treinamento converja mais rápido).
$$X = \frac{X - \mu}{\sigma^2}$$
* $\mu$: mean (média).
* $\sigma^2$: standard deviation (desvio-padrão).

We calculate the mean and standard deviation of the training set, and then apply the z-score to all datasets with the mean and standard deviation of the training set (Calculamos a média e desvio-padrão do training set, e então, aplicamos o z-score para todos os datasets com a média e o desvio-padrão do training set).

In [358]:
item_scaler = StandardScaler()
user_scaler = StandardScaler()

item_train_norm = item_scaler.fit_transform(item_train)
user_train_norm = user_scaler.fit_transform(user_train)

item_cv_norm = item_scaler.transform(item_cv)
user_cv_norm = user_scaler.transform(user_cv)

item_test_norm = item_scaler.transform(item_test)
user_test_norm = user_scaler.transform(user_test)

Defining the columns of each dataset (Definindo as colunas de cada dataset).

In [360]:
item_columns, user_columns = item_train.columns, user_train.columns
item_columns

Index(['popularity', 'danceability', 'energy', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_min', 'key_1', 'key_2', 'key_3', 'key_4',
       'key_5', 'key_6', 'key_7', 'key_8', 'key_9', 'key_10', 'key_11'],
      dtype='object')

Loading preprocessed sets into the `../data/preprocessed/` directory (Carregando os sets pré-processados no diretório `../data/preprocessed/`).

In [362]:
# Item train dataset
pd.DataFrame(
    item_train_norm,
    columns=item_columns
).to_csv('../data/preprocessed/item_train_norm.csv', index=False)
# Train target y
pd.DataFrame(
    y_train,
    columns=['y']
).to_csv('../data/preprocessed/y_train.csv', index=False)
# User train dataset
pd.DataFrame(
    user_train_norm,
    columns=user_columns
).to_csv('../data/preprocessed/user_train_norm.csv', index=False)

# Item cv dataset
pd.DataFrame(
    item_cv_norm,
    columns=item_columns
).to_csv('../data/preprocessed/item_cv_norm.csv', index=False)
# Cv target y
pd.DataFrame(
    y_cv,
    columns=['y']
).to_csv('../data/preprocessed/y_cv.csv', index=False)
# User cv dataset
pd.DataFrame(
    user_cv_norm,
    columns=user_columns
).to_csv('../data/preprocessed/user_cv_norm.csv', index=False)

# Item test dataset
pd.DataFrame(
    item_test_norm,
    columns=item_columns
).to_csv('../data/preprocessed/item_test_norm.csv', index=False)
# Test target y
pd.DataFrame(
    y_test,
    columns=['y']
).to_csv('../data/preprocessed/y_test.csv', index=False)
# User test dataset
pd.DataFrame(
    user_test_norm,
    columns=user_columns
).to_csv('../data/preprocessed/user_test_norm.csv', index=False)