<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-Variables" data-toc-modified-id="Project-Variables-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project Variables</a></span></li><li><span><a href="#Dataset-and-Pipelines" data-toc-modified-id="Dataset-and-Pipelines-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dataset and Pipelines</a></span></li><li><span><a href="#Building-the-Deep-NN" data-toc-modified-id="Building-the-Deep-NN-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Building the Deep NN</a></span><ul class="toc-item"><li><span><a href="#Network-Variables" data-toc-modified-id="Network-Variables-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Network Variables</a></span></li></ul></li></ul></div>

The goal of this notebook is building a sentiment classification model using plain TensorFlow. For making this task easier, we will use data prep steps (like pipelines and text functions) for preparing the data for feeding a neural network.

In [95]:
# Standard libraries
import os
import pandas as pd
from joblib import load

# Utilities
from utils.custom_transformers import import_data

# Modeling
from sklearn.model_selection import train_test_split
import tensorflow as tf

# Project Variables

In [58]:
# Variables for path definition
DATA_PATH = '../../data'
PIPE_PATH = '../../pipelines'

# Variables reading files
DATASET_FILENAME = 'olist_order_reviews_dataset.csv'
DATASET_COLS = ['review_comment_message', 'review_score']
FEATURES_COL = 'review_comment_message'
TARGET_COL = 'review_score'

TEXT_PIPE = 'text_prep_pipeline.pkl'

# Dataset and Pipelines

By now, let's read the raw data and apply the text prep pipeline already built on python training script on `dev/training` project folder. The goal is to give the raw text as input and transform this data into features using the vectorizer implemented on the pipeline (`TfIdfVectorizer`).

In [62]:
# Reading the data and dropping duplicates
df = pd.read_csv(os.path.join(DATA_PATH, DATASET_FILENAME), usecols=DATASET_COLS)
df.dropna(inplace=True)

# Splitting the data into train and test
X = df[FEATURES_COL].values
y = df[TARGET_COL].values

# Reading the pipeline
text_prep_pipe = load(os.path.join(PIPE_PATH, TEXT_PIPE))

# Applying it to training data
X_prep = text_prep_pipe.fit_transform(X)

# Splitting into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X_prep, y, test_size=.20, random_state=42)

# Results
for data, name in zip([X_train, X_test], ['X_train', 'X_test']):
    print('-' * 50)
    print(f'Shape of {name} data: {data.shape}')
print(f'\nSamples of y_train: {y_train[:5]}')
print(f'Samples of y_test: {y_test[:5]}')

--------------------------------------------------
Shape of X_train data: (33402, 650)
--------------------------------------------------
Shape of X_test data: (8351, 650)

Samples of y_train: [1 4 4 5 5]
Samples of y_test: [1 1 3 3 1]


# Building the Deep NN

After rading and preparing the data for feeding it into a neural network, let's retrieve some useful parameters for the network.

## Network Variables

In [99]:
# Retrieving data info
n_inputs = X_train.shape[1]
n_outputs = len(np.unique(y_train))

# Transforming the classes in one hot vectors
y_train_oh = pd.get_dummies(y_train).values

# Neural network structure
n_hidden1 = 300
n_hidden2 = 100

# Overview
print('-' * 40)
print(f'Neural network inputs: {n_inputs}')
print(f'Number of classes: {n_classes} - Sample: {y_train_oh[0]}')
print(f'Structure:')
for units in n_inputs, n_hidden1, n_hidden2, n_classes:
    print(f' - {units}', end='')
print()
print('-' * 40)

----------------------------------------
Neural network inputs: 650
Number of classes: 5 - Sample: [1 0 0 0 0]
Structure:
 - 650 - 300 - 100 - 5
----------------------------------------


In [100]:
# Função para reset do grafo
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [101]:
from datetime import datetime

In [102]:
# Variáveis para salvamento do modelo
now = datetime.now().strftime('%Y%m%d_%H%M%S')
root_logdir = 'tf_logs'
logdir = f'{root_logdir}/run_{now}'

# ----------------------------
# ---- CONSTRUCTION PHASE ----
# ----------------------------

# Definindo placeholders para os inputs
reset_graph()
with tf.name_scope('inputs'):
    X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
    y = tf.placeholder(tf.int32, shape=(None), name='y')
    
# Construindo as camadas da rede
with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name='hidden1')
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name='hidden2')
    logits = tf.layers.dense(hidden2, n_outputs, name='outputs')
    
# Definindo função custo
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')
    
# Definindo otimizador
with tf.name_scope('train'):
    optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
    training_op = optimizer.minimize(loss)
    
# Avaliando performance (acurácia)
with tf.name_scope('accuracy'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
    
"""# Avaliando performance (AUC)
with tf.name_scope('auc'):
    auc = tf.keras.metrics.AUC(y_proba, correct)"""
    
# Nós de inicialização e salvamento da rede
init = tf.global_variables_initializer()
saver = tf.train.Saver()

# Parâmetros para visualização no TensorBoard
loss_summary = tf.summary.scalar('loss', loss)
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())



In [103]:
# Definindo função para leitura de dados em mini-batches
def fetch_batch(X, y, epoch, batch_index, batch_size):
    """
    Etapas:
        1. leitura do conjunto de dados em diferentes mini-batches
        
    Argumentos:
        epoch -- época do treinamento do algoritmo
        batch_index -- índice do mini-batch a ser lido do conjunto total
        batch_size -- tamanho do mini-batch em termos de número de registros
        
    Retorno:
        X_batch, y_batch -- conjuntos mini-batch de dados lidos a partir do conjunto total
    """
    
    # Retornando parâmetros
    m = X.shape[0]
    n_batches = m // batch_size
    
    # Definindo semente aleatória
    np.random.seed(epoch * n_batches + batch_index)
    
    # Indexando mini-batches do conjunto total
    indices = np.random.randint(m, size=batch_size)
    X_batch = X[indices]
    y_batch = y[indices]
    
    return X_batch, y_batch

In [105]:
# -------------------------
# ---- EXECUTION PHASE ----
# -------------------------

# Variáveis importantes para o treinamento
m_train = X_train.shape[0]
n_epochs = 50
batch_size = 128
n_batches = m_train // batch_size
costs = []

# Inicializando sessão
with tf.Session() as sess:
    
    # Inicializando variáveis globais
    init.run()
    
    # Iterando sobre as épocas de treino
    for epoch in range(n_epochs):
        # Iterando sobre cada mini-batch
        for batch in range(n_batches):
            X_batch, y_batch = fetch_batch(X_train, y_train, epoch, batch, batch_size)
            batch_feed_dict = {X: X_batch, y: y_batch}
            
            # Salvando status do modelo a cada T mini-batches
            if batch % 10 == 0:
                summary_loss_str = loss_summary.eval(feed_dict=batch_feed_dict)
                step = epoch * n_batches + batch
                file_writer.add_summary(summary_loss_str, step)
                
            # Inicializando treinamento com cada mini-batch
            sess.run(training_op, feed_dict=batch_feed_dict)
            
        # Métricas de performance a cada N épocas
        test_feed_dict = {X: X_test_prep, y: y_test}
        if epoch % 5 == 0:
            # Acurácia
            acc_train = accuracy.eval(feed_dict=batch_feed_dict)
            acc_test = accuracy.eval(feed_dict=test_feed_dict)
            print(f'Epoch: {epoch}, Train accuracy: {round(float(acc_train), 4)}, \
Test accuracy: {round(float(acc_test), 4)}')
            
            # AUC
            """train_proba = y_proba.eval(feed_dict=batch_feed_dict)
            class_indices = np.argmax(train_proba, axis=1)
            train_pred = np.array([[classes[class_idx]] for class_idx in class_indices], np.int32)
            tf.local_variables_initializer().run()
            auc_train = sess.run(auc(y_batch.reshape(-1, 1), train_pred))
            print(f'AUC: {auc_train}')"""
            
        # Custo do modelo
        cost = loss.eval(feed_dict=batch_feed_dict)
        costs.append(cost)
        
    # Finalizando FileWriter
    file_writer.close()
    
# Plotando custo
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(np.squeeze(costs), color='navy')
format_spines(ax, right_border=False)
ax.set_title('Neural Network Cost', color='dimgrey')
ax.set_xlabel('Epoch')
ax.set_ylabel('Cost')
plt.show()

ValueError: setting an array element with a sequence.