# Treinando um classificador de texto com o Tensorflow-hub

Treinando um clasificador de sentimento simples com uma precisão de linha de base razoável, usando o tensorflow-hub. Em seguida será analisado as previsões para garantir que esse modelo seja razoável e propor melhorias para aumentar a precisão.

## Importando bibliotecas essenciais

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

## Dados 

A tarefa consiste em resolver as revisões do dataset do [Large Movie Review](http://ai.stanford.edu/~amaas/data/sentiment/) . O conjunto de dados de filmes do IMDB são marcados por positividade de 1 a 10. A terefa é rotular as resenhas como **negativas** ou **positivas**.

In [2]:
#Carregando todos os arquivos de um diretóri para um dataframe
def load_directory_data(directory):
  data = {}
  data["sentence"] = []
  data["sentiment"] = []
  for file_path in os.listdir(directory):
    with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
      data["sentence"].append(f.read())
      data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
  return pd.DataFrame.from_dict(data)

In [3]:
# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
  pos_df = load_directory_data(os.path.join(directory, "pos"))
  neg_df = load_directory_data(os.path.join(directory, "neg"))
  pos_df["polarity"] = 1
  neg_df["polarity"] = 0
  return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

In [4]:
# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
  dataset = tf.keras.utils.get_file(
      fname="aclImdb.tar.gz", 
      origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
      extract=True)
  
  train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                       "aclImdb", "train"))
  test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                      "aclImdb", "test"))
  
  return train_df, test_df

In [5]:
# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

train_df, test_df = download_and_load_datasets()
train_df.head()

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


Unnamed: 0,sentence,sentiment,polarity
0,That is the only thing I can positive to say a...,2,0
1,I've often wondered just how much CASPER was m...,8,1
2,"yeah cheap shot i know, but this movie is a gr...",3,0
3,Great fun. I went with 8 friends to a sneak pr...,9,1
4,This movie is about a Dysfunctinal Family but ...,10,1


## Modelo

In [10]:
# Training input on the whole training set with no limit on training epochs.
train_input_fn = tf.estimator.inputs.pandas_input_fn(
    train_df, train_df["polarity"], num_epochs=None, shuffle=True)

In [11]:
# Previsão em todo o conjunto de treinamento.
predict_train_input_fn = tf.estimator.inputs.pandas_input_fn(
    train_df, train_df["polarity"], shuffle=False)

In [12]:
# Previsão em todo o conjunto de teste.
predict_test_input_fn = tf.estimator.inputs.pandas_input_fn(
    test_df, test_df["polarity"], shuffle=False)

In [13]:
predict_train_input_fn

<function tensorflow.python.estimator.inputs.pandas_io.pandas_input_fn.<locals>.input_fn()>

In [14]:
embedded_text_feature_column = hub.text_embedding_column(
    key="sentence", 
    module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")

## Estimador

Para a CLassificação , vamos vai ser feito o uso de um DNNClassifier (Rede Neural Densa)

In [16]:
estimator = tf.estimator.DNNClassifier(
    hidden_units=[500, 100],
    feature_columns=[embedded_text_feature_column],
    n_classes=2,
    optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))


## Treinamento 

In [17]:
estimator.train(input_fn=train_input_fn, steps=1000)

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x7f4c56bcf4a8>

## Predição 

Executando as previsões para treinamento e conjunto de testes.

In [18]:
train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)

print("Training set accuracy: {accuracy}".format(**train_eval_result))
print("Test set accuracy: {accuracy}".format(**test_eval_result))

Training set accuracy: 0.8029999732971191
Test set accuracy: 0.7940400242805481
