https://vgpena.github.io/classifying-tweets-with-keras-and-tensorflow/

En el anterior enlace, tenéis un ejemplo sobre cómo, a partir de tweets con un label específico (un sentimiento, positivo o negativo): 

1. Genera un conjunto de entrenamiento. El conjunto de entrenamiento es formado a partir de tweets completos pasados a un array con un tamaño específico.
2. Ese array (X_train de tamaño N) tiene un label que representa el sentimiento (y_train)
3. Como todas las frases tienen un tamaño N, la entrada de la red neuronal será de tamaño N y la salida de la red será de tamaño 2 usando activación softmax(porque hay dos clases).

Se pide: 

- Realizar un clasificador de reviews para el dataset de IMDB de la carpeta data_exercise/

**Cuando usa la importación "keras.x", reemplázalo por "tensorflow.keras.x"**

### Import the necessary libraries

In [1]:
import io
import os
import re 
import shutil 
import string 
import tensorflow as tf 
import numpy as np
import pandas as pd 
import json

from datetime import datetime 
from tensorflow.keras import Model, Sequential 
from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, Dropout
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from keras.preprocessing.text import Tokenizer

import tensorflow.keras.preprocessing.text as kpt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import model_from_json

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

### Load the IMDb Dataset

In [2]:
dataset = pd.read_csv('data_exercise/IMDB Dataset/IMDB Dataset.csv', sep=',')
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
# Turn sentiment column into categorical values

le = LabelEncoder()
sentiment_le = le.fit_transform(dataset.sentiment)

dataset['sentiment_le'] = sentiment_le

In [4]:
## Divide the data in features and target

X = dataset['review']
y = dataset['sentiment_le']

print('X shape:', X.shape)
print('y shape:', y.shape)

X shape: (50000,)
y shape: (50000,)


In [5]:
## Split the data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

print('X_train shape:\t', X_train.shape)
print('X_test shape:\t', X_test.shape)
print('y_train shape:\t', y_train.shape)
print('y_test shape:\t', y_test.shape)

X_train shape:	 (40000,)
X_test shape:	 (10000,)
y_train shape:	 (40000,)
y_test shape:	 (10000,)


In [6]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,'[%s]' % re.escape(string.punctuation), '')

custom_standardization(input_data=X_train)
custom_standardization(input_data=X_test)

<tf.Tensor: shape=(10000,), dtype=string, numpy=
array([b'although this film was made before dogme emerged as the predominant method of filmmaking and before digital triumphed over  strike that you get the point this 1991 masterpiece clearly anticipated those developments corin nemec is just outstanding as the neer do well author and narrator the pace is slow but elegantly so because the cinematography is so beautiful record it the next time its on tv because i guarantee youll never see a better nostalgia ripoff madefor tv movie directtovideo never felt so good',
       b'my grandmother took me and my sister out to see this movie when it came out in theaters back in 1998 and so we happily bought the tickets the popcorn and soda and walked right in to the theater and sat down to watch the movie when it was over the audience didnt applauded strongly i remember that i heard a few people say that they didnt like it at all i didnt like it i thought that it was rather stupid and not worth se

In [7]:
# Delimit the maximum amount of words used for this case
max_words = 1000
sequence_length = 100

In [8]:
# Create a new tokenizer
tokenizer = Tokenizer(num_words=max_words)

# Feed our text with the tokenizer
tokenizer.fit_on_texts(X_train)

# Tokenizers come with a convenient list of words and IDs
dictionary = tokenizer.word_index

In [9]:
#Save the result in a dictionary
with open('dictionary.json', 'w') as dictionary_file:
    json.dump(dictionary, dictionary_file)

In [10]:
def convert_text_to_index_array(text):
    # one really important thing that `text_to_word_sequence` does
    # is make all texts the same length -- in this case, the length
    # of the longest text in the set.
    return [dictionary[word] for word in kpt.text_to_word_sequence(text)]

In [11]:
# Crate an empty list that will be filled with word's indices
allWordIndices = []

# for each review, change each token to its ID in the Tokenizer's word_index
for text in X_train:
    wordIndices = convert_text_to_index_array(text)
    allWordIndices.append(wordIndices)

In [12]:
# After creating the list of words, we have to pass it to an array
allWordIndices = np.asarray(allWordIndices)

# Last but not least, create one-hot matrices out of the indexed tweets
X_train = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')

In [13]:
y_train = tf.keras.utils.to_categorical(y_train, 2)
y_train

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       ...,
       [1., 0.],
       [1., 0.],
       [0., 1.]], dtype=float32)

## Now it's time to create the model

In [14]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

## Compile the network

In [15]:
model.compile(loss='categorical_crossentropy',
  optimizer='adam',
  metrics=['accuracy'])

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               512512    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               131328    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 514       
Total params: 644,354
Trainable params: 644,354
Non-trainable params: 0
_________________________________________________________________


## Finally, train the model

X_train = np.array(X_train)
X_test = np.array(X_test)

y_train = np.array(y_train)
y_test = np.array(y_test)

#Reshape features
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

In [17]:
X_train[0]

array([0., 1., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 1.,
       0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 0.,
       1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 1., 0.,
       1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       1., 1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [18]:
model.fit(X_train, y_train,
  epochs=10,
  verbose=1,
  validation_split=0.1,
  shuffle=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x26134116a60>

In [19]:
model_json = model.to_json()
with open('model.json', 'w') as json_file:
    json_file.write(model_json)

model.save_weights('model.h5')

In [20]:
# for human-friendly printing
labels = ['negative', 'positive']

# this utility makes sure that all the words in your input
# are registered in the dictionary
# before trying to turn them into a matrix.
def convert_text_to_index_array(text):
    words = kpt.text_to_word_sequence(text)
    wordIndices = []
    for word in words:
        if word in dictionary:
            wordIndices.append(dictionary[word])
        else:
            print("'%s' not in training corpus; ignoring." %(word))
    return wordIndices

# read in your saved model structure
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
# and create a model from that
model = model_from_json(loaded_model_json)
# and weight your nodes with your saved values
model.load_weights('model.h5')

In [None]:
# okay here's the interactive part
while True:
    sentence = input('Input a sentence to be evaluated, or Enter to quit: ')

    if len(sentence) == 0:
        break

    # format your input for the neural net
    testArr = convert_text_to_index_array(sentence)
    final_input = tokenizer.sequences_to_matrix([testArr], mode='binary')
    # predict which bucket your input belongs in
    pred = model.predict(final_input)
    # and print it for the humons
    print(sentence)
    print("%s sentiment; %f%% confidence" % (labels[np.argmax(pred)], pred[0][np.argmax(pred)] * 100))