![image](https://docs.google.com/uc?export=download&id=1NUy1Q-abpoV9XYK9qT9t8Mdhj3ZVlveO)

<table align="center">
  <td>
    <a href="https://colab.research.google.com/github/jpcano1/MINE_4210_Analisis_con_Deep_Learning/blob/master/lab_4/practica_9/practica_9.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
  </td>
</table>

## **Práctica 9**

## **Análisis de Sentimientos**

## **Objetivos**
- Introducción al procesamiento de textos y al análisis de sentimientos
- Introducción a la capa de Embedding.

## **Problema**
- En la base de datos de película IMDB, se tiene un conjunto de reseñas de muchos filmes. Se requiere hacer clasificación de cada reseña para determinar si esta califica de forma positiva o negativa a la película en cuestión.

In [1]:
!shred -u setup_colab_general.py
!wget -q "https://github.com/jpcano1/python_utils/raw/main/setup_colab_general.py" -O setup_colab_general.py
import setup_colab_general as setup_general
setup_general.setup_general()

shred: setup_colab_general.py: failed to open for writing: No such file or directory


  0%|          | 0/3 [00:00<?, ?KB/s]

General Functions Enabled Successfully


## **Importando las librerías necesarias para el laboratorio**

In [21]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use("seaborn-deep")

import io
import os

import re
import shutil
import string
import tensorflow as tf
from tensorflow import keras

from utils import general as gen

from typing import Optional

In [3]:
data_url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

In [10]:
%%shell
wget -q "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" -O data.tar.gz
tar -xzf data.tar.gz
rm data.tar.gz



In [11]:
dataset_dir = gen.create_and_verify("aclImdb")

In [31]:
train_dir = gen.create_and_verify(dataset_dir, "train")
test_dir = gen.create_and_verify(dataset_dir, "test")

In [None]:
remove_dir = gen.create_and_verify(train_dir, "unsup")
shutil.rmtree(remove_dir)

## **Análisis y Procesamiento**
- A primera vista nos encontramos con un dataset de texto para clasificación binaria, siendo el número 1 una reseña positiva y el número 0 una negativa.

In [40]:
batch_size = 512

train_ds = keras.preprocessing.text_dataset_from_directory(
    train_dir, validation_split=0.2,
    subset="training", seed=1234, batch_size=batch_size,
)

val_ds = keras.preprocessing.text_dataset_from_directory(
    train_dir, validation_split=0.2,
    subset="validation", seed=1234, batch_size=batch_size,
)

test_ds = keras.preprocessing.text_dataset_from_directory(
    test_dir, batch_size=batch_size,
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [41]:
for text_batch, label_batch in train_ds.take(1):
    for i in range(5):
        tf.print(label_batch[i].numpy(), text_batch.numpy()[i])

0 (b'This film features two of my favorite guilty pleasures. Sure, the effects ar'
 b'e laughable, the story confused, but just watching Hasselhoff in his Knight '
 b'Rider days is always fun. I especially like the old hotel they used to shoot'
 b' this in, it added to what little suspense was mustered. Give it a 3.')
1 (b'I guess when "Beat Street" made a national appearance, "Flashdance" came at '
 b'the same time. The problem with "Flashdance" is that there was only one brea'
 b'k dancing scene and the rest was jazz dance and ballet. That was one of the '
 b'reasons why "Beat Street" was better. The only movie that could rival "Beat '
 b'Street" seems to be "Footloose", because both movies focused on how dance ha'
 b'd been used by people to express their utmost feelings.<br /><br />The break'
 b'-dance scenes in "Beat Street" come just before the middle and at the end of'
 b' the flick. And I loved all of them. Almost all of the break tricks were fea'
 b'tured in the break jam scen

In [46]:
def performance(
    dataset: Optional[tf.data.Dataset], 
    train: bool = True
) -> Optional[tf.data.Dataset]:
    """
    Function to boost dataset load performance
    :param dataset: The dataset to be boosted
    :param type: Optional[tf.data.Dataset]
    :param train: Flag to indicate the nature of the dataset
    :param type: bool
    :return: The dataset boosted
    :rtype: Optional[tf.data.Dataset]
    """
    if train:
        # Shuffle the dataset to a fixed buffer sample
        dataset = dataset.shuffle(512, reshuffle_each_iteration=True)
        # The number of batches that will be parallel processed
        dataset = dataset.prefetch(tf.data.AUTOTUNE)
    # Repeat the incidences in the dataset
    dataset = dataset.repeat()
    # Create batches from dataset
    return dataset

In [42]:
TRAIN_SIZE = len(train_ds)
VAL_SIZE = len(val_ds)
TEST_SIZE = len(test_ds)

In [47]:
train_ds = performance(train_ds)
val_ds = performance(val_ds, False)
test_ds = performance(test_ds, False)

In [26]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    return tf.strings.regex_replace(
        stripped_html,
        '[%s]' % re.escape(string.punctuation), ''
    )

# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

vectorize_layer = keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

vectorize_layer.adapt(train_ds.map(lambda x, y: x))

## **Modelamiento**
- En este caso, usaremos la capa de embedding para procesar los vectores resultantes de la vectorización de textos.

![image](https://miro.medium.com/max/700/1*HQmO2NNle730Uk-45ucPcQ.png)

In [27]:
embedding_dim = 16

model = keras.Sequential([
    vectorize_layer,
    keras.layers.Embedding(vocab_size, embedding_dim, name="embedding"),
    keras.layers.GlobalAveragePooling1D(),
    keras.layers.Dense(32, activation="relu"),
    keras.layers.Dense(16, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid"),
])

In [59]:
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=[
        "acc", keras.metrics.Recall(), keras.metrics.Precision()]
)

In [60]:
model.fit(
    train_ds, validation_data=val_ds, 
    epochs=15, steps_per_epoch=TRAIN_SIZE,
    validation_steps=VAL_SIZE,
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7fb699b61d10>

In [61]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization_1 (TextVe (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                544       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 161,089
Trainable params: 161,089
Non-trainable params: 0
__________________________________________________

## **Validación**

In [62]:
model.evaluate(test_ds, steps=TEST_SIZE)



[2.0090858936309814, 0.7613599896430969, 0.741599977016449, 0.7721139192581177]

In [72]:
test = np.array([
    '''This was an awful short film that tries to be funny 
    in a dark way but wasn\'t funny at all. Say at a film 
    festival in Chicago. It really is what the title says and 
    I simply wasn\'t into it at all. The bad storytelling was 
    what did it in. If you re-wrote it and re-shot it, it "might" 
    work. This attempt fell in "the hole". Horrible filmmaking.''',
    '''I first saw it at 5am January 1, 2009, and after a day 
    i watched it again and i want to watch it again. Love everything 
    (well, almost, so 9 stars) about it. No color, beautiful naive 
    stories, funny gangsters, Anna, camera work, music. Well, 
    sometimes you just want to listen little bit longer and the music 
    just stops. But this is not a musical after all. I like Anna's 
    acting, this naive wannabe gangster girl, how she speaks, 
    holds the gun, everything makes me smile. No, it's not that 
    funny, though i have laughed a bit at some moments, it's 
    just so subtle. Excellent work by Samuel Benchetrit. Though 
    3d nouvelle seems weaker, but they are also gangsters, maybe 
    even worse, cause they are stealing ideas. And the last scene 
    is my favorite. Makes me feel so warm and.. romantic. Yes, 
    i would recommend this movie for the romantic souls with a taste 
    for such art-housish movies. And i don't agree with those 
    comparing it to Pulp Fiction. It's not about action and twisted 
    story, though all vignettes intersect. It's calm, and maybe too 
    slow movie for most of the people. It's about characters, their 
    feelings, very subtle. Anyway, probably this review won't be of 
    much help to anyone (my first), just wanted to express my 
    appreciation.<br /><br />SPOILER: This movie doesn't have a 
    Goofs section. Wonder, didn't anybody notice that hand in the 2 
    part when the kidnappers decided to go home? Looks like a part 
    of crew, hehe. I know i should better post this in forums, but 
    i don't agree with some policies here.''',
    '''jeez, this was immensely boring. the leading man 
    (Christian Schoyen) has got to be the worst actor i have ever 
    seen. and another thing, if the character in the movie moved to 
    America when he was ten or something and had been living here 
    for over 20 years, he would speak a lot better English than 
    what he pulls of here. or to say it in my own Language "Skikkelig 
    gebrokkent". But it is cool to see Norwegian dudes in a movie 
    made in Hollywood. it was just a damn shame they were talentless 
    hacks. The storyline itself is below mediocre. I have a suspicion 
    that Christian Schoyen did this movie just to live the dream, as 
    he clearly does in the film by humping one beautiful babe after 
    another.''',
])

In [73]:
model.predict(test)

array([[1.4060736e-04],
       [9.9995989e-01],
       [8.7552056e-05]], dtype=float32)