# Analisis de emociones en dataset de "opiniones" de peliculas

Vamos a crear un agente de IA que aprendera a distinguir si un comentario hacia una pelicula es positivo o negativo

In [14]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [2]:
with open('./reviews.txt', 'r') as f:
    reviews = f.read()
with open('./labels.txt', 'r') as f:
    labels = f.read()

In [3]:
reviews[:2000]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is tu

## Data preprocessing

Vamos primero a hacer un poco de pre-procesamiento(muy comun en todo proyecto de ML), limpiando un poco el texto, removiendo palabras desconocidas,agregando "tokens" para los signos de puntuacion, etc.

Vamos a convertir las palabras a un vocabulario numerico, al igual que las calificaciones, positiva  o negativa a 1 y 0

In [4]:
from string import punctuation
all_text = ''.join([c for c in reviews if c not in punctuation])
reviews = all_text.split('\n')[0:2000] #limit size for demo

all_text = ' '.join(reviews)
words = all_text.split()

In [5]:
all_text[:2000]

'bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t    story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent m

In [6]:
words[:100]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me',
 'to',
 'believe',
 'that',
 'bromwell',
 'high',
 's',
 'satire',
 'is',
 'much',
 'closer',
 'to',
 'reality',
 'than',
 'is',
 'teachers',
 'the',
 'scramble',
 'to',
 'survive',
 'financially',
 'the',
 'insightful',
 'students',
 'who',
 'can',
 'see',
 'right',
 'through',
 'their',
 'pathetic',
 'teachers',
 'pomp',
 'the',
 'pettiness',
 'of',
 'the',
 'whole',
 'situation',
 'all',
 'remind',
 'me',
 'of',
 'the',
 'schools',
 'i',
 'knew',
 'and',
 'their',
 'students',
 'when',
 'i',
 'saw',
 'the',
 'episode',
 'in',
 'which',
 'a',
 'student',
 'repeatedly',
 'tried',
 'to',
 'burn',
 'down',
 'the',
 'school',
 'i',
 'immediately',
 'recalled',
 'at',
 'high']

In [7]:
# convertir palabras a numeros
vocab_to_int = dict()
int_to_vocab = dict()
for index,word in enumerate(set(words),1):
    vocab_to_int[word] = index
    int_to_vocab[index] = word
vocab_to_int["<PAD>"] = 0
int_to_vocab[0] ="<PAD>"

vocab_size = len(vocab_to_int)

reviews_ints = []
reviews_one_hot = []
for review in reviews:
    review_ints = [vocab_to_int[word] for word in review.split()]
    reviews_ints.append(review_ints)
    one_hot = np.zeros(vocab_size)
    
    for int_word in review_ints:
        one_hot[int_word] = 1
        
    reviews_one_hot.append(one_hot)

In [8]:
# convertir la salida a numeros
labels_list = labels.split('\n')[0:2000]
labels = [1 if label =='positive' else 0 for label in labels_list]
labels = np.array(labels)

In [9]:
from collections import Counter
review_lens = Counter([len(x) for x in reviews_ints])
print("Reviews sin palabras reviews: {}".format(review_lens[0]))
print("Maximo tamaño: {}".format(max(review_lens)))

Reviews sin palabras reviews: 0
Maximo tamaño: 1853


Parece que hay un review vacio, debemos eliminarlo 

In [10]:
# Filter out that review with 0 length
non_zero_reviews = []
non_zero_labels = []

for index,review in enumerate(reviews_ints):
    if len(review)>0:
        non_zero_reviews.append(review)
        non_zero_labels.append(labels[index])
        
reviews_ints = np.array(non_zero_reviews)
labels = np.array(non_zero_labels)



## Training, Validation, Test



Es muy comun separar el dataset en dataset de entrenamiento, de validaicon, y de pruebas, aun que aca no los usaremos, ejecutaremos el codigo

In [16]:
split_frac = 0.8

split_index  = int(len(reviews_one_hot)*split_frac)

train_x, val_x = np.array( reviews_one_hot[:split_index]), np.array(reviews_one_hot[split_index:])
train_y, val_y = labels[:split_index],labels[split_index:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape))

			Feature Shapes:
Train set: 		(1600, 22541) 
Validation set: 	(400, 22541)


## Training

Ahora entrenaremos un clasificador logistico

In [17]:
logistic_classifier = LogisticRegression()
logistic_classifier.fit(train_x,train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [22]:
train_prediction = logistic_classifier.predict(train_x)

In [24]:
print("Classification report for classifier %s:\n%s\n"
      % (logistic_classifier, metrics.classification_report(train_y, train_prediction)))

Classification report for classifier LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False):
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       800
          1       1.00      1.00      1.00       800

avg / total       1.00      1.00      1.00      1600




In [25]:
validation_prediction = logistic_classifier.predict(val_x)

In [26]:
print("Classification report for classifier %s:\n%s\n"
      % (logistic_classifier, metrics.classification_report(val_y, validation_prediction)))

Classification report for classifier LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False):
             precision    recall  f1-score   support

          0       0.84      0.84      0.84       200
          1       0.84      0.83      0.84       200

avg / total       0.84      0.84      0.84       400


