# Introducción a NLP 

Para ilustrar el análisis de sentimiento vamos a analizar datos de revisiones de películas los cuales incluyen 25,000 datos en el training y 25,000 datos en el test. Para más información pueden revisar el siguiente sitio web:                                    
http://ai.stanford.edu/~amaas/data/sentiment/

## Descargando los datos

Los siguientes son comandos de linux para descargar los datos.




In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2021-11-27 02:48:12--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2021-11-27 02:48:15 (39.0 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Preparando los datos

A continuación vamos a consolidar la información ya que los sentimientos para cada pelicula estan divididos en varios archivos como lo muestran los siguientes comandos de linux.

In [2]:
!apt-get install tree

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tree is already the newest version (1.7.0-5).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


In [3]:
!tree --filelimit=50  ../data

../data
├── aclImdb
│   ├── imdbEr.txt
│   ├── imdb.vocab
│   ├── README
│   ├── test
│   │   ├── labeledBow.feat
│   │   ├── neg [12500 entries exceeds filelimit, not opening dir]
│   │   ├── pos [12500 entries exceeds filelimit, not opening dir]
│   │   ├── urls_neg.txt
│   │   └── urls_pos.txt
│   └── train
│       ├── labeledBow.feat
│       ├── neg [12500 entries exceeds filelimit, not opening dir]
│       ├── pos [12500 entries exceeds filelimit, not opening dir]
│       ├── unsup [50000 entries exceeds filelimit, not opening dir]
│       ├── unsupBow.feat
│       ├── urls_neg.txt
│       ├── urls_pos.txt
│       └── urls_unsup.txt
└── aclImdb_v1.tar.gz

8 directories, 12 files


In [4]:
!ls ../data/aclImdb/train/pos/ | head 

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt


In [5]:
!cat ../data/aclImdb/train/pos/0_9.txt

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

Con el siguiente código almacenaremos la información de las revisiones de las peliculas y sus polaridades en diccionarios.

In [6]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [7]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))


IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


Con los siguientes códigos volvemos a procesar los datos de tal manera que vamos a consolidar la información de los reviews y los labels en listas.

In [8]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test


In [9]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))


IMDb reviews (combined): train = 25000, test = 25000


In [10]:
train_X[100]

"Let me being by saying the I followed watching this video by watching Saw and after Bleed, Saw looked like the all time greatest horror flick ever even though I thought it was only fairly good. Bleed is pretty bad. The best part is seeing the female cast nude. The gore is very fake looking and over-done. It has its funny parts but its extremely predictable and I didn't want to stay to see the horrible ending. If I could, I would ban these actors and actresses, the only reason being is that Debbie Rochon (Maddy) has been in over a hundred other videos and I've also seen two other members of the cast in equally or worse motion pictures. They should not allowed to continue this madness."

Antes de terminar en la representación final que es conocida como Bag-of-words realizaremos un último preprocesamiento el cual consitirá en eliminar códigos html y caracteres que no sean alfa-numéricos. Para lograr esto usaremos expresiones regulares.

In [11]:
import re

REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def review_to_words(review):
    words = REPLACE_NO_SPACE.sub("", review.lower())
    words = REPLACE_WITH_SPACE.sub(" ", words)
    return words

In [12]:
review_to_words(train_X[100])

'let me being by saying the i followed watching this video by watching saw and after bleed saw looked like the all time greatest horror flick ever even though i thought it was only fairly good bleed is pretty bad the best part is seeing the female cast nude the gore is very fake looking and over done it has its funny parts but its extremely predictable and i didnt want to stay to see the horrible ending if i could i would ban these actors and actresses the only reason being is that debbie rochon maddy has been in over a hundred other videos and ive also seen two other members of the cast in equally or worse motion pictures they should not allowed to continue this madness'

In [13]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_web_app")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [14]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


## Alcanzando la representación bag-of-words 

Usaremos la representación bag-of-words la cual consistirá en asignar a los reviews una estructura matricial con cada columna representando una palabra y por cada review tendra un 1 en la columna respectiva para una palabra dada. 

In [15]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. 
# from sklearn.externals import joblib

# Import joblib package directly
import joblib
# joblib is an enhanced version of pickle that is more efficient for storing NumPy arrays

def extract_BoW_features(words_train, words_test, vocabulary_size=5000,
                         cache_dir=cache_dir, cache_file="bow_features.pkl"):
    """Extract Bag-of-Words for a given set of documents, already preprocessed into words."""
    
    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = joblib.load(f)
            print("Read features from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Fit a vectorizer to training documents and use it to transform them
        # NOTE: Training documents have already been preprocessed and tokenized into words;
        #       pass in dummy functions to skip those steps, e.g. preprocessor=lambda x: x
        vectorizer = CountVectorizer(max_features=vocabulary_size)
        features_train = vectorizer.fit_transform(words_train).toarray()

        # Apply the same vectorizer to transform the test documents (ignore unknown words)
        features_test = vectorizer.transform(words_test).toarray()
        
        # NOTE: Remember to convert the features using .toarray() for a compact representation
        
        # Write to cache file for future runs (store vocabulary as well)
        if cache_file is not None:
            vocabulary = vectorizer.vocabulary_
            cache_data = dict(features_train=features_train, features_test=features_test,
                             vocabulary=vocabulary)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                joblib.dump(cache_data, f)
            print("Wrote features to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        features_train, features_test, vocabulary = (cache_data['features_train'],
                cache_data['features_test'], cache_data['vocabulary'])
    
    # Return both the extracted features as well as the vocabulary
    return features_train, features_test, vocabulary

In [16]:


# Extract Bag of Words features for both training and test datasets
train_X, test_X, vocabulary = extract_BoW_features(train_X, test_X)



Read features from cache file: bow_features.pkl


In [17]:
train_X.shape

(25000, 5000)

In [18]:
25000/5000

5.0

Por cada palabra en nuestra representación de bag of words tenemos 5 observaciones es por esto que puede haber una ganancia para los modelos que tienen menos parámetros. 

## Modelando los datos

Antes de modelar vamos a usar pandas para almacenar los datos que por el momento estan como listas. 

In [19]:
import pandas as pd

# Earlier we shuffled the training dataset so to make things simple we can just assign
# the first 10 000 reviews to the validation set and use the remaining reviews for training.
val_X = pd.DataFrame(train_X[:10000])
train_X = pd.DataFrame(train_X[10000:])

val_y = pd.DataFrame(train_y[:10000])
train_y = pd.DataFrame(train_y[10000:])


In [20]:
val_X = val_X.to_numpy()
train_X = train_X.to_numpy()

val_y = val_y.to_numpy()
train_y = train_y.to_numpy()


In [21]:
train_y = train_y.flatten()
val_y = val_y.flatten()

In [25]:
val_y.shape

(10000,)

### Regresión Logística (Sin regularización)

In [36]:

from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score


In [37]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=42, max_iter=1000, penalty='none').fit(train_X, train_y)
probs = clf.predict_proba(val_X)
probs = probs[:,1].flatten()
auc = roc_auc_score(val_y, probs)
print('AUC: %.3f' % auc)
y_pred = clf.predict(val_X)
print('Accuracy %.3f' % accuracy_score(val_y, y_pred)) 

AUC: 0.872
Accuracy 0.829


### Análisis discriminante lineal 

In [38]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(train_X, train_y)

LinearDiscriminantAnalysis()

El anàlisis discriminante no es escalable ya que al calcular $(K-1)\times(p+1)$ parametros y al tener que calcular la inversa de la matriz de varianzas covarianzas con un orden de complejidad $O(n^3)$. En la siguiente referencia menciona detalles de los cálculos para estimar la matriz de varianzas-covarianzas. Otro problema es la hipótesis de normalidad que no se cumple. 

https://stats.stackexchange.com/questions/90615/estimating-the-covariance-matrix-in-linear-discriminant-analysis

A continuación calcularemos el área bajo la curva roc.

In [40]:
probs = clf.predict_proba(val_X)
probs = probs[:,1].flatten()
auc = roc_auc_score(val_y, probs)
print('AUC: %.3f' % auc)
y_pred = clf.predict(val_X)
print('Accuracy %.3f' % accuracy_score(val_y, y_pred)) 

AUC: 0.885
Accuracy 0.813


### Naive Bayes 

Usaremos una versión de Naive-Bayes denominada Bernoulli Naive-Bayes. Las siguientes son buenas referencias.

https://scikit-learn.org/stable/modules/naive_bayes.html

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html

In [41]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(train_X, train_y)

BernoulliNB()

In [42]:
probs = clf.predict_proba(val_X)
probs = probs[:,1].flatten()
auc = roc_auc_score(val_y, probs)
print('AUC: %.3f' % auc)
y_pred = clf.predict(val_X)
print('Accuracy %.3f' % accuracy_score(val_y, y_pred)) 

AUC: 0.916
Accuracy 0.843


## Conclusiones 

* El mejor método resulta ser el Bernoulli Naive-Bayes que asume que los predictores son booleanos con lo que este mètodo saca ventaja de las peculiaridades del dataset, a pesar que tiene apenas un parámetro menos que la regresión logistica. 

* Se puede sacar màs ventaja de estos mètodos usando otra reprensentación como tf-idf la cual puede reducir la dimensionalidad del espacio de predictores.

* Una ventaja del Bernoulli Naive-Bayes es que al tener pocos parámetros puede representar una mejora al costo de ponerlo en producción en la nube ya que ocupa menos memoria.  

