# Prueba Técnica - Data Scientist - NLP Prueba 2

## Clasificación de opiniones Positivas y Negativas

#### La percepción de los clientes sobre los contenidos que ofrecen las empresas o plataformas, es de gran importancia para la mejora de procesos y contenidos dentro de la misma. Esta percepción, no necesariamente se analiza mediante encuestas con preguntas cerradas, por lo que, se deben utilizar metodologías propias del Procesamiento de Lenguaje Natural (NLP). Este procesamiento, ha cobrado importancia para el desarrollo de IA. 

#### En este documento, se presenta el análisis de las reseñas escritas por críticos/aficionados de ciertos contenidos, las cuales, están etiquetadas como positivas y negativas. Estas son etiquetas polarizadas, y el objetivo, es identificar cuándo una reseña es positiva o negativa. Esto se puede lograr, mediante el uso del Análisis de Sentimientos, con NLP, utilizando técnicas de Aprendizaje Supervisado, debido a que los datos están etiquetados.

#### 1. Se cargan los paquetes requeridos para el análisis

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

from sklearn.model_selection import train_test_split, cross_val_predict, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

#### Dada la importancia de la modelación en Ciencia de Datos para un equipo de producción, se requiere especificar el Pipeline del modelo que se va a implementar. De este modo, se describe el Pipeline de este trabajo:

#### 1. Partición de las reseñas.
#### 2. Preprocesamiento de las reseñas (Tokenización, Normalización, Remoción de palabras "stop", Vectorización, Representación tf-idf).
#### 3. Modelado de las reseñas
#### 4. Entrenamiento y prueba del modelo
#### 5. Evaluación del modelo
#### 6. Despliegue del modelo
#### 7. Monitoreo y mantenimiento

### 0. Preparación de los datos

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/hec-gallego/nlp-prueba2/main/NLP%20prueba%202.csv')
df = df[['text', 'tag']]
df.head()

Unnamed: 0,text,tag
0,"in exotica everybody is watching , and what is...",pos
1,some of the gags are so carefully innocuous th...,neg
2,press junkets are a haven for control freaks .,neg
3,"then i realized he was , and i was watching it .",neg
4,uh huh .,neg


In [4]:
df['tag'].value_counts()

neg    25434
pos     3968
Name: tag, dtype: int64

Como se observa en el conjunto de datos, parece que la etiqueta está desbalanceada. Sin embargo, aún se puede trabajar con ella. Lo siguiente, es binarizar dicha etiqueta, para lo cual, se realiza lo siguiente:

In [5]:
import numpy as np
df['tag'] = np.where(df['tag']=='pos', 1, 0)
df['tag'].value_counts()

0    25442
1     3968
Name: tag, dtype: int64

### 1. Partición de las reseñas

La partición de las reseñas, consiste en determinar un conjunto de datos para Entrenamiento (Train), y otro conjunto para la Prueba (Test). En este caso, se hará una partición 70/30, es decir, 70% de los datos para entrenamiento, y el otro 30% para la prueba.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['tag'], test_size = 0.3, random_state = 1)

print(f'Dimensiones de entrenamiento:{X_train.shape, y_train.shape}')
print(f'Dimensiones de prueba:{X_test.shape, y_test.shape}')
index1 = range(0,20586)
index2 = range(0,8822)
X_train.reindex(index1)
X_test.reindex(index2)
y_train.reindex(index1)
y_test.reindex(index2)

# No está de más, revisar la distribución de la etiqueta
print(y_train.value_counts())
print(y_test.value_counts())

Dimensiones de entrenamiento:((20587,), (20587,))
Dimensiones de prueba:((8823,), (8823,))
0    17784
1     2803
Name: tag, dtype: int64
0    7658
1    1165
Name: tag, dtype: int64


### 2. Preprocesamiento de las reseñas

In [7]:
train = pd.concat([X_train, y_train], axis = 1)
test = pd.concat([X_test, y_test], axis = 1)
train.reset_index(drop=True, inplace = True)
train.reset_index(drop=True, inplace = True)
train.head()

Unnamed: 0,text,tag
0,and singer sinead o'connor appears rather effe...,0
1,it's time to take cover .,0
2,the director's negative view on the catholic c...,0
3,uh huh .,0
4,he's covered in dirt and has a gun to her neck...,1


In [8]:
#Para remover puntuaciones
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
# 2.1. Tokenización ignorando signos de puntuación
def remove_punctuation(text):
    punctuationfree="".join([i for i in str(text) if i not in string.punctuation])
    return punctuationfree

train['clean_text']= train['text'].apply(lambda x:remove_punctuation(x))
train['text_lower']= train['clean_text'].apply(lambda x: x.lower())

import re
def tokenization(text):
    tokens = re.split(r'\s+',str(text))
    return tokens

train['text_tokenied']= train['text_lower'].apply(lambda x: tokenization(x))

# 2.2. Normalización
# 2.2.1. Remoción de palabras "stop"
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output

train['no_stopwords']= train['text_tokenied'].apply(lambda x:remove_stopwords(x))
# 2.2.2. Lematización
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatizer(text):
    lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
    return lemm_text

train['text_lemmatized']=train['no_stopwords'].apply(lambda x:lemmatizer(x))

train.head()

Unnamed: 0,text,tag,clean_text,text_lower,text_tokenied,no_stopwords,text_lemmatized
0,and singer sinead o'connor appears rather effe...,0,and singer sinead oconnor appears rather effec...,and singer sinead oconnor appears rather effec...,"[and, singer, sinead, oconnor, appears, rather...","[singer, sinead, oconnor, appears, rather, eff...","[singer, sinead, oconnor, appears, rather, eff..."
1,it's time to take cover .,0,its time to take cover,its time to take cover,"[its, time, to, take, cover, ]","[time, take, cover, ]","[time, take, cover, ]"
2,the director's negative view on the catholic c...,0,the directors negative view on the catholic ch...,the directors negative view on the catholic ch...,"[the, directors, negative, view, on, the, cath...","[directors, negative, view, catholic, church, ...","[director, negative, view, catholic, church, i..."
3,uh huh .,0,uh huh,uh huh,"[uh, huh, ]","[uh, huh, ]","[uh, huh, ]"
4,he's covered in dirt and has a gun to her neck...,1,hes covered in dirt and has a gun to her neck ...,hes covered in dirt and has a gun to her neck ...,"[hes, covered, in, dirt, and, has, a, gun, to,...","[hes, covered, dirt, gun, neck, little, crampe...","[he, covered, dirt, gun, neck, little, cramped..."


In [44]:
# 2.3. Vectorización
countvectorizer = CountVectorizer(analyzer = 'word', stop_words = 'english')
tfidfvectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english')

# Convertimos a matriz
train_count = countvectorizer.fit_transform(train['text_lower'])
train_tfidf = tfidfvectorizer.fit_transform(train['text_lower'])

In [45]:
# recuperamos los tokens
count_tokens = countvectorizer.get_feature_names()
tfidf_tokens = tfidfvectorizer.get_feature_names()

In [20]:
type(train_tfidf)

scipy.sparse.csr.csr_matrix

In [21]:
type(train_tfidf)

scipy.sparse.csr.csr_matrix

In [26]:
from scipy.sparse import csr_matrix

df_tfidfvect = pd.DataFrame.sparse.from_spmatrix(train_tfidf, columns = tfidf_tokens)

In [27]:
df_tfidfvect.head()

Unnamed: 0,00,000,000aweek,000foot,000paltry,007,00s,03,04,05,...,zooms,zoot,zorro,zucker,zuckerabrahamszucker,zuko,zulu,zwick,zwicks,zwigoffs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
test['clean_text']= test['text'].apply(lambda x:remove_punctuation(x))
test['text_lower']= test['clean_text'].apply(lambda x: x.lower())

test_count = countvectorizer.fit_transform(test['text_lower'])
test_tfidf = tfidfvectorizer.fit_transform(test['text_lower'])

test_count_tokens = countvectorizer.get_feature_names()
test_tfidf_tokens = tfidfvectorizer.get_feature_names()

test_tfidfvect = pd.DataFrame.sparse.from_spmatrix(test_tfidf, columns = test_tfidf_tokens)

### 3. Modelado de las reseñas

#### Una vez que el texto de las reseñas fue preprocesado, se ajusta un modelo de clasificación. El modelo  que se utilizará, es un clasificador de Bayes Ingenuo Binario. Este modelo funciona bien cuando se tiene una matriz dispersa.

In [36]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB(binarize=0.0)
bnb.fit(df_tfidfvect, train['tag'])
bnb.score(df_tfidfvect, train['tag'])

0.8874532471948317

#### Ahora, se obtienen las predicciones del modelo ajustado.

In [37]:
bnb.predict(df_tfidfvect)

array([0, 0, 0, ..., 0, 0, 0])

#### El principal objetivo de la modelación para este documento, es identificar cuándo una opinión es positiva o negativa. Para ello, se extraen las características, o palabras, que contribuyen más a la probabilidad de que una opinión sea positiva o negativa.

In [46]:
neg_class_prob_sorted = bnb.feature_log_prob_[0, :].argsort()[::-1]
pos_class_prob_sorted = bnb.feature_log_prob_[1, :].argsort()[::-1]

print(np.take(tfidfvectorizer.get_feature_names(), neg_class_prob_sorted[:10]))
print(np.take(tfidfvectorizer.get_feature_names(), pos_class_prob_sorted[:10]))

['film' 'movie' 'like' 'just' 'time' 'good' 'bad' 'character' 'story'
 'films']
['film' 'movie' 'like' 'just' 'films' 'good' 'story' 'life' 'time' 'way']


In [47]:
bnb_clf_scores = sklearn.model_selection.cross_val_score(bnb, df_tfidfvect, train['tag'], cv=5)

print(bnb_clf_scores)
print("Accuracy: %0.2f (+/- %0.2f)"%(bnb_clf_scores.mean(), bnb_clf_scores.std()*2))

[0.84458475 0.8436134  0.84357542 0.84697595 0.84406121]
Accuracy: 0.84 (+/- 0.00)


In [49]:
bnb_clf_pred = sklearn.model_selection.cross_val_predict(bnb, df_tfidfvect, train['tag'], cv=5)
print(confusion_matrix(train['tag'], bnb_clf_pred))

[[17146   638]
 [ 2562   241]]


#### De acuerdo con los resultados obtenidos, se identificó que, aquellas opiniones negativas que tuvieron más importancia, se referían principalmente a películas y a los personajes de las mismas. En el caso de las opiniones positivas, se inclinaban también hacia las películas y las historias calificadas como buenas.
#### Respecto al modelo, se encontró una precisión del 84%, de acuerdo con el resultado de la validación cruzada.