# Programación 2
# Maestría en Ciencia de Datos
# Challenge 2
# Guillermo Ortiz Macías

In [1]:
# Python libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import confusion_matrix, precision_score, accuracy_score
import torch

  from .autonotebook import tqdm as notebook_tqdm


Se instaló pytorch desde el sitio web https://pytorch.org/get-started/locally/ utilizando el comando de pip que me dio la misma página:

`pip install torch torchvision torchaudio`

También se instalaron, con pip:

- transformers: Para importar el modelo de lenguaje BERT. Específicamente _bert-base-multilingual-uncased-sentiment_
- pandas
- numpy

In [2]:
# Load dataset
df_job_reviews = pd.read_csv("glassdoor_reviews.csv")
df_job_reviews.head()

Unnamed: 0,firm,date_review,job_title,current,location,overall_rating,work_life_balance,culture_values,diversity_inclusion,career_opp,comp_benefits,senior_mgmt,recommend,ceo_approv,outlook,headline,pros,cons
0,AFH-Wealth-Management,2015-04-05,,Current Employee,,2,4.0,3.0,,2.0,3.0,3.0,x,o,r,"Young colleagues, poor micro management",Very friendly and welcoming to new staff. Easy...,"Poor salaries, poor training and communication."
1,AFH-Wealth-Management,2015-12-11,Office Administrator,"Current Employee, more than 1 year","Bromsgrove, England, England",2,3.0,1.0,,2.0,1.0,4.0,x,o,r,"Excellent staff, poor salary","Friendly, helpful and hard-working colleagues",Poor salary which doesn't improve much with pr...
2,AFH-Wealth-Management,2016-01-28,Office Administrator,"Current Employee, less than 1 year","Bromsgrove, England, England",1,1.0,1.0,,1.0,1.0,1.0,x,o,x,"Low salary, bad micromanagement",Easy to get the job even without experience in...,"Very low salary, poor working conditions, very..."
3,AFH-Wealth-Management,2016-04-16,,Current Employee,,5,2.0,3.0,,2.0,2.0,3.0,x,o,r,Over promised under delivered,Nice staff to work with,No career progression and salary is poor
4,AFH-Wealth-Management,2016-04-23,Office Administrator,"Current Employee, more than 1 year","Bromsgrove, England, England",1,2.0,1.0,,2.0,1.0,1.0,x,o,x,client reporting admin,"Easy to get the job, Nice colleagues.","Abysmal pay, around minimum wage. No actual tr..."


La variable objetivo es recommend que tiene los siguientes valores:

- v: Recomendación positiva de la empresa
- x: Recomendación negativa de la empresa
- o: Sin opinión.

Voy a quitar las rows del dataset que estén sin opinión.

In [3]:
df_job_reviews = df_job_reviews[df_job_reviews['recommend'] != 'o']

La parte de texto del dataset está en las columnas headline, pros y cons.

Voy a juntar estas 3 columnas en una única.

In [4]:
df_job_reviews['text'] = df_job_reviews['headline'] + " " + df_job_reviews['pros'] + " " + df_job_reviews['cons']

Ahora voy a quitar todas las columnas del dataset a excepción del recommend y del text

In [5]:
df_job_reviews = df_job_reviews[['text', 'recommend']].copy()

Quitar valores nulos

In [6]:
df_job_reviews = df_job_reviews.dropna()

In [7]:
# Reset de índices
df_job_reviews = df_job_reviews.reset_index().drop('index',axis=1)

In [8]:
df_job_reviews.head()

Unnamed: 0,text,recommend
0,"Young colleagues, poor micro management Very f...",x
1,"Excellent staff, poor salary Friendly, helpful...",x
2,"Low salary, bad micromanagement Easy to get th...",x
3,Over promised under delivered Nice staff to wo...,x
4,"client reporting admin Easy to get the job, Ni...",x


Convertir los valores de recommend a:

- 1 si es recomendado (v)
- 0 si no (x)

In [9]:
df_job_reviews['recommend'] = np.where(df_job_reviews['recommend'] == 'x', 0,1)
df_job_reviews

Unnamed: 0,text,recommend
0,"Young colleagues, poor micro management Very f...",0
1,"Excellent staff, poor salary Friendly, helpful...",0
2,"Low salary, bad micromanagement Easy to get th...",0
3,Over promised under delivered Nice staff to wo...,0
4,"client reporting admin Easy to get the job, Ni...",0
...,...,...
603104,A great brand Family owned and a great brand. ...,1
603105,Awesome place to work It's a company with a cl...,1
603106,Just an awesome company to work for!!! Great c...,1
603107,not interested in growing their people loved b...,1


Para la clasificación de las reseñas de trabajo en las que recomiendan el trabajo
y las que no, voy a utilizar el modelo de lenguaje BERT.

Para utilizar este modelo instalé en mi ambiente de python la librería llamada `transformers`

Primero obtengo el tokenizador del modelo ya preentrenado, y luego el modelo en sí

In [10]:
# Estas ligas vienen del sito HuggingFace.co:
# https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment
# Es en un modelo para el análisis de sentimientos que funciona en 6 idiomas distintos,
# entre ellos inglés y español, y regresa el sentimiento como un número del 1 al 5.
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
text = "Este ha sido el peor trabajo en el que he estado"
tokens = tokenizer.encode(text, return_tensors="pt")

In [12]:
# Ahora el texto se convirtió en un su representación de tokens
tokens

tensor([[  101, 10494, 10240, 12738, 10117, 89664, 15858, 10109, 10117, 10126,
         10191, 10714,   102]])

In [13]:
# Y esto se puede regresar a lenguaje normal
tokenizer.decode(tokens[0])

'[CLS] este ha sido el peor trabajo en el que he estado [SEP]'

In [14]:
# Ahora le podemos pedir al modelo que haga un análisis de sentimientos del texto
result = model(tokens)

In [15]:
# Lo importarte del resultado es el tensor de probabilidades. Es un tensor de 5 elementos,
# con 5 probabilidades distintas, el primer elemento es el peor sentimiento, el quinto es el mejor.
result

SequenceClassifierOutput(loss=None, logits=tensor([[ 3.9755,  1.8709, -0.1890, -2.6711, -2.2595]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

El resultado dice que tiene 3.9 de calificación 1 y -2.26 de calificación 5, por
lo que el modelo nos dice que el texto "Este ha sido el peor trabajo en el que he estado"
indica un sentimiento muy malo. Es una reseña negativa.

In [16]:
torch.argmax(result.logits)

tensor(0)

In [17]:
text = "Fascinante trabajo, trabajar aquí es genial!"
tokens = tokenizer.encode(text, return_tensors="pt")
tokenizer.decode(tokens[0])
result = model(tokens)
result

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7409, -2.9552, -0.7080,  2.0021,  3.4833]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Para una buena review, regresa 3.48 en la quinta posición. Esto indica que la reseña
es muy buena

Ahora hay que utilizar este modelo para hacer el análisis de sentimientos del dataset de
reseñas de trabajo de kaggle.

In [18]:
def get_review_sentiment(review):
    tokens = tokenizer.encode(review, return_tensors="pt")
    tokenizer.decode(tokens[0])
    result = model(tokens)
    # Get biggest probability from 0 to 4
    sentiment = torch.argmax(result.logits)
    # If sentiment is 0,1 or 2 return 0: Negative sentiment
    if sentiment <= 2:
        return 0
    # else, return 1: possitive sentiment
    return 1

In [19]:
# Hacer la predicción para las 603109 filas que tiene el dataset tomaría muchísmo tiempo,
# por lo que voy a obtener un subset para probar qué tan bien funciona el modelo.
sample_size = 200
df_sample = df_job_reviews.sample(sample_size)
df_sample

Unnamed: 0,text,recommend
452565,Telco Accounts Storage Sales Executive Global ...,1
545362,good company good company. Nice team. I enjoy ...,1
356569,Overall satisfied with my experience at Marrio...,1
347170,Vendor Manager for Shared Services Team Strong...,0
447251,"Applications Engineer Great place to learn, go...",1
...,...,...
53407,A good place to work Professional work environ...,1
235654,Hays Review Great training\r\nEveryone is supp...,1
265346,"Project Manager Good benefits, flexible work s...",1
50633,"An wonderful experience good work conditions, ...",1


In [20]:
predictions_recommend = []
len_df = len(df_sample.index)
i = 1
for index, row in df_sample.iterrows():
    print(f"Predicting {i} of {len_df}")
    # El modelo tiene un límite de 512 tokens que puede analizar a la vez. Por eso
    # obtengo únicamente los primeros 512 tokens
    pred_rec = get_review_sentiment(row['text'][:512])
    predictions_recommend.append(pred_rec)
    i = i + 1

Predicting 1 of 200
Predicting 2 of 200
Predicting 3 of 200
Predicting 4 of 200
Predicting 5 of 200
Predicting 6 of 200
Predicting 7 of 200
Predicting 8 of 200
Predicting 9 of 200
Predicting 10 of 200
Predicting 11 of 200
Predicting 12 of 200
Predicting 13 of 200
Predicting 14 of 200
Predicting 15 of 200
Predicting 16 of 200
Predicting 17 of 200
Predicting 18 of 200
Predicting 19 of 200
Predicting 20 of 200
Predicting 21 of 200
Predicting 22 of 200
Predicting 23 of 200
Predicting 24 of 200
Predicting 25 of 200
Predicting 26 of 200
Predicting 27 of 200
Predicting 28 of 200
Predicting 29 of 200
Predicting 30 of 200
Predicting 31 of 200
Predicting 32 of 200
Predicting 33 of 200
Predicting 34 of 200
Predicting 35 of 200
Predicting 36 of 200
Predicting 37 of 200
Predicting 38 of 200
Predicting 39 of 200
Predicting 40 of 200
Predicting 41 of 200
Predicting 42 of 200
Predicting 43 of 200
Predicting 44 of 200
Predicting 45 of 200
Predicting 46 of 200
Predicting 47 of 200
Predicting 48 of 200
P

In [21]:
y_test = df_sample['recommend']
y_pred = predictions_recommend
cm = confusion_matrix(y_test, y_pred)
precision = precision_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
print(cm)
print(precision)
print(accuracy)


[[ 46  14]
 [ 32 108]]
0.8852459016393442
0.77
