# Pre procesado y optención del vector para el algoritmo de aprendisaje
*Se emplea el SEL de palabras teniendo en cuenta las emociones siguientes*

    -Alegría
    -Enojo
    -Miedo
    -Tristeza


**Importar las librerias necesarias**

In [1]:
import xlrd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import openpyxl

**Leer datos de SEL**

In [2]:
file_ = xlrd.open_workbook("constant_files/SEL.xlsx")

**Leer datos de la primera hoja del documento**

In [3]:
emotions = file_.sheet_by_index(0)

**Se crea una lista con las emociones que no son relevantes para nuestro algoritmo**

*- Estas emociones serán omitidas de nuestro procesamiento*

In [4]:
omit = ['Repulsión', 'Sorpresa']

**Se crea un diccionario que contiene las emociones como llaves o keys y una lista de tuplas tipo (palabra, PFA)**

In [5]:
emotion_dic = {}
for i in range(1, emotions.nrows):
    fila = emotions.row(i)
    if fila[3].value not in emotion_dic.keys() and fila[3].value not in omit:
        emotion_dic[fila[3].value] = []

for i in range(1, emotions.nrows):
    fila = emotions.row(i)

    if fila[3].value in omit:
        pass
    else:

        if fila[1].value not in emotion_dic[fila[3].value]:
            emotion_dic[fila[3].value].append((fila[1].value, fila[2].value))


**Se crea una lista con los 4 archivos de entrenamiento que se corresponden con las 4 emociones dadas para el problema**

In [6]:
training_files = ["anger.txt", "fear.txt", "joy.txt", "sadness.txt"]

**Se leen cierta cantidad de lineas (*100*) de cada fichero y se crea una lista que contiene todos los comentarios sin procesar**

In [7]:
num_commets = 100

comments = []
for file_selected in training_files:
    entry_file = open(f"entry_files/{file_selected}")

    # desecha la primera linea que no es útil
    metadata = entry_file.readline()

    # define la cantidad de comentarios de cada fichero
    limit = num_commets

    while limit > 0:
        line = entry_file.readline()
        # Desecha todo lo que no forma parte del comentario
        comment = line.split("\t")[1]
        if comment in comments:
            pass
        else:
            comments.append(comment)
            limit -= 1

    entry_file.close()


**Se tokenizan los comentarios y se eliminan los stop_words**

In [8]:
tokenized_comments = []
for comment in comments:
    tokens = word_tokenize(comment)
    words = [word.lower() for word in tokens if word.isalpha()]
    stop_words = stopwords.words('spanish')
    tokenized_comments.append([w for w in words if w not in stop_words])

suma = 0
for commet in tokenized_comments:
    suma += len(commet)
print(suma)
# print(tokenized_comments)

2684


**Se usa un metodo de reducción del umbral de palabras del SEL en este caso *Zimmermann_Zysno***

In [9]:
T_en = 1
T_e_complement = 1
e_s = []
for key in emotion_dic.keys():
    suma = 0
    e = 0
    for pfa in emotion_dic[key]:
        suma += pfa[1]

    e = suma / len(emotion_dic[key])
    e_complement = 1 - e

    T_en *= e
    T_e_complement *= e_complement

ganma = T_en / (T_en + T_e_complement)

mul1 = (T_en ** (1 - ganma))
mul2 = (T_e_complement ** ganma)
pfa_umbral = (T_en ** (1 - ganma)) * (1 - (T_e_complement ** ganma))


**De acuerdo al umbral calculado se descartan las palabras del SEL que están por debajo del umbral**

In [10]:
emotion_dic_umbral = {}
for key in emotion_dic.keys():
    emotion_dic_umbral[key] = []
    for pfa in emotion_dic[key]:
        if pfa[1] >= pfa_umbral:
            emotion_dic_umbral[key].append(pfa[0])

# print(emotion_dic_umbral)

**Se define la función que calcula la ocurrencia**

In [11]:
def ocu(word, comments):
    """
    Calcula la ocurrencia de una palabra en todos los
    comentarios
    """
    count = 0
    for comment in comments:
        if word in comment:
            count += 1

    return count

**Se define la función que calcula la concurrencia**

In [12]:
def concu(word1, word2, comments):
    """
    Calcula la concurrencia de dos palabras en todos los
    comentarios
    """
    count = 0
    for comment in comments:
        if word1 in comment and word2 in comment:
            count += 1

    return count

**Se unen todas las palabras del SEL y se almacenan en una lista solo las que tienen una ocurrencia mayor que 0**
*Esta operación permite decantar palabras que nos hacen 0 la ecuación $concurrece(w, v)  /  ocurrence(w) * ocurrence(v)$*

In [13]:
all_words = []
for key in emotion_dic_umbral.keys():
    all_words += emotion_dic_umbral[key]

ocurrency_words = []
for w in all_words:
    if ocu(w, tokenized_comments) > 0:
        ocurrency_words.append(w)

# print(ocurrency_words)

**Se calcula la ocurrencia de las palabras de los comentarios y se almacenan en el diccionario *words_ocu_dic*, se calcula la concurrencia con las palabras de *ocurrency_words* y se almacenan en el  diccionario *concur_dic***

In [14]:
words_ocu_dic = {}
concur_dic = {}
for comment in tokenized_comments:
    for w in comment:
        if w not in words_ocu_dic.keys():
            words_ocu_dic[w] = ocu(w, tokenized_comments)

        for w2 in ocurrency_words:
            key = f'{w}-{w2}'
            if key not in concur_dic.keys():
                concur_dic[key] = concu(w, w2, tokenized_comments)

**Se calcula la ocurrencia de las palabras de *ocurrency_words***

In [15]:
f_words_ocu_dic = {}
for comment in tokenized_comments:
    for w in ocurrency_words:
        if w not in f_words_ocu_dic.keys():
            f_words_ocu_dic[w] = ocu(w, tokenized_comments)

**Se aplica la fórmula para calcular $O(wi, Ej)$ a todos los comentarios optieniendo el diccionario *average_word_emotion* que contiene como llave todos los comentarios, sus palabras y el valor de $O(wi, Ej)$ de cada una de ellas respecto a cada emoción**

In [16]:
average_word_emotion = {}
for comment in tokenized_comments:
    original_comment = comments[tokenized_comments.index(comment)]
    average_word_emotion[original_comment] = {}

    for word in comment:
        average_word_emotion[original_comment][word] = {}

        for i in range(0, len(emotion_dic_umbral.keys())):
            mod_E = len(emotion_dic[list(emotion_dic_umbral.keys())[i]])
            emotion = list(emotion_dic_umbral.keys())[i]
            emotion_words = emotion_dic_umbral[emotion]
            word_ocurrence = words_ocu_dic[word]
            sum_ = 0

            for f_w in emotion_words:
                try:
                    sum_ += round((concur_dic[f'{word}-{f_w}'] /
                        (f_words_ocu_dic[f_w] * words_ocu_dic[word])), 6)

                except:
                    sum_ += 0

            average_word_emotion[original_comment][word][emotion] =\
                round((sum_ / mod_E), 6)

# print(average_word_emotion)

**Se calcula el vecto realizando las siguientes operaciones:**
1. Se extrae la emoción com mayor promedio de $O(wi, Ej)$
2. Dada la emoción se añaden de manera exclusiva las palabras con $O(wi, Ej)$ > 0

In [17]:
vector = {}
for commnet in average_word_emotion.keys():
    commnet_dic = {}
    for words in average_word_emotion[commnet].keys():
        word_emo = average_word_emotion[commnet][words]
        for emotion in word_emo.keys():
            o = word_emo[emotion]
            if emotion not in commnet_dic.keys():
                commnet_dic[emotion] = o
            else:
                commnet_dic[emotion] += o

    max_prom = 0
    max_emo = ''
    module_o = len(average_word_emotion[commnet])
    for emot in commnet_dic.keys():
        prom = commnet_dic[emot] / module_o
        if prom > max_prom:
            max_emo = emot
            max_prom = prom

    for w in average_word_emotion[commnet].keys():
        try:
            emotion = average_word_emotion[commnet][w][max_emo]
            word_O_tuple = (w, emotion)
            if emotion > 0 and word_O_tuple not in vector:
                vector[w] = emotion
        except KeyError:
            pass

# print(vector)
print(len(vector))

1021


**Se calcula una matriz para el aprendisaje usando el modelo booleano $TF-IDF$ y una con el valor $O(wi, Ej)$**

In [18]:
tf_idf = []
O_w = []
count = 1
for comment in tokenized_comments:
    fila = []
    fila1 = []
    for vector_word in vector.keys():
        if vector_word in comment:
            fila.append(1)
            fila1.append(vector[vector_word])
        else:
            fila.append(0)
            fila1.append(0)

    if count <= num_commets:
        fila.append('enfado')
        fila1.append('enfado')
    elif count > num_commets and count <= num_commets * 2:
        fila.append('miedo')
        fila1.append('miedo')
    elif count > num_commets * 2 and count <= num_commets * 3:
        fila.append('alegria')
        fila1.append('alegria')
    else:
        fila.append('tristeza')
        fila1.append('tristeza')

    count += 1

    tf_idf.append(fila)
    O_w.append(fila1)

**Por último se guardan los vectores en documentos excel separados que utilizará nuestro modelo de aprendizaje**

In [19]:
wb = openpyxl.Workbook()
sheet = wb.active
sheet.append(list(vector.keys()) + ['emoción'])
for com in tf_idf:
    sheet.append(com)
wb.save('pre_pros_tf_idf.xlsx')

wb = openpyxl.Workbook()
sheet = wb.active
sheet.append(list(vector.keys()) + ['emoción'])
for com in O_w:
    sheet.append(com)
wb.save('pre_pros_O_w.xlsx')