# Modelo realizado con *embeddings* para o TFM **"Estimación automática de signos de depresión a partir de análises de texto."** do Máster universitario en tecnoloxías de análise de datos masivos: Big Data no curso académico 2019/2020

## Autor: Manuel Ramón Varela López

- Ver **Readme** para as instruccións de uso. 

### Pasos previos

- **Comezamos cos datos de configuración do notebook.** 
    - Indicamos o ficherio JSON cas preguntas e respostas do test BDI.
    - Indicamos o directorio onde se gardan os documentos XML cas publicacións dos usuarios.
    - Indicamos o ficheiro que contén os resultados reais.

In [1]:
questions_file = "inquerito.json"
dir_corpus = 'corpus'
file_real_results = 'Depression Questionnaires_anon.txt'

- **Improtamos as librerías**

In [2]:
%matplotlib inline
import json
import xml.etree.ElementTree as ET
import os
import pandas as pd
import math
import numpy as np
import shutil
import matplotlib.pyplot as plt
import gensim
import nltk
import gensim.downloader as api

- **Configuración**

In [3]:
pd.options.display.max_columns = None

### Comezo do script

- **Cargamos as preguntas.** 
    - Cargamos as preguntas do test BDI do arquivo indicado ao comezo do script.

In [4]:
with open(questions_file) as json_file:
    questions = json.load(json_file)

- **Consultamos o número de xml cás publicacións de Reddit**.
    - Collemos todos os arquivos XML (un para cada usuario) no directorio na que se gardan estes arquivos indicado ao comezo do script.

In [5]:
corpus_files = [f for f in os.listdir(dir_corpus) if os.path.isfile(os.path.join(dir_corpus, f))]

- **Lemos os arquivos XML**.
    - Preocesamos esos arquivos e gardámola información toda nun dict.

In [6]:
data = []
for file in corpus_files:
    dataElement = {}
    writings = []
    tree = ET.parse(dir_corpus + os.path.sep + file)
    root = tree.getroot()
    for child in root:
        if child.tag == 'ID':
            dataElement['id']=child.text
        elif child.tag=='WRITING':
            writing = {}
            for wriIter in child:
                if wriIter.tag == 'TITLE':
                    writing['title']=wriIter.text
                elif wriIter.tag == 'DATE':
                    writing['date']=wriIter.text
                elif wriIter.tag == 'INFO':
                    writing['info']=wriIter.text
                elif wriIter.tag == 'TEXT':
                    writing['text']=wriIter.text
            writings.append(writing)
    dataElement['corpus'] = writings
    data.append(dataElement)

- **Cargamos o modelo pre-adestrado**.
    - Se non está dispoñible, descárgase.

In [7]:
#Con este modelo
model = api.load("word2vec-google-news-300")

- **Procesamos o corpus de cada usuario**
    - Convertimos o texto e o título das publicacións dos usuarios en arrays de palabras.
    - Eliminamos as stop-words.

In [8]:
stop_words = nltk.corpus.stopwords.words('english')
for user in data:
    corpus = user.get("corpus")
    for document in corpus:
        #Quitamos as stop words do titulo
        title = document.get("title")
        document['title'] = [word for word in gensim.utils.simple_preprocess(str(title)) if word not in stop_words]
        text = document.get("text")
        document['text'] = [word for word in gensim.utils.simple_preprocess(str(text)) if word not in stop_words]

- **Construimos o array de palabras co que se calculará a similitude co corpus do usuario para cada pregunta**
    - Recibimos un texto cas palabras que se van calcular as similitudes.
    - Devolvemos un array de palabras eliminando as stopWords.

In [9]:
def build_query(texto):
    q_aux = [word for word in gensim.utils.simple_preprocess(texto) if word not in stop_words]
    return q_aux

- **Calculo de similitude entre dúas palabras**
    - Calculamos a similitude entre dúas palabras.
    - Cada palabra é representada por un vector.
    - A similitude é o coseno entre os vectores de cada palabra.
    - Se algunha das palabras non está no modelo, devolvemos **None**.

In [10]:
def cosSim(word1,word2):
    if (word1 in model) and (word2 in model):
        return model.similarity(word1, word2)
    return None    

- **Creamos un DataFrame para gardar os resultados.**

In [11]:
#Recorremos todas as preguntas
d = ['subject']
for question in questions:
    d.append(question['question_number'])

#Creamos os dataframes para as medidas
results = pd.DataFrame(columns=d)
#for i in range(medidas):
    #aux = 
    #results.append(aux)

- **Realizamos o cálculo das similitudes entre pregunta e resposta máis alta e o corpus do usuario**
    - Para cada palabra do array cas palabras de cada pregunta e a resposta máis alta calculamos a similitude con cada palabra do corpus do usuario.
    - Unha vez que teñamos todas as similitudes, calculamos a medida de similitude de todas as similitudes.

In [12]:
#Recorremos todos os usuarios
for user in data:
    
    #Collemos o id do usuario
    subject = user['id']
    
    #Imos gardando as medidas para cada usuario
    subject_res = {'subject':subject}
    #subject_res.append()
    
    #Collemos as preguntas
    for question in questions:
        #print(question)
        
        #Collemos o texto da pregunta e o numero
        question_text = question['question_text']
        question_number = question['question_number']
        
        #Collemos a ultima resposta
        answers = question['answers']
        answer = answers[len(answers)-1]
        answer_text = answer['answer_text']
        
        #Collemos cada palabra da pregunta
        question_array = build_query(question_text + " " + answer_text)
        simCoseno = []
        for word_question in question_array:
            
            #Collemos o corpus do usuario
            corpus = user.get("corpus")
            
            #Recorremos todas os textos das palabras
            for coment in corpus:
                
                #Recorremos todas as palabras do titulo
                for word_title in coment.get('title'):
                    
                    simCos = cosSim(word_question,word_title)
                    if simCos is not None:
                        simCoseno.append(simCos)   
                
                #Recorremos todas as palabras do text
                for word_text in coment.get('text'):

                    simCos = cosSim(word_question,word_text)
                    if simCos is not None:
                        simCoseno.append(simCos)
        
        #Tras calcular todas as similirades entre palabras da pregunta e dos textos calculamos o medio
        score = np.mean(simCoseno)
        subject_res[question_number] = score
        
    #Gardamos os scores para cada usuario
    results = results.append(subject_res,ignore_index=True)

- **Mostramos os resultados obtidos**
    - Indicamos que a columna "subject" é índice do DataFrame.
    - Mostramos o data frame, para cada usuario e pregunta indicamos a similitude entre a pregunta e a resposta máis alta e o corpus do usuario.

In [13]:
results=results.set_index('subject')
display(results)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
subject1272,0.123829,0.149903,0.105086,0.139704,0.138584,0.129512,0.126316,0.158698,0.130883,0.196097,0.121096,0.144008,0.105345,0.133061,0.123906,0.119092,0.10938,0.108654,0.1313,0.127571,0.092662
subject2341,0.146406,0.17873,0.120166,0.159173,0.160386,0.15499,0.140306,0.182701,0.146831,0.215363,0.133703,0.166968,0.126133,0.14696,0.146131,0.131883,0.123237,0.12686,0.152945,0.14794,0.102956
subject2432,0.102199,0.129228,0.094563,0.123264,0.1158,0.109295,0.109802,0.137859,0.111035,0.160706,0.105799,0.127032,0.095615,0.114746,0.11439,0.108104,0.096012,0.102056,0.119013,0.113445,0.084712
subject2827,0.141759,0.168608,0.110767,0.151573,0.152381,0.146216,0.137364,0.167711,0.141375,0.203344,0.129091,0.158669,0.11773,0.144899,0.134067,0.122546,0.115802,0.119676,0.145156,0.137741,0.097567
subject2903,0.117341,0.138694,0.101064,0.131898,0.129908,0.121893,0.117312,0.146673,0.121442,0.17801,0.114367,0.134211,0.100758,0.120598,0.118365,0.116676,0.106787,0.109183,0.125266,0.121512,0.09214
subject2961,0.129326,0.159532,0.108668,0.149071,0.14642,0.138598,0.132283,0.165022,0.130106,0.200446,0.128239,0.152781,0.113062,0.139438,0.132659,0.126441,0.124065,0.119276,0.141699,0.146002,0.097612
subject3707,0.104241,0.129344,0.090085,0.124309,0.114606,0.106727,0.113798,0.13369,0.110146,0.160711,0.104801,0.127728,0.094113,0.108282,0.10762,0.111235,0.108175,0.10402,0.119456,0.116088,0.083262
subject3993,0.087888,0.116647,0.093754,0.110869,0.103863,0.095136,0.099414,0.12144,0.099539,0.128368,0.094098,0.118292,0.090468,0.101813,0.108468,0.101115,0.091272,0.101445,0.110211,0.100376,0.085935
subject4058,0.108808,0.133971,0.1017,0.114545,0.118338,0.111929,0.113932,0.140311,0.113817,0.150666,0.105555,0.124957,0.1004,0.115373,0.113227,0.103755,0.089249,0.101309,0.11578,0.108297,0.086323
subject436,0.116136,0.146955,0.112326,0.13527,0.133168,0.12302,0.115288,0.151879,0.120863,0.179295,0.121776,0.138995,0.111516,0.119268,0.125115,0.129008,0.122958,0.11857,0.133688,0.13615,0.094358


- **Normalización**
    - Normalizamos cada resposta a cada pregunta de cada usuario a un valor entre 0 e 1.
    - Despois multiplicamos polo valor do número de preguntas que ten a resposta menos 1.
    - Calculamos o floor para obter un número enteiro.
    - Ese valor e a resposta da pregunta. Terá que estar entre 0 e 3 para as preguntas con 4 respostas e 0 e 6 para as preguntas de 7 respostas.

In [14]:
for question in questions:
    q = question['question_number']
    cats = len(question['answers'])

    #Results1
    if (max(results[q])-min(results[q])) > 0:
        results[q] = np.floor((results[q]-min(results[q]))/(max(results[q])-min(results[q]))*(cats-1))
    results[q] = pd.to_numeric(results[q],downcast='integer')

- **Cambiamos a numeración das preguntas que teñen 7 respostas**

In [15]:
for aux2 in [16,18]:
    results.loc[results[aux2] == 1,aux2] = '1a'
    results.loc[results[aux2] == 2,aux2] = '1b'
    results.loc[results[aux2] == 3,aux2] = '2a'
    results.loc[results[aux2] == 4,aux2] = '2b'
    results.loc[results[aux2] == 5,aux2] = '3a'
    results.loc[results[aux2] == 6,aux2] = '3b'

- **Mostramos as respostas de cada usuario para cada unha das preguntas.**

In [16]:
display(results)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
subject1272,1,1,1,1,1,1,2,1,1,2,2,1,1,2,1,2a,1,1b,1,1,1
subject2341,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3b,2,3b,3,3,3
subject2432,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1a,0,1a,0,0,0
subject2827,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2b,2,2b,2,2,2
subject2903,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,2a,1,1b,1,1,1
subject2961,2,2,1,2,2,2,2,2,1,2,2,2,1,2,1,2b,3,2b,2,2,2
subject3707,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1a,1,1a,0,0,0
subject3993,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1a,0,0,0
subject4058,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1a,0,0,0
subject436,1,1,2,1,1,1,1,1,1,1,2,1,1,1,1,3a,2,2b,1,2,1


### Avaliación

- **Cargamos os datos reais e gardámolos nun DataFrame**

In [17]:
real_results = pd.read_csv(file_real_results,index_col=False,header=None,sep='\t')
real_results.columns = d
real_results=real_results.set_index('subject')

- **Funcións necesarias para o calculo dos resultados**

In [18]:
#Necesitamos esta función porque algunhas preguntas teñen letra
def transform(answer):
    if(type(answer)!=str):
        return answer
    elif len(answer) == 1:
        return int(answer)
    elif len(answer) == 2:
        return int(answer[0])

def dhcl_score(real,estimed):
    if (real <= 9) and (estimed<=9):
        return 1
    elif (real>29) and (estimed>29):
        return 1
    elif (real>9) and (real<=18) and (estimed>9) and (estimed<=18):
        return 1
    elif (real>18) and (real<=29) and (estimed>18) and (estimed<=29):
        return 1
    else:
        return 0

- **Calculamos as 4 medidas de avaliación descritas no TFM**

In [19]:
mad = 3
score_array = []

#Recorremos todos os usuarios
for identificador in real_results.index.values:
    hits = 0
    crs = 0
    scr_real = 0
    scr_stm = 0
    #Recorremos todas as preguntas
    for question in questions:
        q = question['question_number']
        real_a = real_results.loc[identificador,q]
        estimated_a = results.loc[identificador,q]

        #Contamos as pregutnas acertadas
        if real_a == estimated_a:
            hits = hits + 1

        #Contamos camo de cerca estamos
        crs_aux = (mad - abs(transform(real_a)-transform(estimated_a)))/mad
        crs = crs + crs_aux

        #Calculamos os valores de depresion
        scr_real = scr_real + transform(real_a)
        scr_stm = scr_stm + transform(estimated_a)

    #Calculamos o porcentaxe de preguntas acertadas
    hit_score_aux = hits / len(questions)
    cls_score_aux = crs / len(questions)
    dl = (63 - abs(scr_real - scr_stm))/63
    dhcl = dhcl_score(scr_real,scr_stm)
    score_array.append({'subject':identificador,
                        'hit rate score':hit_score_aux,
                        'closeness rate score':cls_score_aux,
                        'real score':scr_real,
                        'estimated_score':scr_stm,
                        'dl':dl,
                        'dchr':dhcl})


score = pd.DataFrame(score_array)
score = score.set_index('subject')

- **Mostramos as medidas de avaliación do modelo proposto**

In [20]:
display(score[['hit rate score','closeness rate score','dl','dchr']].describe())

Unnamed: 0,hit rate score,closeness rate score,dl,dchr
count,20.0,20.0,20.0,20.0
mean,0.409524,0.728571,0.796825,0.4
std,0.199893,0.135658,0.154634,0.502625
min,0.142857,0.460317,0.460317,0.0
25%,0.238095,0.630952,0.690476,0.0
50%,0.380952,0.761905,0.833333,0.0
75%,0.52381,0.809524,0.90873,1.0
max,0.857143,0.984127,0.984127,1.0
