# Modelo realizado con *whoosh* para o TFM **"Estimación automática de signos de depresión a partir de análises de texto."** do Máster universitario en tecnoloxías de análise de datos masivos: Big Data no curso académico 2019/2020

## Autor: Manuel Ramón Varela López

- Ver **Readme** para as instruccións de uso. 

### Pasos previos

- **Comezamos cos datos de configuración do notebook.** 
    - Indicamos o ficherio JSON cas preguntas e respostas do test BDI.
    - Indicamos o directorio onde se gardan os documentos XML cas publicacións dos usuarios.
    - Indicamos o ficheiro que contén os resultados reais.
    - Indicamos o directorio no que se crearán os índices para cada usuario.

In [1]:
questions_file = "inquerito.json"
dir_corpus = 'corpus'
file_real_results = 'Depression Questionnaires_anon.txt'
dir_index = 'indexes'

- **Importamos as librerías**

In [2]:
%matplotlib inline
import json
import xml.etree.ElementTree as ET
from whoosh.fields import Schema, TEXT, KEYWORD
from whoosh import index
from whoosh.writing import BufferedWriter
from whoosh.qparser import QueryParser,OrGroup
from whoosh.query import Term
from whoosh.analysis import StopFilter, RegexTokenizer
from whoosh.lang.porter import stem
from whoosh.lang.morph_en import variations
from whoosh.lang.wordnet import Thesaurus
import os
import pandas as pd
import math
import numpy as np
import shutil
import matplotlib.pyplot as plt
from whoosh.scoring import TF_IDF
import nltk

- **Configuración**

In [3]:
pd.options.display.max_columns = None

### Comezo do script

- **Cargamos as preguntas.** 
    - Cargamos as preguntas do test BDI do arquivo indicado ao comezo do script.

In [4]:
with open(questions_file) as json_file:
    questions = json.load(json_file)

- **Consultamos o número de xml cás publicacións de Reddit**.
    - Collemos todos os arquivos XML (un para cada usuario) no directorio na que se gardan estes arquivos indicado ao comezo do script.

In [5]:
corpus_files = [f for f in os.listdir(dir_corpus) if os.path.isfile(os.path.join(dir_corpus, f))]

- **Lemos os arquivos XML**.
    - Preocesamos esos arquivos e gardámola información toda nun dict.

In [6]:
data = []
for file in corpus_files:
    dataElement = {}
    writings = []
    tree = ET.parse(dir_corpus + os.path.sep + file)
    root = tree.getroot()
    for child in root:
        if child.tag == 'ID':
            dataElement['id']=child.text
        elif child.tag=='WRITING':
            writing = {}
            for wriIter in child:
                if wriIter.tag == 'TITLE':
                    writing['title']=wriIter.text
                elif wriIter.tag == 'DATE':
                    writing['date']=wriIter.text
                elif wriIter.tag == 'INFO':
                    writing['info']=wriIter.text
                elif wriIter.tag == 'TEXT':
                    writing['text']=wriIter.text
            writings.append(writing)
    dataElement['corpus'] = writings
    data.append(dataElement)

- **Creamos o esquema polo que se rixirán os documentos do índice**

In [7]:
schema = Schema(text=TEXT(stored=True),subject=KEYWORD)

- **Eliminamos a carpeta que conteñen os índices e volvémola crear**

In [8]:
if os.path.exists(dir_index):
    shutil.rmtree(dir_index)

os.mkdir(dir_index)

- **Función que engade para un usuario as súas publicacións no seu correspondente índice**

In [9]:
#Métese a información no índice
def insert_data_user(ix,user):

    #Abrimos o writer para escribir
    writer = BufferedWriter(ix, period=120, limit=10)

    #Recorremos todos os usuarios
    text = ''
    identificador = user['id']
    #Imos gardando cada documento no índice
    for doc in user['corpus']:
        text = doc['title'] + ' ' + doc['text']
        writer.add_document(text=text,subject=identificador)
    writer.close()

- **Función que crea un índice para un usuario**
    - Crea un índice para o usuario que se lle pasa por parámetro.
    - Engade as publicacións do usuario ao índice.
    - Devolve o índice creado.

In [10]:
#Función que crea o índice
def create_index(user):

    #Se existe do usuario borramolo
    name_index = user['id']
    aux_dir = dir_index + os.sep + name_index
    if os.path.exists(aux_dir):
        shutil.rmtree(aux_dir)

    #Creamos a carpeta para o indice
    os.mkdir(aux_dir)
    index.create_in(aux_dir, schema)

    #Abrimos o indice
    ix = index.open_dir(aux_dir)

    insert_data_user(ix,user)

    return ix

- **Collemos o tokenizador para separar palabras, o filtro para as "stop-words" e cargamos os sinónimos.**

In [11]:
tokenizer = RegexTokenizer()
stopper = StopFilter()
#Collemos os sinonimos e metémolos en memoria
f = open("utils/wn_s.pl")
thesaurus = Thesaurus.from_file(f)

- **Función que para cada palabra a transforma en minúsculas todas as letras.**

In [12]:
def LowercaseFilter(tokens):
    for t in tokens:
        t.text = t.text.lower()
        yield t

- **Función que recibe un array de palabras e engade a este array as palabras do array sen as terminacións**

In [13]:
def delete_terminations(words):
    words_aux = []
    for w in words:
        aux = stem(w)
        if (aux not in words) and (aux not in words_aux):
            words_aux.append(aux)
    words = words + words_aux
    return words

- **Función que recibe un array de palabras e engade a ese array as palabras derivadas de cada unha das palabras.**

In [14]:
def add_terminations(words):
    words_aux = []
    for w in words:
        aux = variations(w)
        for aux2 in aux:
            if (aux2 not in words) and (aux2 not in words_aux):
                words_aux.append(aux2)
    words = words + words_aux
    return words

- **Función que transforma un array de palabras nunha cadea de texto con todas as palabras.**

In [15]:
def vector_to_string(words):
    text = ""
    for w in words:
        text = text + " " + w
    return text

- **Función que constrúe a consulta a partir dunha cadea de texto**

In [16]:
def build_query(phrase):
    
    #Tokenizamos, poñemolas en minuscula e quitamos as stopwords -> Esto facemolo sempre
    words = []
    for token in stopper(LowercaseFilter(tokenizer(phrase))):
        words.append(token.text)

    #Agora imos quitarlle as terminacións e añadimolas
    words = delete_terminations(words)

    #Engadimos as terminacións
    words = add_terminations(words)

    return vector_to_string(words)

- **Devolve o texto da consulta collendo a pregunta a resposta máis alta**

In [17]:
def get_query(question):
    answers = question['answers']
    answer = answers[len(answers)-1]
    answer_text = answer['answer_text']
    return question['question_text'] + ' ' + answer_text

- **Devolve o número de documentos para unha consulta.**
    - Recibe o índice no que se encontra o corpus.
    - Recibe tamén a consulta.
    - Realizase unha consulta ao índice disxuntiva.
    - Devólvese o número de documentos recuperados.

In [18]:
def get_puntuation(ix,query):
    with ix.searcher() as s:
        qp = QueryParser("text", schema=schema,group=OrGroup)
        allow_q = Term("subject", subject)
        
        #Facemos unica busqueda
        q = qp.parse(query)
        res = s.search(q,filter=allow_q,limit=None)
        numero_ducumentos = len(res)

        return numero_ducumentos

- **Creamos un DataFrame para gardar os resultados.**

In [19]:
#Recorremos todas as preguntas
d = ['subject']
for question in questions:
    d.append(question['question_number'])

#Creamos os dataframes para as medidas
results = pd.DataFrame(columns=d)

- **Realizamos o cálculo de porcentaxe de documentos recuperados para cada usuario e pregunta**

In [20]:
#Recorremos todos os usuarios
for user in data:
    
    #Collemos o id do usuario
    subject = user['id']
    
    #Collemos o número de publicacións do usuario
    len_corpus = len(user['corpus'])
    
    #Creamos o índice para o usuario e engadimos todas as publicacións.
    ix = create_index(user)
    
    #Imos gardando as medidas para cada usuario
    subject_res = {'subject':subject}
    
    #Collemos as preguntas
    for question in questions:
        
        #Collemos o texto o número da pregunta
        question_number = question['question_number']
        
        #Collemos as palabras da consulta
        query_aux =  get_query(question)
        
        #Cosntruimos a consulta
        query = build_query(query_aux)
        
        #Calculamos as puntuacións
        question_puntuation = get_puntuation(ix,query)
        
        #Calculamos as porcentaxes
        if len_corpus > 0:
            score = question_puntuation / len_corpus

        subject_res[question_number] = score
        
    #Gardamos os scores para cada usuario
    results = results.append(subject_res,ignore_index=True)

- **Mostramos os resultados obtidos**
    - Indicamos que a columna "subject" é índice do DataFrame.
    - Mostramos o data frame, para cada usuario e pregunta indicamos a similitude entre a pregunta e a resposta máis alta e o corpus do usuario.

In [21]:
results=results.set_index('subject')
display(results)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
subject1272,0.1,0.225,0.141667,0.108333,0.141667,0.125,0.033333,0.05,0.1,0.125,0.2,0.066667,0.05,0.066667,0.1,0.133333,0.141667,0.083333,0.033333,0.233333,0.008333
subject2341,0.364341,0.604651,0.550388,0.465116,0.48062,0.395349,0.046512,0.193798,0.310078,0.589147,0.596899,0.410853,0.27907,0.333333,0.581395,0.348837,0.426357,0.333333,0.147287,0.666667,0.062016
subject2432,0.216867,0.23494,0.174699,0.177711,0.093373,0.138554,0.01506,0.048193,0.14759,0.174699,0.313253,0.111446,0.108434,0.036145,0.156627,0.171687,0.192771,0.093373,0.03012,0.283133,0.018072
subject2827,0.346908,0.417798,0.358974,0.24736,0.342383,0.309201,0.043741,0.07994,0.199095,0.405732,0.515837,0.155354,0.177979,0.253394,0.238311,0.197587,0.229261,0.170437,0.049774,0.371041,0.058824
subject2903,0.405751,0.485623,0.392971,0.351438,0.258786,0.348243,0.022364,0.13738,0.300319,0.428115,0.517572,0.246006,0.172524,0.121406,0.277955,0.309904,0.386581,0.246006,0.095847,0.539936,0.051118
subject2961,0.477778,0.627778,0.5,0.438889,0.433333,0.433333,0.094444,0.205556,0.25,0.588889,0.555556,0.427778,0.183333,0.3,0.277778,0.4,0.338889,0.216667,0.1,0.494444,0.116667
subject3707,0.402153,0.631115,0.428571,0.563601,0.37182,0.385519,0.181018,0.340509,0.391389,0.562622,0.669276,0.456947,0.406067,0.290607,0.443249,0.507828,0.389432,0.272994,0.266145,0.578278,0.056751
subject3993,0.288079,0.435762,0.286755,0.355629,0.254967,0.216556,0.05298,0.133113,0.280132,0.311258,0.443046,0.26755,0.233113,0.075497,0.352318,0.311258,0.342384,0.25894,0.072185,0.46755,0.106623
subject4058,0.333658,0.200389,0.425097,0.22179,0.116732,0.293774,0.005837,0.087549,0.189689,0.253891,0.47179,0.13035,0.093385,0.02821,0.311284,0.180934,0.332685,0.114786,0.042802,0.4893,0.055447
subject436,0.517241,0.448276,0.448276,0.241379,0.275862,0.413793,0.0,0.172414,0.413793,0.551724,0.517241,0.172414,0.206897,0.206897,0.310345,0.172414,0.413793,0.172414,0.275862,0.551724,0.0


- **Funcións para transformar os resultados obtidos as respostas seleccionadas**  

In [22]:
preguntas4 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,19,20,21]
preguntas7 = [16,18]

def p4(value):
    if(value>=1):
        return 3
    elif(value>=0.66):
        return 2
    elif(value>=0.33):
        return 1
    else:
        return 0

def p7(value):
    if(value>=1):
        return '3b'
    elif(value>=0.83):
        return '3a'
    elif(value>=0.67):
        return '2b'
    elif(value>=0.5):
        return '2a'
    elif(value>=0.33):
        return '1b'
    elif(value>=0.17):
        return '1a'
    else:
        return 0

- **Tranformarmos os resultados obtidos en respostas cubertas.**

In [23]:
for i in preguntas4:
    results[i] = results[i].map(p4)
for i in preguntas7:
    results[i] = results[i].map(p7)

- **Mostramos as respostas de cada usuario para cada unha das preguntas.**

In [24]:
display(results)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
subject1272,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
subject2341,1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1b,1,1b,0,2,0
subject2432,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1a,0,0,0,0,0
subject2827,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1a,0,1a,0,1,0
subject2903,1,1,1,1,0,1,0,0,0,1,1,0,0,0,0,1a,1,1a,0,1,0
subject2961,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,1b,1,1a,0,1,0
subject3707,1,1,1,1,1,1,0,1,1,1,2,1,1,0,1,2a,1,1a,0,1,0
subject3993,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,1a,1,1a,0,1,0
subject4058,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1a,1,0,0,1,0
subject436,1,1,1,0,0,1,0,0,1,1,1,0,0,0,0,1a,1,1a,0,1,0


### Avaliación

- **Cargamos os datos reais e gardámolos nun DataFrame**

In [25]:
real_results = pd.read_csv(file_real_results,index_col=False,header=None,sep='\t')
real_results.columns = d
real_results=real_results.set_index('subject')

- **Funcións necesarias para o calculo dos resultados**

In [26]:
#Necesitamos esta función porque algunhas preguntas teñen letra
def transform(answer):
    if(type(answer)!=str):
        return answer
    elif len(answer) == 1:
        return int(answer)
    elif len(answer) == 2:
        return int(answer[0])

def dhcl_score(real,estimed):
    if (real <= 9) and (estimed<=9):
        return 1
    elif (real>29) and (estimed>29):
        return 1
    elif (real>9) and (real<=18) and (estimed>9) and (estimed<=18):
        return 1
    elif (real>18) and (real<=29) and (estimed>18) and (estimed<=29):
        return 1
    else:
        return 0

- **Calculamos as 4 medidas de avaliación descritas no TFM**

In [27]:
mad = 3
score_array = []

#Recorremos todos os usuarios
for identificador in real_results.index.values:
    hits = 0
    crs = 0
    scr_real = 0
    scr_stm = 0
    #Recorremos todas as preguntas
    for question in questions:
        q = question['question_number']
        real_a = real_results.loc[identificador,q]
        estimated_a = results.loc[identificador,q]

        #Contamos as pregutnas acertadas
        if real_a == estimated_a:
            hits = hits + 1

        #Contamos camo de cerca estamos
        crs_aux = (mad - abs(transform(real_a)-transform(estimated_a)))/mad
        crs = crs + crs_aux

        #Calculamos os valores de depresion
        scr_real = scr_real + transform(real_a)
        scr_stm = scr_stm + transform(estimated_a)

    #Calculamos o porcentaxe de preguntas acertadas
    hit_score_aux = hits / len(questions)
    cls_score_aux = crs / len(questions)
    dl = (63 - abs(scr_real - scr_stm))/63
    dhcl = dhcl_score(scr_real,scr_stm)
    score_array.append({'subject':identificador,
                        'hit rate score':hit_score_aux,
                        'closeness rate score':cls_score_aux,
                        'real score':scr_real,
                        'estimated_score':scr_stm,
                        'dl':dl,
                        'dchr':dhcl})


score = pd.DataFrame(score_array)
score = score.set_index('subject')

- **Mostramos as medidas de avaliación do modelo proposto**

In [28]:
display(score[['hit rate score','closeness rate score','dl','dchr']].describe())

Unnamed: 0,hit rate score,closeness rate score,dl,dchr
count,20.0,20.0,20.0,20.0
mean,0.340476,0.661905,0.731746,0.35
std,0.218641,0.194956,0.234526,0.48936
min,0.047619,0.253968,0.253968,0.0
25%,0.130952,0.492063,0.535714,0.0
50%,0.357143,0.722222,0.865079,0.0
75%,0.488095,0.825397,0.924603,1.0
max,0.761905,0.936508,1.0,1.0
