# **Desarrollo de un enrutador de preguntas sobre enfermedades**
# **Procesamiento del texto de preguntas y respuestas**
### **Por Nayare Montes Gavilán**

## **1. Introducción**
En esta sección vamos a tratar el contenido de las preguntas y las respuestas para dejar un input más sencillo para pasar después al modelo.

## **2. Carga de datos**

In [2]:
import json

with open('../data_processed/train/train_1.json', 'r') as file:
    train_1 = json.load(file)

Texto original:

In [76]:
original_question = train_1['NLM-QUESTION'][0]['MESSAGE']
original_question

'Literature on Cardiac amyloidosis.  Please let me know where I can get literature on Cardiac amyloidosis.  My uncle died yesterday from this disorder.  Since this is such a rare disorder, and to honor his memory, I would like to distribute literature at his funeral service.  I am a retired NIH employee, so I am familiar with the campus in case you have literature at NIH that I can come and pick up.  Thank you '

In [20]:
train_1['NLM-QUESTION'][0]['SUB-QUESTIONS']['SUB-QUESTION']['ANSWERS']['ANSWER']

['Cardiac amyloidosis is a disorder caused by deposits of an abnormal protein (amyloid) in the heart tissue. These deposits make it hard for the heart to work properly.',
 'The term "amyloidosis" refers not to a single disease but to a collection of diseases in which a protein-based infiltrate deposits in tissues as beta-pleated sheets. The subtype of the disease is determined by which protein is depositing; although dozens of subtypes have been described, most are incredibly rare or of trivial importance. This analysis will focus on the main systemic forms of amyloidosis, both of which frequently involve the heart.']

## **3. Transformación de datos de train**

Apliquemos lowercase, borrado de símbolos y de espacios.

In [77]:
import re

new_question = original_question.lower().strip()

new_question = re.sub(r'[\.\?\!\,\:\;\"]', '', new_question)

new_question

'literature on cardiac amyloidosis  please let me know where i can get literature on cardiac amyloidosis  my uncle died yesterday from this disorder  since this is such a rare disorder and to honor his memory i would like to distribute literature at his funeral service  i am a retired nih employee so i am familiar with the campus in case you have literature at nih that i can come and pick up  thank you'

Ahora quitemos las stopwords


In [21]:
from nltk.corpus import stopwords

In [26]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/perseis/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [23]:
question_split=new_question.split(' ')

In [28]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in question_split if token.lower() not in stop_words]
new_question = ' '.join(filtered_tokens)
new_question

'literature cardiac amyloidosis  please let know get literature cardiac amyloidosis  uncle died yesterday disorder  since rare disorder honor memory would like distribute literature funeral service  retired nih employee familiar campus case literature nih come pick  thank'

Podemos crear una función que reciba el archivo json y procese los textos. No eliminaremos las stop words ya que el futuro modelo puede necesitarlas para entender el contexto.

In [90]:
def cleanText(qna:str):
    '''Función que recibe un archivo json y limpia las preguntas y respuestas.
    -qna: ruta del archivo json'''
    with open(qna, 'r') as file:
       questions = json.load(file)

    for question in questions.values():
        
        quest = question['MESSAGE']
        if quest is not None:
            quest = quest.lower().strip()
            quest = re.sub(r'[\.\?\!\,\:\;\"]', '', quest)
            question['MESSAGE'] = quest

        answer = question['ANSWER']
        if isinstance(answer, list):
            for ans in answer:
                ans = ans.lower().strip()
                ans = re.sub(r'[\.\?\!\,\:\;\"]', '', ans)
        else:
            answer = answer.lower().strip()
            ans = re.sub(r'[\.\?\!\,\:\;\"]', '', answer)

        question['ANSWER'] = ans
        
            
    return questions, print(questions)


In [80]:
with open('/home/perseis/qualentum/tfb/data_processed/train/train.json', 'r') as file:
       questions = json.load(file)

In [81]:
questions['Q0']['MESSAGE']

'Literature on Cardiac amyloidosis.  Please let me know where I can get literature on Cardiac amyloidosis.  My uncle died yesterday from this disorder.  Since this is such a rare disorder, and to honor his memory, I would like to distribute literature at his funeral service.  I am a retired NIH employee, so I am familiar with the campus in case you have literature at NIH that I can come and pick up.  Thank you '

In [91]:
cleanText('../data_processed/train/train.json')



({'Q0': {'SUBJECT': None,
   'MESSAGE': 'literature on cardiac amyloidosis  please let me know where i can get literature on cardiac amyloidosis  my uncle died yesterday from this disorder  since this is such a rare disorder and to honor his memory i would like to distribute literature at his funeral service  i am a retired nih employee so i am familiar with the campus in case you have literature at nih that i can come and pick up  thank you',
   'ANSWER': 'the term amyloidosis refers not to a single disease but to a collection of diseases in which a protein-based infiltrate deposits in tissues as beta-pleated sheets the subtype of the disease is determined by which protein is depositing although dozens of subtypes have been described most are incredibly rare or of trivial importance this analysis will focus on the main systemic forms of amyloidosis both of which frequently involve the heart',
   'FOCUS': 'cardiac amyloidosis',
   'TYPE': 'information'},
  'Q1': {'SUBJECT': 'treatment 

## **4. Transformación de datos de test**

Para los datos de test hay que cambiar un poco la funcion, ya que el nombre de las claves es diferente.

In [92]:
with open('../data_processed/test/test.json', 'r') as file:
    test = json.load(file)

In [93]:
cleanText('/home/perseis/qualentum/tfb/data_processed/test/test.json')

UnboundLocalError: cannot access local variable 'ans' where it is not associated with a value