Authors: Orrana Lhaynher Veloso Sousa, David Pereira da Silva, Victor Eulalio Sousa Campelo, Romuere Rodrigues Veloso e Silva & Deborah Maria Vieira Magalhães
Abstract: The widespread adoption of medical document management has generated a large volume of unstructured data containing abbreviations, ambiguous terms, and typing errors. These factors make manual categorization an expensive, time-consuming, and error-prone task. Thus, the automatic classification of medical data into informative clinical categories can substantially reduce the cost of this task. In this context, this work aims to evaluate the use of an ensemble of classifiers of clinical texts to differentiate them into prescriptions, clinical notes, and exam requests. For this, we used the combination of N_gram+TF-IDF and BERTimbau to vectorize the text. Then, we used the classifiers Random Forest, Multilayer Perceptron, and Support Vector Machine to create the ensemble. After that, we predict the final ensemble label through a voting approach. The results are promising, reaching an accuracy of 0.99, kappa of 0.99, and F1-score of 0.99. Our approach allows automatic and accurate classification of clinical texts, achieving better categorization results than individual approaches.
Keywords: Clinical data; Ensemble; Embeddings; Classification