Skip to content

pgfox/thesis_embedding_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Identification of Talking Points from Discussion Group Transcripts

Master of Science in Applied Information and Data Science

Lucerne University of Applied Sciences and Arts

Management Summary

This dissertation evaluates the effectiveness of different vector embedding techniques for multilabel text classification. The primary goal is to understand which embedding spaces are more effective when applied to real world data.

This research uses data collected by IMPACT Initiatives, an organisation that gathers information on the needs and displacement patterns of crisis-affected populations. IMPACT’s data, primarily collected through discussion groups and interviews, is analysed to identify key talking points. These talking points form the foundation of IMPACT’s data products, which support humanitarian organizations in making informed crisis response decisions.

A range of models were used to generate embeddings, including Word2Vec, BERT, Sentence-BERT (sBERT), and Llama 3.1. Each embedding space was tested using three traditional classifiers - Logistic Regression, Random Forest, and Support Vector Classification.

Of all the embedding–classifier combinations evaluated, a fine-tuned sBERT model paired with a Random Forest classifier achieved the highest classification performance. This combination is also capable of running efficiently on limited hardware, such as an IMPACT laptop in the field.

Overall, the findings suggest that lightweight, task-specific embedding models—when paired with lightweight classifiers—can deliver strong performance, even in constrained computing environments.

Keywords: vector embeddings, multilabel classification, text classification, fine-tuning, Sentence-BERT

About

Thesis from my Master of Science in Applied Information and Data Science

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published