"A new benchmark for NLP in Social Sciences: Evaluating the usefulness of pre-trained language models for classifying open-ended survey responses"

This repository complements our Paper published at ICAART 2021.

We provide our used data in the folders "clean_data/" and " traintest_data/". The first contains the cleaned and matched verbatim-codes data for all of our 10 thematic data sets. The second contains the actually used train/test splits which have been split using a method developed by [1] / [2] and implemented by [3]. We also provide the used python-code under "src/".

The original, raw data comes from the "Open Ended Coding Project" which accompanied the 2008 ANES survey conducted by [4]. All credit for the initial collection of the open-ended data goes to ANES and the principal investigators: Jon Krosnick, Matt Berent and Arthur Lupia. The original raw data can be obtained via the ANES-website for the "Open Ended Coding Project" and the "2008 Time Series Study".

The creation of the cleaned data sets is described in our paper. For a quick overview on the used method, consider the following:

In order to generate data sets usable for multi-label learning from the files distributed by the Open Ended Coding Project, we exploit the notion of representing the codes, which have been assigned to each textual observation, by a binary vector.

Additionally, to map the numeric codes to binary label vector elements one-to-one, we sourced the total size of each code set from the codes-documents enclosed with each data set. Using this information, we defined the length of the binary mapping vectors to be identical to the cardinality of the code sets. To generate k-hot encoded label vectors for each response contained in the data sets, we designed a mapping dictionary for each code set defining which code from the current set belongs to which element in the binary vector generated for a particular response.

To finally obtain the binary label vectors from the set of numeric codes associated to each observation, we designed a custom function which can be fed a mapping dictionary and the raw data row-by-row. The function then returns the binary label vectors of length q for each observation, where each vector element is 1 if the code mapped to this element was assigned to the response and 0 otherwise. All the data sets are transformed by using this function. For the latter application of machine learning methods we split the data into train and test set (90/10) using an iterative stratification method for balancing the label distributions, developed by [1]/[2] and implemented by [3] in scikit-multilearn.

Preliminary Leaderboard:

Below, we report the results previously achieved by [5]. Additionally, we describe the results obtained by the different multi-label classification transfer-learning models we used. All reported scores refer to the sample-based F1 described in [5].

Authors	1	2	3	4	5	6	7	8	9	10
Card/Smith (2015)	0.55	0.67	0.71	0.71	0.81	0.86	0.94	0.96	0.93	0.96
Our Baseline-model	0.44	0.51	0.57	0.54	0.68	0.88	0.92	0.95	0.90	0.91
Our BERT-model	0.00	0.02	0.44	0.35	0.41	0.79	0.94	0.95	0.91	0.93
Our RoBERTa-model	0.00	0.00	0.56	0.55	0.57	0.85	0.95	0.97	0.93	0.94
Our XLNet-model	0.00	0.00	0.54	0.58	0.55	0.86	0.96	0.98	0.91	0.92

Literature:

Sechidis, K., Tsoumakas, G., & Vlahavas, I. (2011). On the stratification of multi-label data. Machine Learning and Knowledge Discovery in Databases, 145-158. http://lpis.csd.auth.gr/publications/sechidis-ecmlpkdd-2011.pdf
Piotr Szymański, Tomasz Kajdanowicz (2017). Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 74:22-35, 2017. http://proceedings.mlr.press/v74/szyma%C5%84ski17a.html
Piotr Szymański, Tomasz Kajdanowicz (2017). A scikit-based Python environment for performing multi-label classification. ArXiv e-prints. https://arxiv.org/abs/1702.01460
The American National Election Studies (ANES). ANES 2008 Time Series Study. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2015-11-10. https://doi.org/10.3886/ICPSR25383.v3
Card, Dallas; Smith, Noah (2015). Automated Coding of Open-Ended Survey Responses. Research Paper. https://www.ml.cmu.edu/research/dap-papers/DAP_Card.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
clean_data		clean_data
src		src
traintest_data		traintest_data
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

"A new benchmark for NLP in Social Sciences: Evaluating the usefulness of pre-trained language models for classifying open-ended survey responses"

Preliminary Leaderboard:

Literature:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

"A new benchmark for NLP in Social Sciences: Evaluating the usefulness of pre-trained language models for classifying open-ended survey responses"

Preliminary Leaderboard:

Literature:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages