<a href="https://colab.research.google.com/github/kalawinka/politics_classifier/blob/main/pollux_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Detecting scientific articles from the political science field using scientific abstracts


## Learning Objectives

By the end of this tutorial, you will be able to

- detect scientific article from the political science domain using pretrained models and the transformers library


## Description
This tutorial provides detailed information detecting texts from the political science domain using scientific abstracts. The usage of two models will be described in this tutorial.

### 1. English classification model to detect texts from the political science domain

The model was fine-tuned using a dataset of 2,919 abstracts from scientific articles retrieved from the [BASE](https://www.base-search.net/) and [POLLUX](https://www.pollux-fid.de/) collections of scientific articles. Based on [SSCI-SciBERT](https://link.springer.com/article/10.1007/s11192-022-04602-4).

The BASE data were labelled as "politics" or "multi" according to the Dewey Decimal Classification (DDC). Data from several major political science journals in the POLLUX dataset were marked as "politics" class.

Accuracy: 0.9

Predicts 2 classes:

| class |description| precision | recall | f1-score | support |
|:------|:------------|:------------|:---------|:-----------|:----------|
| multi    | other scientific domains|   0.911 | 0.899 |   0.905 |   513 |
| politics | political science|   0.889 | 0.902      |   0.895 |       460 |

To learn more about the model see model card: https://huggingface.co/kalawinka/SSciBERT_politics

### 2. Multilingual classification model to detect texts from the political science domain


This model is a multilingual version of our [SSciBERT_politics](https://huggingface.co/kalawinka/SSciBERT_politics). Based on [BERT multilingual base model (uncased)](http://arxiv.org/abs/1810.04805).

The model was fine-tuned using a dataset of 14,178 abstracts from scientific articles retrieved from the [BASE](https://www.base-search.net/)
and [POLLUX](https://www.pollux-fid.de/) collections of scientific articles.
Abstracts from scientific articles in 3 languages (English, German and French) were used for the training.
The BASE data were labelled as "politics" or "multi" according to the Dewey Decimal Classification (DDC).
Data from several major political science journals in the POLLUX dataset were marked as "politics" class.

Accuracy: 0.978

Predicts 2 classes in 3 languages (English, German and French):

| class    | description              | precision | recall | f1-score | support |
|----------|--------------------------|-----------|--------|----------|---------|
| politics |political science         | 0.975     | 0.978  | 0.976    | 2143    |
| multi    |other scientific domains  | 0.981     | 0.979  | 0.980    | 2583    |

Evaluation by class and language:

| class    | description              | language | precision | recall | f1-score | support |
|----------|--------------------------|----------|-----------|--------|----------|---------|
| politics | political science        | English  | 0,989     | 0,993  | 0,991    | 1212    |
| multi    | other scientific domains | English  | 0,992     | 0,989  | 0,991    | 1164    |
| politics | political science        | German   | 0,952     | 0,958  | 0,955    | 783     |
| multi    | other scientific domains | German   | 0,957     | 0,951  | 0,954    | 776     |
| politics | political science        | French   | 0,979     | 0,959  | 0,969    | 148     |
| multi    | other scientific domains | French   | 0,991     | 0,995  | 0,993    | 643     |

To learn more about the model see model card: https://huggingface.co/kalawinka/bert-base-ml-politics

## Target Audience (Difficulty level)
- This tutorial is aimed at beginners with some knowledge in Python.

## Prerequisites
- Prior knowledge of Python programming is required for this tutorial
- Prior knowledge of Jupyter/IPython Notebook usage is required for this tutorial
    - To learn more see: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/
- Prior knowledge of Google Colab usage is required for this tutorial
    - To learn more see: https://colab.research.google.com/drive/16pBJQePbqkz3QFV54L4NIkOn1kwpuRrj

## Environment Setup

 - In order to run this tutorial, you need at least Python >= 3.11.4  
 - The following dependencies should also install the required packages

In [None]:
# Install packages for Jupiter Notebook environment
!pip3 install transformers
!pip3 install datasets


# How to use

In [None]:
#import libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

## 1. English classification model to detect texts from the political science domain

In [None]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained('kalawinka/SSciBERT_politics')
# load classification model
model = AutoModelForSequenceClassification.from_pretrained('kalawinka/SSciBERT_politics')
# initiaize classification pipeline
# optional argument device: https://huggingface.co/docs/transformers/en/main_classes/pipelines#:~:text=device%20(int%20or%20str%20or%20torch.device)%20%E2%80%94%20Defines%20the%20device%20(e.g.%2C%20%22cpu%22%2C%20%22cuda%3A1%22%2C%20%22mps%22%2C%20or%20a%20GPU%20ordinal%20rank%20like%201)%20on%20which%20this%20pipeline%20will%20be%20allocated.
pipe = pipeline("text-classification", model=model, tokenizer = tokenizer, max_length=512, truncation=True)

### OR TRY
## 2. Multilingual classification model to detect texts from the political science domain

In [None]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained('kalawinka/bert-base-ml-politics')
# load classification model
model = AutoModelForSequenceClassification.from_pretrained('kalawinka/bert-base-ml-politics')
# initiaize classification pipeline
# optional argument device: https://huggingface.co/docs/transformers/en/main_classes/pipelines#:~:text=device%20(int%20or%20str%20or%20torch.device)%20%E2%80%94%20Defines%20the%20device%20(e.g.%2C%20%22cpu%22%2C%20%22cuda%3A1%22%2C%20%22mps%22%2C%20or%20a%20GPU%20ordinal%20rank%20like%201)%20on%20which%20this%20pipeline%20will%20be%20allocated.
pipe = pipeline("text-classification", model=model, tokenizer = tokenizer, max_length=512, truncation=True)


## Try model (English or multilingual) with a single text
### Sample Input

In [None]:
# use classification pipline with a single text
example = 'Germaparl is a collection of protocols of 72 years of parliamentary debates in the German Bundestag. Analysis of parliamentary debates can reveal the hidden programmatic-ideological positions of political parties. From the linguistic perspective, parliamentary debates can be analysed in terms of political discourse, sentiment and position-taking, and cross-cultural and gender differences. Our main focus is the analysis of interjections or ’Zwischenrufe’ and calls to order or ’Ordnungsrufe’. Calls to order are a good tool to analyse the negativity in politics. Moreover, we believe that both calls to order and interjections are able to reveal tendencies in the country’s political scene, collaboration patterns between political parties and the impact and productivity of a single politician or political party. To our best knowledge calls to order and interjections in the Germaparl corpus were not analysed by other researchers.'
out = pipe(example)
print(out)

[{'label': 'politics', 'score': 0.9995762705802917}]


### Output data

In [None]:
# access model output single features
for el in out:
    # access classification label
    print(el.get('label'))
    # access score
    print(el.get('score'))

politics
0.9995762705802917


## Apply model to corpus

In [None]:
# import libraries
from datasets import Dataset

### Sample Input

In [None]:
#create example corpus
data_dict = {
    'id': ['1', '2', '3', '4'],
    'text' : [
        'Germaparl is a collection of protocols of 72 years of parliamentary debates in the German Bundestag. Analysis of parliamentary debates can reveal the hidden programmatic-ideological positions of political parties. From the linguistic perspective, parliamentary debates can be analysed in terms of political discourse, sentiment and position-taking, and cross-cultural and gender differences. Our main focus is the analysis of interjections or ’Zwischenrufe’ and calls to order or ’Ordnungsrufe’. Calls to order are a good tool to analyse the negativity in politics. Moreover, we believe that both calls to order and interjections are able to reveal tendencies in the country’s political scene, collaboration patterns between political parties and the impact and productivity of a single politician or political party. To our best knowledge calls to order and interjections in the Germaparl corpus were not analysed by other researchers.',
        'Acknowledgments in scientific papers may give an insight into aspects of the scientific community, such as reward systems, collaboration patterns, and hidden research trends. The aim of the paper is to evaluate the performance of different embedding models for the task of automatic extraction and classification of acknowledged entities from the acknowledgment text in scientific papers. We trained and implemented a named entity recognition (NER) task using the Flair NLP framework. The training was conducted using three default Flair NER models with four differently-sized corpora and different versions of the Flair NLP framework. The Flair Embeddings model trained on the medium corpus with the latest FLAIR version showed the best accuracy of 0.79. Expanding the size of a training corpus from very small to medium size massively increased the accuracy of all training algorithms, but further expansion of the training corpus did not bring further improvement. Moreover, the performance of the model slightly deteriorated. Our model is able to recognize six entity types: funding agency, grant number, individuals, university, corporation, and miscellaneous. The model works more precisely for some entity types than for others; thus, individuals and grant numbers showed a very good F1-Score over 0.9. Most of the previous works on acknowledgment analysis were limited by the manual evaluation of data and therefore by the amount of processed data. This model can be applied for the comprehensive analysis of acknowledgment texts and may potentially make a great contribution to the field of automated acknowledgment analysis.',
        'Analysis of acknowledgments is particularly interesting as acknowledgments may give information not only about funding, but they are also able to reveal hidden contributions to authorship and the researcher’s collaboration patterns, context in which research was conducted, and specific aspects of the academic work. The focus of the present research is the analysis of a large sample of acknowledgement texts indexed in the Web of Science (WoS) Core Collection. Record types “article” and “review” from four different scientific domains, namely social sciences, economics, oceanography and computer science, published from 2014 to 2019 in a scientific journal in English were considered. Six types of acknowledged entities, i.e., funding agency, grant number, individuals, university, corporation and miscellaneous, were extracted from the acknowledgement texts using a Named Entity Recognition (NER) tagger and subsequently examined. A general analysis of the acknowledgement texts showed that indexing of funding information in WoS is incomplete. The analysis of the automatically extracted entities revealed differences and distinct patterns in the distribution of acknowledged entities of different types between different scientific domains. A strong association was found between acknowledged entity and scientific domain, and acknowledged entity and entity type. Only negligible correlation was found between the number of citations and the number of acknowledged entities. Generally, the number of words in the acknowledgement texts positively correlates with the number of acknowledged funding organizations, universities, individuals and miscellaneous entities. At the same time, acknowledgement texts with the larger number of sentences have more acknowledged individuals and miscellaneous categories.',
        'Purpose: The recent proliferation of preprints could be a way for researchers worldwide to increase the availability and visibility of their research findings. Against the background of rising publication costs caused by the increasing prevalence of article processing fees, the search for other ways to publish research results besides traditional journal publication may increase. This could be especially true for lower-income countries. Design/methodology/approach: Therefore, we are interested in the experiences and attitudes towards posting and using preprints in the Global South as opposed to the Global North. To explore whether motivations and concerns about posting preprints differ, we adopted a mixed-methods approach, combining a quantitative survey of researchers with focus group interviews. Findings: We found that respondents from the Global South were more likely to agree to adhere to policies and to emphasise that mandates could change publishing behaviour towards open access. They were also more likely to agree posting preprints has a positive impact. Respondents from the Global South and the Global North emphasised the importance of peer-reviewed research for career advancement. Originality: The study has identified a wide range of experiences with and attitudes towards posting preprints among researchers in the Global South and the Global North. To our knowledge, this has hardly been studied before, which is also because preprints only have emerged lately in many disciplines and countries.'
    ]
}
# convert dictionary to transformers dataset
# learn more about the dataset class: https://huggingface.co/docs/datasets/en/create_dataset
dataset = Dataset.from_dict(data_dict)

In [None]:
print(dataset)

Dataset({
    features: ['id', 'text'],
    num_rows: 4
})


In [None]:
#apply classification pipeline to the whole corpus and save results as a dictionary in the dataset column 'multi_clas'
encoded_dataset = dataset.map(
        lambda x: {"multi_clas": pipe(
                x['text'])},batched=True)

Map: 100%|██████████| 4/4 [00:01<00:00,  3.28 examples/s]


### Output data

In [None]:
#access single classification result from the dataset
for el in encoded_dataset:
    #access text id
    print(el['id'])
    #access text
    print(el['text'])
    #access classification label
    print(el['multi_clas'].get('label'))
    #access score
    print(el['multi_clas'].get('score'))

1
Germaparl is a collection of protocols of 72 years of parliamentary debates in the German Bundestag. Analysis of parliamentary debates can reveal the hidden programmatic-ideological positions of political parties. From the linguistic perspective, parliamentary debates can be analysed in terms of political discourse, sentiment and position-taking, and cross-cultural and gender differences. Our main focus is the analysis of interjections or ’Zwischenrufe’ and calls to order or ’Ordnungsrufe’. Calls to order are a good tool to analyse the negativity in politics. Moreover, we believe that both calls to order and interjections are able to reveal tendencies in the country’s political scene, collaboration patterns between political parties and the impact and productivity of a single politician or political party. To our best knowledge calls to order and interjections in the Germaparl corpus were not analysed by other researchers.
politics
0.9995762705802917
2
Acknowledgments in scientific p

## References
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
- Shen, S., Liu, J., Lin, L. et al. SsciBERT: a pre-trained language model for social science texts. Scientometrics 128, 1241–1263 (2023). https://doi.org/10.1007/s11192-022-04602-4
- https://huggingface.co/docs/transformers/index
- https://huggingface.co/kalawinka/SSciBERT_politics
- https://huggingface.co/kalawinka/bert-base-ml-politics
- https://www.base-search.net/
- https://www.pollux-fid.de/

## Contact details
Nina Smirnova \
Email: nina.smirnova@gesis.org \
Huggingface: https://huggingface.co/kalawinka \
Research intersets: NLP, Machine Learning, Computational Linguistics, LLMs, Text Minings