<a href="https://colab.research.google.com/github/kalawinka/ner_acknowledgements/blob/main/ner_acknoweledgements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting Named Entities from scientific acknowledgements using the Flair NLP Framework

## Learning objectives
By the end of this tutorial, you will be able to
- run NER task over a single acknowledgement text or a whole corpus using the Flair NLP framework
- extract different levels of information about a named entity

## Description
This tutorial provides detailed information about extracting named entities from scientific acknowledgements (written in English) using the Flair NLP framework and pretrained NER model.

The model used in this tutorial was pretrained on the corpus of over 600 acknowledgements texts and is able to predict 6 tags.

|label|description|precision|recall|f1-score|support|
|:----|:----|:----|:----|:----|:----|
|GRNB|grant number|0,93|0,98|0,96|160|
|IND|person|0,98|0,98|0,98|295|
|FUND|funding organization|0,70|0,83|0,76|157|
|UNI|university|0,77|0,74|0,75|99|
|MISC|miscellaneous|0,65|0,65|0,65|82|
|COR|corporation|0,75|0,50|0,60|12|

F1-Score: 0.79

To learn more about the model see model card: https://huggingface.co/kalawinka/flair-ner-acknowledgments

For detailed information about model pretraining and training parameters see:
Smirnova, N., Mayr, P. Embedding models for supervised automatic extraction and classification of named entities in scientific acknowledgements. Scientometrics (2023). https://doi.org/10.1007/s11192-023-04806-2


## Target Audience (Difficulty level)
- This tutorial is aimed at beginners with some knowledge in Python

## Prerequisites
- Prior knowledge of Python programming is required for this tutorial

## Environment Setup

 - In order to run this tutorial, you need at least Python >= 3.11.4  
 - The following dependencies should also install the required packages

In [None]:
# Install packages for Jupiter Notebook environment
!pip3 install flair
!pip3 install pandas


In [None]:
#import libraries
from flair.data import Sentence
from flair.nn import Classifier

In [None]:
# load the trained model
model = Classifier.load("kalawinka/flair-ner-acknowledgments")

# Try model with one sentence

In [None]:
# create example sentence
sentence = Sentence("This work was supported by State Key Lab of Ocean Engineering Shanghai Jiao Tong University and financially supported by China National Scientific and Technology Major Project (Grant No. 2016ZX05028-006-009)")

In [None]:
# run NER over sentence
model.predict(sentence)
# print the sentence with all annotations
print(sentence)

In [None]:
#print output as spans
for entity in sentence.get_spans('ner'):
    print(entity)

In [None]:
# access single information about entity
for entity in sentence.get_spans('ner'):
    # access entity text
    print(entity.text)
    # access entity label
    print(entity.get_label("ner").value)
    # access confidence sore
    print(entity.get_label("ner").score)
    # access entity start position
    print(entity.start_position)
    # access entity end position
    print(entity.end_position)

# Apply model to the corpus

In [None]:
# import pandas library
import pandas as pd

In [None]:

#create example corpus
data_dict = {
    'id': ['1', '2', '3'],
    'text' : [
        'This work is funded by the German Federal Ministry of Education and Research (BMBF) project Open Access Effects – The Influence of Structural and Author-specific Factors on the Impact of OA (OASE), grant numbers 16PU17005A and 16PU17005B.',
        'The original work was funded by the German Center for Higher EducationResearch and Science Studies (DZHW) via the project ”Mining Acknowl-edgement Texts in Web of Science (MinAck)”17. Access to the WoS datawas granted via the Competence Centre for Bibliometrics18. Data accesswas funded by BMBF (Federal Ministry of Education and Research, Germany) under grant number 01PQ17001. Nina Smirnova received funding fromthe German Research Foundation (DFG) via the project ”POLLUX”19. Thepresent paper is an extended version of the paper ”Evaluation of EmbeddingModels for Automatic Extraction and Classification of Acknowledged Entities in Scientific Documents” (Smirnova and Mayr, 2022) presented at the 3rd Workshop on Extraction and Evaluation of Knowledge Entities from ScientificDocuments (EEKE2022).',
        'This work was funded by German Centre for Higher Education Research and Science Studies (DZHW) via the project ”Mining Acknowledgement Texts in Web of Science (MinAck)”21. Nina Smirnova acknowledges support by Deutsche Forschungsgemeinschaft (DFG) under grant number MA 3964/7-2, the Fachinformationsdienst Politikwissenschaft – Pollux. Access to the WoS data was granted via the Competence Centre for Bibliometrics22. Data access was funded by BMBF (Federal Ministry of Education and Research, Germany) under grant number 01PQ17001.'
            ]
}
#convert dictionary to pandas dataframe
corpus = pd.DataFrame.from_dict(data_dict)
corpus.head()

In [None]:
# function to apply NER model to the dataframe and accsess single information about entity
# input parameters: acknoweledgemnt text and NER model
def get_entity_info(text, model):
  sentence = Sentence(text)
  model.predict(sentence)
  # append dictionaries with entity information to list
  result = list()
  for entity in sentence.get_spans('ner'):
        # save extracted entities, labels and confidence score to dictionary
        dict = {'entity' : entity.text,
                'label' : entity.get_label("ner").value,
                'confidence': entity.get_label("ner").score,
                'start_pos' : entity.start_position,
                'end_pos' : entity.end_position}
        result.append(dict)
  return result


In [None]:
#apply NER model to the whole corpus and save results as a dictionary in the datafarme column
corpus['ner_results'] = corpus.apply(
    lambda row: get_entity_info(row['text'], model),
    axis=1
)
corpus.head()

## References
- Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual String Embeddings for Sequence Labeling. COLING 2018, 27th International Conference on Computational Linguistics, 1638–1649.
- Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., & Vollgraf, R. (2019). FLAIR: An easy-to-use framework for state-of-the-art NLP. NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 54–59.
- Smirnova, N., Mayr, P. Embedding models for supervised automatic extraction and classification of named entities in scientific acknowledgements. Scientometrics (2023). https://doi.org/10.1007/s11192-023-04806-2
- https://huggingface.co/kalawinka/flair-ner-acknowledgments
- https://flairnlp.github.io/

## Contact details
Nina Smirnova \
Email: nina.smirnova@gesis.org \
Huggingface: https://huggingface.co/kalawinka \
Research intersets: NLP, Machine Learning, Computational Linguistics, LLMs, Text Minings