# Academia–Practice Interaction Mapping Using NLP

**Notebook 02: NER data extraction**

**Author:** Kamila Lewandowska  
**Project Status:** *In Progress*  
**Last Updated:** April 2025  

---
### Notebook Overview

This notebook applies two NER models to extract `ORG` entities from Polish-language research impact case studies:

- **Stanza** (Stanford NLP – rule-based for Polish)
- **Davlan/XLM-RoBERTa** (Transformer-based multilingual model from HuggingFace)

Extracted entities are saved to CSV for further analysis.

---

## Extract entities using Stanza Model ('pl')

In [None]:
import os
import pandas as pd
from transformers import pipeline

In [None]:
# Load the data 

cleaned_impact_case_studies = pd.read_csv("../data/cleaned_impact_case_studies.csv")
cleaned_text_pl = cleaned_impact_case_studies["cleaned_text"]

In [None]:
import stanza

# Download the Polish model (only needed once, but safe to include)
stanza.download('pl')

# Initialize the pipeline
nlp_stanza_pl = stanza.Pipeline(lang='pl', processors='tokenize,ner')

In [None]:
# Function to extract 'ORG' entities using Stanza_pl
def extract_org_stanza_pl(text):
    """
    Extracts "ORG" entities from Polish-language text using the Stanza NER pipeline.

    Parameters: 
        text (str): A string of text in Polish.

    Returns: 
        list: A list of named entities of type "ORG" found in the input text.
    """
    doc = nlp_stanza_pl(text)
    return [ent.text for ent in doc.ents if ent.type == "orgName"]


In [None]:
# Apply Stanza_pl NER extraction to the dataset
cleaned_text_stanza_pl = cleaned_text_pl.apply(extract_org_stanza_pl)

# Save results in a DataFrame
df_stanza_pl = pd.DataFrame({"Text": cleaned_text_pl, "ORG_Entities_stanza": cleaned_text_stanza_pl})


In [None]:
# Add the 'Impact description identifier - POL-on 2.0 system uuid' column to df_stanza_pl
df_stanza_pl["ICS_ID"] = ics_selected_columns_pl["Impact description identifier - POL-on 2.0 system uuid"]

# Display the first few rows to confirm
print(df_stanza_pl.head())

In [None]:
total_entities_ner_stanza_pl = df_stanza_pl["ORG_Entities_stanza"].explode().notna().sum()
print(f"Total ORG entities extracted Stanza_pl: {total_entities_ner_stanza_pl}")

In [None]:
# Define your target folder path 
OUTPUT_DIR = "../output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save the CSV file in the specified folder
csv_file_path_pl = os.path.join(OUTPUT_DIR, "ner_stanza_pl.csv")
df_stanza_pl.to_csv(csv_file_path_pl, index=False)

print(f"CSV file saved at: {csv_file_path_pl}")

## Extract entities using multilingual NER model from HuggingFace

In [None]:
# Load multilingual NER model from HuggingFace

ner_pipeline_xlm = pipeline(
    "ner",
    model="Davlan/xlm-roberta-base-ner-hrl",
    aggregation_strategy="simple"  # Merges tokens into entities
)

In [None]:
# Write a function to extract ORG entities

def extract_org_xlm(text):
    """
    Extracts 'ORG' entities from Polish-language text using the Davlan/XLM-RoBERTa Hugging Face NER model.

    Parameters:
        text (str): A string of text in Polish.

    Returns:
        list: A list of 'ORG' entities extracted by the transformer-based model.
    """
    results = ner_pipeline_xlm(text)
    return [r['word'] for r in results if r['entity_group'] == 'ORG']

In [None]:
# Apply the model and create a dataframe

df_davlan_pl = pd.DataFrame({
    "Text": cleaned_text_pl,
    "ORG_Entities_xlm": cleaned_text_pl.apply(extract_org_xlm)
})

In [None]:
# Add the 'Impact description identifier - POL-on 2.0 system uuid' column to df_davlan_pl
df_davlan_pl["ICS_ID"] = ics_selected_columns_pl["Impact description identifier - POL-on 2.0 system uuid"]

# Display the first few rows to confirm
print(df_davlan_pl.head())

In [None]:
total_entities_ner_davlan_pl = df_davlan_pl["ORG_Entities_xlm"].explode().notna().sum()
print(f"Total ORG entities extracted Davlan_pl: {total_entities_ner_davlan_pl}")

In [None]:
# Define your target folder path 
OUTPUT_DIR = "../output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save the CSV file in the specified folder
csv_davlan_path_pl = os.path.join(OUTPUT_DIR, "ner_davlan_pl.csv")
df_davlan_pl.to_csv(csv_davlan_path_pl, index=False)

print(f"CSV file saved at: {csv_davlan_path_pl}")