# Academia–Practice Interaction Mapping Using NLP  
**Notebook 07:Apply Rules to Stanza-Only Entities**

**Author:** Kamila Lewandowska  
**Project Phase:** In Progress  
**Last Updated:** June 2025  

---

## Objective

Apply the rule-based classification logic developed on the manually annotated *common entities* dataset to the remaining *Stanza-only* entities. This helps extend the categorization of non-academic organizations beyond the initial overlap with spaCy/transformer models.

---

## Workflow Summary

- Load raw Stanza output (`ner_stanza_pl.csv`) and filter out already classified entities  
- Preprocess the remaining Stanza-only entities:
  - Clean and normalize organization names  
  - Remove likely academic entities using a keyword-based filter  
  - Lemmatize names for more robust matching  
- Apply rule-based classification using pre-defined keyword lists (from Notebook 05)  
- Split into rule-matched and unmatched groups for further inspection  

---

## Rule-Based Categorization

- A dictionary of lemmatized keywords was used to match organizations to one of 11 categories:  

---

## Key Outcomes

- **Stanza-only non-academic entities after cleaning:** *12490*  
- **Classified by rules:** *[insert number]*  
- **Remaining unmatched (“Other / Unclear”):** *[insert number]*  



In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval
import re
import stanza
from collections import Counter

# Data preparation and preprocessing

## Create stanza_only dataset 

In [2]:
# Isolate Stanza-Only Entities

# Load full Stanza output
df_stanza_all = pd.read_csv("../output/ner_stanza_pl.csv")  
df_common = pd.read_csv("../output/common_org_entities.csv")

In [3]:
df_stanza_all.head(5)

Unnamed: 0,Text,ORG_Entities_stanza,ICS_ID
0,Badania skupiające się na szczegółowej analizi...,['Komitetu Nauk Weterynaryjnych i Rozrodu Zwie...,00153fbd-82f7-48c4-b5bd-e830bc390244
1,"Birdwatching, czyli obserwacje w terenie ptakó...","['Królewskie Towarzystwo Ochrony Ptaków', 'Fac...",002768f1-8b96-4e0f-bcc8-192eb0594e60
2,Efektywny transfer wiedzy jest podstawowym czy...,"['MŚP', 'MŚP', 'ETW']",00500483-f00c-4410-b6f7-8650a003125f
3,Ważnym obszarem działalności naukowej WSPiA je...,"['WSPiA', 'AP', 'WSPiA', '4 Zespoły', 'AP', 'A...",006e7fef-2083-426d-9c1b-1affd27b939e
4,Znaczna część europejskiego dziedzictwa archeo...,"['Interreg Central Europe', 'Archaeological He...",00901439-d91a-48e0-903a-26a4253c3a0c


In [4]:
# Check the type of data of the row with entities

type(df_stanza_all["ORG_Entities_stanza"].iloc[0])

str

In [5]:
# Extract unique ORG entities from df_stanza_all

# Flatten all (ICS_ID, ORG_Entity) pairs from Stanza

# List to hold flattened records

all_stanza_entities = []

# Loop through each row in stanza data
for _, row in df_stanza_all.iterrows():
    ics_id = row["ICS_ID"]
    try:
        org_list = literal_eval(row["ORG_Entities_stanza"])
        for org in org_list:
            org_cleaned = org.strip()
            if isinstance(org_cleaned, str) and org_cleaned:
                all_stanza_entities.append((ics_id, org_cleaned))
    except:
        continue

# Create dataframe
all_stanza_flat = pd.DataFrame(all_stanza_entities, columns=["ICS_ID", "ORG_Entity"])

In [6]:
all_stanza_flat.head(20)

Unnamed: 0,ICS_ID,ORG_Entity
0,00153fbd-82f7-48c4-b5bd-e830bc390244,Komitetu Nauk Weterynaryjnych i Rozrodu Zwierz...
1,00153fbd-82f7-48c4-b5bd-e830bc390244,Rady Doradczej
2,00153fbd-82f7-48c4-b5bd-e830bc390244,Advisory Board
3,00153fbd-82f7-48c4-b5bd-e830bc390244,Medycyna Weterynaryjna
4,00153fbd-82f7-48c4-b5bd-e830bc390244,Journal of Applied Genetics
5,00153fbd-82f7-48c4-b5bd-e830bc390244,SPRINGER
6,00153fbd-82f7-48c4-b5bd-e830bc390244,Gene
7,00153fbd-82f7-48c4-b5bd-e830bc390244,ELSEVIER
8,00153fbd-82f7-48c4-b5bd-e830bc390244,Scientific Reports
9,00153fbd-82f7-48c4-b5bd-e830bc390244,Zespołu


In [7]:
# Filter out entities that were already processed (the 6111 “common”)

# Set of already classified entities
common_entities_set = set(df_common['ORG_Entity'].str.strip())

# Keep only stanza-only entries
df_stanza_only = all_stanza_flat[~all_stanza_flat["ORG_Entity"].isin(common_entities_set)].copy()

# Check number of unique entities and entries
print("Unique entities:", df_stanza_only["ORG_Entity"].nunique())
print("Total rows:", len(df_stanza_only))

Unique entities: 11877
Total rows: 15092


In [8]:
# Convert to DataFrame and save

df_stanza_only.to_csv("../output/stanza_only_entities.csv", index=False)

## Preprocess stanza_only: STEP 1 Basic cleaning

In [9]:
# Function for basic claening: whitespace, removinh redundant entries

def clean_entity(entity):
    if not isinstance(entity, str):       # Check if the input is a string; if not, return None
        return None
    entity = re.sub(r'\s+', ' ', entity).strip()    # Replace multiple whitespace characters with a single space and trim leading/trailing spaces
    entity = re.sub(r'^[^\w]+', '', entity)         # Remove any non-word characters from the beginning of the string
    entity = re.sub(r'[^\w.]+$', '', entity)        # Remove any non-word characters (excluding period) from the end of the string
    return entity if len(entity) >= 3 else None      # Return the cleaned entity if it has at least 3 characters; otherwise, return None

In [10]:
# Apply the clean_entity function

df_stanza_only["Cleaned_Entity"] = df_stanza_only["ORG_Entity"].apply(clean_entity)
stanza_only_cleaned = df_stanza_only.dropna(subset=["Cleaned_Entity"]).reset_index(drop=True)
print(f"After cleaning: {len(stanza_only_cleaned)} rows.")

After cleaning: 14893 rows.


## Preprocess stanza_only: STEP 2 Rule-based academic entity removal

In [11]:
# Load academic keyword list (copied from notebook 04)

academic_keywords = [
    'instytut', 'pan', 'university', 'nauk', 'uniwersytet', 'wydział', 'wydzial', 'department', 'badań', 'akadem', 
    'katedr', 'politechni', 'laboratorium', 'research', 'institut', 'fizy', 'matematy', 
    'architektur', 'pracown', 'pedagogi', 'filozof', 'medycyn', 'medyczn', 'medical', 'językoznawstwo', 
    'mickiewicz', 'biolog', 'studia', 'uczeln', 'kolegium', 'collegium', 'studium', 'colleg', 'universit', 
    'wyższ', 'journal', 'springer', 'doktor']

In [12]:
# Lowercase for rule-based match

stanza_only_cleaned["entity_lower"] = stanza_only_cleaned["Cleaned_Entity"].str.lower()

In [13]:
# Remove academic entities

stanza_only_cleaned["Is_Academic"] = stanza_only_cleaned["entity_lower"].apply(lambda x: any(kw in x for kw in academic_keywords))
stanza_only_non_academic = stanza_only_cleaned[~stanza_only_cleaned["Is_Academic"]].copy().reset_index(drop=True)
print(f"After removing academic entities: {len(stanza_only_non_academic)} rows.")

After removing academic entities: 12490 rows.


# Rule-based Classification

In [14]:
# Download and initialize Stanza for Polish

stanza.download("pl")
nlp = stanza.Pipeline(lang="pl", processors="tokenize,mwt,lemma")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-25 08:58:59 INFO: Downloaded file to C:\Users\lewandowska\stanza_resources\resources.json
2025-06-25 08:58:59 INFO: Downloading default packages for language: pl (Polish) ...
2025-06-25 08:59:00 INFO: File exists: C:\Users\lewandowska\stanza_resources\pl\default.zip
2025-06-25 08:59:02 INFO: Finished downloading models and saved to C:\Users\lewandowska\stanza_resources
2025-06-25 08:59:02 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-25 08:59:02 INFO: Downloaded file to C:\Users\lewandowska\stanza_resources\resources.json
2025-06-25 08:59:02 INFO: Loading these models for language: pl (Polish):
| Processor | Package      |
----------------------------
| tokenize  | pdb          |
| mwt       | pdb          |
| lemma     | pdb_nocharlm |

2025-06-25 08:59:02 INFO: Using device: cpu
2025-06-25 08:59:02 INFO: Loading: tokenize
2025-06-25 08:59:05 INFO: Loading: mwt
2025-06-25 08:59:05 INFO: Loading: lemma
2025-06-25 08:59:08 INFO: Done loading processors!


In [15]:
# Function to lemmatize entities

def lemmatize_entity(text):
    doc = nlp(text)
    lemmas = [word.lemma for sent in doc.sentences for word in sent.words]
    return " ".join(lemmas)

In [16]:
# Apply lemmatize entities function

stanza_only_non_academic["Lemma_Entity"] = stanza_only_non_academic["Cleaned_Entity"].apply(lemmatize_entity)

In [70]:
# Fix incorrect lemmatizations (from notebook 05)

lemma_fixes = {
    "do_tar": "DG",
    "sa .": "s.a.",
    "s.a .": "s.a.",
    "o.o": "o.o.",
    "z ograniczona_odpowiedzialność .": "z o.o.",
    "z ograniczona_odpowiedzialność": "z o.o.",
    "do_spraw .": "ds.",
    "te_jeta_bowy": "TV",
    "roktóry": "rp",
    "et_yg_pały": "ETW",
    "to_sysapipiporoby": "TVP",
    "lt_yp_dwownokinoć": "Ltd",
    "fektometr_pici": "FM",
    "Bspódniczoab": "BSP"
}

# Apply the correction

def apply_fixes(text):
    for wrong, correct in lemma_fixes.items():
        text = text.replace(wrong, correct)
    return text

stanza_only_non_academic["Lemma_Cleaned"] = stanza_only_non_academic["Lemma_Entity"].apply(apply_fixes)

In [79]:
# Ruled based keyword dictionary (from notebook 05 - enhanced manually and stemmed after reviewing unmached entities)

lemmatized_keywords_by_category_clean = {
       "Government / Public Administration": [
        "urząd", "ministerstw", "gmin", "powiat", "rada", "sejm", "senat",
        "wojewódz", "rp", "komisj", "samorząd", "narodow", "krajow", "izb",
        "państw", "miast", "regional", "marszałkow",
        "sąd", "inspektorat", "parlament", "rzeczpospolit", "agencj", "stołeczn", "ambasad",
        "rzecznik", "senack", "inspek", "konsul", "turyst", "delegatur",
        "punkt", "fundusz", "skarb", "rząd", "dyrekcj", "archiw", "sztab", "ris", "krrit"
    ],
     "NGO / Association / Foundation": [
        "fundacj", "stowarzyszen", "towarzystw", "zrzeszen", "federacj", "koalicj", "association", "związk", "obywatel",
         "wspólot", "pomoc", "nno"
    ],
   
    "Media / Publishing": [
        "radi", "tv", "gazet", "media", "wydawnictw", "czasopism", "pras", "tvp", "fm"
    ],
    "Cultural Institution / Arts": [
        "muze", "teatr", "galer", "festiwal", "filharmoni", "dom", "kultur", "sztuk", "artystycz",
        "twórcz", "książ", "muzycz", "museum", "koncert", "zamek", "królewsk", "pałac", "klub", "filmow",
        "heritage"
    ],
    "Health / Hospitals / Medical": [
        "zdrow", "klinik", "szpital", "lekarz", "medycz", "przychodni", "sanatori", "rehabilitacj",
        "hospicj", "chorob", "epidemi", "health", "uzdrowisk", "chory"
    ],
    "Religious Organization": [
        "kościół", "parafi", "diecezj", "episkopat", "zakon", "misja", "cerkiew", "salwatorian", "duchow", "biblijn", "kuria",
        "metropolit"
    ],
    "Military / Defense / Security": [
        "wojsk", "żandarmeri", "bezpiecz", "policj", "straż", "obron", "militar", "komendant", "lotnicz", "zbrojny", "bsp"
    ],
    "International Organization / EU": [
        "europejsk", "unia", "ue", "nato", "unesco", "oecd", "who", "międzynarod", "international", "european", "DG", "union", "dyrektoriat", "nations",
        "światow", "unijn", "eu-xfel"
    ],
    "Company / Business": [
       "s.a.", "s.a", "z o.o.", "hold", "firm", "grupa", "przedsiębiorstw", "msp", "csr",
        "technolog", "logistyk", "consulting", "solutions", "commerc", "industr", "group", "spółk", "Ltd", "stoczni", "kopaln", "huta"
    ],
       
    "Education (non-university)": [
        "liceum", "licea", "technikum", "szkoł", "podstawow", "przedszkol", "edukacj", "bibliotek", "szkolen", "nauczyciel", "podyplomow",
        "oświat", "kurator", "kuratori", "szkol"
    ],
    "Other / Unclear": []  # fallback category
}


In [80]:
# Define a function to match a lemmatized entity against category keyword lists

def match_entity_to_category(entity, keyword_dict):
    """
    Matches a lemmatized entity string to one of the predefined categories
    based on the presence of any lemmatized keywords.

    Parameters:
        entity (str): The lemmatized name of an organization.
        keyword_dict (dict): Dictionary mapping category names to keyword lists.

    Returns:
        str: The matched category name, or "Other / Unclear" if no match found.
    """
    entity_lower = entity.lower()
    for category, keywords in keyword_dict.items():
        if any(keyword in entity_lower for keyword in keywords):
            return category
    return "Other / Unclear"

In [81]:
# Apply the match_entity_to_category function

stanza_only_non_academic["Matched_Category"] = stanza_only_non_academic["Lemma_Cleaned"].apply(lambda x: match_entity_to_category(x, lemmatized_keywords_by_category_clean))

In [82]:
# Split: rule-matched and unmatched

stanza_non_academic_rule_matched = stanza_only_non_academic[stanza_only_non_academic["Matched_Category"] != "Other / Unclear"].copy()
stanza_non_academic_unmatched = stanza_only_non_academic[stanza_only_non_academic["Matched_Category"] == "Other / Unclear"].copy()

print(f"Rule-matched: {len(stanza_non_academic_rule_matched)} rows.")
print(f"Unmatched {len(stanza_non_academic_unmatched)} rows.")

Rule-matched: 5037 rows.
Unmatched 7453 rows.


In [83]:
print(stanza_non_academic_rule_matched["ORG_Entity"].nunique())
print(stanza_non_academic_unmatched["ORG_Entity"].nunique())


4055
5615


In [84]:
# Inspect the data: check frequencies of categories

cat_freq = Counter(stanza_non_academic_rule_matched["Matched_Category"])
cat_freq

Counter({'Government / Public Administration': 1907,
         'Company / Business': 764,
         'Cultural Institution / Arts': 562,
         'NGO / Association / Foundation': 515,
         'International Organization / EU': 489,
         'Education (non-university)': 225,
         'Media / Publishing': 206,
         'Health / Hospitals / Medical': 189,
         'Military / Defense / Security': 113,
         'Religious Organization': 67})

In [85]:
# Display entity names and lemmas for the most frequent unmatched entries 

unmatched_with_originals = (
    stanza_non_academic_unmatched
    .groupby(["Lemma_Cleaned", "ORG_Entity"])
    .size()
    .reset_index(name="count")
    .sort_values(by="count", ascending=False)
)

unmatched_with_originals.head(50)

Unnamed: 0,Lemma_Cleaned,ORG_Entity,count
4084,platforma,Platforma,17
4117,pol-on,POL-on,15
2216,eg_m,EFQM,14
4343,ptoce,PCN,14
423,JSA,JSA,14
88,Bosidała,BPS,14
3334,mPM,MPM,13
3946,ostyk_da,OSL,12
5051,twbym_podobne,TWŚ,12
882,SPEC,SPEC,11


In [87]:
# Export files

# Export full matched entities
stanza_non_academic_rule_matched.to_csv("../output/stanza_non_academic_rule_matched.csv", index=False)

# Export full unmatched entities
stanza_non_academic_unmatched.to_csv("../output/stanza_non_academic_unmatched.csv", index=False)

# Create annotation file with duplicates (one row per occurrence)
stanza_for_annotation = (
    stanza_non_academic_unmatched[["ICS_ID", "ORG_Entity"]]
    .copy()
)
stanza_for_annotation["Annotated_Category"] = ""

# Export to CSV
stanza_for_annotation.to_csv("../output/stanza_non_academic_for_annotation.csv", index=False)
