# Academia–Practice Interaction Mapping Using NLP  
**Notebook 07:Apply Rules to Stanza-Only Entities**

**Author:** Kamila Lewandowska  
**Project Phase:** In Progress  
**Last Updated:** June 2025  

---

## Objective

Apply the rule-based classification logic developed on the manually annotated *common entities* dataset to the remaining *Stanza-only* entities. This helps extend the categorization of non-academic organizations beyond the initial overlap with spaCy/transformer models.

---

## Workflow Summary

- Load raw Stanza output (`ner_stanza_pl.csv`) and filter out already classified entities  
- Preprocess the remaining Stanza-only entities:
  - Clean and normalize organization names  
  - Remove likely academic entities using a keyword-based filter  
  - Lemmatize names for more robust matching  
- Apply rule-based classification using pre-defined keyword lists (from Notebook 05)  
- Split into rule-matched and unmatched groups for further inspection  

---

## Rule-Based Categorization

- A dictionary of lemmatized keywords was used to match organizations to one of 11 categories:  

---

## Key Outcomes

- **Stanza-only non-academic entities after cleaning:** *12490*  
- **Classified by rules:** *[insert number]*  
- **Remaining unmatched (“Other / Unclear”):** *[insert number]*  



In [3]:
import pandas as pd
import numpy as np
from ast import literal_eval
import re
import stanza
from collections import Counter

# Data preparation and preprocessing

## Create stanza_only dataset 

In [4]:
# Isolate Stanza-Only Entities

# Load full Stanza output
df_stanza_all = pd.read_csv("../output/ner_stanza_pl.csv")  
df_common = pd.read_csv("../output/common_org_entities.csv")

In [5]:
df_stanza_all.head(5)

Unnamed: 0,Text,ORG_Entities_stanza,ICS_ID
0,Badania skupiające się na szczegółowej analizi...,['Komitetu Nauk Weterynaryjnych i Rozrodu Zwie...,00153fbd-82f7-48c4-b5bd-e830bc390244
1,"Birdwatching, czyli obserwacje w terenie ptakó...","['Królewskie Towarzystwo Ochrony Ptaków', 'Fac...",002768f1-8b96-4e0f-bcc8-192eb0594e60
2,Efektywny transfer wiedzy jest podstawowym czy...,"['MŚP', 'MŚP', 'ETW']",00500483-f00c-4410-b6f7-8650a003125f
3,Ważnym obszarem działalności naukowej WSPiA je...,"['WSPiA', 'AP', 'WSPiA', '4 Zespoły', 'AP', 'A...",006e7fef-2083-426d-9c1b-1affd27b939e
4,Znaczna część europejskiego dziedzictwa archeo...,"['Interreg Central Europe', 'Archaeological He...",00901439-d91a-48e0-903a-26a4253c3a0c


In [6]:
# Check the type of data of the row with entities

type(df_stanza_all["ORG_Entities_stanza"].iloc[0])

str

In [7]:
# Extract unique ORG entities from df_stanza_all

# Flatten all (ICS_ID, ORG_Entity) pairs from Stanza

# List to hold flattened records

all_stanza_entities = []

# Loop through each row in stanza data
for _, row in df_stanza_all.iterrows():
    ics_id = row["ICS_ID"]
    try:
        org_list = literal_eval(row["ORG_Entities_stanza"])
        for org in org_list:
            org_cleaned = org.strip()
            if isinstance(org_cleaned, str) and org_cleaned:
                all_stanza_entities.append((ics_id, org_cleaned))
    except:
        continue

# Create dataframe
all_stanza_flat = pd.DataFrame(all_stanza_entities, columns=["ICS_ID", "ORG_Entity"])

In [8]:
all_stanza_flat.head(20)

Unnamed: 0,ICS_ID,ORG_Entity
0,00153fbd-82f7-48c4-b5bd-e830bc390244,Komitetu Nauk Weterynaryjnych i Rozrodu Zwierz...
1,00153fbd-82f7-48c4-b5bd-e830bc390244,Rady Doradczej
2,00153fbd-82f7-48c4-b5bd-e830bc390244,Advisory Board
3,00153fbd-82f7-48c4-b5bd-e830bc390244,Medycyna Weterynaryjna
4,00153fbd-82f7-48c4-b5bd-e830bc390244,Journal of Applied Genetics
5,00153fbd-82f7-48c4-b5bd-e830bc390244,SPRINGER
6,00153fbd-82f7-48c4-b5bd-e830bc390244,Gene
7,00153fbd-82f7-48c4-b5bd-e830bc390244,ELSEVIER
8,00153fbd-82f7-48c4-b5bd-e830bc390244,Scientific Reports
9,00153fbd-82f7-48c4-b5bd-e830bc390244,Zespołu


In [9]:
# Filter out entities that were already processed (the 6111 “common”)

# Set of already classified entities
common_entities_set = set(df_common['ORG_Entity'].str.strip())

# Keep only stanza-only entries
df_stanza_only = all_stanza_flat[~all_stanza_flat["ORG_Entity"].isin(common_entities_set)].copy()

# Check number of unique entities and entries
print("Unique entities:", df_stanza_only["ORG_Entity"].nunique())
print("Total rows:", len(df_stanza_only))

Unique entities: 11877
Total rows: 15092


In [10]:
# Convert to DataFrame and save

df_stanza_only.to_csv("../output/stanza_only_entities.csv", index=False)

## Preprocess stanza_only: STEP 1 Basic cleaning

In [11]:
# Function for basic claening: whitespace, removinh redundant entries

def clean_entity(entity):
    if not isinstance(entity, str):       # Check if the input is a string; if not, return None
        return None
    entity = re.sub(r'\s+', ' ', entity).strip()    # Replace multiple whitespace characters with a single space and trim leading/trailing spaces
    entity = re.sub(r'^[^\w]+', '', entity)         # Remove any non-word characters from the beginning of the string
    entity = re.sub(r'[^\w.]+$', '', entity)        # Remove any non-word characters (excluding period) from the end of the string
    return entity if len(entity) >= 3 else None      # Return the cleaned entity if it has at least 3 characters; otherwise, return None

In [12]:
# Apply the clean_entity function

df_stanza_only["Cleaned_Entity"] = df_stanza_only["ORG_Entity"].apply(clean_entity)
stanza_only_cleaned = df_stanza_only.dropna(subset=["Cleaned_Entity"]).reset_index(drop=True)
print(f"After cleaning: {len(stanza_only_cleaned)} rows.")

After cleaning: 14893 rows.


## Preprocess stanza_only: STEP 2 Rule-based academic entity removal

In [13]:
# Load academic keyword list (copied from notebook 04)

academic_keywords = [
    'instytut', 'pan', 'university', 'nauk', 'uniwersytet', 'wydział', 'wydzial', 'department', 'badań', 'akadem', 
    'katedr', 'politechni', 'laboratorium', 'research', 'institut', 'fizy', 'matematy', 
    'architektur', 'pracown', 'pedagogi', 'filozof', 'medycyn', 'medyczn', 'medical', 'językoznawstwo', 
    'mickiewicz', 'biolog', 'studia', 'uczeln', 'kolegium', 'collegium', 'studium', 'colleg', 'universit', 
    'wyższ', 'journal', 'springer', 'doktor']

In [14]:
# Lowercase for rule-based match

stanza_only_cleaned["entity_lower"] = stanza_only_cleaned["Cleaned_Entity"].str.lower()

In [15]:
# Remove academic entities

stanza_only_cleaned["Is_Academic"] = stanza_only_cleaned["entity_lower"].apply(lambda x: any(kw in x for kw in academic_keywords))
stanza_only_non_academic = stanza_only_cleaned[~stanza_only_cleaned["Is_Academic"]].copy().reset_index(drop=True)
print(f"After removing academic entities: {len(stanza_only_non_academic)} rows.")

After removing academic entities: 12490 rows.


# Rule-based Classification

In [16]:
# Download and initialize Stanza for Polish

stanza.download("pl")
nlp = stanza.Pipeline(lang="pl", processors="tokenize,mwt,lemma")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-24 15:34:15 INFO: Downloaded file to C:\Users\lewandowska\stanza_resources\resources.json
2025-06-24 15:34:15 INFO: Downloading default packages for language: pl (Polish) ...
2025-06-24 15:34:16 INFO: File exists: C:\Users\lewandowska\stanza_resources\pl\default.zip
2025-06-24 15:34:18 INFO: Finished downloading models and saved to C:\Users\lewandowska\stanza_resources
2025-06-24 15:34:18 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-24 15:34:18 INFO: Downloaded file to C:\Users\lewandowska\stanza_resources\resources.json
2025-06-24 15:34:18 INFO: Loading these models for language: pl (Polish):
| Processor | Package      |
----------------------------
| tokenize  | pdb          |
| mwt       | pdb          |
| lemma     | pdb_nocharlm |

2025-06-24 15:34:18 INFO: Using device: cpu
2025-06-24 15:34:18 INFO: Loading: tokenize
2025-06-24 15:34:19 INFO: Loading: mwt
2025-06-24 15:34:19 INFO: Loading: lemma
2025-06-24 15:34:20 INFO: Done loading processors!


In [17]:
# Function to lemmatize entities

def lemmatize_entity(text):
    doc = nlp(text)
    lemmas = [word.lemma for sent in doc.sentences for word in sent.words]
    return " ".join(lemmas)

In [18]:
# Apply lemmatize entities function

stanza_only_non_academic["Lemma_Entity"] = stanza_only_non_academic["Cleaned_Entity"].apply(lemmatize_entity)

In [19]:
# Fix incorrect lemmatizations (from notebook 05)

lemma_fixes = {
    "do_tar": "DG",
    "sa .": "s.a.",
    "s.a .": "s.a.",
    "o.o": "o.o.",
    "z ograniczona_odpowiedzialność .": "z o.o.",
    "z ograniczona_odpowiedzialność": "z o.o.",
    "do_spraw .": "ds.",
    "te_jeta_bowy": "TV",
    "roktóry": "rp",
    "et_yg_pały": "ETW",
    "to_sysapipiporoby": "TVP",
    "lt_yp_dwownokinoć": "Ltd",
    "fektometr_pici": "FM",
}

# Apply the correction

def apply_fixes(text):
    for wrong, correct in lemma_fixes.items():
        text = text.replace(wrong, correct)
    return text

stanza_only_non_academic["Lemma_Cleaned"] = stanza_only_non_academic["Lemma_Entity"].apply(apply_fixes)

In [106]:
# Ruled based keyword dictionary (from notebook 05 and enhanced manually after reviewing unmached entities)

lemmatized_keywords_by_category_clean = {
       "Government / Public Administration": [
        "urząd", "ministerstwo", "gmina", "powiat", "rada", "sejm", "senat",
        "województwo", "rp", "komisja", "samorząd", "narodowy", "krajowy", "izba",
        "państwowy", "miasto", "regionalny", "województwo", "wojewódzki", "marszałkowski",
        "sąd", "inspektorat", "parlament", "rzeczpospolita", "agencja", "państwowy", "stołeczny", "stołeczne", "ambasada",
        "rzecznik", "senacko", "inspekcja", "inspekcję", "inspektor", "konsula", "turystyczna", "turystyczny", "delegatura",
        "punkt", "fundusz", "skarb", "rząd", "dyrekcja", "archiwum"
    ],
     "NGO / Association / Foundation": [
        "fundacja", "fundacją", "stowarzyszenie", "towarzystwo", "zrzeszenie", "federacja", "koalicja", "association", "związek", "obywatelski",
         "wspólota", "pomoc"
    ],
   
    "Media / Publishing": [
        "radio", "tv", "gazeta", "media", "wydawnictwo", "czasopismo", "prasa", "tvp", "fm"
    ],
    "Cultural Institution / Arts": [
        "muzeum", "teatr", "galeria", "festiwal", "filharmonia", "dom", "kultura", "sztuka", "artystyczny", "artystyczne",
        "twórczość", "książka", "muzyczny", "museum", "koncert", "zamek", "królewski", "pałac", "klub", "filmowy",
        "heritage"
    ],
    "Health / Hospitals / Medical": [
        "zdrowia", "zdrowie", "klinika", "szpital", "lekarz", "medyczny", "przychodnia", "sanatorium", "rehabilitacja", "zdrowiu", "zdrowotny"
        "hospicjum", "choroba", "epidemia", "health", "uzdrowisko", "sanatorium", "chory"
    ],
    "Religious Organization": [
        "kościół", "parafia", "diecezja", "episkopat", "zakon", "misja", "cerkiew", "salwatorianie", "salwatorian", "duchowa", "biblijny", "biblijnym", "kuria",
        "metropolita", "parafia", "parafię"
    ],
    "Military / Defense / Security": [
        "wojsko", "żandarmeria", "bezpieczeństwo", "policja", "straż", "obrona", "militaria", "komendant", "lotniczy", "lotniczej", "policję", "zbrojny"
    ],
    "International Organization / EU": [
        "europejski", "unia", "ue", "nato", "unesco", "oecd", "who", "międzynarodowy", "international", "european", "DG", "union", "dyrektoriat", "nations"
    ],
    "Company / Business": [
       "s.a.", "s.a", "z o.o.", "holding", "firma", "grupa", "przedsiębiorstwo",
        "technologia", "logistyka", "consulting", "solutions", "commerce", "industry", "group", "spółka", "Ltd", "stocznia", "kopalni", "firmą", "huta"
    ],
       
    "Education (non-university)": [
        "liceum", "technikum", "szkoła", "podstawowy", "przedszkole", "edukacja", "biblioteka", "bibliotek", "szkolenie", "nauczyciel", "podyplomowy",
        "oświatowy", "oświata", "kurator", "kuratorium", "szkolno", "szkolny", "przedszkolny"
    ],
    "Other / Unclear": []  # fallback category
}


In [107]:
# Define a function to match a lemmatized entity against category keyword lists

def match_entity_to_category(entity, keyword_dict):
    """
    Matches a lemmatized entity string to one of the predefined categories
    based on the presence of any lemmatized keywords.

    Parameters:
        entity (str): The lemmatized name of an organization.
        keyword_dict (dict): Dictionary mapping category names to keyword lists.

    Returns:
        str: The matched category name, or "Other / Unclear" if no match found.
    """
    entity_lower = entity.lower()
    for category, keywords in keyword_dict.items():
        if any(keyword in entity_lower for keyword in keywords):
            return category
    return "Other / Unclear"

In [108]:
# Apply the match_entity_to_category function

stanza_only_non_academic["Matched_Category"] = stanza_only_non_academic["Lemma_Cleaned"].apply(lambda x: match_entity_to_category(x, lemmatized_keywords_by_category_clean))

In [109]:
# Split: rule-matched and unmatched

stanza_non_academic_rule_matched = stanza_only_non_academic[stanza_only_non_academic["Matched_Category"] != "Other / Unclear"].copy()
stanza_non_academic_unmatched = stanza_only_non_academic[stanza_only_non_academic["Matched_Category"] == "Other / Unclear"].copy()

print(f"Rule-matched: {len(stanza_non_academic_rule_matched)} rows.")
print(f"Unmatched {len(stanza_non_academic_unmatched)} rows.")

Rule-matched: 4549 rows.
Unmatched 7941 rows.


In [110]:
print(stanza_non_academic_rule_matched["ORG_Entity"].nunique())
print(stanza_non_academic_unmatched["ORG_Entity"].nunique())


3821
5849


In [111]:
# Inspect the data: check frequencies of categories

cat_freq = Counter(stanza_non_academic_rule_matched["Matched_Category"])
cat_freq

Counter({'Government / Public Administration': 1733,
         'Company / Business': 657,
         'Cultural Institution / Arts': 537,
         'NGO / Association / Foundation': 495,
         'International Organization / EU': 383,
         'Education (non-university)': 237,
         'Media / Publishing': 198,
         'Health / Hospitals / Medical': 174,
         'Military / Defense / Security': 76,
         'Religious Organization': 59})

In [112]:
set(stanza_non_academic_unmatched["Lemma_Cleaned"])

{'belorus',
 'kuow',
 'prodrobota',
 'sgppl „ dolina lotnicza',
 'FAPPS',
 'on- 1',
 'nestle',
 'mrk_hpam',
 'mza',
 'info',
 'hnms',
 'dolar_ameryłsza',
 'Słttip',
 'Schneider Polska',
 'nike',
 'frontiers forówna Young minds',
 'müller',
 'sędzia2',
 'Zagroda podlaski',
 'glinojeck',
 'kafu',
 'newg lab pharma',
 'kuźnia pomysłów',
 'pRL-u',
 'krakowskie chorągwi ZHP',
 'zespół badanie',
 "l'oreal",
 'msatjo Manggha',
 'cogInfoCom',
 'stacja badawcza',
 'supramak',
 'shadowing',
 'kaszubski UL',
 'Ir_A',
 'uniwersytet w Edynburg',
 'sk-cz',
 'litgrid',
 'IX gremium ekspert turystyka',
 'kongres',
 'centrum informacja i promocja śródlądowy droga wodny w Bydgoszcz',
 'associação para',
 'tow . przyjaciel sulejówka',
 'apeva',
 'giorin',
 'angielski . science inny poland',
 'górnictwo i energetyka agś',
 'we',
 'strategia dunajska',
 'JSA',
 'thyssenkrupp',
 'gdańską',
 'K . chty',
 'social affairs and inclusion',
 'kaplast',
 'trace',
 'jada',
 'Z. wałęgi',
 'spebeI flow',
 'hed KMS',
