# Academia–Practice Interaction Mapping Using NLP

**Notebook 05: Entity Classification**

**Author:** Kamila Lewandowska  
**Project Status:** *In Progress*  
**Last Updated:** April 2025  

---

### Notebook Overview

**Goal:** Develop a typology of non-academic organizations based on their names, using an inductive, grounded approach.

This notebook:
- Loads the cleaned list of non-academic organization names
- Randomly samples 1,000 entries for manual review and typology creation
- Prepares a CSV file for human annotation 
---



## STEP 3: Categorize data

In [1]:
import pandas as pd
import random

# Load non-academic orgs
df_non_academic = pd.read_csv("../output/non_academic_org_entities.csv")
non_academic_entities = df_non_academic["ORG_Entity"].dropna().tolist()

In [2]:
# Establish categories of non-academic entities

"""
To classify non-academic entities identified in the impact case studies, I developed a typology of organization 
types based on a grounded, inductive review process. Specifically, I randomly selected a sample of approximately 1,000 
unique non-academic organization names from the full list of 4,735 deduplicated entities. This sample served as an 
exploratory base for developing an inductive typology. By manually reviewing the selected entries, I identified recurring 
organizational patterns and formulated a set of categories that reflect the functional diversity of non-academic 
stakeholders mentioned in the case studies.
"""




'\nTo classify non-academic entities identified in the impact case studies, I developed a typology of organization \ntypes based on a grounded, inductive review process. Specifically, I randomly selected a sample of approximately 1,000 \nunique non-academic organization names from the full list of 4,735 deduplicated entities. This sample served as an \nexploratory base for developing an inductive typology. By manually reviewing the selected entries, I identified recurring \norganizational patterns and formulated a set of categories that reflect the functional diversity of non-academic \nstakeholders mentioned in the case studies.\n'

In [5]:
# Create a random sample of 1000 entities

# Set seed for reproducubility
random.seed(42)

# Randomly sample 1,000 entires form non-academic entities list
sample_size = 1000
non_academic_sampled = random.sample(non_academic_entities, sample_size)

# Convert to DataFrame for review
non_academic_sampled_df = pd.DataFrame(non_academic_sampled, columns=["organization_name"])
non_academic_sampled_df.to_csv("../output/non_academic_sampled.csv", index=False, encoding="utf-8-sig")


In [4]:
# Establish categories of non-academic entities

"""
1. Company / Business
Commercial enterprises, corporations, startups, and private firms (e.g., Kaufland, KGHM ZANAM, Voicelab, Photon).

2. Government / Public Administration
Includes ministries, central/local government agencies, parliament, and other state entities (e.g., Senat RP, Urząd Miasta, Ministerstwo Rozwoju).

3. NGO / Association / Foundation
Non-profit organizations, foundations, professional associations, and social initiatives (e.g., Fundacja La Strada, Polskie Towarzystwo Psychologiczne, Stowarzyszenie Wioska Gotów).

4. Media / Publishing
News outlets, broadcasters, publishers, and cultural magazines (e.g., Polskie Radio, TVP Info, Deutsche Welle, Gazeta Lubuska).

5. Cultural Institution / Arts
Museums, theatres, orchestras, festivals, galleries (e.g., Teatr Wielki, Muzeum Historii Polski, Galeria Arsenał).

6. Health / Hospitals / Medical
Clinics, hospitals, medical institutes, and health-related organizations (e.g., Centrum Zdrowia Szansa, NFZ, American Heart Association).

7. Religious Organization
Churches, dioceses, religious associations, and theological institutions (e.g., Kościół Katolicki, Episkopat Polski, Cerkiew).

8. Military / Defense / Security
Armed forces, police, defense industry, or military R&D (e.g., Wojsko Polskie, Żandarmeria Wojskowa, Lockheed Martin).

9. International Organization / EU
UN, EU, NATO, OECD, international consortia or partnerships (e.g., European Commission, UNESCO, OECD).

10. Education (non-university)
Includes schools, kindergartens, vocational schools, continuing education centers (e.g., Szkoła Podstawowa, Centrum Kształcenia Ustawicznego).

11. Other / Unclear
Anything that doesn’t clearly fall into the above categories or needs human validation.

"""

'\n1. Company / Business\nCommercial enterprises, corporations, startups, and private firms (e.g., Kaufland, KGHM ZANAM, Voicelab, Photon).\n\n2. Government / Public Administration\nIncludes ministries, central/local government agencies, parliament, and other state entities (e.g., Senat RP, Urząd Miasta, Ministerstwo Rozwoju).\n\n3. NGO / Association / Foundation\nNon-profit organizations, foundations, professional associations, and social initiatives (e.g., Fundacja La Strada, Polskie Towarzystwo Psychologiczne, Stowarzyszenie Wioska Gotów).\n\n4. Media / Publishing\nNews outlets, broadcasters, publishers, and cultural magazines (e.g., Polskie Radio, TVP Info, Deutsche Welle, Gazeta Lubuska).\n\n5. Cultural Institution / Arts\nMuseums, theatres, orchestras, festivals, galleries (e.g., Teatr Wielki, Muzeum Historii Polski, Galeria Arsenał).\n\n6. Health / Hospitals / Medical\nClinics, hospitals, medical institutes, and health-related organizations (e.g., Centrum Zdrowia Szansa, NFZ, Am

## **Rule-based categorization pipeline**

**Step 1:** Build keyword lists per category

**Step 2:** Lemmatize keywords lists

**Step 3:** Lemmatize entities

**Step 4:** Match: each lemmatized entity vs. stemmed keyword lists

### **Step 1:** Build keyword lists per category

In [3]:
# Keywords list by category 

keywords_by_category = {
    "Company / Business": [
        "sp", "sa", "s.a.", "holding", "firma", "grupa", "przedsiębiorstwo",
        "technologie", "logistyka", "consulting", "solutions", "commerce", "industry"
    ],
    "Government / Public Administration": [
        "urząd", "ministerstwo", "gmina", "powiat", "rady", "sejm", "senat",
        "województwo", "rp", "komisji", "samorząd"
    ],
    "NGO / Association / Foundation": [
        "fundacja", "stowarzyszenie", "towarzystwo", "zrzeszenie", "federacja", "koalicja"
    ],
    "Media / Publishing": [
        "radio", "tv", "gazeta", "media", "wydawnictwo", "czasopismo", "prasa"
    ],
    "Cultural Institution / Arts": [
        "muzeum", "teatr", "galeria", "festiwal", "filharmonia", "dom", "kultury", "im"
    ],
    "Health / Hospitals / Medical": [
        "zdrowia", "klinika", "szpital", "lekarz", "medyczne", "przychodnia", "sanatorium"
    ],
    "Religious Organization": [
        "kościół", "parafia", "diecezja", "episkopat", "zakon", "misja", "cerkiew"
    ],
    "Military / Defense / Security": [
        "wojsko", "żandarmeria", "bezpieczeństwo", "policja", "straż", "obrona", "militaria"
    ],
    "International Organization / EU": [
        "europejski", "europejskiej", "unia", "unii", "ue", "nato", "unesco", "oecd", "who"
    ],
    "Education (non-university)": [
        "centrum", "liceum", "technikum", "szkoła", "podstawowa", "przedszkole", "edukacja"
    ],
    "Other / Unclear": []  # fallback category
}



### **Step 2:** Lemmatize Keyword Lists

In [4]:
import stanza

# 1. Download and initialize Stanza for Polish
stanza.download("pl")
nlp = stanza.Pipeline(lang="pl", processors="tokenize,mwt,lemma")



Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-05-12 15:09:09 INFO: Downloaded file to C:\Users\lewandowska\stanza_resources\resources.json
2025-05-12 15:09:09 INFO: Downloading default packages for language: pl (Polish) ...
2025-05-12 15:09:10 INFO: File exists: C:\Users\lewandowska\stanza_resources\pl\default.zip
2025-05-12 15:09:12 INFO: Finished downloading models and saved to C:\Users\lewandowska\stanza_resources
2025-05-12 15:09:12 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-05-12 15:09:12 INFO: Downloaded file to C:\Users\lewandowska\stanza_resources\resources.json
2025-05-12 15:09:12 INFO: Loading these models for language: pl (Polish):
| Processor | Package      |
----------------------------
| tokenize  | pdb          |
| mwt       | pdb          |
| lemma     | pdb_nocharlm |

2025-05-12 15:09:12 INFO: Using device: cpu
2025-05-12 15:09:12 INFO: Loading: tokenize
2025-05-12 15:09:14 INFO: Loading: mwt
2025-05-12 15:09:14 INFO: Loading: lemma
2025-05-12 15:09:15 INFO: Done loading processors!


In [5]:
# 2. Function to lemmatize a list of keywords
def lemmatize_keywords(keyword_list):
    lemmatized = []
    for kw in keyword_list:
        doc = nlp(kw.lower())
        for sent in doc.sentences:
            for word in sent.words:
                lemmatized.append(word.lemma)
    return list(set(lemmatized))  # remove duplicates

In [6]:
# 3. Lemmatize each category's keywords
lemmatized_keywords_by_category = {
    category: lemmatize_keywords(words)
    for category, words in keywords_by_category.items()
}

In [7]:
# 4. Print to review
for cat, lemmas in lemmatized_keywords_by_category.items():
    print(f"\n{cat}:\n", lemmas)


Company / Business:
 ['grupa', 'industry', 'sa', 'technologia', 'logistyka', 'przedsiębiorstwo', '.', 'consulting', 'solutions', 'commerke', 'holding', 'firma', 's.a']

Government / Public Administration:
 ['ministerstwo', 'rada', 'województwo', 'roktóry', 'komisja', 'samorząd', 'powiat', 'urząd', 'senat', 'pan', 'gmina', 'sejm']

NGO / Association / Foundation:
 ['towarzystwo', 'fundacja', 'federacja', 'stowarzyszenie', 'koalicja', 'zrzeszenie']

Media / Publishing:
 ['te_jeta_bowy', 'radio', 'wydawnictwo', 'prasa', 'gazeta', 'media', 'czasopismo']

Cultural Institution / Arts:
 ['galeria', 'festiwal', 'filharmonia', 'on', 'kultura', 'teatr', 'dom', 'muzeum']

Health / Hospitals / Medical:
 ['szpital', 'klinika', 'zdrowie', 'przychodnia', 'lekarz', 'medyczny', 'sanatorium']

Religious Organization:
 ['episkopat', 'cerkiew', 'diecezja', 'parafia', 'zakon', 'misja', 'kościół']

Military / Defense / Security:
 ['militaria', 'wojsko', 'żandarmeria', 'policja', 'obrona', 'straż', 'bezpiec

### **Step 2b (Manual Cleanup):** Clean Incorrect Lemmatizations

During the lemmatization of category-specific keyword lists (Step 2a), several inaccurate or malformed lemmas were identified as a result of natural language processing limitations. These included:

- **Punctuation-based artifacts**  
  e.g., `"S.A."` → `"."`

- **Hallucinated or unrelated lemmas**  
  e.g., `"RP"` → `"roktóry"`, `"TV"` → `"te_jeta_bowy"`

- **Typographical errors introduced by lemmatization**  
  e.g., `"commerce"` → `"commerke"`

To ensure the integrity of the classification logic, all keyword–lemma pairs were manually reviewed. Problematic cases were either:

- **Corrected** by reintroducing the original keyword form
- **Excluded** entirely if deemed irrelevant or too noisy

A cleaned, curated set of lemmatized keywords was then produced for each category. These lists serve as high-confidence matching vocabularies used in downstream classification.

> All corrections were documented to maintain full transparency and reproducibility of the pipeline.


In [8]:
# Function to lemmatize each keyword while preserving the original form.
# Returns a list of (original, lemma) pairs for transparency and manual review.
# This allows tracking any unexpected or incorrect lemmatization results.

def lemmatize_keywords_with_originals(keyword_list):
    pairs = []
    for kw in keyword_list:
        doc = nlp(kw.lower())
        for sent in doc.sentences:
            for word in sent.words:
                pairs.append((kw, word.lemma))
    return pairs

In [11]:
# Apply lemmatization to each keyword in every category

lemmatized_pairs_by_category = {
    category: lemmatize_keywords_with_originals(keyword_list)
    for category, keyword_list in keywords_by_category.items()
}

lemmatized_pairs_by_category

{'Company / Business': [('sp', 'sa'),
  ('sa', 'sa'),
  ('s.a.', 's.a'),
  ('s.a.', '.'),
  ('holding', 'holding'),
  ('firma', 'firma'),
  ('grupa', 'grupa'),
  ('przedsiębiorstwo', 'przedsiębiorstwo'),
  ('technologie', 'technologia'),
  ('logistyka', 'logistyka'),
  ('consulting', 'consulting'),
  ('solutions', 'solutions'),
  ('commerce', 'commerke'),
  ('industry', 'industry')],
 'Government / Public Administration': [('urząd', 'urząd'),
  ('ministerstwo', 'ministerstwo'),
  ('gmina', 'gmina'),
  ('powiat', 'powiat'),
  ('rady', 'rada'),
  ('sejm', 'sejm'),
  ('senat', 'senat'),
  ('województwo', 'województwo'),
  ('pan', 'pan'),
  ('rp', 'roktóry'),
  ('komisji', 'komisja'),
  ('samorząd', 'samorząd')],
 'NGO / Association / Foundation': [('fundacja', 'fundacja'),
  ('stowarzyszenie', 'stowarzyszenie'),
  ('towarzystwo', 'towarzystwo'),
  ('zrzeszenie', 'zrzeszenie'),
  ('federacja', 'federacja'),
  ('koalicja', 'koalicja')],
 'Media / Publishing': [('radio', 'radio'),
  ('tv', '

In [14]:
# Cleaned keywords list by category 

lemmatized_keywords_by_category_clean = {
    "Company / Business": [
        "sp", "sa", "s.a.", "holding", "firma", "grupa", "przedsiębiorstwo",
        "technologia", "logistyka", "consulting", "solutions", "commerce", "industry"
    ],
    "Government / Public Administration": [
        "urząd", "ministerstwo", "gmina", "powiat", "rada", "sejm", "senat",
        "województwo", "rp", "komisja", "samorząd"
    ],
    
    "NGO / Association / Foundation": [
        "fundacja", "stowarzyszenie", "towarzystwo", "zrzeszenie", "federacja", "koalicja"
    ],
   
    "Media / Publishing": [
        "radio", "tv", "gazeta", "media", "wydawnictwo", "czasopismo", "prasa"
    ],
    "Cultural Institution / Arts": [
        "muzeum", "teatr", "galeria", "festiwal", "filharmonia", "dom", "kultura", 
    ],
    "Health / Hospitals / Medical": [
        "zdrowia", "klinika", "szpital", "lekarz", "medyczny", "przychodnia", "sanatorium"
    ],
    "Religious Organization": [
        "kościół", "parafia", "diecezja", "episkopat", "zakon", "misja", "cerkiew"
    ],
    "Military / Defense / Security": [
        "wojsko", "żandarmeria", "bezpieczeństwo", "policja", "straż", "obrona", "militaria"
    ],
    "International Organization / EU": [
        "europejski", "unia", "ue", "nato", "unesco", "oecd", "who"
    ],
    "Education (non-university)": [
        "liceum", "technikum", "szkoła", "podstawowy", "przedszkole", "edukacja"
    ],
    "Other / Unclear": []  # fallback category
}
