# Academia–Practice Interaction Mapping Using NLP

**Notebook 05: Entity Classification**

**Author:** Kamila Lewandowska  
**Project Status:** *In Progress*  
**Last Updated:** June 2025  

---

### Notebook Overview

**Goal:** Develop a typology of non-academic organizations based on their names, using an inductive, grounded approach.

This notebook:
- Loads the cleaned list of non-academic organization names
- Randomly samples 1,000 entries for manual review and typology creation
- Prepares a CSV file for human annotation 
---



## STEP 3: Categorize data

In [2]:
import pandas as pd
import random

# Load non-academic orgs
df_non_academic = pd.read_csv("../output/non_academic_org_entities.csv")
non_academic_entities = df_non_academic["ORG_Entity"].dropna().tolist()
len(non_academic_entities)

4767

In [2]:
# Establish categories of non-academic entities

"""
To classify non-academic entities identified in the impact case studies, I developed a typology of organization 
types based on a grounded, inductive review process. Specifically, I randomly selected a sample of approximately 1,000 
unique non-academic organization names from the full list of 4,735 deduplicated entities. This sample served as an 
exploratory base for developing an inductive typology. By manually reviewing the selected entries, I identified recurring 
organizational patterns and formulated a set of categories that reflect the functional diversity of non-academic 
stakeholders mentioned in the case studies.
"""




'\nTo classify non-academic entities identified in the impact case studies, I developed a typology of organization \ntypes based on a grounded, inductive review process. Specifically, I randomly selected a sample of approximately 1,000 \nunique non-academic organization names from the full list of 4,735 deduplicated entities. This sample served as an \nexploratory base for developing an inductive typology. By manually reviewing the selected entries, I identified recurring \norganizational patterns and formulated a set of categories that reflect the functional diversity of non-academic \nstakeholders mentioned in the case studies.\n'

In [3]:
# Create a random sample of 1000 entities

# Set seed for reproducubility
random.seed(42)

# Randomly sample 1,000 entires form non-academic entities list
sample_size = 1000
non_academic_sampled = random.sample(non_academic_entities, sample_size)

# Convert to DataFrame for review
non_academic_sampled_df = pd.DataFrame(non_academic_sampled, columns=["organization_name"])
non_academic_sampled_df.to_csv("../output/non_academic_sampled.csv", index=False, encoding="utf-8-sig")


In [4]:
# Establish categories of non-academic entities

"""
1. Company / Business
Commercial enterprises, corporations, startups, and private firms (e.g., Kaufland, KGHM ZANAM, Voicelab, Photon).

2. Government / Public Administration
Includes ministries, central/local government agencies, parliament, and other state entities (e.g., Senat RP, Urząd Miasta, Ministerstwo Rozwoju).

3. NGO / Association / Foundation
Non-profit organizations, foundations, professional associations, and social initiatives (e.g., Fundacja La Strada, Polskie Towarzystwo Psychologiczne, Stowarzyszenie Wioska Gotów).

4. Media / Publishing
News outlets, broadcasters, publishers, and cultural magazines (e.g., Polskie Radio, TVP Info, Deutsche Welle, Gazeta Lubuska).

5. Cultural Institution / Arts
Museums, theatres, orchestras, festivals, galleries (e.g., Teatr Wielki, Muzeum Historii Polski, Galeria Arsenał).

6. Health / Hospitals / Medical
Clinics, hospitals, medical institutes, and health-related organizations (e.g., Centrum Zdrowia Szansa, NFZ, American Heart Association).

7. Religious Organization
Churches, dioceses, religious associations, and theological institutions (e.g., Kościół Katolicki, Episkopat Polski, Cerkiew).

8. Military / Defense / Security
Armed forces, police, defense industry, or military R&D (e.g., Wojsko Polskie, Żandarmeria Wojskowa, Lockheed Martin).

9. International Organization / EU
UN, EU, NATO, OECD, international consortia or partnerships (e.g., European Commission, UNESCO, OECD).

10. Education (non-university)
Includes schools, kindergartens, vocational schools, continuing education centers (e.g., Szkoła Podstawowa, Centrum Kształcenia Ustawicznego).

11. Other / Unclear
Anything that doesn’t clearly fall into the above categories or needs human validation.

"""

'\n1. Company / Business\nCommercial enterprises, corporations, startups, and private firms (e.g., Kaufland, KGHM ZANAM, Voicelab, Photon).\n\n2. Government / Public Administration\nIncludes ministries, central/local government agencies, parliament, and other state entities (e.g., Senat RP, Urząd Miasta, Ministerstwo Rozwoju).\n\n3. NGO / Association / Foundation\nNon-profit organizations, foundations, professional associations, and social initiatives (e.g., Fundacja La Strada, Polskie Towarzystwo Psychologiczne, Stowarzyszenie Wioska Gotów).\n\n4. Media / Publishing\nNews outlets, broadcasters, publishers, and cultural magazines (e.g., Polskie Radio, TVP Info, Deutsche Welle, Gazeta Lubuska).\n\n5. Cultural Institution / Arts\nMuseums, theatres, orchestras, festivals, galleries (e.g., Teatr Wielki, Muzeum Historii Polski, Galeria Arsenał).\n\n6. Health / Hospitals / Medical\nClinics, hospitals, medical institutes, and health-related organizations (e.g., Centrum Zdrowia Szansa, NFZ, Am

## **Rule-based categorization pipeline**

**Step 1:** Build keyword lists per category

**Step 2:** Lemmatize keywords lists

**Step 3:** Lemmatize entities

**Step 4:** Match: each lemmatized entity vs. stemmed keyword lists

### **Step 1:** Build keyword lists per category

In [5]:
# Keywords list by category 

keywords_by_category = {
    "Company / Business": [
        "sp", "sa", "s.a.", "z o.o.", "o.o", "o.o.", "o.", "holding", "firma", "grupa", "przedsiębiorstwo",
        "technologie", "logistyka", "consulting", "solutions", "commerce", "industry", "zakład", "bank", "business",
        "ltd", "inc"
    ],
    "Government / Public Administration": [
        "urząd", "ministerstwo", "gmina", "powiat", "rady", "sejm", "senat",
        "województwo", "rp", "komisji", "samorząd", "pib", "komitet", "national", "służba"
    ],
    "NGO / Association / Foundation": [
        "fundacja", "stowarzyszenie", "towarzystwo", "zrzeszenie", "federacja", "koalicja", "fundacją"
    ],
    "Media / Publishing": [
        "radio", "tv", "gazeta", "media", "wydawnictwo", "czasopismo", "prasa"
    ],
    "Cultural Institution / Arts": [
        "muzeum", "teatr", "galeria", "festiwal", "filharmonia", "dom", "kultury", "im"
    ],
    "Health / Hospitals / Medical": [
        "zdrowia", "klinika", "szpital", "lekarz", "medyczne", "przychodnia", "sanatorium"
    ],
    "Religious Organization": [
        "kościół", "parafia", "diecezja", "episkopat", "zakon", "misja", "cerkiew"
    ],
    "Military / Defense / Security": [
        "wojsko", "żandarmeria", "bezpieczeństwo", "policja", "straż", "obrona", "militaria"
    ],
    "International Organization / EU": [
        "europejski", "europejskiej", "unia", "unii", "ue", "nato", "unesco", "oecd", "who", "europe"
    ],
    "Education (non-university)": [
        "centrum", "liceum", "technikum", "szkoła", "podstawowa", "przedszkole", "edukacja"
    ],
    "Other / Unclear": []  # fallback category
}



### **Step 2:** Lemmatize Keyword Lists

In [6]:
import stanza

# 1. Download and initialize Stanza for Polish
stanza.download("pl")
nlp = stanza.Pipeline(lang="pl", processors="tokenize,mwt,lemma")



Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-06 10:09:46 INFO: Downloaded file to C:\Users\lewandowska\stanza_resources\resources.json
2025-06-06 10:09:46 INFO: Downloading default packages for language: pl (Polish) ...
2025-06-06 10:09:47 INFO: File exists: C:\Users\lewandowska\stanza_resources\pl\default.zip
2025-06-06 10:09:49 INFO: Finished downloading models and saved to C:\Users\lewandowska\stanza_resources
2025-06-06 10:09:49 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-06 10:09:49 INFO: Downloaded file to C:\Users\lewandowska\stanza_resources\resources.json
2025-06-06 10:09:49 INFO: Loading these models for language: pl (Polish):
| Processor | Package      |
----------------------------
| tokenize  | pdb          |
| mwt       | pdb          |
| lemma     | pdb_nocharlm |

2025-06-06 10:09:49 INFO: Using device: cpu
2025-06-06 10:09:49 INFO: Loading: tokenize
2025-06-06 10:09:51 INFO: Loading: mwt
2025-06-06 10:09:51 INFO: Loading: lemma
2025-06-06 10:09:52 INFO: Done loading processors!


In [7]:
# 2. Function to lemmatize a list of keywords
def lemmatize_keywords(keyword_list):
    lemmatized = []
    for kw in keyword_list:
        doc = nlp(kw.lower())
        for sent in doc.sentences:
            for word in sent.words:
                lemmatized.append(word.lemma)
    return list(set(lemmatized))  # remove duplicates

In [8]:
# 3. Lemmatize each category's keywords
lemmatized_keywords_by_category = {
    category: lemmatize_keywords(words)
    for category, words in keywords_by_category.items()
}

In [9]:
# 4. Print to review
for cat, lemmas in lemmatized_keywords_by_category.items():
    print(f"\n{cat}:\n", lemmas)


Company / Business:
 ['consulting', 'lt_yp_dwownokinoć', 'technologia', 'industry', 's.a', 'bank', 'business', 'ograniczona_odpowiedzialność', 'commerke', 'sa', 'przedsiębiorstwo', 'grupa', 'o.', 'z', 'solutions', 'logistyka', 'holding', '.', 'inc', 'zakład', 'firma', 'o']

Government / Public Administration:
 ['powiat', 'ministerstwo', 'samorząd', 'gmina', 'urząd', 'rada', 'służba', 'komisja', 'komitet', 'roktóry', 'województwo', 'pib', 'sejm', 'senat', 'national']

NGO / Association / Foundation:
 ['zrzeszenie', 'stowarzyszenie', 'koalicja', 'towarzystwo', 'federacja', 'fundacja', 'fundacją']

Media / Publishing:
 ['media', 'gazeta', 'wydawnictwo', 'radio', 'czasopismo', 'prasa', 'te_jeta_bowy']

Cultural Institution / Arts:
 ['teatr', 'muzeum', 'kultura', 'on', 'galeria', 'festiwal', 'filharmonia', 'dom']

Health / Hospitals / Medical:
 ['klinika', 'przychodnia', 'sanatorium', 'medyczny', 'zdrowie', 'szpital', 'lekarz']

Religious Organization:
 ['parafia', 'diecezja', 'zakon', 'ko

### **Step 2b (Manual Cleanup):** Clean Incorrect Lemmatizations

During the lemmatization of category-specific keyword lists (Step 2a), several inaccurate or malformed lemmas were identified as a result of natural language processing limitations. These included:

- **Punctuation-based artifacts**  
  e.g., `"S.A."` → `"."`

- **Hallucinated or unrelated lemmas**  
  e.g., `"RP"` → `"roktóry"`, `"TV"` → `"te_jeta_bowy"`

- **Typographical errors introduced by lemmatization**  
  e.g., `"commerce"` → `"commerke"`

To ensure the integrity of the classification logic, all keyword–lemma pairs were manually reviewed. Problematic cases were either:

- **Corrected** by reintroducing the original keyword form
- **Excluded** entirely if deemed irrelevant or too noisy

A cleaned, curated set of lemmatized keywords was then produced for each category. These lists serve as high-confidence matching vocabularies used in downstream classification.

> All corrections were documented to maintain full transparency and reproducibility of the pipeline.


In [10]:
# Function to lemmatize each keyword while preserving the original form.
# Returns a list of (original, lemma) pairs for transparency and manual review.
# This allows tracking any unexpected or incorrect lemmatization results.

def lemmatize_keywords_with_originals(keyword_list):
    pairs = []
    for kw in keyword_list:
        doc = nlp(kw.lower())
        for sent in doc.sentences:
            for word in sent.words:
                pairs.append((kw, word.lemma))
    return pairs

In [11]:
# Apply lemmatization to each keyword in every category

lemmatized_pairs_by_category = {
    category: lemmatize_keywords_with_originals(keyword_list)
    for category, keyword_list in keywords_by_category.items()
}

lemmatized_pairs_by_category

{'Company / Business': [('sp', 'sa'),
  ('sa', 'sa'),
  ('s.a.', 's.a'),
  ('s.a.', '.'),
  ('z o.o.', 'z'),
  ('z o.o.', 'ograniczona_odpowiedzialność'),
  ('z o.o.', '.'),
  ('o.o', 'ograniczona_odpowiedzialność'),
  ('o.o.', 'o.'),
  ('o.o.', 'o'),
  ('o.o.', '.'),
  ('o.', 'o'),
  ('o.', '.'),
  ('holding', 'holding'),
  ('firma', 'firma'),
  ('grupa', 'grupa'),
  ('przedsiębiorstwo', 'przedsiębiorstwo'),
  ('technologie', 'technologia'),
  ('logistyka', 'logistyka'),
  ('consulting', 'consulting'),
  ('solutions', 'solutions'),
  ('commerce', 'commerke'),
  ('industry', 'industry'),
  ('zakład', 'zakład'),
  ('bank', 'bank'),
  ('business', 'business'),
  ('ltd', 'lt_yp_dwownokinoć'),
  ('inc', 'inc')],
 'Government / Public Administration': [('urząd', 'urząd'),
  ('ministerstwo', 'ministerstwo'),
  ('gmina', 'gmina'),
  ('powiat', 'powiat'),
  ('rady', 'rada'),
  ('sejm', 'sejm'),
  ('senat', 'senat'),
  ('województwo', 'województwo'),
  ('rp', 'roktóry'),
  ('komisji', 'komisja'

In [12]:
# Cleaned keywords list by category 

lemmatized_keywords_by_category_clean = {
       "Government / Public Administration": [
        "urząd", "ministerstwo", "gmina", "powiat", "rada", "sejm", "senat",
        "województwo", "rp", "komisja", "samorząd", "narodowy", "krajowy", "izba",
        "państwowy", "miasto", "regionalny", "województwo", "wojewódzki", "marszałkowski",
        "sąd", "inspektorat", "parlament", "rzeczpospolita"
    ],
     "NGO / Association / Foundation": [
        "fundacja", "stowarzyszenie", "towarzystwo", "zrzeszenie", "federacja", "koalicja", "association"
    ],
   
    "Media / Publishing": [
        "radio", "tv", "gazeta", "media", "wydawnictwo", "czasopismo", "prasa", "tvp", "fm"
    ],
    "Cultural Institution / Arts": [
        "muzeum", "teatr", "galeria", "festiwal", "filharmonia", "dom", "kultura", "sztuka"
    ],
    "Health / Hospitals / Medical": [
        "zdrowia", "zdrowie", "klinika", "szpital", "lekarz", "medyczny", "przychodnia", "sanatorium"
    ],
    "Religious Organization": [
        "kościół", "parafia", "diecezja", "episkopat", "zakon", "misja", "cerkiew"
    ],
    "Military / Defense / Security": [
        "wojsko", "żandarmeria", "bezpieczeństwo", "policja", "straż", "obrona", "militaria"
    ],
    "International Organization / EU": [
        "europejski", "unia", "ue", "nato", "unesco", "oecd", "who", "międzynarodowy", "international", "european", "DG"
    ],
    "Company / Business": [
       "s.a.", "s.a", "z o.o.", "holding", "firma", "grupa", "przedsiębiorstwo",
        "technologia", "logistyka", "consulting", "solutions", "commerce", "industry", "group", "spółka", "Ltd"
    ],
       
    "Education (non-university)": [
        "liceum", "technikum", "szkoła", "podstawowy", "przedszkole", "edukacja", "biblioteka"
    ],
    "Other / Unclear": []  # fallback category
}


### **Step 3:** Lemmatize entities

In [13]:
df_non_academic.head()

Unnamed: 0,ORG_Entity
0,POPW
1,DG REGIO
2,RILEM
3,EFOE
4,Aiut


In [14]:
# Function to lemmatize a string

def lemmatize_text(text):
    doc = nlp(text.lower())
    return " ".join([word.lemma for sent in doc.sentences for word in sent.words])

In [15]:
# Apply to your entity column

df_non_academic["Lemma_Entity"] = df_non_academic["ORG_Entity"].dropna().apply(lemmatize_text)


In [16]:
# Preview result

df_non_academic.sample(50)[["ORG_Entity", "Lemma_Entity"]]

Unnamed: 0,ORG_Entity,Lemma_Entity
2970,Novasome,novasome
2164,Fundacją EFQM,fundacją efqm
975,High Level Experts Group,high level experts group
345,NFOŚ,nfoś
2416,IGC,incym_podoba
638,MNK,mok
625,Type Directors Club,ty directors club
807,In Your Pocket City Guides,inny your pocket cit guides
600,Rady Programowej FD,rada programowy stopiec
1056,Kole Dydaktyków MaFiI,koło dydaktyk mafia


### **Step 3b:** Clean incorrect lemmatizations

After lemmatizing organizational entity names, we observed that some outputs contained corrupted or malformed tokens. These "gremlins" often stemmed from:

- Incomplete or erroneous morphological parsing (e.g., `s.a .` instead of `s.a.`)
- Misinterpretation of abbreviations and legal suffixes (e.g., `ograniczona_odpowiedzialność` instead of `z o.o.`)
- Spurious tokens created by sentence context (e.g., `roktóry` from `Sejm RP, który...`)

To ensure consistency and enable reliable downstream classification, we manually curated a dictionary of component-level fixes that map these corrupted forms back to their intended representation.

We applied these fixes **only** to tokens that:
- Appeared frequently (2+ times) across the dataset, **and**
- Represented standard or recognizable structures (e.g., `s.a.`, `TVP`, `z o.o.`)

Example corrections:
- `"s.a ."` → `"s.a."`
- `"ograniczona_odpowiedzialność"` → `"z o.o."`
- `"to_sysapipiporoby"` → `"TVP"`



In [17]:
# Auto-flag “suspicious” outputs

def is_suspicious_lemma(lemma_str):
    return (
        "_" in lemma_str or
        any(char in lemma_str for char in [".", ",", "-", "/"]) and len(lemma_str) < 10
    )

df_non_academic["Suspicious"] = df_non_academic["Lemma_Entity"].apply(is_suspicious_lemma)

# Review flagged cases

pd.set_option("display.max_rows", 100) # Set max limit of displayed rows

df_non_academic[df_non_academic["Suspicious"] == True].head(20)

Unnamed: 0,ORG_Entity,Lemma_Entity,Suspicious
1,DG REGIO,do_tar regio,True
31,Komisji ds. Przeciwdziałania Mobbingowi,komisja do_spraw . przeciwdziałać mobbing,True
44,IASE Sp. z o.o.,iase sa . z ograniczona_odpowiedzialność .,True
63,IFP UZ,i__ymed uz,True
67,Dr Green Sp. z o.o.,doktor green sa . z ograniczona_odpowiedzialno...,True
70,Komisji ds. Zamówień Publicznych ŚZGiP,komisja do_spraw . zamówienie publiczny śzgip,True
72,E-RIHS,e - rihs,True
80,CPP Poland sp. z o.o.,ca poland sa . z ograniczona_odpowiedzialność .,True
121,Krajowe Biuro ds. Przeciwdziałania Narkomanii,krajowy biuro do_spraw . przeciwdziałać narkom...,True
141,BPN,bo_,True


In [18]:
from collections import Counter
import re

# Tokenize all Lemma_Entity values and flatten them

all_lemma_tokens = []
for lemma in df_non_academic["Lemma_Entity"].dropna():
    tokens = re.findall(r"\b\w[\w\.-]*\b", lemma.lower())  # capture words, acronyms, dots, dashes
    all_lemma_tokens.extend(tokens)

# Count token frequencies
token_freq = Counter(all_lemma_tokens)

# Convert to DataFrame for viewing
df_common_tokens = pd.DataFrame(token_freq.items(), columns=["Token", "Frequency"]).sort_values(by="Frequency", ascending=False)
df_common_tokens.head(20)

Unnamed: 0,Token,Frequency
22,polski,246
215,i,163
82,sa,139
83,z,111
42,ministerstwo,105
84,ograniczona_odpowiedzialność,104
69,rada,99
73,centrum,99
158,europejski,90
11,s.a,86


In [19]:
# Build a cleaning rule library

lemma_fixes = {
    "do_tar": "DG",
    "sa .": "s.a.",
    "s.a .": "s.a.",
    "o.o": "o.o.",
    "z ograniczona_odpowiedzialność .": "z o.o.",
    "z ograniczona_odpowiedzialność": "z o.o.",
    "do_spraw .": "ds.",
    "te_jeta_bowy": "TV",
    "roktóry": "rp",
    "to_sysapipiporoby": "TVP",
    "lt_yp_dwownokinoć": "Ltd",
    "fektometr_pici": "FM",
}

    

In [20]:
# Apply the correction

def apply_fixes(text):
    for wrong, correct in lemma_fixes.items():
        text = text.replace(wrong, correct)
    return text

df_non_academic["Lemma_Cleaned"] = df_non_academic["Lemma_Entity"].apply(apply_fixes)

### **Step 4:** Match: lemmatized keyword lists vs. each lemmatized entity 

In this step, we compare each lemmatized entity against curated, lemmatized keyword lists representing 11 target categories (e.g., Company, Government, NGO, Media).

Each category includes a list of manually validated sector-related keywords (e.g., `"firma"`, `"fundacja"`, `"urząd"`, `"radio"`), all pre-lemmatized for consistency.

We match each entity by checking if it **contains any keyword from a given category**. Matching is case-insensitive and based on substring presence, allowing for flexible alignment even when entities include multiple terms.

If a match is found:
- The entity is **assigned to that category**
- If multiple categories match, the first match is retained
- If no match occurs, the entity is assigned to **"Other / Unclear"**

This rule-based matching acts as a **semi-automated pre-classification**, reducing the burden of full manual labeling while leveraging structured domain knowledge.

In [21]:
# Define a function to match a lemmatized entity against category keyword lists

def match_entity_to_category(entity, keyword_dict):
    """
    Matches a lemmatized entity string to one of the predefined categories
    based on the presence of any lemmatized keywords.

    Parameters:
        entity (str): The lemmatized name of an organization.
        keyword_dict (dict): Dictionary mapping category names to keyword lists.

    Returns:
        str: The matched category name, or "Other / Unclear" if no match found.
    """
    entity_lower = entity.lower()
    for category, keywords in keyword_dict.items():
        if any(keyword in entity_lower for keyword in keywords):
            return category
    return "Other / Unclear"

In [22]:
# Apply the match_entity_to_category function

df_non_academic["Matched_Category"] = df_non_academic["Lemma_Cleaned"].apply(lambda x: match_entity_to_category(x, lemmatized_keywords_by_category_clean))

In [23]:
df_non_academic.head(20)

Unnamed: 0,ORG_Entity,Lemma_Entity,Suspicious,Lemma_Cleaned,Matched_Category
0,POPW,popwo,False,popwo,Other / Unclear
1,DG REGIO,do_tar regio,True,DG regio,Other / Unclear
2,RILEM,rilem,False,rilem,Other / Unclear
3,EFOE,efoe,False,efoe,Other / Unclear
4,Aiut,aiut,False,aiut,Other / Unclear
5,Urzędem Miasta Sulejówek,urząd miasto sulejówek,False,urząd miasto sulejówek,Government / Public Administration
6,Grupa INCO S.A.,grupa inco s.a .,False,grupa inco s.a.,Company / Business
7,WARBUD S.A.,warbud s.a .,False,warbud s.a.,Company / Business
8,ZIGUL S.A.,ziguc s.a .,False,ziguc s.a.,Company / Business
9,Idealia,idealia,False,idealia,Other / Unclear


In [24]:
# Check frequencies of categories

category_freq = Counter(df_non_academic["Matched_Category"])
category_freq

Counter({'Other / Unclear': 3101,
         'Government / Public Administration': 683,
         'Company / Business': 274,
         'NGO / Association / Foundation': 189,
         'International Organization / EU': 125,
         'Media / Publishing': 114,
         'Cultural Institution / Arts': 107,
         'Education (non-university)': 63,
         'Military / Defense / Security': 54,
         'Religious Organization': 29,
         'Health / Hospitals / Medical': 28})

In [25]:
# Print "unclear" entities for review

df_unclear = df_non_academic[df_non_academic["Matched_Category"] == "Other / Unclear"]
unclear_list = df_unclear["Lemma_Cleaned"].tolist()
unclear_list

['popwo',
 'DG regio',
 'rilem',
 'efoe',
 'aiut',
 'idealia',
 'tmt’s',
 'advisory board',
 'sodpypcpypi',
 'bloomberga',
 'iwnirz',
 'igig upwr',
 'sbwpwp',
 'zero -ektag studio',
 'klastra obróbka metal',
 'wim',
 'inspekcja handlowy',
 'dctda',
 'zarząd morski port szczecin',
 'lietącpykła',
 'ihar - pib',
 'eurostat',
 'migrant info point',
 'kfiim',
 'eie',
 'ministry of development',
 'bohamet',
 'centrum kształcenie ustawiczny',
 'pa elektrownia tur',
 'gus',
 'garment',
 'związek sybirak',
 'jyu',
 'pwrul',
 'łom',
 'zespół ekspercki',
 'sysaba',
 'generalny dyrekcja służba więzienna',
 'res',
 'galois',
 'nissan',
 'cit abw',
 'optpinka',
 'i__ymed uz',
 'qno',
 'not',
 'google braini',
 'inity',
 'e - rihs',
 'ksieo',
 'ordo carmelitarum discalceatorum',
 'kghm',
 'polski związek łowiecki',
 'cep',
 'fundacją azja - pacyfik',
 'centrum rzeźba polski',
 'yhe sa',
 'bonimed',
 'umcs lublin',
 'scopus',
 'etorki scl .',
 'cnoty',
 'podprat',
 'comarch',
 'uniwersytet trzeci wie

In [26]:
# Inspect most frequent terms from “Other / Unclear” entities.

all_unclear_tokens = []
for entity in unclear_list:
    tokens = re.findall(r"\b\w[\w\.-]*\b", entity.lower())
    all_unclear_tokens.extend(tokens)

Counter(all_unclear_tokens).most_common(100)

[('polski', 87),
 ('centrum', 56),
 ('związek', 32),
 ('sa', 32),
 ('i', 32),
 ('zakład', 31),
 ('of', 29),
 ('agencja', 24),
 ('pib', 22),
 ('zespół', 22),
 ('komitet', 21),
 ('bank', 20),
 ('biuro', 16),
 ('rozwój', 15),
 ('ochrona', 15),
 ('instytut', 15),
 ('ośrodek', 14),
 ('national', 14),
 ('fundacją', 13),
 ('uniwersytet', 13),
 ('the', 13),
 ('organizacja', 13),
 ('społeczny', 11),
 ('technologies', 10),
 ('a', 10),
 ('środowisko', 10),
 ('rok', 10),
 ('departament', 10),
 ('pa', 9),
 ('dyrekcja', 9),
 ('służba', 9),
 ('dane', 9),
 ('uw', 9),
 ('europe', 9),
 ('ds', 9),
 ('and', 9),
 ('on', 9),
 ('inspekcja', 8),
 ('elektrownia', 8),
 ('fundusz', 8),
 ('forówna', 8),
 ('siła', 8),
 ('prawo', 8),
 ('uwr', 8),
 ('zarząd', 7),
 ('um', 7),
 ('sieć', 7),
 ('society', 7),
 ('global', 7),
 ('lek', 7),
 ('science', 7),
 ('is', 7),
 ('dok', 7),
 ('generalna', 7),
 ('uam', 7),
 ('studio', 6),
 ('generalny', 6),
 ('uz', 6),
 ('business', 6),
 ('gdański', 6),
 ('jewisho', 6),
 ('klub', 6)

In [27]:
# Extract entities for manual annotation and prepare the CSV

# Filter only "Other / Unclear" rows
entities_unclear = df_non_academic[df_non_academic["Matched_Category"] == "Other / Unclear"].copy()

# Add empty column for manual annotation
entities_unclear["Annotated_Category"] = ""

# Add column with entity id
entities_unclear["Entity_ID"] = entities_unclear.index

# Keep only relevant columns for annotation
columns_to_export = ["Entity_ID", "ORG_Entity", "Matched_Category", "Annotated_Category"]

# Export to CSV
entities_unclear[columns_to_export].to_csv("../output/unclear_entities_for_annotation.csv", index=False)


In [28]:
# Export df_on_academic to csv
df_non_academic.to_csv("../output/df_non_academic.csv")
