# Academia–Practice Interaction Mapping Using NLP

**Notebook 04: Entity Cleaning & Preprocessing**

**Author:** Kamila Lewandowska  
**Project Status:** *In Progress*  
**Last Updated:** April 2025  

---

### Project Description

This notebook assumes the common ORG entities have already been extracted and saved to `../output/common_org_entities.csv` in the previous notebook.

---

### Steps

1. Normalize whitespace and special characters
2. Tokenize and identify frequently occurring keywords
3. Prepare for filtering of academic-related terms


---


# Cleaning and preprocessing the data: unique_common

## STEP 1: Basic cleaning 

In [1]:
import pandas as pd
import re
from collections import Counter

In [2]:
# Load common entities
df_common = pd.read_csv("../output/common_org_entities.csv")
unique_common = df_common["ORG_Entity"].dropna().tolist()

In [3]:

cleaned_entities = []

# Basic cleaning: whitespace, removing redundant entries
for entity in unique_common:

    # Normalize whitespace (substitutes more than 1 spaces with 1, strips leading/ trailing whitespace)
    entity = re.sub(r'\s+', ' ', entity).strip()

    # Remove leading non-word characters
    entity = re.sub(r'^[^\w]+', '', entity)

    # Remove trailing punctuation except periods
    entity = re.sub(r'[^\w.]+$', '', entity)

    # Remove entries with less than 3 characters
    if len(entity) < 3:
        continue

    cleaned_entities.append(entity)

len(cleaned_entities)

6005

In [4]:
print(cleaned_entities[0:101])

['POPW', 'DG REGIO', 'RILEM', 'EFOE', 'Aiut', 'Urzędem Miasta Sulejówek', 'Instytut Sztuk Wizualnych', 'Grupa INCO S.A.', 'WARBUD S.A.', 'ZIGUL S.A.', 'Uniwersytet Stanforda', 'Idealia', 'Biblioteki Jagiellońskiej', 'UEMS', 'TMT’s', 'Advisory Board', 'Polskiego Radia', 'SSRN', 'Bloomberga', 'IWNiRZ', 'IGiG UPWr', 'SBWPwP', 'Zero-G studio', 'Sekcji Nauk Biblijnych KUL', 'Klastra Obróbki Metali', 'Polskiego Stowarzyszenia Montessori', 'QNA Technology Spółka z.o.o.', 'Ministerstwa Rolnictwa', 'WIM', 'Inspekcja Handlowa', 'Teatr Polski', 'DCTDA', 'Instytutem Badawczym Visionary Analytics', 'Zarząd Morskich Portów Szczecin', 'LDL', 'Università Pontificia Salesiana', 'University of Notre Dame', 'Komisji ds. Przeciwdziałania Mobbingowi', 'Polska Akademia Nauk', 'IHAR-PIB', 'Eurostatu', 'Migrant Info Point', 'KFiIM', 'Narodowy Instytut Samorządu Terytorialnego', 'EIE', 'Ministry of Development', 'Rady Transportu Aglomeracyjnego', 'BOHAMET', 'Centrum Kształcenia Ustawicznego', 'PGE Elektrowni T

## STEP 2: Filter out academia entities

In [10]:
# Define academic-related keywords 

## Generate most frequently occuring keywords

## Tokenize all entries into lowercase words
tokens = []

for entry in cleaned_entities:
    tokens.extend(re.findall(r'\b\w+\b', entry.lower()))

# Count most common tokens
token_counts = Counter(tokens)
common_tokens = token_counts.most_common(500)

# Display for review
academic_candidate_keywords = pd.DataFrame(common_tokens, columns = ["token", "count"])
academic_candidate_keywords.to_csv("../output/tokens_review.csv", index=False)


    

In [6]:
# Create a list of stemmed keywords indicating academic entities

academic_keywords = [
    'instytut', 'pan', 'university', 'nauk', 'uniwersytet', 'wydział', 'wydzial', 'department', 'badań', 'akadem', 
    'katedr', 'politechni', 'laboratorium', 'research', 'institut', 'fizy', 'matematy', 
    'architektur', 'pracown', 'pedagogi', 'filozof', 'medycyn', 'medyczn', 'medical', 'językoznawstwo', 
    'mickiewicz', 'biolog', 'studia', 'uczeln', 'kolegium', 'collegium', 'studium', 'colleg', 'universit', 
    'wyższ', 'journal', 'springer', 'doktor']

In [7]:
# Define function to return academic entities

def is_academic(entity, keywords):
    """
    Determines whether a given entity likely represents an academic institution 
    by checking for the presence of predefined keywords.

    Parameters:
        entity (str): The name of the organization or entity to check.
        keywords (list of str): A list of lowercase keywords associated with academic institutions (academic_keywords)
                                (e.g., "university", "institute", "academy").

    Returns:
        bool: True if any keyword is found in the entity name (case-insensitive), False otherwise.
    """
    entity_lower = entity.lower()
    return any(keyword in entity_lower for keyword in keywords)

In [8]:
# Apply function to cleaned_entities

non_academic_entities = [entity for entity in cleaned_entities 
                        if not is_academic(entity, academic_keywords)]
print(non_academic_entities[0:101])

['POPW', 'DG REGIO', 'RILEM', 'EFOE', 'Aiut', 'Urzędem Miasta Sulejówek', 'Grupa INCO S.A.', 'WARBUD S.A.', 'ZIGUL S.A.', 'Idealia', 'Biblioteki Jagiellońskiej', 'UEMS', 'TMT’s', 'Advisory Board', 'Polskiego Radia', 'SSRN', 'Bloomberga', 'IWNiRZ', 'IGiG UPWr', 'SBWPwP', 'Zero-G studio', 'Klastra Obróbki Metali', 'Polskiego Stowarzyszenia Montessori', 'QNA Technology Spółka z.o.o.', 'Ministerstwa Rolnictwa', 'WIM', 'Inspekcja Handlowa', 'Teatr Polski', 'DCTDA', 'Zarząd Morskich Portów Szczecin', 'LDL', 'Komisji ds. Przeciwdziałania Mobbingowi', 'IHAR-PIB', 'Eurostatu', 'Migrant Info Point', 'KFiIM', 'EIE', 'Ministry of Development', 'Rady Transportu Aglomeracyjnego', 'BOHAMET', 'Centrum Kształcenia Ustawicznego', 'PGE Elektrowni Turów', 'GUS', 'Garment', 'IASE Sp. z o.o.', 'Komisja Spraw Konstytucyjnych PE', 'Związek Sybiraków', 'JYU', 'PWRUL', 'ŁOM', 'Zespołu Eksperckiego', 'SYSABA', 'Generalnej Dyrekcji Służby Więziennej', 'RES', 'Teatr Stary', 'GALOIS', 'Nissan', 'Urzędu Żeglugi Śród

In [9]:
# Preview what you're removing (sanity checking the filter)

academic_entities = [name for name in cleaned_entities if is_academic(name, academic_keywords)]

print(len(academic_entities))

1238


In [11]:
# Save non-academic entities to CSV
df_non_academic = pd.DataFrame({"ORG_Entity": non_academic_entities})
df_non_academic.to_csv("../output/non_academic_org_entities.csv", index=False)

print("Saved non-academic entities to /output/non_academic_org_entities.csv")

Saved non-academic entities to /output/non_academic_org_entities.csv
