# Academia–Practice Interaction Mapping Using NLP

**Author:** Kamila Lewandowska  
**Project Status:** *In Progress*  
**Last Updated:** April 2025  

---

### Project Description

This notebook uses Named Entity Recognition (NER) to extract non-academic organizations from social impact case studies submitted by Polish universities.  
The goal is to identify and analyze the network of practice partners involved in academic research and its broader societal impact.

---

### File Overview

- Input data: `data/merged_impact_case_studies.csv`
- Output files: 
  - `output/ner_stanza_pl.csv`
  - `output/ner_davlan_pl.csv`
- NLP Models used:
  - [`Stanza`](https://github.com/stanfordnlp/stanza) (Stanford NLP) for Polish NER
  - [`Davlan/xlm-roberta-base-finetuned-ner`](https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl) model via Hugging Face


---


# Read and prepare data for NER extraction

In [17]:
import pandas as pd
import os
import re
from transformers import pipeline
from ast import literal_eval
from collections import Counter
import random
from pathlib import Path

In [23]:
# Read data from a csv file with desriptions of societal impact of research

ics = pd.read_csv("../data/merged_impact_case_studies.csv")

In [5]:
# Select list of columns to include for analysis
columns_to_include_pl = ['Impact description identifier - POL-on 2.0 system uuid', 'Identifier of the institution to which the impact description is assigned - POL-on 2.0 system uuid', 'Domain name', 'Discipline name', 'Title (Polish version)', 'Impact (Polish version)', 'The leading area of impact']  

# Filter to keep only the necessary columns and drop duplicates
ics_selected_columns_pl = ics[columns_to_include_pl].drop_duplicates()

# Reset index if necessary
ics_selected_columns_pl.reset_index(drop=True, inplace=True)



In [6]:
# Create a new column combining 'Impact (English version)' and 'Title (English version)'
texts_pl = ics_selected_columns_pl['Impact (Polish version)'].fillna('') + " " + ics_selected_columns_pl['Title (Polish version)'].fillna('')

In [7]:
# Check teh shape of the dataset
print(texts_pl.shape)

(2661,)


In [11]:
# Define text preprocessing

# Function to normalize text: remove URLs 
def preprocess_text(text):
    """
    Normalizes URLs in the input text by replacing full URLs with just the domain name.

    Parameters:
        text (str): A string of text potentially containing URLs.

    Returns:
        str: The input text with URLs simplified to their domain names.
    """
    text = re.sub(r'https?://(?:www\.)?([a-zA-Z0-9.-]+)(?:/[\w./-]*)?', r'\1', text)  # Normalize URLs
    return text

In [9]:
# Preprocess text
cleaned_text_pl = texts_pl.apply(preprocess_text)

In [10]:
# Check the descriptive statistics of the dataset

cleaned_text_pl_mean = cleaned_text_pl.apply(lambda x: len(str(x).split())).mean()
cleaned_text_pl_sum = cleaned_text_pl.apply(lambda x: len(str(x).split())).sum()

print(f"Average word number in cleaned_text: {cleaned_text_pl_mean}")
print(f"Total words in cleaned_text: {cleaned_text_pl_sum}")

Average word number in cleaned_text: 550.5678316422398
Total words in cleaned_text: 1465061


# Extract ORG Entities Using Different Models

## Extract entities using Stanza Model ('pl')

In [12]:
# Function to extract 'ORG' entities using Stanza_pl
def extract_org_stanza_pl(text):
    """
    Extracts "ORG" entities from Polish-language text using the Stanza NER pipeline.

    Parameters: 
        text (str): A string of text in Polish.

    Returns: 
        list: A list of named entities of type "ORG" found in the input text.
    """
    doc = nlp_stanza_pl(text)
    return [ent.text for ent in doc.ents if ent.type == "orgName"]


In [None]:
# Apply Stanza_pl NER extraction to the dataset
cleaned_text_stanza_pl = cleaned_text_pl.apply(extract_org_stanza_pl)

# Save results in a DataFrame
df_stanza_pl = pd.DataFrame({"Text": cleaned_text_pl, "ORG_Entities_stanza": cleaned_text_stanza_pl})


In [None]:
# Add the 'Impact description identifier - POL-on 2.0 system uuid' column to df_stanza_pl
df_stanza_pl["ICS_ID"] = ics_selected_columns_pl["Impact description identifier - POL-on 2.0 system uuid"]

# Display the first few rows to confirm
print(df_stanza_pl.head())

In [None]:
total_entities_ner_stanza_pl = df_stanza_pl["ORG_Entities_stanza"].explode().notna().sum()
print(f"Total ORG entities extracted Stanza_pl: {total_entities_ner_stanza_pl}")

In [24]:
# Define your target folder path 
OUTPUT_DIR = "../output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save the CSV file in the specified folder
csv_file_path_pl = os.path.join(OUTPUT_DIR, "ner_stanza_pl.csv")
df_stanza_pl.to_csv(csv_file_path_pl, index=False)

print(f"CSV file saved at: {csv_file_path_pl}")

  OUTPUT_DIR = "..\output"
  OUTPUT_DIR = "..\output"


NameError: name 'df_stanza_pl' is not defined

## Extract entities using multilingual NER model from HuggingFace

In [None]:
# Load multilingual NER model from HuggingFace

ner_pipeline_xlm = pipeline(
    "ner",
    model="Davlan/xlm-roberta-base-ner-hrl",
    aggregation_strategy="simple"  # Merges tokens into entities
)

In [15]:
# Write a function to extract ORG entities

def extract_org_xlm(text):
    """
    Extracts 'ORG' entities from Polish-language text using the Davlan/XLM-RoBERTa Hugging Face NER model.

    Parameters:
        text (str): A string of text in Polish.

    Returns:
        list: A list of 'ORG' entities extracted by the transformer-based model.
    """
    results = ner_pipeline_xlm(text)
    return [r['word'] for r in results if r['entity_group'] == 'ORG']

In [None]:
# Apply the model and create a dataframe

df_davlan_pl = pd.DataFrame({
    "Text": cleaned_text_pl,
    "ORG_Entities_xlm": cleaned_text_pl.apply(extract_org_xlm)
})

In [None]:
# Add the 'Impact description identifier - POL-on 2.0 system uuid' column to df_davlan_pl
df_davlan_pl["ICS_ID"] = ics_selected_columns_pl["Impact description identifier - POL-on 2.0 system uuid"]

# Display the first few rows to confirm
print(df_davlan_pl.head())

In [None]:
total_entities_ner_davlan_pl = df_davlan_pl["ORG_Entities_xlm"].explode().notna().sum()
print(f"Total ORG entities extracted Davlan_pl: {total_entities_ner_davlan_pl}")

In [None]:
# Define your target folder path 
OUTPUT_DIR = "../output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save the CSV file in the specified folder
csv_davlan_path_pl = os.path.join(OUTPUT_DIR, "ner_davlan_pl.csv")
df_davlan_pl.to_csv(csv_davlan_path_pl, index=False)

print(f"CSV file saved at: {csv_davlan_path_pl}")

# EDA: Compare entities from both models

In [6]:
# Load the CSV files

stanza_pl = pd.read_csv(r"C:\Users\lewandowska\Desktop\OPUS LAP_realizacja\WP_3_Impact_case_studies\academia-practice-nlp\output\ner_stanza_pl.csv")
davlan_pl = pd.read_csv(r"C:\Users\lewandowska\Desktop\OPUS LAP_realizacja\WP_3_Impact_case_studies\academia-practice-nlp\output\ner_davlan_pl.csv")

In [5]:
# Create a dataframe for comparison 

ner_pl_comp = pd.merge(stanza_pl, davlan_pl, on = ["Text", "ICS_ID"], how="inner")


In [6]:
# Change the order of columns

ner_pl_comp = ner_pl_comp[['ICS_ID', 'Text', 'ORG_Entities_stanza', 'ORG_Entities_xlm']]

In [7]:
ner_pl_comp.columns

Index(['ICS_ID', 'Text', 'ORG_Entities_stanza', 'ORG_Entities_xlm'], dtype='object')

In [8]:
ner_pl_comp.head()

Unnamed: 0,ICS_ID,Text,ORG_Entities_stanza,ORG_Entities_xlm
0,00153fbd-82f7-48c4-b5bd-e830bc390244,Badania skupiające się na szczegółowej analizi...,['Komitetu Nauk Weterynaryjnych i Rozrodu Zwie...,['Komitetu Nauk Weterynaryjnych i Rozrodu Zwie...
1,002768f1-8b96-4e0f-bcc8-192eb0594e60,"Birdwatching, czyli obserwacje w terenie ptakó...","['Królewskie Towarzystwo Ochrony Ptaków', 'Fac...","['Królewskie Towarzystwo Ochrony Ptaków', 'Zak..."
2,00500483-f00c-4410-b6f7-8650a003125f,Efektywny transfer wiedzy jest podstawowym czy...,"['MŚP', 'MŚP', 'ETW']",[]
3,006e7fef-2083-426d-9c1b-1affd27b939e,Ważnym obszarem działalności naukowej WSPiA je...,"['WSPiA', 'AP', 'WSPiA', '4 Zespoły', 'AP', 'A...","['WSPiA', 'WSPiA']"
4,00901439-d91a-48e0-903a-26a4253c3a0c,Znaczna część europejskiego dziedzictwa archeo...,"['Interreg Central Europe', 'Archaeological He...","['Inter', 'Archaeological Heritage Office of S..."


In [9]:
# Checking the data type

print(type(ner_pl_comp.iloc[0, 2]))

<class 'str'>


In [10]:
# Converting columns with entity names into lists

ner_pl_comp[["ORG_Entities_stanza", "ORG_Entities_xlm"]] = ner_pl_comp[["ORG_Entities_stanza", "ORG_Entities_xlm"]].apply(
    lambda col: col.map(literal_eval))


In [11]:
# Create a column with entities detected by both models

ner_pl_comp["Common_entities"] = ner_pl_comp.apply(
    lambda row: list(set(row["ORG_Entities_stanza"]) & set(row["ORG_Entities_xlm"])), axis=1
                    )

In [12]:
ner_pl_comp.head()

Unnamed: 0,ICS_ID,Text,ORG_Entities_stanza,ORG_Entities_xlm,Common_entities
0,00153fbd-82f7-48c4-b5bd-e830bc390244,Badania skupiające się na szczegółowej analizi...,[Komitetu Nauk Weterynaryjnych i Rozrodu Zwier...,[Komitetu Nauk Weterynaryjnych i Rozrodu Zwier...,"[Rady Doradczej, Facebook, Scientific Reports,..."
1,002768f1-8b96-4e0f-bcc8-192eb0594e60,"Birdwatching, czyli obserwacje w terenie ptakó...","[Królewskie Towarzystwo Ochrony Ptaków, Facebo...","[Królewskie Towarzystwo Ochrony Ptaków, Zakład...","[Królewskie Towarzystwo Ochrony Ptaków, UAM]"
2,00500483-f00c-4410-b6f7-8650a003125f,Efektywny transfer wiedzy jest podstawowym czy...,"[MŚP, MŚP, ETW]",[],[]
3,006e7fef-2083-426d-9c1b-1affd27b939e,Ważnym obszarem działalności naukowej WSPiA je...,"[WSPiA, AP, WSPiA, 4 Zespoły, AP, AP, ZK, ZK, ...","[WSPiA, WSPiA]",[WSPiA]
4,00901439-d91a-48e0-903a-26a4253c3a0c,Znaczna część europejskiego dziedzictwa archeo...,"[Interreg Central Europe, Archaeological Herit...","[Inter, Archaeological Heritage Office of Saxo...","[Cultural Heritage Department, Archaeological ..."


In [13]:
# Create stanza and davlan columns with unique entities per row (entities can repeat across rows but not in the same row/ ics)

ner_pl_comp["Stanza_unique_row"] = ner_pl_comp["ORG_Entities_stanza"].apply(lambda row: list(set(row)))
ner_pl_comp["Davlan_unique_row"] = ner_pl_comp["ORG_Entities_xlm"].apply(lambda row: list(set(row)))


In [14]:
ner_pl_comp.head()

Unnamed: 0,ICS_ID,Text,ORG_Entities_stanza,ORG_Entities_xlm,Common_entities,Stanza_unique_row,Davlan_unique_row
0,00153fbd-82f7-48c4-b5bd-e830bc390244,Badania skupiające się na szczegółowej analizi...,[Komitetu Nauk Weterynaryjnych i Rozrodu Zwier...,[Komitetu Nauk Weterynaryjnych i Rozrodu Zwier...,"[Rady Doradczej, Facebook, Scientific Reports,...","[Rady Doradczej, Zespołu, Uniwersytet, Faceboo...","[Rady Doradczej, Facebook, Scientific Reports,..."
1,002768f1-8b96-4e0f-bcc8-192eb0594e60,"Birdwatching, czyli obserwacje w terenie ptakó...","[Królewskie Towarzystwo Ochrony Ptaków, Facebo...","[Królewskie Towarzystwo Ochrony Ptaków, Zakład...","[Królewskie Towarzystwo Ochrony Ptaków, UAM]",[Komitetu Biologii Środowiskowej i Ewolucyjnej...,"[Królewskie Towarzystwo Ochrony Ptaków, „Grupa..."
2,00500483-f00c-4410-b6f7-8650a003125f,Efektywny transfer wiedzy jest podstawowym czy...,"[MŚP, MŚP, ETW]",[],[],"[ETW, MŚP]",[]
3,006e7fef-2083-426d-9c1b-1affd27b939e,Ważnym obszarem działalności naukowej WSPiA je...,"[WSPiA, AP, WSPiA, 4 Zespoły, AP, AP, ZK, ZK, ...","[WSPiA, WSPiA]",[WSPiA],"[4 Zespoły, WSPiA, AP, ZK]",[WSPiA]
4,00901439-d91a-48e0-903a-26a4253c3a0c,Znaczna część europejskiego dziedzictwa archeo...,"[Interreg Central Europe, Archaeological Herit...","[Inter, Archaeological Heritage Office of Saxo...","[Cultural Heritage Department, Archaeological ...","[Province of Trento, VR, Urzędu Miasta Pucka, ...","[Cultural Heritage Department, Inter, Archaeol..."


In [15]:
# Calculate entities in each column:
# Stanza_all = all entities extracted by the model (with duplicates within and across rows)
# Stanza_unique_row = no duplicates within rows
# Davlan_all_entities = all entities extracted by the model (with duplicates within and across rows)
# Davlan_unique_row = = no duplicates within rows
# Common_sum = entities extracted both by Stanza and Davlan (no duplicates)

stanza_all = sum(ner_pl_comp["ORG_Entities_stanza"].apply(lambda row: len(row)))
stanza_unique_row = sum(ner_pl_comp["Stanza_unique_row"].apply(lambda row: len(row)))
davlan_sum = sum(ner_pl_comp["ORG_Entities_xlm"].apply(lambda row: len(row)))
davlan_unique_row = sum(ner_pl_comp["Davlan_unique_row"].apply(lambda row: len(row)))
common_sum = sum(ner_pl_comp["Common_entities"].apply(lambda row: len(row)))

print(f'Stanza all entities: {stanza_all}')
print(f'Stanza unique per row: {stanza_unique_row}')
print(f'Davlan all entities: {davlan_sum}')
print(f'Davlan unique per row: {davlan_unique_row}')
print(f'Common entities: {common_sum}')

Stanza all entities: 35642
Stanza unique per row: 25714
Davlan all entities: 17554
Davlan unique per row: 14080
Common entities: 8803


In [16]:
# Calculate the overlap between Stanza and Davlan (Jaccard similarity percentage): 
# Jaccard Similarity = (Number of common elements) / (Number of unique elements in both sets)

overlap_percent = 8803 / (25714 + 14080 - 8803) * 100
print(overlap_percent)


28.405020812493948


In [17]:
# Explore frequencies of entities

stanza_flat_list = [entity for row in ner_pl_comp["Stanza_unique_row"] for entity in row]
stanza_freq = Counter(stanza_flat_list)
print("Stanza:")
print(stanza_freq.most_common(50))

davlan_flat_list = [entity for row in ner_pl_comp["Davlan_unique_row"] for entity in row]
davlan_freq = Counter(davlan_flat_list)
print("Davlan:")
print(davlan_freq.most_common(50))

both_flat_list = [entity for row in ner_pl_comp["Common_entities"] for entity in row]
both_freq = Counter(both_flat_list)
print("Stanza & Davlan:")
print(both_freq.most_common(50))

Stanza:
[('UE', 341), ('Unii Europejskiej', 167), ('Komisji Europejskiej', 105), ('Instytutu', 102), ('NCBiR', 61), ('Instytut', 60), ('ONZ', 57), ('UJ', 51), ('UW', 47), ('UNESCO', 47), ('unijnych', 46), ('Zespołu', 45), ('NFZ', 44), ('Uczelni', 43), ('GUS', 42), ('WHO', 40), ('Instytucie', 39), ('Rady Ministrów', 38), ('Facebook', 37), ('UMK', 37), ('KE', 36), ('Rady', 36), ('Parlamentu Europejskiego', 34), ('NATO', 32), ('UAM', 31), ('Komisję Europejską', 31), ('NCN', 30), ('UWr', 29), ('AGH', 29), ('OZE', 29), ('MŚP', 27), ('Komisja Europejska', 27), ('Muzeum', 27), ('UP', 27), ('OECD', 26), ('TVP', 26), ('PW', 26), ('UŁ', 26), ('B+R', 26), ('Polskiego Radia', 25), ('Wydziału', 25), ('SGGW', 25), ('PG', 25), ('Polonii', 24), ('NGO', 24), ('Policji', 24), ('MRiRW', 23), ('europejskiej', 23), ('Ministerstwa Zdrowia', 23), ('EU', 23)]
Davlan:
[('', 190), ('UE', 171), ('Instytut', 100), ('Unii Europejskiej', 64), ('Komisji Europejskiej', 56), ('In', 51), ('UJ', 35), ('WHO', 34), ('UW',

In [18]:
# Create a list of unique entities per model

unique_stanza = list(set(stanza_flat_list))
unique_davlan = list(set(davlan_flat_list))
unique_common = list(set(both_flat_list))

unique_stanza_only = list(set(stanza_flat_list) - set(davlan_flat_list))
unique_davlan_only = list(set(davlan_flat_list) - set(stanza_flat_list))

unique_stanza_count = len(unique_stanza)
unique_davlan_count = len(unique_davlan)
unique_common_count = len(unique_common)
unique_stanza_only_count = len(unique_stanza_only)
unique_davlan_only_count = len(unique_davlan_only)

print(f'Unique entities Stanza: {unique_stanza_count}')
print(f'Unique entities Davlan: {unique_davlan_count}')
print(f'Unique common entities: {unique_common_count}')
print(f'Unique Stanza only: {unique_stanza_only_count}')
print(f'Unique Davlan only: {unique_davlan_only_count}')

Unique entities Stanza: 17988
Unique entities Davlan: 9751
Unique common entities: 6111
Unique Stanza only: 11720
Unique Davlan only: 3483


In [22]:
print(unique_stanza_only[0:101])

['JZW KOKS', 'Nowotarski & Weron', 'Polskiemu Radiu Pomorza i Kujaw', 'Royal College of Music', 'Arida/Zeszuta', 'Wojewodę K-P', 'Ośrodek IWRD', 'World Bank', 'PROTE Sp.', 'Akademia Kreatywnego Rozwoju', 'Zakładzie Asil Çelik', 'ISAM', 'Damovo', 'Innowatora Śląska', 'Montessori Europe Research Group', 'Polskiego Towarzystwa Filozoficznego', 'Euroregionu Karpackiego', 'FlavorActiv', 'SPOZ 1 w Lublinie', 'Harvard University Press', 'CFA Society Switzerland', 'Muzeum Zachęta w Warszawie', 'Prywatna Klinika VET-LAB Brudzew', 'NPBWP', 'Radzie Klimatycznej', 'Ltd.', 'Stowarzyszenie Inżynierów i Techników Pożarnictwa SITP', 'Spółka z o.o.', 'Instytut Pedagogiki', 'Szkole Doktorskiej NŚT UZ', 'GlobeCore', 'Zrzeszenia Kaszubsko-Pomorskiego', 'NMSG', 'Pełnomocnikiem Rządu ds. Partnerstwa Strategicznego Polski i Ukrainy', 'Radiu TOK FM', 'Stowarzyszeniem Związek Miast Polskich', 'JIABEL BILINGUAL EDUCATION CENTER', 'Edipresse Polska', 'Teatr Witkacego', 'muzeum Historii Akademii Ostrogskiej', '„B

# Cleaning and preprocessing the data: unique_common

## STEP 1: Basic cleaning 

In [25]:

cleaned_entities = []

# Basic cleaning: whitespace, removing redundant entries
for entity in unique_common:

    # Normalize whitespace (substitutes more than 1 spaces with 1, strips leading/ trailing whitespace)
    entity = re.sub(r'\s+', ' ', entity).strip()

    # Remove leading non-word characters
    entity = re.sub(r'^[^\w]+', '', entity)

    # Remove trailing punctuation except periods
    entity = re.sub(r'[^\w.]+$', '', entity)

    # Remove entries with less than 3 characters
    if len(entity) < 3:
        continue

    cleaned_entities.append(entity)

len(cleaned_entities)

6005

In [26]:
print(cleaned_entities[0:101])

['WNH', 'Organu Doradczego Zarządu PZSN', 'ICM', 'HeidelbergCement AG', 'The Washington Post', 'Grupy Badawczej Prawa Usług Cyfrowych', 'Talmex', 'Singer Instruments', 'INT UP', 'Centrum Studiów Ratzingera', 'ZUGIL S.A.', 'Polskiego Związku Łowickiego', 'Zakładu Energetycznego w Łomży', 'PPNT', 'Komitetu Standardów Rachunkowości', 'Uniwersytetu SWPS', 'Research & Development et Nokia Wroclaw', 'LAŚ', 'Kaufland', 'Car Cosmetics', 'Akademickim Chórem UMCS', 'Polskiego Towarzystwa Ginekologicznego', 'Polska Spółka Gazownictwa', 'GALOIS', 'Narodowego Funduszu Rewaloryzacji Zabytków Krakowa', 'GIK', 'Stanford University', 'ZG Związku Sybiraków', 'Komitecie Językoznawstwa PAN', 'Jewish Journal', 'Pracowni Badań nad Bezpieczeństwem Lokalnym', 'Głównym Inspektoratem Weterynarii', 'Krajowej Administracji Skarbowej', 'Dowództwach Strategicznych NATO', 'Ekombud', 'European Food Safety Authority', 'Wojsk Obrony Cyberprzestrzeni', 'Muzeum Narodowe', 'Żabka', 'Głównego Lekarza Weterynarii', 'Rossman

## STEP 2: Filter out academia entities

In [27]:
# Define academic-related keywords 

## Generate most frequently occuring keywords

## Tokenize all entries into lowercase words
tokens = []

for entry in cleaned_entities:
    tokens.extend(re.findall(r'\b\w+\b', entry.lower()))

# Count most common tokens
token_counts = Counter(tokens)
common_tokens = token_counts.most_common(500)

# Display for review
academic_candidate_keywords = pd.DataFrame(common_tokens, columns = ["token", "count"])
academic_candidate_keywords.to_csv("tokens_review.csv", index=False)


    

In [28]:
# Create a list of stemmed keywords indicating academic entities

academic_keywords = [
    'instytut', 'pan', 'university', 'nauk', 'uniwersytet', 'wydział', 'wydzial', 'department', 'badań', 'akadem', 
    'katedr', 'politechni', 'laboratorium', 'research', 'institut', 'fizy', 'matematy', 
    'architektur', 'pracown', 'pedagogi', 'filozof', 'medycyn', 'medyczn', 'medical', 'językoznawstwo', 
    'mickiewicz', 'biolog', 'studia', 'uczeln', 'kolegium', 'collegium', 'studium', 'colleg', 'universit', 
    'wyższ', 'journal', 'springer', 'doktor']

In [16]:
# Define function to return academic entities

def is_academic(entity, keywords):
    """
    Determines whether a given entity likely represents an academic institution 
    by checking for the presence of predefined keywords.

    Parameters:
        entity (str): The name of the organization or entity to check.
        keywords (list of str): A list of lowercase keywords associated with academic institutions (academic_keywords)
                                (e.g., "university", "institute", "academy").

    Returns:
        bool: True if any keyword is found in the entity name (case-insensitive), False otherwise.
    """
    entity_lower = entity.lower()
    return any(keyword in entity_lower for keyword in keywords)

In [32]:
# Apply function to cleaned_entities

non_academic_entities = [entity for entity in cleaned_entities 
                        if not is_academic(entity, academic_keywords)]
print(non_academic_entities[0:101])

['WNH', 'Organu Doradczego Zarządu PZSN', 'ICM', 'HeidelbergCement AG', 'The Washington Post', 'Grupy Badawczej Prawa Usług Cyfrowych', 'Talmex', 'Singer Instruments', 'INT UP', 'Centrum Studiów Ratzingera', 'ZUGIL S.A.', 'Polskiego Związku Łowickiego', 'Zakładu Energetycznego w Łomży', 'PPNT', 'Komitetu Standardów Rachunkowości', 'LAŚ', 'Kaufland', 'Car Cosmetics', 'Polskiego Towarzystwa Ginekologicznego', 'Polska Spółka Gazownictwa', 'GALOIS', 'Narodowego Funduszu Rewaloryzacji Zabytków Krakowa', 'GIK', 'ZG Związku Sybiraków', 'Głównym Inspektoratem Weterynarii', 'Krajowej Administracji Skarbowej', 'Dowództwach Strategicznych NATO', 'Ekombud', 'European Food Safety Authority', 'Wojsk Obrony Cyberprzestrzeni', 'Muzeum Narodowe', 'Żabka', 'Głównego Lekarza Weterynarii', 'Rossman', 'Modelowania i Projektowania Inteligentnych Technologii Asystujących', 'Ogólnopolskiego Porozumienia Związków Zawodowych', 'Kazachskiego Urzędu Statystycznego', 'BIOWET Puławy sp. z o.o.', 'ESPON', 'Centrum B

In [33]:
# Preview what you're removing (sanity checking the filter)

academic_entities = [name for name in cleaned_entities if is_academic(name, academic_keywords)]

print(len(academic_entities))

1238


## STEP 3: Categorize data

In [34]:
# Establish categories of non-academic entities

"""
To classify non-academic entities identified in the impact case studies, I developed a typology of organization 
types based on a grounded, inductive review process. Specifically, I randomly selected a sample of approximately 1,000 
unique non-academic organization names from the full list of 4,735 deduplicated entities. This sample served as an 
exploratory base for developing an inductive typology. By manually reviewing the selected entries, I identified recurring 
organizational patterns and formulated a set of categories that reflect the functional diversity of non-academic 
stakeholders mentioned in the case studies.
"""




'\nTo classify non-academic entities identified in the impact case studies, I developed a typology of organization \ntypes based on a grounded, inductive review process. Specifically, I randomly selected a sample of approximately 1,000 \nunique non-academic organization names from the full list of 4,735 deduplicated entities. This sample served as an \nexploratory base for developing an inductive typology. By reviewing the selected entries, I identified recurring \norganizational patterns and formulated a set of categories that reflect the functional diversity of non-academic \nstakeholders mentioned in the case studies.\n'

In [35]:
# Create a random sample of 1000 entities

# Set seed for reproducubility
random.seed(42)

# Randomly sample 1,000 entires form non-academic entities list
sample_size = 1000
non_academic_sampled = random.sample(non_academic_entities, sample_size)

# Convert to DataFrame for review
non_academic_sampled_df = pd.DataFrame(non_academic_sampled, columns=["organization_name"])
non_academic_sampled_df.to_csv("non_academic_sampled.csv", index=False, encoding="utf-8-sig")


In [47]:
# Establish categories of non-academic entities

"""
1. Company / Business
Commercial enterprises, corporations, startups, and private firms (e.g., Kaufland, KGHM ZANAM, Voicelab, Photon).

2. Government / Public Administration
Includes ministries, central/local government agencies, parliament, and other state entities (e.g., Senat RP, Urząd Miasta, Ministerstwo Rozwoju).

3. NGO / Association / Foundation
Non-profit organizations, foundations, professional associations, and social initiatives (e.g., Fundacja La Strada, Polskie Towarzystwo Psychologiczne, Stowarzyszenie Wioska Gotów).

4. Media / Publishing
News outlets, broadcasters, publishers, and cultural magazines (e.g., Polskie Radio, TVP Info, Deutsche Welle, Gazeta Lubuska).

5. Cultural Institution / Arts
Museums, theatres, orchestras, festivals, galleries (e.g., Teatr Wielki, Muzeum Historii Polski, Galeria Arsenał).

6. Health / Hospitals / Medical
Clinics, hospitals, medical institutes, and health-related organizations (e.g., Centrum Zdrowia Szansa, NFZ, American Heart Association).

7. Religious Organization
Churches, dioceses, religious associations, and theological institutions (e.g., Kościół Katolicki, Episkopat Polski, Cerkiew).

8. Military / Defense / Security
Armed forces, police, defense industry, or military R&D (e.g., Wojsko Polskie, Żandarmeria Wojskowa, Lockheed Martin).

9. International Organization / EU
UN, EU, NATO, OECD, international consortia or partnerships (e.g., European Commission, UNESCO, OECD).

10. Education (non-university)
Includes schools, kindergartens, vocational schools, continuing education centers (e.g., Szkoła Podstawowa, Centrum Kształcenia Ustawicznego).

11. Other / Unclear
Anything that doesn’t clearly fall into the above categories or needs human validation.

"""

'\n1. Company / Business\nCommercial enterprises, corporations, startups, and private firms (e.g., Kaufland, KGHM ZANAM, Voicelab, Photon).\n\n2. Government / Public Administration\nIncludes ministries, central/local government agencies, parliament, and other state entities (e.g., Senat RP, Urząd Miasta, Ministerstwo Rozwoju).\n\n3. NGO / Association / Foundation\nNon-profit organizations, foundations, professional associations, and social initiatives (e.g., Fundacja La Strada, Polskie Towarzystwo Psychologiczne, Stowarzyszenie Wioska Gotów).\n\n4. Media / Publishing\nNews outlets, broadcasters, publishers, and cultural magazines (e.g., Polskie Radio, TVP Info, Deutsche Welle, Gazeta Lubuska).\n\n5. Cultural Institution / Arts\nMuseums, theatres, orchestras, festivals, galleries (e.g., Teatr Wielki, Muzeum Historii Polski, Galeria Arsenał).\n\n6. Health / Hospitals / Medical\nClinics, hospitals, medical institutes, and health-related organizations (e.g., Centrum Zdrowia Szansa, NFZ, Am