# Academia–Practice Interaction Mapping Using NLP  
**Notebook 06 – Finalize Common Entity Categories**

**Author:** Kamila Lewandowska  
**Project Phase:** In Progress  
**Last Updated:** June 2025  

---

## Objective

Finalize the classification of non-academic organizations by incorporating manual annotations for entities initially labeled as “Other / Unclear.” The outcome is a definitive, human-validated dataset of categorized *common* non-academic entities, linked where possible to their original `ICS_ID`s.

---

## Workflow Summary

- Load manual annotations from `unclear_entities_annotated.xlsx`  
- Merge annotations with previously classified non-academic entities (`df_non_academic.csv`)  
- Update `Matched_Category` using `Annotated_Category` values  
- Export the finalized dataset to `df_non_academic_updated.csv`  
- Merge with original `ICS_ID` mappings from `common_entities_with_icsid.csv`  
- Export final matched entities to `common_classified_with_ids.csv`

---

## Key Outcomes

- **Total classified common entities:** 4,767  
- **Entities successfully matched with ICS_ID:** 4,722  
- **Unmatched entities:** 45 (~0.9%)  
  - Mostly due to punctuation/formatting discrepancies during preprocessing  
  - These remain classified but are excluded from ICS-linked analysis

---

## Output Files

- `df_non_academic_updated.csv` – Fully updated classification of common non-academic entities  
- `common_classified_with_ids.csv` – Classified common entities with matched `ICS_ID`s  


In [1]:
import pandas as pd
from collections import Counter

In [2]:
# Create a dataframe with updated matched categories based on manual annotation

#Lead the nevessary files
df_annotated = pd.read_excel("../output/unclear_entities_annotated.xlsx")
df_non_academic = pd.read_csv("../output/df_non_academic.csv", index_col=0)

# Set index of df_non_academic to allow join via Entity_ID
df_annotated = df_annotated.set_index('Entity_ID')

# Create a copy of the original DataFrame
df_non_academic_updated = df_non_academic.copy()

# Replace values in Matched_Category for entities where annotations exist
df_non_academic_updated.loc[df_annotated.index, 'Matched_Category'] = df_annotated['Annotated_Category']

# Keep only the required columns
entities_categorized = df_non_academic_updated[['ORG_Entity', 'Matched_Category']]

In [3]:
# Sanity Check: Compares Counter results of category distributions: before (df_non_academic) and after (entities_categorized)

df_non_academic_entities_freq = Counter(df_non_academic["Matched_Category"])
print(df_non_academic_entities_freq)

annotated_entities_freq = Counter(entities_categorized["Matched_Category"])
print(annotated_entities_freq)

Counter({'Other / Unclear': 3101, 'Government / Public Administration': 683, 'Company / Business': 274, 'NGO / Association / Foundation': 189, 'International Organization / EU': 125, 'Media / Publishing': 114, 'Cultural Institution / Arts': 107, 'Education (non-university)': 63, 'Military / Defense / Security': 54, 'Religious Organization': 29, 'Health / Hospitals / Medical': 28})
Counter({'Company / Business': 1203, 'Government / Public Administration': 1082, 'Other / Unclear': 853, 'NGO / Association / Foundation': 386, 'International Organization / EU': 378, 'Media / Publishing': 225, 'Cultural Institution / Arts': 218, 'Health / Hospitals / Medical': 125, 'Education (non-university)': 119, 'Military / Defense / Security': 116, 'Religious Organization': 62})


In [4]:
# Export df_non_academic_updated to csv

df_non_academic_updated.to_csv("../output/df_non_academic_updated.csv", index=False)

In [13]:
len(df_non_academic_updated)

4767

In [8]:
# Create a df with ICS_ID

# Load saved mapping
df_ics_entities = pd.read_csv("../output/common_entities_with_icsid.csv")

In [12]:
# Merge on ORG_Entity
common_classified_with_ids = pd.merge(df_ics_entities, df_non_academic_updated, on="ORG_Entity", how="inner")

# Save final result
common_classified_with_ids.to_csv("../output/common_classified_with_ids.csv", index=False)

In [14]:
len(common_classified_with_ids)

6555

In [20]:
common_classified_with_ids["ORG_Entity"].nunique()

4722

In [21]:
# Set difference: which classified ORG_Entities have no ICS_ID match
classified_names = set(df_non_academic_updated["ORG_Entity"])
ids_matched_names = set(df_ics_entities["ORG_Entity"])

unmatched = classified_names - ids_matched_names
print(len(unmatched))
print(sorted(unmatched))

45
['AUTOPART', 'Agencji Ochrony „Gwarant', 'Bertin Instruments', 'Biuro Programu „Niepodległa', 'DAK-POL', 'Elektrociepłownia „Zielona Góra', 'Erasmus', 'Falochron', 'FlexiOss', 'Fundacją "I am kids', 'Fundacją „Pełną Piersią', 'Gdańska Stocznia „REMONTOWA', 'Gdańskiej Stoczni „REMONTOWA', 'Ghetto Fighters', 'Glasspoint Krzemień', 'KK NSZZ „Solidarność', 'KWK „Wujek', 'Kinderkraft', 'Manreza', 'Nasz Bocian', 'Nasz Szczuczyn', 'OMEGA', 'Podlaski Oddział Stowarzyszenia „Wspólnota Polska', 'Polwet-Centrowet” Sp. z o.o.', 'Pracodawcy Lubelszczyzny „Lewiatan', 'Rozgłośni Polskiego Radia „Zachód', 'SE-K”Z', 'Saturn Lis Ceramika Spółka Jawna', 'Spółdzielnia Produkcyjna "Előre', 'Spółką „Destylacje Polskie', 'Stowarzyszenie „Amici del Villaggio', 'Stowarzyszenie „Czajnia', 'Stowarzyszenie „Wioska Gotów w Masłomęczu', 'Stowarzyszeniem „Czajnia', 'TRB”R', 'TechTransBalt', 'Towarystwo „Nadsannia', 'Van Storm', 'Wytwórnia Filmowa „Russkij put', 'X-deep', 'ZG „Sobieski', 'ZM „Jasiołka', 'Zabytek',

### Note on Missing Entity Matches

Out of 4,767 classified non-academic organization names, 4,722 were successfully matched back to their original ICS_IDs. The remaining 45 were not matched due to minor changes during the cleaning process (e.g. removal of quotes or punctuation).

These unmatched entities represent ~0.9% of the dataset and remain included in the classification, but without ICS_ID linkage.
