# Academia–Practice Interaction Mapping Using NLP

**Notebook 06: Entity Categorization (Final Update)**

**Author:** Kamila Lewandowska  
**Project Status:** *In Progress*  
**Last Updated:** June 2025  

---

### Notebook Overview

**Goal:** Finalize the classification of non-academic organizations by incorporating manual annotations for previously unclear entities.

This notebook:

- Loads manual annotations for entities initially labeled as “Other / Unclear”
- Merges them with the full list of non-academic organizations
- Updates the Matched_Category values based on human-coded categories
- Produces a finalized dataset of categorized non-academic entities
---

In [1]:
import pandas as pd
from collections import Counter

In [2]:
# Create a dataframe with updated matched categories based on manual annotation

#Lead the nevessary files
df_annotated = pd.read_excel("../output/unclear_entities_annotated.xlsx")
df_non_academic = pd.read_csv("../output/df_non_academic.csv", index_col=0)

# Set index of df_non_academic to allow join via Entity_ID
df_annotated = df_annotated.set_index('Entity_ID')

# Create a copy of the original DataFrame
df_non_academic_updated = df_non_academic.copy()

# Replace values in Matched_Category for entities where annotations exist
df_non_academic_updated.loc[df_annotated.index, 'Matched_Category'] = df_annotated['Annotated_Category']

# Keep only the required columns
entities_categorized = df_non_academic_updated[['ORG_Entity', 'Matched_Category']]

In [3]:
# Sanity Check: Compares Counter results of category distributions: before (df_non_academic) and after (entities_categorized)

df_non_academic_entities_freq = Counter(df_non_academic["Matched_Category"])
print(df_non_academic_entities_freq)

annotated_entities_freq = Counter(entities_categorized["Matched_Category"])
print(annotated_entities_freq)

Counter({'Other / Unclear': 3101, 'Government / Public Administration': 683, 'Company / Business': 274, 'NGO / Association / Foundation': 189, 'International Organization / EU': 125, 'Media / Publishing': 114, 'Cultural Institution / Arts': 107, 'Education (non-university)': 63, 'Military / Defense / Security': 54, 'Religious Organization': 29, 'Health / Hospitals / Medical': 28})
Counter({'Company / Business': 1203, 'Government / Public Administration': 1082, 'Other / Unclear': 853, 'NGO / Association / Foundation': 386, 'International Organization / EU': 378, 'Media / Publishing': 225, 'Cultural Institution / Arts': 218, 'Health / Hospitals / Medical': 125, 'Education (non-university)': 119, 'Military / Defense / Security': 116, 'Religious Organization': 62})


In [None]:
# Export df_non_academic_updated to csv

df_non_academic_updated.to_csv(