# Academia–Practice Interaction Mapping Using NLP

**Notebook 07: Training the Text Classification Model**

**Author:** Kamila Lewandowska  
**Project Status:** *In Progress*  
**Last Updated:** June 2025  

---

### Notebook Overview

**Goal:** Train and evaluate a text classification model to categorize non-academic organizations based on their names.

This notebook:

- Loads a manually annotated dataset of non-academic entities
- Prepares features using TF-IDF vectorization (1-2 grams)
- Trains a logistic regression classifier with balanced class weights
- Evaluates model performance using precision, recall, and F1-score
- Saves the final model and label encoder for reuse on new entity data
---

In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import joblib

# Read and prepare training data for ML

In [2]:
# Isolate Stanza-Only Entities

# Load full Stanza output
df_stanza_all = pd.read_csv("../output/ner_stanza_pl.csv")  
df_common = pd.read_csv("../output/common_org_entities.csv")

In [3]:
df_stanza_all.head(5)

Unnamed: 0,Text,ORG_Entities_stanza,ICS_ID
0,Badania skupiające się na szczegółowej analizi...,['Komitetu Nauk Weterynaryjnych i Rozrodu Zwie...,00153fbd-82f7-48c4-b5bd-e830bc390244
1,"Birdwatching, czyli obserwacje w terenie ptakó...","['Królewskie Towarzystwo Ochrony Ptaków', 'Fac...",002768f1-8b96-4e0f-bcc8-192eb0594e60
2,Efektywny transfer wiedzy jest podstawowym czy...,"['MŚP', 'MŚP', 'ETW']",00500483-f00c-4410-b6f7-8650a003125f
3,Ważnym obszarem działalności naukowej WSPiA je...,"['WSPiA', 'AP', 'WSPiA', '4 Zespoły', 'AP', 'A...",006e7fef-2083-426d-9c1b-1affd27b939e
4,Znaczna część europejskiego dziedzictwa archeo...,"['Interreg Central Europe', 'Archaeological He...",00901439-d91a-48e0-903a-26a4253c3a0c


In [4]:
# Check the type of data of the row with entities

type(df_stanza_all["ORG_Entities_stanza"].iloc[0])

str

In [6]:
# Extract unique ORG entities from df_stanza_all

# Flatten all (ICS_ID, ORG_Entity) pairs from Stanza

# List to hold flattened records

all_stanza_entities = []

# Loop through each row in stanza data
for _, row in df_stanza_all.iterrows():
    ics_id = row["ICS_ID"]
    try:
        org_list = literal_eval(row["ORG_Entities_stanza"])
        for org in org_list:
            org_cleaned = org.strip()
            if isinstance(org_cleaned, str) and org_cleaned:
                all_stanza_entities.append((ics_id, org_cleaned))
    except:
        continue

# Create dataframe
all_stanza_flat = pd.DataFrame(all_stanza_entities, columns=["ICS_ID", "ORG_Entity"])

In [7]:
all_stanza_flat.head(20)

Unnamed: 0,ICS_ID,ORG_Entity
0,00153fbd-82f7-48c4-b5bd-e830bc390244,Komitetu Nauk Weterynaryjnych i Rozrodu Zwierz...
1,00153fbd-82f7-48c4-b5bd-e830bc390244,Rady Doradczej
2,00153fbd-82f7-48c4-b5bd-e830bc390244,Advisory Board
3,00153fbd-82f7-48c4-b5bd-e830bc390244,Medycyna Weterynaryjna
4,00153fbd-82f7-48c4-b5bd-e830bc390244,Journal of Applied Genetics
5,00153fbd-82f7-48c4-b5bd-e830bc390244,SPRINGER
6,00153fbd-82f7-48c4-b5bd-e830bc390244,Gene
7,00153fbd-82f7-48c4-b5bd-e830bc390244,ELSEVIER
8,00153fbd-82f7-48c4-b5bd-e830bc390244,Scientific Reports
9,00153fbd-82f7-48c4-b5bd-e830bc390244,Zespołu


In [9]:
# Filter out entities that were already processed (the 6111 “common”)

# Set of already classified entities
common_entities_set = set(df_common['ORG_Entity'].str.strip())

# Keep only stanza-only entries
df_stanza_only = all_stanza_flat[~all_stanza_flat["ORG_Entity"].isin(common_entities_set)].copy()

# Check number of unique entities and entries
print("Unique entities:", df_stanza_only["ORG_Entity"].nunique())
print("Total rows:", len(df_stanza_only))

Unique entities: 11877
Total rows: 15092


In [10]:
# Convert to DataFrame and save

df_stanza_only.to_csv("../output/stanza_only_entities.csv", index=False)

# Training a Text Classification Pipeline Using TF-IDF and Logistic Regression

In [9]:
# Upload the labeled dataset

df_labeled = pd.read_csv("../output/df_non_academic_updated.csv", index_col=0)
df_labeled.columns

Index(['ORG_Entity', 'Lemma_Entity', 'Suspicious', 'Lemma_Cleaned',
       'Matched_Category'],
      dtype='object')

In [10]:
# Prepare features and labels

X = df_labeled['Lemma_Entity'].astype(str)
y = df_labeled['Matched_Category'].astype(str)

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(X, y_encoded, stratify=y_encoded, test_size=0.2, random_state=42)

In [21]:
# Train a classifier

# Define a pipeline combining TF-IDF vectorization and Logistic Regression classifier
pipeline = Pipeline([
    # Step 1: Convert text into TF-IDF features (using unigrams and bigrams)
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=1)),

    # Step 2: Train a logistic regression model with balanced class weights
    ("clf", LogisticRegression(max_iter=1000, class_weight="balanced"))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate the model on the validation set and print accuracy
print("Validation accuracy:", pipeline.score(X_val, y_val))

Validation accuracy: 0.6257861635220126


In [22]:
# Compute confusion matrix:

y_pred = pipeline.predict(X_val)
print(classification_report(y_val, y_pred, target_names=label_encoder.classes_))

                                    precision    recall  f1-score   support

                Company / Business       0.92      0.45      0.60       241
       Cultural Institution / Arts       0.63      0.50      0.56        44
        Education (non-university)       0.81      0.54      0.65        24
Government / Public Administration       0.90      0.67      0.77       216
      Health / Hospitals / Medical       0.79      0.44      0.56        25
   International Organization / EU       0.73      0.47      0.58        76
                Media / Publishing       0.64      0.64      0.64        45
     Military / Defense / Security       0.64      0.78      0.71        23
    NGO / Association / Foundation       0.86      0.65      0.74        77
                   Other / Unclear       0.37      0.92      0.53       171
            Religious Organization       1.00      0.67      0.80        12

                          accuracy                           0.63       954
          

In [51]:
# Save the trained model and label encoder

joblib.dump(pipeline, "../output/final_org_classifier.joblib")
joblib.dump(label_encoder, "../output/org_label_encoder.joblib")

['../output/org_label_encoder.joblib']

---

### Summary

- Trained a Logistic Regression classifier on 4,767 labeled non-academic entities
- Used TF-IDF vectorization with unigrams and bigrams
- Validation accuracy: **62.6%**
- High precision retained for key categories
- Saved the model and label encoder for future use

This model will now be applied to the remaining Stanza-only entities in the next step.