# Academia–Practice Interaction Mapping Using NLP

**Notebook 01: Data Preparation**

**Author:** Kamila Lewandowska  
**Project Status:** *In Progress*  
**Last Updated:** April 2025  

---
### Notebook Overview

This notebook prepares the raw dataset of social impact case studies submitted by Polish universities for Named Entity Recognition (NER). 
It includes data loading, basic preprocessing, and export of a cleaned version used in subsequent NLP tasks.

---

### Project Description

This project explores academia–practice collaboration through automated text analysis. Using NER models, it identifies non-academic organizations mentioned in impact narratives and builds an empirical foundation for analyzing the societal engagement of academic institutions.

---

### File Overview

- Input data: `data/merged_impact_case_studies.csv`
- Output files: 'data/cleaned_impact_case_studies.csv (used by 02_ner_extraction.ipynb)'
  
---


# Read and prepare data for NER extraction

In [1]:
# Import required libraries
import pandas as pd
import re
from pathlib import Path

In [2]:
# Read data from a csv file with desriptions of societal impact of research

ics = pd.read_csv("../data/merged_impact_case_studies.csv")

In [3]:
# Select list of columns to include for analysis
columns_to_include_pl = ['Impact description identifier - POL-on 2.0 system uuid', 'Identifier of the institution to which the impact description is assigned - POL-on 2.0 system uuid', 'Domain name', 'Discipline name', 'Title (Polish version)', 'Impact (Polish version)', 'The leading area of impact']  

# Filter to keep only the necessary columns and drop duplicates
ics_selected_columns_pl = ics[columns_to_include_pl].drop_duplicates()

# Reset index if necessary
ics_selected_columns_pl.reset_index(drop=True, inplace=True)



In [4]:
# Create a new column combining 'Impact (English version)' and 'Title (English version)'
texts_pl = ics_selected_columns_pl['Impact (Polish version)'].fillna('') + " " + ics_selected_columns_pl['Title (Polish version)'].fillna('')

In [5]:
# Check teh shape of the dataset
print(texts_pl.shape)

(2661,)


In [6]:
# Define text preprocessing

# Function to normalize text: remove URLs 
def preprocess_text(text):
    """
    Normalizes URLs in the input text by replacing full URLs with just the domain name.

    Parameters:
        text (str): A string of text potentially containing URLs.

    Returns:
        str: The input text with URLs simplified to their domain names.
    """
    text = re.sub(r'https?://(?:www\.)?([a-zA-Z0-9.-]+)(?:/[\w./-]*)?', r'\1', text)  # Normalize URLs
    return text

In [7]:
# Preprocess text
cleaned_text_pl = texts_pl.apply(preprocess_text)

In [8]:
# Add cleaned text to dataframe and inspect basic stats

ics_selected_columns_pl['cleaned_text'] = cleaned_text_pl

print(f"Average word count: {cleaned_text_pl.apply(lambda x: len(str(x).split())).mean():.2f}")
print(f"Total words: {cleaned_text_pl.apply(lambda x: len(str(x).split())).sum()}")

Average word count: 550.57
Total words: 1465061


In [9]:
# Save cleaned dataset for NER pipeline

ics_selected_columns_pl.to_csv("../data/cleaned_impact_case_studies.csv", index=False)
print("✅ Cleaned dataset saved to /data/cleaned_impact_case_studies.csv")

✅ Cleaned dataset saved to /data/cleaned_impact_case_studies.csv
