## In this notebook:
__INPUT__: post-semantic search subset

__PROCESS__: (1) NER and (2) RE

__NER__
1. Split article_text into sentences
2. For each sentence, quick pre-filter: take forward only sentences containing any of event_lemmas or event_phrases
3. Extract WHAT HAPPENED: define custom pattern-matching for EVENT types via spaCy's entity ruler module in nlp pipeline, accounting for verb-form events in event_lemmas object
4. Extract WHERE: looking for VENUE and LOCATION using GLiNER
5. Output df with kept sentences, and NER-extracted event, venue, and location, keeping only sentences that have at least one of venue or location

__RE__:

6. For each row (sentence) in post-NER df, conduct dependency parsing using spaCy and convert to undirected graph using NetworkX
7. For each NER-extracted event, locate event token/phrase (for single/multiword events respectively) in the sentence, takes that token as the 'anchor point' from which syntactic distances are calculated 
8. For each identified event in a row, find syntactically closest venue and location using find_closest_match_for_type() function, which calls nx's shortest_path_length() to get the venue/location entity
9. Output df with added columns for matched_venue and matched_location

## Imports

In [8]:
import pandas as pd
import numpy as np
import os
import gc # garbage....hehe

# NERcessities
import spacy
# !python -m spacy download en_core_web_lg
from collections import defaultdict
from spacy.pipeline import EntityRuler
# !pip3 install gliner
from gliner import GLiNER
import networkx as nx

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [5]:
# change directory if necessary
# os.chdir("../..")
os.getcwd()

'/home/jovyan/work'

## NLP models

In [9]:
nlp = spacy.load("en_core_web_lg")
nermodel = GLiNER.from_pretrained("EmergentMethods/gliner_medium_news-v2.1")

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

README.md: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

zero-shot_18_table.png:   0%|          | 0.00/344k [00:00<?, ?B/s]

entity-types_limited.png:   0%|          | 0.00/179k [00:00<?, ?B/s]

topics_fig_connected.png:   0%|          | 0.00/172k [00:00<?, ?B/s]

.gitignore:   0%|          | 0.00/5.00 [00:00<?, ?B/s]

gliner_config.json:   0%|          | 0.00/476 [00:00<?, ?B/s]

countries_distribution.png:   0%|          | 0.00/398k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/781M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



## Big Fat Event/Location Extraction Function-defining Cells

In [11]:
# ==============================================================================
# SETTING UP
# ==============================================================================

# Defining events to look out for (both nouns and verb forms using LEMMA)
# This is used in STEP 1 (pre-filter) and STEP 2A (spacy event extraction)
event_lemmas = [
    "meeting",        # noun
    "meet",           # verb
    "strike",         # noun/verb same lemma
    "protest",        # noun/verb same lemma
    "riot",           # noun/verb same lemma
    "demonstration",  # noun
    "demonstrate",    # verb
    "assembly",       # noun
    "assemble",       # verb
    "gathering",      # noun
    "gather",         # verb
    "lecture"         # noun/verb same lemma
]

# Multi-word phrase (not lemmatised)
event_phrases = ["public meeting"]

# mapping noun and verb forms to their events so all forms of the event keyword are captured and identified as an event type
# This is ued in STEP 2A (spacy event extraction) and in EVENT_LOC_PAIRING() to match verbs to noun event types
nounverb_map = {
    "meet": "meeting",
    "meeting": "meeting",
    "assemble": "assembly",
    "assembly": "assembly",
    "gather": "gathering",
    "gathering": "gathering",
    "demonstrate": "demonstration",
    "demonstration": "demonstration",
    "riot": "riot",
    "strike": "strike",
    "protest": "protest",
    "lecture": "lecture"
}

gliner_labels = ['venue', 'location']

# ==============================================================================
# ENTITYRULER SETUP TO DETECT EVENTS
# ==============================================================================

# this is necessary because "EVENT" will not traditionally get my collective action stuff!

# Create and add the EntityRuler BEFORE the statistical NER
if "entity_ruler" not in nlp.pipe_names:
    ruler = nlp.add_pipe("entity_ruler", before="ner")
else:
    ruler = nlp.get_pipe("entity_ruler")

event_patterns = []

# Add single-token patterns for all lemmas
for lemma in event_lemmas:
    event_patterns.append({"label": "EVENT", "pattern": [{"LEMMA": lemma}]})

# Add multi-word phrase patterns
for phrase in event_phrases:
    tokens = phrase.split()
    pattern = [{"LOWER": tok} if i < len(tokens)-1 else {"LEMMA": tokens[-1]} for i, tok in enumerate(tokens)]
    event_patterns.append({"label": "EVENT", "pattern": pattern})

# Add patterns to the ruler
ruler.add_patterns(event_patterns)

In [20]:
# ==============================================================================
# NER BUILDING BLOCK FUNCTIONS
# ==============================================================================

# for step 2: FUNCTION TO CHECK IF SENTENCES CONTAIN EVENT KEYWORDS (LEMMA OR PHRASE)
# note: this is a 'cheap' command+f type pre-filtering step

def contains_event(sent, lemmas_to_check=event_lemmas, phrases_to_check=event_phrases):
    """
    Checks if a spaCy sentence Span contains event lemma or phrase.
    """
    text_lower = sent.text.lower()
    # Check for multi-word phrases first
    for phrase in phrases_to_check:
        if phrase.lower() in text_lower:
            return True
            
    # Then check for single-word lemmas
    for token in sent:
        if token.lemma_.lower() in lemmas_to_check:
            return True
            
    return False
    
# for step 3: SPACY FOR EVENTS (INCL. VERB -> NOUN MAPPING) 
def extract_events(input_text):
    """
    Uses SpaCy's custom rule-based EVENT extraction to identify events, both in their noun and verb forms.
    """
    doc = nlp(str(input_text))
    mapped_events = []

    for ent in doc.ents:
        if ent.label_ == 'EVENT':
            
            # for "public meeting"
            if ent.text.lower() in event_phrases:
                mapped_events.append(ent.text.lower())
                
            # for the rest of the single-word event types
            lemma = ent[0].lemma_.lower()
            # Map to event label if available
            if lemma in nounverb_map:
                mapped_event = nounverb_map.get(lemma, ent.text)
                mapped_events.append(mapped_event)

    extracted_events = ', '.join(mapped_events) if mapped_events else None
    return extracted_events

# for step 3: GLINER FOR LOCATIONS 
def extract_locations(input_text, labels=gliner_labels, gliner_confidence=0.5):
    """
    Extracts location entities (venue, location) using GLiNER.
    Deduplicates and returns them as comma-separated strings.
    NOTE: gliner_confidence threshold defaults to 0.5 but this can be overwritten in the master process_articles() function.
    """
    entities_by_label = defaultdict(list)
    output_dict = {}

    # Extract entities using GLiNER
    entities = nermodel.predict_entities(input_text, 
                                         labels, 
                                         threshold=gliner_confidence) 

    # Group entities by their label
    for entity in entities:
        entities_by_label[entity['label']].append(entity['text'])

    # Process each label for unique, comma-separated strings
    for label in labels:
        if label in entities_by_label and entities_by_label[label]:
            # Deduplicate while preserving case and order
            unique_entities = []
            seen_lower = set()
            for entity in entities_by_label[label]:
                if entity.lower() not in seen_lower:
                    unique_entities.append(entity)
                    seen_lower.add(entity.lower())
            
            output_dict[label] = ", ".join(unique_entities)
        else:
            output_dict[label] = None # Ensure the key exists, even if no entities were found
            
    return output_dict
    
# ==============================================================================
# NER MASTER FUNCTION
# ==============================================================================

def process_articles_ner(df, gliner_confidence=0.5):
    """
    1. Splits article_text for each row into sentences.
    2. On each sentence, pre-filter contains_events() to keep sentences containing words in the event_lemmas or event_phrases list
    3. On remaining sentence, extract EVENT, VENUE, LOCATION by calling extract_events() and extract_locations()
    4. Keep only sentences with EVENT and at least one of VENUE/LOCATION filled
    5. Returns: df with columns: ['corpus_id', 'sentence', 'event', 'venue', 'location']

    """
    results = []
    total_sentences = 0
    total_kept_sentences = 0
    
    for _, row in df.iterrows():
        corpus_id = row['corpus_id']
        article = row['article_text']
        
        # STEP 1: SPLIT INTO SENTENCES
        sentences = list(nlp(article).sents)
        total_sentences += len(sentences)

        
        # STEP 2: FILTER FOR SENTENCES CONTAINING KEYWORDS 
        filtered_sents = [s for s in sentences if contains_event(s, event_lemmas, event_phrases)]
        total_kept_sentences += len(filtered_sents)

        for sent in filtered_sents:
            sent_text = sent.text.strip()
            
            # STEP 3: Extract events (from spaCy EntityRuler)
            event_str = extract_events(sent_text)

            
            # STEP 4: Extract locations (from GLiNER)
            loc_dict = extract_locations(sent_text, 
                             labels=['venue', 'location'], # Pass both labels
                             gliner_confidence=gliner_confidence)
            
            # Append result
            results.append({
               "corpus_id": corpus_id,
                "sentence": sent_text,
                "event": event_str,
                "venue": loc_dict.get("venue"), # Get venue from the dictionary
                "location": loc_dict.get("location") # Get location from the dictionary
            })
    
    result_df = pd.DataFrame(results)
    
    # STEP 5: Keep only rows where both event and at least venue or location are filled 
    result_df_dropped = result_df[result_df['event'].notna() & 
                        (result_df['venue'].notna() | result_df['location'].notna())].copy()

    # === Print how many sentences were dropped for not containing event_lemma or event_phrase ===
    total_dropped = total_sentences - total_kept_sentences
    print(f"\nTotal sentences dropped (no event keywords): {total_dropped} out of {total_sentences} ({total_dropped/total_sentences:.2%})")
    
    # === Print how many rows were dropped because of missing venue/location ===
    n_rows_dropped = len(result_df)-len(result_df_dropped)
    print(f"Rows dropped due to missing event/location: {n_rows_dropped} out of {len(result_df)} ({n_rows_dropped/len(result_df):.2%})")
    
    return result_df_dropped

In [13]:
# ==============================================================================
# RE BUILDING BLOCK FUNCTIONS
# ==============================================================================

# for step 6: function to build dependency graph from sentence
def build_dependency_graph(doc):
    """Convert a spaCy Doc into an undirected graph where nodes are token indices."""
    edges = []
    for token in doc:
        for child in token.children:
            edges.append((token.i, child.i))
    return nx.Graph(edges)

# for step 7: function to look for multiword events (i.e. public meeting in this case)
def match_span_indices(doc, phrase):
    """Finds token indices for a multiword entity (case-insensitive). e.g. Chartist Lecture Room"""
    phrase_tokens = phrase.lower().split()
    for i in range(len(doc) - len(phrase_tokens) + 1):
        window = [t.text.lower() for t in doc[i:i+len(phrase_tokens)]]
        if window == phrase_tokens:
            return list(range(i, i+len(phrase_tokens)))


# for step 8: function to find best-candidate venue/location with the shortest syntactic path to the event token
def find_closest_match_for_type(event_head_token, entity_list, doc, graph):
    """
    Finds the entity from a given list that is syntactically closest to a given event token.

    Args:
        event_head_token (spacy.Token): The head token of the event entity.
        entity_list (list): A list of entity strings to search through (e.g., all venues).
        doc (spacy.Doc): The spaCy Doc object for the sentence.
        graph (nx.Graph): The dependency graph for the sentence.

    Returns:
        tuple: (best_entity_text, min_distance)
               - best_entity_text (str|None): The text of the closest entity found.
               - min_distance (float): The shortest dependency path length.
    """
    best_entity = None
    min_distance = float("inf")

    if not entity_list:
        return None, float("inf")

    for entity_text in entity_list:
        span_indices = match_span_indices(doc, entity_text)
        if not span_indices:
            continue

        # Get the syntactic head of the entity span to represent it in the graph
        entity_span = doc[span_indices[0]:span_indices[-1] + 1]
        entity_head_idx = entity_span.root.i

        try:
            # Calculate the shortest path from the event's head to the entity's head
            path_length = nx.shortest_path_length(graph, source=event_head_token.i, target=entity_head_idx)

            if path_length < min_distance:
                min_distance = path_length
                best_entity = entity_text
        except nx.NetworkXNoPath:
            # No syntactic path exists between the event and this entity
            continue
            
    return best_entity, min_distance

# ==============================================================================
# RE MASTER FUNCTION
# ==============================================================================
def event_loc_pairing(df, max_path_len=4):
    """
    For each event in a sentence, finds the closest 'venue' AND the closest 'location'
    independently, based on the shortest syntactic path.

    It creates one row per event, containing any matched venue and/or location that
    falls within the max_path_len threshold.
    
    Returns a new DataFrame with the paired results.
    """

    results = []

    # running over each row in df and...
    for _, row in df.iterrows():
        sentence_text = row["sentence"]

        # spacy parsing for dependency parsing
        doc = nlp(sentence_text)

        # defining the graph for the current sentence(doc)
        graph = build_dependency_graph(doc)

        # Create lists from comma-separated strings, handling potential None values
        events = [e.strip() for e in str(row["event"]).split(',') if e.strip()] if pd.notna(row["event"]) else []
        venues = [v.strip() for v in str(row["venue"]).split(',') if v.strip()] if pd.notna(row["venue"]) else []
        locations = [l.strip() for l in str(row["location"]).split(',') if l.strip()] if pd.notna(row["location"]) else []

        # If there are no locations/venues to match, skip this sentence. 
        # more of a second safety check since we've already dropped rows with no venue nor location in the NER step
        if not venues and not locations:
            continue
            
        # find syntactically closest venue and location for each event in events
        for event in events:
            # initialising
            event_head_token = None
            
            # MULTI-WORD phrase matching for "public meeting" 
            span_indices = match_span_indices(doc, event)
            if span_indices:
                event_head_token = doc[span_indices[0]:span_indices[-1] + 1].root # if found, this takes the syntactic 'anchor' of the phrase
            
            # SINGLE-WORD event types: for each sentence token, check if its lemma matches the event string
            else:
                for token in doc:
                    mapped_token_event = nounverb_map.get(token.lemma_.lower())
                    if mapped_token_event == event.lower():
                        event_head_token = token
                        break # Stop after finding the first match
            
            # If we still couldn't find any representation of the event, skip it.
            if not event_head_token:
                continue

            ### NOTE: event_head_token is the "anchor point" from which syntactic distance is calculated in the following step.  

            # --- Find the closest VENUE independently ---
            best_venue, venue_dist = find_closest_match_for_type(event_head_token, venues, doc, graph)
            
            # --- Find the closest LOCATION independently ---
            best_location, loc_dist = find_closest_match_for_type(event_head_token, locations, doc, graph)

            # Apply the max_path_len filter to each match
            matched_venue = best_venue if venue_dist <= max_path_len else None
            matched_location = best_location if loc_dist <= max_path_len else None

            # Only create a row if at least one successful pairing was made
            if matched_venue or matched_location:
                row_data = row.to_dict()
                row_data["event"] = event  # The specific event for this row
                row_data["re_venue"] = matched_venue
                row_data["re_location"] = matched_location
            
                results.append(row_data)

    return pd.DataFrame(results)

## Trying it out on first 5 rows of subset df

In [16]:
# please = subset.head(n=5)
# please

Unnamed: 0,corpus_id,score,date,source,article_text,year
0,178361,0.839721,1838-09-08,star,Preparatory to a Demonstration in favor of the...,1838
1,454650,0.839554,1842-02-26,star,The London Chartists are auxiously invited to ...,1842
2,798998,0.832106,1839-05-04,star,PUBLIC MEETING. - In this small village we had...,1839
3,709266,0.830941,1842-04-30,star,Islington.—A public open air meeting was held ...,1842
4,833213,0.828974,1841-10-02,star,"LEEDS.—On Sunday last, in the absence of Mr. M...",1841


In [17]:
# pls = process_articles_ner(please, gliner_confidence=0.7)
# pls

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



Total sentences dropped (no event keywords): 28 out of 56 (50.00%)
Rows dropped due to missing event/location: 17 out of 28 (60.71%)


Unnamed: 0,corpus_id,sentence,event,venue,location
1,178361,"On Monday evening last, the 3rd instant, the W...","assembly, demonstration",,Hull
4,178361,The chairman then read a letter received on Su...,meeting,Old Palace Yard,Westminster
6,454650,This will be the most important meeting ever h...,meeting,,London
7,454650,F. O'Connor will address a general meeting of ...,meeting,Social Institution,"John-street, Tottenham-Court-road"
8,454650,MEN of BIRMINGHAM.—A meeting will be held in t...,meeting,Town Hall,BIRMINGHAM
12,798998,At the time appointed for the meeting taking p...,meeting,,"New Mills, Glos- sop, Hyde, Stockport"
13,798998,When they arrived at Marple Bridge they all pr...,meeting,,Marple Bridge
16,798998,The above meeting was addressed by Messrs. T. ...,meeting,,"Stalybridge, Hyde, Glossop"
18,709266,Islington.—A public open air meeting was held ...,meeting,,"Islington, Finsbury"
24,833213,HALIFAX.—O'Connor Demonstration.—The committee...,demonstration,,HALIFAX


In [18]:
# pls_postre = event_loc_pairing(pls, max_path_len=4)
# pls_postre

Unnamed: 0,corpus_id,sentence,event,venue,location,re_venue,re_location
0,178361,"On Monday evening last, the 3rd instant, the W...",assembly,,Hull,,Hull
1,178361,The chairman then read a letter received on Su...,meeting,Old Palace Yard,Westminster,Old Palace Yard,Westminster
2,454650,This will be the most important meeting ever h...,meeting,,London,,London
3,454650,F. O'Connor will address a general meeting of ...,meeting,Social Institution,"John-street, Tottenham-Court-road",Social Institution,
4,709266,Islington.—A public open air meeting was held ...,meeting,,"Islington, Finsbury",,Islington
5,833213,"The committee meet every Tuesday evening, at t...",meeting,Chartist Lecture Room,Swan Coppic,Chartist Lecture Room,Swan Coppic


## Saving test output to test mapping stage


In [None]:
# pls_postre.to_csv("data/postre_test.csv", index=False, encoding="utf-8-sig")

In [None]:
# check = subset.loc[subset['corpus_id'] == 798998]
# check['article_text'].values[0]

## Moment of truth af.... let's try on my whole semantic-searched subset (this was run as a script on CLI)

In [15]:
subset = pd.read_csv("data/ner_subset.csv")


Unnamed: 0,corpus_id,score,date,source,article_text,year
0,178361,0.839721,1838-09-08,star,Preparatory to a Demonstration in favor of the...,1838
1,454650,0.839554,1842-02-26,star,The London Chartists are auxiously invited to ...,1842
2,798998,0.832106,1839-05-04,star,PUBLIC MEETING. - In this small village we had...,1839
3,709266,0.830941,1842-04-30,star,Islington.—A public open air meeting was held ...,1842
4,833213,0.828974,1841-10-02,star,"LEEDS.—On Sunday last, in the absence of Mr. M...",1841


In [None]:
# running both NER and RE functions
ner_re_output = event_loc_pairing(process_articles_ner(subset, gliner_confidence=0.7))

# saving locally
ner_re_output.to_csv("data/ner_re_done_v2.csv", index=False, encoding="utf-8-sig")

## Script output:
Total sentences dropped (no event keywords): 459183 out of 602545 (76.21%)

Rows dropped due to missing event/location: 74172 out of 143362 (51.74%)