<a href="https://colab.research.google.com/github/m3wzz/very_fake/blob/main/Lia_QA_based_Information_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QA-based Information Extraction

Created by Sarah Oberbichler [ORCID](https://orcid.org/0000-0002-1031-2759)


QA-based Entity Extraction transforms entity recognition into a question-answering task. Instead of directly labeling words in text as entities, it asks specific questions like "What companies are mentioned?" or "Who are the people in this text?" The model then responds by extracting the relevant entities from the text as answers to these questions. This approach makes entity extraction more flexible and intuitive, as new entity types can be added simply by asking new questions, though it may be more computationally intensive than traditional sequence labeling methods.

###NuExtract Lannguage Model

For the NE extraction, we use the NuExtract model v2. NuExtract is trained on a private high-quality dataset for structured information extraction. It supports long documents and several languages (English, French, Spanish, German, Portuguese, and Italian).

## Importing the Dataset

We import a dataset that contains single articles

In [None]:
!git clone https://github.com/soberbichler/NLP-Course4Humanities_2025.github.io.git

In [None]:
import pandas as pd

# Replace 'your_excel_file.xlsx' with the actual path to your Excel file
df = pd.read_excel('/content/NLP-Course4Humanities_2025.github.io/datasets/lügenpresse_context_window.xlsx')

# Now you can work with the DataFrame 'df'
df.head()

In [None]:
df['context_small'][0]

##Defining a Template for Information Extraction

As an example, we extract information from earthquake reportings. In doing so, we want the model to distinguish between earthquake locations, dateline locations, extract the date of the earthquake, the magnitutes, the persons involved as well as causalities, damage and rescue effort of the earthquake.

In [None]:
# Define a template for Lügenpresse information extraction
import json
lügenpresse_template = json.dumps({
    "Lügenpresse": {
        "Places_Of_Origin_Of_The_Lie": "string",
        "Press_Mentioned_in_Relation_To_Lying_Press": "string",
        "What_They_Lied_About": "string",
        "Context": "string"
    }
}, indent=4)

## Running the Model

The code below extracts the named entities using the extraction template. The model output is per default a json format. We add the extracted entities to our dataframe that will be saved as excel file.

In [None]:
df = df[:10]

In [None]:
!pip install qwen-vl-utils
import json
import torch
import pandas as pd
from transformers import AutoProcessor, AutoModelForVision2Seq
from tqdm import tqdm
import logging

# Suppress transformer warnings
logging.getLogger("transformers").setLevel(logging.ERROR)

def process_all_vision_info(messages, examples=None):
    """Process vision information - returns None for text-only inputs"""
    from qwen_vl_utils import process_vision_info

    is_batch = messages and isinstance(messages[0], list)
    messages_batch = messages if is_batch else [messages]

    all_images = []
    for message_group in messages_batch:
        input_message_images = process_vision_info(message_group)[0] or []
        all_images.extend(input_message_images)

    return all_images if all_images else None

def predict_NuExtract(model, processor, texts, template):
    """Extract structured information using NuExtract 2.0"""
    outputs = []

    for text in texts:
        # Prepare messages
        messages = [{"role": "user", "content": text}]

        # Apply chat template
        formatted_text = processor.tokenizer.apply_chat_template(
            messages,
            template=template,
            tokenize=False,
            add_generation_prompt=True,
        )

        # Process inputs
        image_inputs = process_all_vision_info(messages)
        inputs = processor(
            text=[formatted_text],
            images=image_inputs,
            padding=True,
            return_tensors="pt",
        ).to(model.device)

        # Generate
        with torch.no_grad():
            generated_ids = model.generate(
                **inputs,
                do_sample=False,
                num_beams=1,
                max_new_tokens=2048,
                temperature=0
            )

        # Decode
        generated_ids_trimmed = [
            out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids_trimmed,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )

        outputs.extend(output_text)

    return outputs

# Load NuExtract 2.0 model
model_name = "numind/NuExtract-2.0-4B"
device = "cuda"

print("Loading NuExtract 2.0 model...")
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    model_name,
    trust_remote_code=True,
    padding_side='left',
    use_fast=True
)

# Filter texts
df['context_small'] = df['context_small'].astype(str)
valid_texts = df['context_small'][df['context_small'].str.strip() != '']

print(f"Processing {len(valid_texts)} texts...")

# Extract information
all_predictions = []

for i in tqdm(range(len(valid_texts)), desc="Extracting"):
    text = valid_texts.iloc[i]
    try:
        prediction = predict_NuExtract(model, processor, [text], lügenpresse_template)
        all_predictions.extend(prediction)
    except Exception as e:
        print(f"Error on text {i}: {e}")
        all_predictions.append("{}")

    if i % 5 == 0:
        torch.cuda.empty_cache()

# Parse predictions
df['lügenpresse_extraction'] = pd.Series([None] * len(df))
df.loc[valid_texts.index, 'lügenpresse_extraction'] = all_predictions

# Flatten JSON
def parse_lügenpresse_info(extraction):
    try:
        parsed = json.loads(extraction)
        return parsed.get('Lügenpresse', {})
    except:
        return {}

df['places_of_origin_of_the_lie'] = df['lügenpresse_extraction'].apply(lambda x: parse_lügenpresse_info(x).get('Places_Of_Origin_Of_The_Lie', ''))
df['press_mentioned_in_relation_to_lying_press'] = df['lügenpresse_extraction'].apply(lambda x: parse_lügenpresse_info(x).get('Press_Mentioned_in_Relation_To_Lying_Press', ''))
df['what_they_lied_about'] = df['lügenpresse_extraction'].apply(lambda x: parse_lügenpresse_info(x).get('What_They_Lied_About', ''))
df['context_extraction'] = df['lügenpresse_extraction'].apply(lambda x: parse_lügenpresse_info(x).get('Context', ''))

# Save results
df.to_excel('lügenpresse_extractions.xlsx', index=False)

print("Done!")
df

In [None]:
df.to_excel('results.xlsx')

## Visualization Example - Creating a Map with Earthquake Locations


We first use the geopy library to process geographic locations and add their corresponding coordinates (latitude and longitude) to a pandas DataFrame. It includes a GeocodingService class that interfaces with the Nominatim geocoding API, implementing rate-limiting, retries with exponential backoff, and error handling to ensure robust geocoding.

We further use the folium library to create an interactive map with markers for locations provided in a pandas DataFrame. Finally, the map is created and displayed, providing a visual representation of the geographic data.

Please not that when using extracted named entities for further analysis, they need to be controlled and verified by a human reader since the model most likely has made some mistakes.

In [None]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut, GeocoderServiceError
import pandas as pd
import time
from typing import List, Tuple, Optional
import random

class GeocodingService:
    def __init__(self, user_agent: str = None, timeout: int = 10, rate_limit: float = 1.1):
        """
        Initialize the geocoding service with proper configuration.

        Args:
            user_agent: Custom user agent string (default: generated)
            timeout: Timeout for requests in seconds
            rate_limit: Time to wait between requests in seconds
        """
        if user_agent is None:
            user_agent = f"python_geocoding_script_{random.randint(1000, 9999)}"

        self.geolocator = Nominatim(
            user_agent=user_agent,
            timeout=timeout
        )
        self.rate_limit = rate_limit
        self.last_request = 0

    def _rate_limit_wait(self):
        """Implement rate limiting between requests"""
        current_time = time.time()
        time_since_last = current_time - self.last_request
        if time_since_last < self.rate_limit:
            time.sleep(self.rate_limit - time_since_last)
        self.last_request = time.time()

    def geocode_location(self, location: str, max_retries: int = 3) -> Optional[Tuple[float, float]]:
        """
        Geocode a single location with retries.

        Args:
            location: Location string to geocode
            max_retries: Maximum number of retry attempts

        Returns:
            Tuple of (latitude, longitude) or None if geocoding fails
        """
        for attempt in range(max_retries):
            try:
                self._rate_limit_wait()
                location_data = self.geolocator.geocode(location)
                if location_data:
                    return (location_data.latitude, location_data.longitude)
                return None
            except (GeocoderTimedOut, GeocoderServiceError) as e:
                if attempt == max_retries - 1:
                    print(f"Failed to geocode '{location}' after {max_retries} attempts: {e}")
                    return None
                time.sleep(2 ** attempt)  # Exponential backoff
            except Exception as e:
                print(f"Error geocoding '{location}': {e}")
                return None
        return None

    def process_locations(self, locations: str) -> List[Optional[Tuple[float, float]]]:
        """
        Process a comma-separated string of locations.

        Args:
            locations: Comma-separated string of location names

        Returns:
            List of coordinate tuples or None for failed geocoding
        """
        if pd.isna(locations) or not locations:
            return []

        location_list = [loc.strip() for loc in locations.split(',')]
        return [self.geocode_location(loc) for loc in location_list]

def geolocate_places(df: pd.DataFrame,
                    places_column: str = 'locations',
                    user_agent: str = None) -> pd.DataFrame:
    """
    Add coordinates to a DataFrame based on location names.

    Args:
        df: Input DataFrame
        places_column: Name of the column containing comma-separated location strings
        user_agent: Custom user agent string

    Returns:
        DataFrame with added 'coordinates' column
    """
    geocoder = GeocodingService(user_agent=user_agent)

    # Create a copy to avoid modifying the original DataFrame
    result_df = df.copy()

    # Process locations
    result_df['coordinates'] = result_df[places_column].apply(geocoder.process_locations)

    return result_df

# Main execution for Lügenpresse analysis
if __name__ == "__main__":
    # Apply geocoding to the Lügenpresse DataFrame
    df_with_coords = geolocate_places(
        df,
        places_column='places_of_origin_of_the_lie',
        user_agent='lügenpresse_geocoding_service_v1.0'
    )

    # Update the original DataFrame with the new coordinates
    df['coordinates'] = df_with_coords['coordinates']

    # Display the results
    print("\nSample of geocoded locations:")
    print(df[['places_of_origin_of_the_lie', 'coordinates']].head())

    # Optional: Display some statistics
    total_rows = len(df)
    successful_geocodes = df['coordinates'].apply(lambda x: len([c for c in x if c is not None])).sum()
    failed_geocodes = df['coordinates'].apply(lambda x: len([c for c in x if c is None])).sum()

    print(f"\nGeocoding Statistics:")
    print(f"Total rows processed: {total_rows}")
    print(f"Successfully geocoded: {successful_geocodes}")
    print(f"Failed to geocode: {failed_geocodes}")

In [None]:
import folium
from folium import plugins
import pandas as pd
from typing import List, Tuple, Optional
from IPython.display import display

def create_location_map(df: pd.DataFrame,
                       coordinates_col: str = 'coordinates',
                       places_col: str = 'places_of_origin_of_the_lie',
                       title_col: Optional[str] = None) -> folium.Map:
    """
    Create an interactive map with individual markers for all Lügenpresse locations.

    Args:
        df: DataFrame containing coordinates and location names
        coordinates_col: Name of column containing coordinates
        places_col: Name of column containing location names
        title_col: Optional column name for additional marker information

    Returns:
        folium.Map object with all locations marked individually
    """
    # Initialize the map centered on Europe (where most references likely are)
    m = folium.Map(location=[50, 10], zoom_start=4)

    # Keep track of all valid coordinates for setting bounds
    all_coords = []

    # Process each row in the DataFrame
    for idx, row in df.iterrows():
        coordinates = row[coordinates_col]
        places = row[places_col].split(',') if pd.notna(row[places_col]) else []
        title = row[title_col] if title_col and pd.notna(row[title_col]) else None

        # Skip if no coordinates
        if not coordinates:
            continue

        # Add individual markers for each location
        for i, (coord, place) in enumerate(zip(coordinates, places)):
            if coord is not None:  # Skip None coordinates
                lat, lon = coord
                place_name = place.strip()

                # Create popup content
                popup_content = f"<b>{place_name}</b>"
                if title:
                    popup_content += f"<br>{title}"

                # Add marker directly to the map
                folium.Marker(
                    location=[lat, lon],
                    popup=folium.Popup(popup_content, max_width=300),
                    tooltip=place_name,
                    icon=folium.Icon(color='red', icon='info-sign')
                ).add_to(m)

                all_coords.append([lat, lon])

    # If we have coordinates, fit the map bounds to include all points
    if all_coords:
        m.fit_bounds(all_coords)

    return m

# Create and display the map
map_obj = create_location_map(df)
display(map_obj)