## Dataset Preparation

This code segment is responsible for preparing the dataset. It loads JSON files containing the training, validation, and test split information, as well as a large file with dialogue data. The code is optimized for handling large JSON files, ensuring efficient memory usage.

### Detailed description:

1. **File path definitions**:
   - `SPLIT_FILE_PATH`: Path to the JSON file that contains the train/validation/test data split.
   - `NEWSDIALOG_FILE_PATH`: Path to the large JSON file containing dialogue data.

2. **Loading split data**:
   - The contents of the `train_val_test_split.json` file are loaded into the `split_data` variable using the `json` library.

3. **Function to load a large JSON file**:
   - `load_large_json(file_path)` is a generator function that reads a JSON file line by line, allowing large datasets to be processed without loading the entire file into memory.

4. **Confirmation of successful data loading**:
   - After the data is loaded, the script prints the message: "Files have been successfully loaded."

In [2]:
import json
import os
import random

# File paths
SPLIT_FILE_PATH = "/Users/kamiljaworski/Projects/NLP_pro/datasets/MediaSum/data/train_val_test_split.json"
NEWSDIALOG_FILE_PATH = "/Users/kamiljaworski/Projects/NLP_pro/datasets/MediaSum/data/news_dialogue.json"

# Load split data
with open(SPLIT_FILE_PATH, "r") as split_file:
    split_data = json.load(split_file)

# Function to load a large JSON file line by line
def load_large_json(file_path):
    """
    Generator function to load a large JSON file line by line.
    
    Args:
        file_path (str): Path to the JSON file.
        
    Yields:
        dict: Parsed JSON object for each line in the file.
    """
    with open(file_path, "r") as file:
        for line in file:
            yield json.loads(line)

print("Files have been successfully loaded.")

Files have been successfully loaded.


## Sampling Data from Splits

This code segment is responsible for sampling a specified number of data points from the combined `train`, `val`, and `test` datasets. It is useful when you need a randomized subset of the data for experimentation or evaluation purposes.

### Detailed description:

1. **Function `sample_train_ids`**:
   - Randomly selects a specified number of IDs from the combined `train`, `val`, and `test` splits.
   - **Arguments**:
     - `split_data`: A dictionary containing the dataset splits.
     - `sample_size`: The number of samples to retrieve.
   - The function prints the total number of available unique IDs and raises a `ValueError` if the requested sample size exceeds the number of available IDs.

2. **Combining IDs from all splits**:
   - The IDs from the `train`, `val`, and `test` sets are merged into a single list.

3. **Random sampling**:
   - The `random.sample` function is used to select the specified number of unique IDs without repetition.

4. **Global variables**:
   - `accumulated_ids`: Total number of IDs across all dataset splits.
   - `SAMPLE_SIZE`: The desired sample size, here set to the total number of IDs (i.e., using the full dataset).

5. **Output**:
   - After sampling, the script prints the number of selected samples.

In [3]:
def sample_train_ids(split_data, sample_size):
    """
    Selects a specified number of samples from the `train`, `val`, and `test` datasets.

    Args:
        split_data (dict): Dictionary containing split data.
        sample_size (int): Number of samples to select.

    Returns:
        list: List of selected IDs.
    """
    # Combine IDs from train, validation, and test splits
    train_ids = split_data["train"] + split_data["val"] + split_data["test"]
    
    # Print the number of unique IDs
    print(f"Number of unique IDs: {len(train_ids)}")
    
    # Raise an error if the sample size exceeds the total number of IDs
    if sample_size > len(train_ids):
        raise ValueError(
            f"Sample size ({sample_size}) exceeds the number of available data points ({len(train_ids)})."
        )
    
    # Randomly sample the specified number of IDs
    return random.sample(train_ids, sample_size)

# Calculate the total number of IDs across all splits
accumulated_ids = len(split_data["train"] + split_data["val"] + split_data["test"])

# Define the sample size as the total number of IDs
SAMPLE_SIZE = accumulated_ids

# Select a sample of IDs
sampled_ids = sample_train_ids(split_data, SAMPLE_SIZE)

# Print the number of selected samples
print(f"Selected {len(sampled_ids)} samples.")

Number of unique IDs: 463596
Selected 463596 samples.


## Extracting Data by ID

This code segment extracts specific articles and their summaries based on a list of selected IDs. It processes the main JSON dataset and saves the extracted samples to a new file. This approach is useful when working with a smaller, targeted subset of the data.

### Detailed description:

1. **Function `extract_data_by_ids`**:
   - Extracts article data (ID, full text, and summary) from the original dataset based on a given list of IDs.
   - **Arguments**:
     - `news_file_path`: Path to the source JSON file containing all article data.
     - `ids_to_extract`: List of selected IDs to extract from the dataset.
   - The function:
     - Converts the list of IDs to a set for faster lookups.
     - Loads the entire JSON file into memory (assuming it's manageable).
     - Iterates over each article and checks if its ID is among the selected ones.
     - Extracts and formats the article's `utt` (utterances) into a single string of text, along with the summary.
     - Stops early once all desired IDs have been found.

2. **Extracting data**:
   - The function is called with `NEWSDIALOG_FILE_PATH` and `sampled_ids` as inputs.
   - Extracted samples are stored in `extracted_samples`.

3. **Saving to file**:
   - The extracted samples are saved to a new JSON file at `./datasets/extracted_samples.json` with proper formatting (`indent=4`).

4. **Output messages**:
   - Prints the number of extracted articles.
   - Confirms the location where the new file is saved.

In [4]:
import json

def extract_data_by_ids(news_file_path, ids_to_extract):
    """
    Extracts articles and summaries based on provided IDs.

    Args:
        news_file_path (str): Path to the NewsDialog.json file.
        ids_to_extract (list): List of IDs to extract.

    Returns:
        list: A list of articles containing ID, text, and summary.
    """
    extracted_data = []
    ids_set = set(ids_to_extract)  # Use a set for faster lookups

    # Open and load the JSON file
    with open(news_file_path, "r") as file:
        news_data = json.load(file)  # Load the entire JSON file (list)

        for article in news_data:
            # Check if "id" exists and matches one of the selected IDs
            if "id" in article and article["id"] in ids_set:
                extracted_data.append({
                    "id": article["id"],
                    "text": " ".join(article.get("utt", [])),  # Join sentences
                    "summary": article.get("summary", "")  # Get the summary
                })
            # Stop when all IDs have been processed
            if len(extracted_data) == len(ids_to_extract):
                break  

    return extracted_data


# Extract data
extracted_samples = extract_data_by_ids(NEWSDIALOG_FILE_PATH, sampled_ids)
print(f"Extracted {len(extracted_samples)} articles.")

# Save extracted data to a JSON file
output_file = "./datasets/extracted_samples.json"
with open(output_file, "w") as file:
    json.dump(extracted_samples, file, indent=4)
print(f"Extracted data saved to file: {output_file}")

Extracted 463596 articles.
Extracted data saved to file: ./datasets/extracted_samples.json


## Displaying Random Sampled Examples

This code segment randomly selects and displays a few examples from the extracted dataset. It is useful for quickly verifying the quality and structure of the extracted data by visually inspecting a small subset.

### Detailed description:

1. **Random selection**:
   - Two examples are randomly selected from the `extracted_samples` list using `random.sample`.

2. **Output formatting**:
   - Each selected article is displayed with the following fields:
     - `Article ID`: The unique identifier of the article.
     - `Text`: The full concatenated article text.
     - `Summary`: The corresponding summary.
   - A horizontal separator line (`'-'*100`) is printed after each sample for clarity.

3. **Loop through samples**:
   - The `enumerate` function is used to iterate over the randomly selected samples and print their contents in a structured way.

In [None]:
# Wybierz dwa losowe przykłady
random_samples = random.sample(extracted_samples, 2)

# Wyświetl dwa losowe przykłady
for i, sample in enumerate(random_samples):  
    print(f"Article ID: {sample['id']}")
    print(f"Text:\n{sample['text']}\n")
    print(f"Summary:\n{sample['summary']}\n{'-'*100}")

Article ID: CNN-407734
Text:
As the total number of coronavirus cases in the U.S. nears 5 million, we're now learning just how quickly the virus is spreading inside the U.S. federal prison system. CNN's Drew Griffin has our report. The Seagoville Federal Correctional Institution outside Dallas is a petri dish of coronavirus infection. I lost my smell. I lived in the restroom for like 12 days. In the 15-minute phone call allowed from the inside, inmate George Reagan explained how coronavirus swept through the facility in just a month. Everybody has it now. Seagoville has significantly more COVID-19 cases than any other federal prison in the U.S. More than 1,300 of the roughly 1,750 inmates there tested positive. That's 75 percent. George Reagan's wife, Tabitha, that says no visitors have been allowed at the prison for months, so she believes workers must have infected the inmates. This was 100 percent their fault. COVID was brought in by their people. The FCI Seagoville staff was not pr

## Counting Tokens with BART Tokenizer

This code segment defines a utility function to count the number of tokens in a given text using the BART tokenizer from Hugging Face's Transformers library. Token counting is essential in NLP tasks, especially for models that have a maximum input length constraint.

### Detailed description:

1. **Function `count_tokens`**:
   - Initializes the BART tokenizer (`facebook/bart-base`).
   - Encodes the input text and returns the number of tokens.
   - **Arguments**:
     - `text`: A string representing the input to be tokenized.
   - **Returns**:
     - An integer representing the number of tokens in the input text.

2. **Example usage**:
   - A sample news transcript is provided as `example_text`.
   - The function `count_tokens` is called on this text to calculate its token length.
   - The result is printed as: `"Number of tokens: X"`.

3. **Use case**:
   - Useful for determining whether a text fits within a model’s maximum token limit, or for understanding input size before fine-tuning or inference.

In [6]:
from transformers import BartTokenizer

def count_tokens(text):
    """
    Counts the number of tokens in the given text.

    Args:
        text (str): Input text.

    Returns:
        int: Number of tokens in the text.
    """
    # Initialize the BART tokenizer
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
    
    # Tokenize the input text and return the token count
    tokens = tokenizer.encode(text)
    return len(tokens)

# Example usage
example_text = """The world will be finding out this morning just what city will host the 2008 Olympic Games. Members of the International Olympic Committee are choosing between five candidates. And if you're curious, in Las Vegas, odds makers have their money on Beijing. Well, CNN's Patrick Snell is covering the IOC meeting. He joins us now, live, from Moscow and that's where the announcement will be made in just a few hours. Hello there, what can you tell us? Linda, well welcome to the Russian capital. The very latest as the tension mounts here, the excitement we get all the more closer to who will win the right to stage the 2008 Summer Games. Let me bring you up to date what's happened so far today, Friday, in Moscow. We've had three presentations from the bid cities. The first of which was the Japanese bid from Osaka. Now, in this bid, the Japanese are claiming that they will provide the best sporting facilities and infrastructure in the world. And what's different -- what's innovative about this bid is that the facilities will be built across three uniquely designed manmade islands. So that's something that the Japanese are confident about. They're doubly more confident because, of course, they'll be staging next summer's football World Cup along with South Korea. So they're very, very confident that this is the bid that should catch the public imagination and hopefully will try and pull in, from their point of view, these 122 IOC votes that are up for grabs. Now what else do we have today? Well, we've had the bid from the French capital Paris. Yes, Oui Paris 2008. The French believe the Olympic Games should come to their city. Remember, they staged the 1998 football World Cup and won many plaudits for doing so, a wonderfully spectacular performance. They say their infrastructure is in place. They've got the Parisian Stadium, the Stade de France already built, infrastructure ready to go. And because they're promising games like Patrick Snell, thank you very much, from Moscow. TO ORDER A VIDEO OF THIS TRANSCRIPT, PLEASE CALL 800-CNN-NEWS OR USE OUR SECURE ONLINE ORDER FORM LOCATED AT www.fdch.com
"""
token_count = count_tokens(example_text)
print(f"Number of tokens: {token_count}")

Number of tokens: 452


## Filtering and Splitting Articles by Token Length

This code segment filters the dataset to remove overly long texts and very short summaries based on token counts using the BART tokenizer. It then splits the cleaned data into training, validation, and test sets, and saves them to separate JSON files.

### Detailed description:

1. **Tokenizer initialization**:
   - The BART tokenizer (`facebook/bart-base`) from Hugging Face is loaded to tokenize both article texts and summaries.

2. **Function `filter_articles_by_token_length`**:
   - Filters articles based on two criteria:
     - The number of tokens in the article text must be less than or equal to `max_tokens`.
     - The number of tokens in the summary must be greater than or equal to `min_summary_tokens`.
   - Uses `tokenizer(..., truncation=False)` to count tokens without truncating the input.
   - Returns a list of articles that meet both conditions.

3. **Function `split_and_save_filtered_data`**:
   - Loads the article dataset from the input file.
   - Applies the filtering function described above.
   - Randomly shuffles the filtered data.
   - Splits the data into training, validation, and test subsets using the provided percentages:
     - `train_pct`: 60% of data
     - `val_pct`: 30% of data
     - `test_pct`: 10% of data
   - Creates the output directory if it doesn't exist.
   - Saves each data split (`train.json`, `val.json`, `test.json`) in the output folder with proper JSON formatting.

4. **Execution**:
   - The script runs the filtering and splitting logic with the following parameters:
     - `max_tokens = 1024`: maximum allowed token length for article text.
     - `min_summary_tokens = 40`: minimum token length required for summaries.
   - Filtered and split data is saved under `./datasets/splits_filtered_with_summary`.

5. **Output**:
   - Prints the number of articles that passed the filtering step.
   - Prints how many samples were saved to each of the training, validation, and test sets.

In [7]:
from transformers import BartTokenizer
import json
import os
import random

# Initialize BART tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

def filter_articles_by_token_length(input_data, max_tokens=1024, min_summary_tokens=10):
    """
    Filters articles so that the text has a maximum number of tokens, 
    and the summary has a minimum number of tokens.

    Args:
        input_data (list): List of articles with "id", "text", and "summary" keys.
        max_tokens (int): Maximum number of tokens for the text.
        min_summary_tokens (int): Minimum number of tokens for the summary.

    Returns:
        list: List of articles meeting the conditions.
    """
    filtered_data = []
    for article in input_data:
        text_tokenized = tokenizer(article["text"], truncation=False)["input_ids"]
        summary_tokenized = tokenizer(article["summary"], truncation=False)["input_ids"]
        
        # Check conditions: text and summary lengths
        if len(text_tokenized) <= max_tokens and len(summary_tokenized) >= min_summary_tokens:
            filtered_data.append(article)
    return filtered_data

def split_and_save_filtered_data(input_file, output_folder, max_tokens=1024, min_summary_tokens=10, train_pct=0.6, val_pct=0.3, test_pct=0.1):
    """
    Filters articles, splits them into training, validation, and test sets, 
    and saves the results to JSON files.

    Args:
        input_file (str): Path to the JSON file with articles.
        output_folder (str): Folder where results will be saved.
        max_tokens (int): Maximum number of tokens for the text.
        min_summary_tokens (int): Minimum number of tokens for the summary.
        train_pct (float): Percentage of data for training.
        val_pct (float): Percentage of data for validation.
        test_pct (float): Percentage of data for testing.

    Returns:
        None
    """
    # Load data from file
    with open(input_file, "r") as file:
        data = json.load(file)

    # Filter articles
    print("Filtering articles based on token counts...")
    filtered_data = filter_articles_by_token_length(
        data, max_tokens=max_tokens, min_summary_tokens=min_summary_tokens
    )
    print(f"Number of articles after filtering: {len(filtered_data)}")

    # Shuffle and split data
    random.shuffle(filtered_data)
    total_samples = len(filtered_data)
    train_count = int(total_samples * train_pct)
    val_count = int(total_samples * val_pct)
    train_data = filtered_data[:train_count]
    val_data = filtered_data[train_count:train_count + val_count]
    test_data = filtered_data[train_count + val_count:]

    # Create output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    # Save splits to JSON files
    with open(os.path.join(output_folder, "train.json"), "w") as train_file:
        json.dump(train_data, train_file, indent=4, ensure_ascii=False)
    with open(os.path.join(output_folder, "val.json"), "w") as val_file:
        json.dump(val_data, val_file, indent=4, ensure_ascii=False)
    with open(os.path.join(output_folder, "test.json"), "w") as test_file:
        json.dump(test_data, test_file, indent=4, ensure_ascii=False)

    print(f"Data saved in folder: {output_folder}")
    print(f"Sample counts: Training = {len(train_data)}, Validation = {len(val_data)}, Testing = {len(test_data)}")

# File paths
input_file = "./datasets/extracted_samples.json"  # File with previously extracted data
output_folder = "./datasets/splits_filtered_with_summary"  # Folder for results

# Filter and save data
split_and_save_filtered_data(
    input_file, output_folder, max_tokens=1024, min_summary_tokens=40, train_pct=0.6, val_pct=0.3, test_pct=0.1
)

Filtering articles based on token counts...
Number of articles after filtering: 24692
Data saved in folder: ./datasets/splits_filtered_with_summary
Sample counts: Training = 14815, Validation = 7407, Testing = 2470
