# PodcastIQ - Research Grade Data Preprocessing
## Zero-Upload Training Pipeline

This notebook prepares high-quality research datasets for PodcastIQ models. Instead of manual uploads, it downloads large-scale conversational and Q&A datasets directly from Hugging Face.

**Datasets Used:**
1. **DialogSum**: For conversational summarization, filtered for health/medical topics.
2. **SQuAD v2**: For advanced extractive Q&A.
3. **PubMedQA**: For research-grade medical Q&A.

In [None]:
# Install dependencies
!pip install datasets transformers pandas numpy tqdm scikit-learn



In [None]:
import os
import json
import pandas as pd
from datasets import load_dataset
from tqdm import tqdm
from sklearn.model_selection import train_test_split

## 1. Summarization Data (DialogSum)
DialogSum is a large-scale dialogue summarization dataset. It is much more suitable for podcasts than news datasets.

In [None]:
print("ðŸ“¥ Downloading DialogSum and PubMedQA datasets...")
from datasets import load_dataset
import json
ds = load_dataset("knkarthick/dialogsum")
pm_qa = load_dataset("pubmed_qa", "pqa_labeled")

health_keywords = [
    "health", "fitness", "nutrition", "diet", "protein", "muscle", "workout", "exercise",
    "sleep", "brain", "vitamin", "mineral", "metabolism", "insulin", "heart", "longevity",
    "supplement", "body", "physiology", "biomechanics", "cardio", "hypertrophy", "nutrients"
]

def format_summarization(example):
    """Format for PodcastIQ training with health focus"""
    text = example['dialogue'].replace('\n', ' ')
    # Check if health related
    is_health = any(kw in text.lower() for kw in health_keywords)
    if not is_health and len(text) % 5 != 0: # Keep all health, and subset of others
        return None

    return {
        'input': text,
        'output': example['summary'],
        'length': 'medium',
        'source': 'dialogsum_health'
    }

train_sum = [format_summarization(ex) for ex in ds['train']]
train_sum = [x for x in train_sum if x is not None]
val_sum = [format_summarization(ex) for ex in ds['validation']]
val_sum = [x for x in val_sum if x is not None]
test_sum = [format_summarization(ex) for ex in ds['test']]
test_sum = [x for x in test_sum if x is not None]

print(f"âœ… Loaded {len(train_sum)} health-filtered training records.")

# Save summarized data
with open('train_summarization.json', 'w') as f: json.dump(train_sum, f, indent=2)
with open('val_summarization.json', 'w') as f: json.dump(val_sum, f, indent=2)
with open('test_summarization.json', 'w') as f: json.dump(test_sum, f, indent=2)

ðŸ“¥ Downloading DialogSum and PubMedQA datasets...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

README.md: 0.00B [00:00, ?B/s]

pqa_labeled/train-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

âœ… Loaded 3566 health-filtered training records.


## 2. Q&A Data (SQuAD v2)
SQuAD (Stanford Question Answering Dataset) is the standard for training robust Q&A models.

In [None]:
print("ðŸ“¥ Processing SQuAD and PubMedQA...")
from datasets import load_dataset
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import json
qa_ds = load_dataset("squad_v2")
pm_qa = load_dataset("pubmed_qa", "pqa_labeled") # Loaded again to be sure

qa_records = []
# Add SQuAD samples
for ex in tqdm(qa_ds['train'], desc="Formatting SQuAD"):
    if ex['answers']['text']:
        qa_records.append({
            'question': ex['question'],
            'answer': ex['answers']['text'][0],
            'context': ex['context']
        })
    if len(qa_records) >= 8000: break

# Add PubMedQA samples (Advanced Health Focus)
for ex in tqdm(pm_qa['train'], desc="Formatting PubMedQA"):
    qa_records.append({
        'question': ex['question'],
        'answer': ex['long_answer'],
        'context': ex['context']['contexts'][0]
    })
    if len(qa_records) >= 12000: break

train_qa, test_qa = train_test_split(qa_records, test_size=0.2, random_state=42)
print(f"âœ… Prepared {len(train_qa)} QA pairs (incl. PubMedQA).")

with open('train_qa.json', 'w') as f: json.dump(train_qa, f, indent=2)
with open('test_qa.json', 'w') as f: json.dump(test_qa, f, indent=2)

ðŸ“¥ Processing SQuAD and PubMedQA...


README.md: 0.00B [00:00, ?B/s]

squad_v2/train-00000-of-00001.parquet:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

squad_v2/validation-00000-of-00001.parqu(â€¦):   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

Formatting SQuAD:   7%|â–‹         | 9440/130319 [00:00<00:07, 15428.63it/s]
Formatting PubMedQA: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1000/1000 [00:00<00:00, 9166.58it/s]

âœ… Prepared 7200 QA pairs (incl. PubMedQA).





## 3. Finalize and Download
Download the processed files to use in the Training Notebooks (02 and 03).

In [None]:
from google.colab import files

processed_files = [
    'train_summarization.json',
    'val_summarization.json',
    'test_summarization.json',
    'train_qa.json',
    'test_qa.json'
]

!zip -r processed_data.zip {' '.join(processed_files)}
files.download('processed_data.zip')

print("\nâœ… SUCCESS!")
print("Download 'processed_data.zip' and use it in notebooks 02 and 03.")

  adding: train_summarization.json (deflated 70%)
  adding: val_summarization.json (deflated 68%)
  adding: test_summarization.json (deflated 86%)
  adding: train_qa.json (deflated 64%)
  adding: test_qa.json (deflated 64%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


âœ… SUCCESS!
Download 'processed_data.zip' and use it in notebooks 02 and 03.
