## Summary

âœ… **Data Preprocessing Complete!**

**What was done:**
1. Loaded LIAR dataset (train, test, val)
2. Converted 6 labels â†’ 3 labels:
   - `true` + `mostly-true` â†’ **REAL**
   - `false` + `barely-true` + `pants-fire` â†’ **FAKE**
   - `half-true` â†’ **NOT_ENOUGH_INFO**
3. Cleaned text (removed URLs, special chars, etc.)
4. Created 4 CSV files in `./preprocessed_data/`:
   - `train_data.csv` - Training data
   - `test_data.csv` - Test data
   - `val_data.csv` - Validation data
   - `combined_data.csv` - All data combined

**Next Steps:**
- Notebook 03: Feature Extraction
- Notebook 04: Semantic Similarity Model
- Notebook 05: NLI Model Training ðŸš€

In [None]:
# Combine all datasets into one\ncombined_df = pd.concat([train_final, test_final, val_final], ignore_index=True)\n\n# Add dataset type column (for future reference)\ntrain_final_copy = train_final.copy()\ntrain_final_copy['dataset_type'] = 'train'\n\ntest_final_copy = test_final.copy()\ntest_final_copy['dataset_type'] = 'test'\n\nval_final_copy = val_final.copy()\nval_final_copy['dataset_type'] = 'val'\n\ncombined_with_type = pd.concat([train_final_copy, test_final_copy, val_final_copy], ignore_index=True)\n\n# Save combined dataset\ncombined_with_type.to_csv('./preprocessed_data/combined_data.csv', index=False)\n\nprint(\"âœ… Combined dataset saved!\")\nprint(f\"Total samples: {len(combined_with_type)}\")\nprint(f\"\\nFinal Distribution:\")\nprint(combined_with_type['label'].value_counts())\nprint(f\"\\nDataset Type Distribution:\")\nprint(combined_with_type['dataset_type'].value_counts())

## Step 8: Create Combined Dataset (Optional)

In [None]:
# Display sample data by label\nprint(\"=\"*80)\nprint(\"SAMPLE DATA AFTER PREPROCESSING\")\nprint(\"=\"*80)\n\nfor label in ['REAL', 'FAKE', 'NOT_ENOUGH_INFO']:\n    print(f\"\\nðŸ“Œ Label: {label}\")\n    print(\"-\" * 80)\n    \n    sample = train_final[train_final['label'] == label].iloc[0]\n    print(f\"ID: {sample['claim_id']}\")\n    print(f\"Claim: {sample['claim'][:100]}...\")\n    print(f\"Speaker: {sample['speaker']}\")\n    print(f\"Label: {sample['label']}\")

## Step 7: Display Sample Data After Preprocessing

In [None]:
# Select relevant columns for ML modeling\ncolumns_to_keep = ['claim_id', 'claim', 'label', 'speaker', 'context']\n\n# Create final datasets with only needed columns\ntrain_final = train_clean[columns_to_keep].copy()\ntest_final = test_clean[columns_to_keep].copy()\nval_final = val_clean[columns_to_keep].copy()\n\nprint(\"Final datasets prepared:\")\nprint(f\"Train: {train_final.shape}\")\nprint(f\"Test: {test_final.shape}\")\nprint(f\"Val: {val_final.shape}\")\n\n# Create output directory\nos.makedirs('./preprocessed_data', exist_ok=True)\n\n# Save to CSV\ntrain_final.to_csv('./preprocessed_data/train_data.csv', index=False)\ntest_final.to_csv('./preprocessed_data/test_data.csv', index=False)\nval_final.to_csv('./preprocessed_data/val_data.csv', index=False)\n\nprint(\"\\nâœ… CSV files saved to ./preprocessed_data/\")\nprint(\"  - train_data.csv\")\nprint(\"  - test_data.csv\")\nprint(\"  - val_data.csv\")

## Step 6: Select Relevant Columns & Save CSV Files

In [None]:
print(\"=\"*60)\nprint(\"LABEL DISTRIBUTION AFTER CONVERSION\")\nprint(\"=\"*60)\n\nfor df_name, df in [('TRAIN', train_clean), ('TEST', test_clean), ('VAL', val_clean)]:\n    print(f\"\\n{df_name} Set:\")\n    label_dist = df['label'].value_counts()\n    print(label_dist)\n    print(f\"Total: {len(df)}\")\n    \n    # Calculate percentages\n    print(\"\\nPercentages:\")\n    for label, count in label_dist.items():\n        pct = (count / len(df)) * 100\n        print(f\"  {label}: {pct:.2f}%\")

## Step 5: Verify Label Distribution After Conversion

In [None]:
# Preprocessing function\ndef preprocess_dataframe(df):\n    \"\"\"\n    Apply all preprocessing steps to a dataframe\n    \"\"\"\n    df_processed = df.copy()\n    \n    # Convert labels\n    df_processed['label'] = df_processed['label'].apply(convert_labels)\n    \n    # Clean claim text\n    print(\"Cleaning text...\")\n    df_processed['claim'] = df_processed['claim'].apply(clean_text)\n    \n    # Remove rows with empty claims\n    initial_len = len(df_processed)\n    df_processed = df_processed[df_processed['claim'].str.len() > 0]\n    removed = initial_len - len(df_processed)\n    \n    if removed > 0:\n        print(f\"  Removed {removed} rows with empty claims\")\n    \n    return df_processed\n\n# Apply preprocessing to all datasets\nprint(\"\\nProcessing TRAIN set...\")\ntrain_clean = preprocess_dataframe(train_df)\n\nprint(\"\\nProcessing TEST set...\")\ntest_clean = preprocess_dataframe(test_df)\n\nprint(\"\\nProcessing VAL set...\")\nval_clean = preprocess_dataframe(val_df)\n\nprint(\"\\nâœ… Preprocessing complete!\")

## Step 4: Apply Preprocessing

In [None]:
def clean_text(text):\n    \"\"\"\n    Clean and normalize text:\n    - Remove URLs\n    - Remove HTML tags\n    - Remove extra whitespace\n    - Convert to lowercase\n    - Remove special characters\n    \"\"\"\n    if not isinstance(text, str):\n        return \"\"\n    \n    # Remove URLs\n    text = re.sub(r'http\\S+|www\\S+|https\\S+', '', text, flags=re.MULTILINE)\n    \n    # Remove HTML tags\n    text = re.sub(r'<.*?>', '', text)\n    \n    # Remove email addresses\n    text = re.sub(r'\\S+@\\S+', '', text)\n    \n    # Remove special characters but keep apostrophes\n    text = re.sub(r'[^a-zA-Z0-9\\s\\']', ' ', text)\n    \n    # Convert to lowercase\n    text = text.lower()\n    \n    # Remove extra whitespace\n    text = ' '.join(text.split())\n    \n    return text\n\n# Test the cleaning function\ntest_claim = \"Check this URL: http://example.com! It's <important> & should be cleaned.\"\nprint(\"Original:\", test_claim)\nprint(\"Cleaned: \", clean_text(test_claim))\nprint(\"\\nâœ… Text cleaning function defined!\")

## Step 3: Text Cleaning Function

In [None]:
# Label Mapping: 6 labels â†’ 3 labels
label_mapping = {
    'true': 'REAL',
    'mostly-true': 'REAL',
    'half-true': 'NOT_ENOUGH_INFO',
    'barely-true': 'FAKE',
    'false': 'FAKE',
    'pants-fire': 'FAKE'
}

def convert_labels(label):
    """Convert 6-class labels to 3-class labels"""
    return label_mapping.get(label, 'NOT_ENOUGH_INFO')

# Test the mapping
print("Label Mapping:")
print("-" * 40)
for old_label, new_label in label_mapping.items():
    print(f"{old_label:20} â†’ {new_label}")

print("\nâœ… Label mapping defined!")

## Step 2: Label Mapping Function

In [None]:
# Define column names
columns = ['claim_id', 'claim', 'label', 'speaker', 'speaker_job_title', 
           'state_info', 'party_affiliation', 'barely_true_counts', 
           'false_counts', 'half_true_counts', 'mostly_true_counts', 
           'pants_on_fire_counts', 'context']

# Load datasets
train_df = pd.read_csv('./data/train.tsv', sep='\t', header=None, names=columns, on_bad_lines='skip')
test_df = pd.read_csv('./data/test.tsv', sep='\t', header=None, names=columns, on_bad_lines='skip')
val_df = pd.read_csv('./data/val.tsv', sep='\t', header=None, names=columns, on_bad_lines='skip')

print("âœ… Raw data loaded!")
print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")
print(f"Val shape: {val_df.shape}")

## Step 1: Load Raw Data

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import re
import os
from collections import Counter

print("âœ… All libraries imported successfully!")

# 02 - Data Preprocessing & Cleaning
## Label Conversion & Text Cleaning

This notebook preprocesses the LIAR dataset:
- Converts 6 labels â†’ 3 labels (REAL, FAKE, NOT_ENOUGH_INFO)
- Cleans text (removes special chars, URLs, etc.)
- Creates train/val/test CSV files for modeling