# Common Crawl Data Exploration

This notebook explores the Common Crawl WET data extraction process for Australian companies.

## Contents
1. Understanding WET File Format
2. Downloading Sample Data
3. Parsing and Extracting Company Information
4. Data Quality Analysis


In [None]:
# Import libraries
import sys
sys.path.insert(0, '..')

import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Import our custom modules
from src.ingest.parse_commoncrawl import CommonCrawlParser
from src.ingest.download_commoncrawl import download_wet_partial

print("Libraries imported successfully!")


## 2. Create Sample Data for Testing

Since downloading actual Common Crawl data requires significant bandwidth, we'll create sample data for exploration.


In [None]:
# Create sample Common Crawl data
sample_data = pd.DataFrame({
    'url': [
        'https://www.acme.com.au/about',
        'https://techcorp.com.au/',
        'https://greenenergy.com.au/services',
        'https://sydneyconsulting.com.au',
        'https://melbournefinance.com.au'
    ],
    'company_name': [
        'ACME Corporation Pty Ltd',
        'TechCorp Australia',
        'Green Energy Partners Pty Ltd',
        'Sydney Consulting Group',
        'Melbourne Financial Services Ltd'
    ],
    'industry': [
        'Manufacturing',
        'Technology',
        'Energy',
        'Consulting',
        'Finance'
    ],
    'raw_text': [
        'Welcome to ACME Corporation. We are a leading manufacturing company in Australia.',
        'TechCorp provides innovative technology solutions for businesses.',
        'Green Energy Partners is committed to sustainable energy solutions.',
        'Sydney Consulting offers expert business consulting services.',
        'Melbourne Financial provides comprehensive financial services.'
    ]
})

print(f"Sample data created: {len(sample_data)} records")
sample_data


## 3. Apply Data Cleaning


In [None]:
# Apply cleaning transformations
from src.transform.clean_commoncrawl import clean_commoncrawl_data

cleaned_df = clean_commoncrawl_data(sample_data)
print(f"After cleaning: {len(cleaned_df)} records")
print(f"\nNew columns added: {set(cleaned_df.columns) - set(sample_data.columns)}")
cleaned_df[['company_name', 'normalized_name', 'block_key', 'domain']]
