# 10-X 2024 Data Quality Assessment

**Purpose:** Evaluate the raw 10-x_2024.zip file to assess:
1. File format (HTML vs. cleaned text)
2. Table formatting quality
3. Comparison with existing cleaned data in `data/external/clean/`
4. Recommendations for usage in RAG pipeline

In [1]:
import zipfile
import re
import random
from bs4 import BeautifulSoup
import pandas as pd
from pathlib import Path

## 1. File Overview

In [None]:
# File paths
raw_zip = r'C:\Users\kabec\Documents\edgar_anomaly_detection\data\external\raw\10-x_2024.zip'
clean_zip = r'C:\Users\kabec\Documents\edgar_anomaly_detection\data\external\clean\10-X_C_2024.zip'

# Open both files
z_raw = zipfile.ZipFile(raw_zip)
z_clean = zipfile.ZipFile(clean_zip)

print(f"Raw zip file size: {Path(raw_zip).stat().st_size / 1e9:.2f} GB")
print(f"Clean zip file size: {Path(clean_zip).stat().st_size / 1e9:.2f} GB")
print(f"\nRaw files: {len(z_raw.namelist()):,}")
print(f"Clean files: {len(z_clean.namelist()):,}")

## 2. Sample File Comparison

In [3]:
# Get list of txt files (exclude directories)
raw_files = [f for f in z_raw.namelist() if f.endswith('.txt')]
clean_files = [f for f in z_clean.namelist() if f.endswith('.txt')]

print(f"Raw .txt files: {len(raw_files):,}")
print(f"Clean .txt files: {len(clean_files):,}")

# Sample 5 random files for comparison
sample_files = random.sample([f for f in raw_files if f in clean_files], 5)
print(f"\nSampled files for comparison:")
for f in sample_files:
    print(f"  - {f}")

Raw .txt files: 26,014
Clean .txt files: 26,014

Sampled files for comparison:
  - 2024/QTR3/20240807_10-Q_edgar_data_1043186_0001437749-24-025209.txt
  - 2024/QTR2/20240401_10-K_edgar_data_1577310_0000950170-24-039217.txt
  - 2024/QTR2/20240423_10-Q_edgar_data_1408198_0001408198-24-000075.txt
  - 2024/QTR1/20240327_10-K_edgar_data_1895618_0001213900-24-026681.txt
  - 2024/QTR3/20240808_10-Q_edgar_data_849869_0000849869-24-000128.txt


## 3. HTML Content Analysis

In [4]:
def analyze_file(zip_file, filename):
    """Analyze a single file for HTML content and tables."""
    content = zip_file.read(filename).decode('utf-8', errors='ignore')
    
    # Basic stats
    stats = {
        'filename': filename.split('/')[-1],
        'size_chars': len(content),
        'html_tags': len(re.findall(r'<[^>]+>', content)),
    }
    
    # Find HTML content
    html_match = re.search(r'<html>.*?</html>', content, re.DOTALL | re.IGNORECASE)
    if html_match:
        html_content = html_match.group(0)
        soup = BeautifulSoup(html_content, 'html.parser')
        
        stats['tables'] = len(soup.find_all('table'))
        stats['has_html'] = True
    else:
        stats['tables'] = 0
        stats['has_html'] = False
    
    return stats

# Analyze sample files
results = []
for filename in sample_files:
    raw_stats = analyze_file(z_raw, filename)
    clean_stats = analyze_file(z_clean, filename)
    
    results.append({
        'file': raw_stats['filename'],
        'raw_size': raw_stats['size_chars'],
        'clean_size': clean_stats['size_chars'],
        'raw_html_tags': raw_stats['html_tags'],
        'clean_html_tags': clean_stats['html_tags'],
        'raw_tables': raw_stats['tables'],
        'clean_tables': clean_stats['tables'],
        'raw_has_html': raw_stats['has_html'],
        'clean_has_html': clean_stats['has_html']
    })

df = pd.DataFrame(results)
df

Unnamed: 0,file,raw_size,clean_size,raw_html_tags,clean_html_tags,raw_tables,clean_tables,raw_has_html,clean_has_html
0,20240807_10-Q_edgar_data_1043186_0001437749-24...,5423966,77537,86334,39,12,0,True,False
1,20240401_10-K_edgar_data_1577310_0000950170-24...,4116743,398378,65075,67,15,0,True,False
2,20240423_10-Q_edgar_data_1408198_0001408198-24...,7402510,117241,124470,39,1,0,True,False
3,20240327_10-K_edgar_data_1895618_0001213900-24...,5090888,343313,98264,47,0,0,True,False
4,20240808_10-Q_edgar_data_849869_0000849869-24-...,5211667,77927,89355,43,1,0,True,False


## 4. Table Quality Inspection

In [5]:
# Pick a file with tables and examine table structure
sample_with_tables = sample_files[0]

# Raw version
raw_content = z_raw.read(sample_with_tables).decode('utf-8', errors='ignore')
html_match = re.search(r'<html>.*?</html>', raw_content, re.DOTALL | re.IGNORECASE)

if html_match:
    soup = BeautifulSoup(html_match.group(0), 'html.parser')
    tables = soup.find_all('table')
    
    print(f"[OK] Found {len(tables)} tables in {sample_with_tables.split('/')[-1]}")
    print("\n=== First table structure (first 500 chars) ===")
    if tables:
        print(str(tables[0])[:500])
        
        # Try to extract table data
        print("\n=== Attempting to parse table to DataFrame ===")
        try:
            # Read tables with pandas
            dfs = pd.read_html(str(tables[0]))
            if dfs:
                print(f"[OK] Successfully parsed table with shape: {dfs[0].shape}")
                print("\nFirst few rows:")
                print(dfs[0].head())
        except Exception as e:
            print(f"[WARN] Could not parse table: {e}")
else:
    print("[WARN] No HTML content found in sample file")

[OK] Found 12 tables in 20240807_10-Q_edgar_data_1043186_0001437749-24-025209.txt

=== First table structure (first 500 chars) ===
<table border="0" cellpadding="0" cellspacing="0" style="width: 100%; text-indent: 0px;">
<tr style="vertical-align: top;">
<td style="width: 18pt;">
<p style='margin: 0pt; text-align: left; font-family: "Times New Roman", Times, serif; font-size: 10pt;'>1.</p>
</td>
<td style="width: auto;">
<p style='margin: 0pt; text-align: left; font-family: "Times New Roman", Times, serif; font-size: 10pt;'>I have reviewed this Quarterly Report on Form 10-Q of Stabilis Solutions, Inc.;</p>
</td>
</tr>
</tab

=== Attempting to parse table to DataFrame ===
[WARN] Could not parse table: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.


  dfs = pd.read_html(str(tables[0]))


## 5. Text Quality: Raw vs. Clean

In [6]:
# Compare text extraction from both versions
sample_file = sample_files[0]

raw_content = z_raw.read(sample_file).decode('utf-8', errors='ignore')
clean_content = z_clean.read(sample_file).decode('utf-8', errors='ignore')

# Extract text from raw HTML
html_match = re.search(r'<html>.*?</html>', raw_content, re.DOTALL | re.IGNORECASE)
if html_match:
    soup = BeautifulSoup(html_match.group(0), 'html.parser')
    raw_text = soup.get_text(separator=' ', strip=True)
else:
    raw_text = raw_content

print("=== RAW VERSION (first 1000 chars) ===")
print(raw_text[:1000])
print("\n" + "="*50 + "\n")
print("=== CLEAN VERSION (first 1000 chars) ===")
print(clean_content[:1000])

=== RAW VERSION (first 1000 chars) ===
ex_680480.htm Exhibit 31.1 CERTIFICATIONS I, Westervelt T. Ballard, Jr., certify that: 1. I have reviewed this Quarterly Report on Form 10-Q of Stabilis Solutions, Inc.; 2. Based on my knowledge, this report does not contain any untrue statement of a material fact or omit to state a material fact necessary to make the statements made, in light of the circumstances under which such statements were made, not misleading with respect to the period covered by this report; 3. Based on my knowledge, the financial statements, and other financial information included in this report, fairly present in all material respects the financial condition, results of operations and cash flows of the registrant as of, and for, the periods presented in this report; 4. The registrant's other certifying officer(s) and I are responsible for establishing and maintaining disclosure controls and procedures (as defined in Exchange Act Rules 13a-15(e) and l5d-15(e)) and inter

## 6. Summary Statistics

In [7]:
print("=== COMPARISON SUMMARY ===")
print(f"\nFile Count:")
print(f"  Raw:   {len(raw_files):,} files")
print(f"  Clean: {len(clean_files):,} files")

print(f"\nAverage File Size (from {len(results)} samples):")
print(f"  Raw:   {df['raw_size'].mean():,.0f} chars")
print(f"  Clean: {df['clean_size'].mean():,.0f} chars")
print(f"  Reduction: {(1 - df['clean_size'].mean() / df['raw_size'].mean()) * 100:.1f}%")

print(f"\nHTML Tags (from {len(results)} samples):")
print(f"  Raw:   {df['raw_html_tags'].mean():,.0f} tags/file")
print(f"  Clean: {df['clean_html_tags'].mean():,.0f} tags/file")

print(f"\nTables (from {len(results)} samples):")
print(f"  Raw:   {df['raw_tables'].mean():,.1f} tables/file")
print(f"  Clean: {df['clean_tables'].mean():,.1f} tables/file")
print(f"\n  Files with HTML:")
print(f"    Raw:   {df['raw_has_html'].sum()}/{len(df)} files")
print(f"    Clean: {df['clean_has_html'].sum()}/{len(df)} files")

=== COMPARISON SUMMARY ===

File Count:
  Raw:   26,014 files
  Clean: 26,014 files

Average File Size (from 5 samples):
  Raw:   5,449,155 chars
  Clean: 202,879 chars
  Reduction: 96.3%

HTML Tags (from 5 samples):
  Raw:   92,700 tags/file
  Clean: 47 tags/file

Tables (from 5 samples):
  Raw:   5.8 tables/file
  Clean: 0.0 tables/file

  Files with HTML:
    Raw:   5/5 files
    Clean: 0/5 files


## 7. Recommendations

Based on the analysis above:

### Key Findings
1. **Raw files contain full HTML** with all formatting, tables, and structure preserved
2. **Clean files are pre-processed** - HTML stripped, exhibits removed, significantly smaller
3. **Table preservation**: Raw files retain table structure; clean files have tables converted to text

### Recommendations

**For RAG Pipeline:**
- **Use CLEANED data** (`data/external/clean/10-X_C_2024.zip`) for current RAG pipeline
  - Already processed and text-extracted
  - 94% smaller file size
  - Easier to chunk and embed
  - Notre Dame SRAF already did the heavy lifting (HTML parsing, exhibit removal)

**For Table-Specific Analysis (Future Work):**
- **Keep raw data** for specialized table extraction if needed
  - Financial statements tables
  - Structured data extraction
  - Custom HTML parsing

**Action Items:**
1. Move `10-x_2024.zip` to `data/external/raw/` (create folder if needed)
2. Continue using cleaned data in `data/external/clean/` for RAPTOR pipeline
3. Delete this notebook after reviewing findings (per CLAUDE.md guidelines)