# WJEC Copyediting Results Analysis

## Overview

This notebook analyses the results from the various automated copyediting tools applied to the entire corpus of WJEC 'Made for Wales' GCSE documents. These were scraped directly from the WJEC website on November 8th 2025. I pushed them to GitHub at this time to ensure that there is an immutable record of the exact documents used for this analysis.

## Methodology

### Data Collection

1. **Scraping**: The documents were scraped from the WJEC website using a custom Python script that navigated to the relevant pages and downloaded the PDF files.
2. **Conversion**: The PDFs were converted to markdown using [Marker](https://github.com/datalab-to/marker), aided by `Gemini-2.5-Flash-Lite` to enchance text extraction quality. I also tried Docling and PDF2Markdown but marker produced the best results.

### Copyediting Process

Once processed, the proofreading tools were applied to markdown in passes, with each pass building on the previous one.

#### 1. Language Tool Spelling and Grammar Checker

This tool was used to prove that the WJEC had not even bothered to run a basic spelling and grammar check on their documents. It is a free and open-source spelling a grammar checker.

Several passes were made,


In [None]:
# Imports and versions
import sys, platform, subprocess
import pandas as pd
import plotly
import plotly.express as px
import duckdb

print('Python:', sys.version.splitlines()[0])
print('Pandas:', pd.__version__)
print('Plotly:', plotly.__version__)
print('DuckDB:', duckdb.__version__)

In [None]:
# Paths to key CSVs
DATA_DOCS = 'document_stats.csv'
DATA_FILES = 'document_stats-files.csv'
DATA_LANG = 'Documents/language-check-report.csv'

# Load (these are small enough to load into memory in this repo)
docs = pd.read_csv(DATA_DOCS)
files = pd.read_csv(DATA_FILES)
lang = pd.read_csv(DATA_LANG)

docs.shape, files.shape, lang.shape

In [None]:
# Quick heads and basic stats
display(docs.head())
display(docs.describe(include='all'))

display(files.head())
display(files['Pages'].describe())

display(lang.head())
print('Language-check issues by type:')
print(lang['Type'].value_counts().head(10))

In [None]:
# Example plot: PDFs per subject (from document_stats.csv)
fig = px.bar(docs, x='Subject', y='PDFs', title='PDFs per subject')
fig.update_layout(xaxis={'categoryorder':'total descending'}, height=500)
fig.show()

In [None]:
# Distribution of pages per file (document_stats-files.csv)
fig = px.histogram(files, x='Pages', nbins=30, title='Distribution of pages per file')
fig.show()

In [None]:
# Join example: verify per-subject page totals
files_by_subject = files.groupby('Subject', as_index=False)['Pages'].sum().rename(columns={'Pages':'Pages_files'})
merged = pd.merge(docs, files_by_subject, on='Subject', how='left')
merged['Pages_diff'] = merged['Pages'] - merged['Pages_files']
display(merged[['Subject','Pages','Pages_files','Pages_diff']].sort_values('Pages_diff', key=abs, ascending=False).head(15))

In [None]:
# Top subjects by language-check issues (quick summary)
top_issues = lang.groupby('Subject').size().reset_index(name='n').sort_values('n', ascending=False).head(25)
fig = px.bar(top_issues, x='Subject', y='n', title='Top subjects by language-check issues')
fig.update_layout(xaxis={'categoryorder':'total descending'}, height=500)
fig.show()

In [None]:
# Save a small processed summary for quick reference
import os
os.makedirs('notebooks', exist_ok=True)
merged.to_csv('notebooks/processed_summary.csv', index=False)
print('Wrote notebooks/processed_summary.csv')

In [None]:
# Reproducibility: git commit and environment
try:
    sha = subprocess.check_output(['git','rev-parse','--short','HEAD']).decode().strip()
except Exception:
    sha = '<not available>'
print('Git commit:', sha)
print('Platform:', platform.platform())

## Next steps

- Add a DuckDB-powered cell for SQL analytics without fully loading very large CSVs.
- Add more focused visualisations (issue types over time, heatmaps by rule ID, per-file issue density).
- Optionally add `nbval` or `papermill` based CI to execute this notebook in CI and ensure it runs without errors.