Reference Extraction Tool

Pure JavaScript Reference Validator and Extractor

Overview

A pure JavaScript tool that extracts, validates, and verifies academic references from uploaded text files or plain text input. Features comprehensive editing capabilities, individual reference validation, and intelligent dual-validation system.

Key Highlights

✅ All Fields Editable - Edit authors, title, year, venue, DOI, URL, arXiv ID - everything!

✅ Individual Validation - Validate one reference at a time, no need to run full batch

✅ Dual Validation - Academic references with DOI/arXiv get two-step verification

✅ Fake Reference Detection - Detects arXiv references citing wrong papers (title mismatch)

✅ Smart Text Cleanup - Auto-fixes PDF line-break hyphens and broken URLs

✅ 6 Citation Formats - APA, MLA, IEEE, ACM, arXiv, URL references

✅ Real-time Statistics - See validation progress and results instantly

✅ No Backend Required - Pure JavaScript, runs entirely in your browser

✅ Privacy First - No data storage, no tracking, all processing is local

QUICK START

Open index.html in your web browser
Upload sample-references.txt or paste your own references
Click "Extract References" to parse the text
Edit any field directly in the displayed results (all fields are editable)
Click "✓ Validate This Reference" on individual references OR "Start Validation" to validate all
Review validation results and use verification links for manual checking

Core Features

1. Reference Extraction

Input Sources:
- Upload .txt files containing references
- Direct text input/paste functionality
Supported Citation Formats:
- APA (American Psychological Association)
- MLA (Modern Language Association)
- IEEE (Institute of Electrical and Electronics Engineers)
- ACM (Association for Computing Machinery)
- arXiv (preprint repository references with arXiv IDs)
- URL/Web References (websites, online resources with URLs)
Output:
- Display extracted references one by one
- Parse each reference into structured components:
  - Authors, year, title, venue for academic references
  - Source, title, URL, access date for web references
  - DOI (Digital Object Identifier) links when available
  - arXiv ID and URL for arXiv preprint references
- Automatic fixing of broken URLs and DOI links from line breaks

2. Reference Validation (User-Initiated)

Intelligent Validation Strategies:
- User activates validation by clicking "Start Validation" button
- Different strategies based on reference type:
Pure URL References:
- Single check: Verify URL accessibility only
- No academic database checking
Academic References WITHOUT DOI:
- Single check: Verify paper exists in Semantic Scholar
- Search by title, authors, year
Academic References WITH DOI:
- DUAL VALIDATION (two separate checks):
  1. Paper Check: Verify paper exists in Semantic Scholar
  2. DOI Check: Verify DOI URL is accessible
- Both checks must pass for "Valid" status
- Displays detailed results for each check
arXiv References:
- DUAL VALIDATION (two separate checks):
  1. Paper Check: Verify paper exists in Semantic Scholar
  2. arXiv Check: Verify arXiv ID exists using arXiv API
- Fake Reference Detection:
  - Compares reference title with arXiv paper title (70% similarity threshold)
  - Detects when reference cites "Paper A" but arXiv link points to "Paper B"
  - Shows "🚨 FAKE REFERENCE DETECTED" warning for significant mismatches
  - Displays title similarity percentage
- Both checks must pass for "Valid" status
- Displays detailed results for each check
Validation Process:
- Progressive display (one by one as each completes)
- Flag suspicious or unverifiable references
- Automatic delays between checks to prevent rate limiting/blocking (1.5-2s)
- Longer delays (2s) for dual validation and URL checks

3. Editable Fields & Individual Validation

All Fields Are Editable:
- Authors, Year, Title, Venue, DOI (for academic references)
- Source, Title, URL, Access Date (for URL references)
- arXiv ID, arXiv URL (for arXiv references)
- Edit directly in the displayed results - no need to re-extract
Individual Validation Buttons:
- Each reference has its own "✓ Validate This Reference" button
- Validate individual references without running full validation
- Button changes to "🔄 Re-validate This Reference" after first validation
- Make corrections and re-validate as many times as needed
- Statistics automatically recalculate after each validation
Benefits:
- Fix extraction errors immediately
- Test corrections before full validation
- Save time - only validate what needs checking
- Iterative improvement through multiple re-validations

4. Statistics Dashboard

Real-time Statistics Display:
- Total number of references extracted
- Valid references count (✓ green badge)
- Suspicious references count (⚠ yellow badge)
- Unverified references count (✗ red badge)
- Automatically recalculates after each individual validation
- Updates in real-time during full validation

5. Manual Verification Support

For Academic References, Provide Links:
- Semantic Scholar search link
- Google Scholar search link
- General Google search link
For URL References, Provide Links:
- Direct link to visit the URL
- Google search for the URL (to check if it's indexed)
- Google search for the title
Flagging System:
- Mark references as: Valid, Suspicious, or Unverified
- Highlight potential issues (missing information, format errors, not found online)
- Color-coded status indicators for quick visual assessment

Technical Requirements

Environment

Pure JavaScript (no backend required)
Compatible with modern browsers
No external dependencies for core functionality (can use fetch API for validation)

API Integration

Semantic Scholar API:
- Endpoint: https://api.semanticscholar.org/
- Free tier available
- Search papers by title, author, DOI
Google Scholar:
- Generate search URLs (no official API)
- Manual verification option

User Interface Components

Input Section:
- File upload button for .txt files
- Editable text area for direct input/paste
- Allow users to modify uploaded text to fix errors
Control Buttons:
- "Extract References" / "Re-run" button to process/reprocess text
- "Start Validation" button to activate automatic checking process
Output Section:
- Results display area showing references one by one
- Visual indicators for validation status (Valid/Suspicious/Unverified)
- Clickable links for manual verification
- Progressive display: validation results appear one by one as they complete

Key User Interactions

Two Validation Workflows:

Workflow A: Individual Validation (Recommended for corrections)

Upload/Input: Load .txt file or paste reference text
Extract: Click "Extract References" button
Review & Edit: Check extracted references, edit any incorrect fields (all fields are editable)
Validate Individually: Click "✓ Validate This Reference" on specific references
Iterate: Edit and re-validate as needed until satisfied

Workflow B: Full Validation (Recommended for initial check)

Upload/Input: Load .txt file or paste reference text
Edit Input Text (Optional): Modify text in editable text area to fix errors
Extract/Re-run: Click "Extract References" button (can re-run after editing)
Start Full Validation: Click "Start Validation" to check all references automatically
View Progressive Results: Validation results appear one by one as each completes
Edit & Re-validate: Fix errors in suspicious references using individual validation
Manual Verification: Click provided links for remaining suspicious references

Note: The tool is semi-automatic - extraction and validation are triggered by user actions, not automatic on upload.

Automatic Text Preprocessing

Before extracting references, the tool automatically fixes common formatting issues:

Line-Break Hyphen Removal

Problem: Text copied from PDFs often has broken words from line breaks
Examples:
- "digitiza- tion" → "digitization"
- "Manufactur- ing" → "Manufacturing"
- "Tool- path" → "Toolpath"
Solution: Automatically detects and removes line-break hyphens
Pattern: Removes hyphen + space/newline when followed by word continuation
Preserved: Legitimate hyphenated words (e.g., "state-of-the-art", "machine-learning")

This preprocessing happens automatically during extraction and requires no user action.

Functional Workflow

Input Phase:
- User uploads .txt file OR pastes text into editable text area
- Text is displayed in the text area and can be edited
Editing Phase (Optional):
- User can modify the text to fix formatting errors, typos, or inconsistencies
- User can re-upload or paste new text at any time
Extraction Phase:
- User clicks "Extract References" button (or "Re-run" for reprocessing)
- Detect citation style for each reference (APA/MLA/IEEE/ACM)
- Extract individual references from text
- Parse each reference into components
- Display extracted references one by one
- Update existing reference items if re-running
Validation Phase (User-Initiated):
- User clicks "Start Validation" button to begin automatic checking
- For each reference (processed sequentially):
  - Query Semantic Scholar API with reference details (title, authors, year)
  - Display validation result immediately as it completes (progressive display)
  - If not found, flag as suspicious
  - Generate verification links for manual checking
- Results appear one by one, not all at once after completion
Manual Verification Phase:
- User can click on provided links for suspicious references:
  - Semantic Scholar search link
  - Google Scholar search link
  - General Google search link
- User can manually verify and make decisions about flagged references

Expected Output Format

For each reference, display:

Original reference text
Detected citation style
Parsed components (authors, title, year, venue)
Validation status (Valid/Suspicious/Unverified)
Manual verification links:
- [Check on Semantic Scholar]
- [Check on Google Scholar]
- [Search on Google]

Known Issues & Limitations

CORS Restrictions (Important!)

What is CORS? Cross-Origin Resource Sharing is a browser security feature that blocks requests to external websites
Impact on URL Validation:
- Direct URL checking is blocked by most websites
- Many URL references will show as "Suspicious" even if valid
- DOI URL checks may fail due to CORS blocking
- Workaround: Use the provided manual verification links

Validation Coverage

Semantic Scholar Database:
- Not all papers are indexed (especially older publications)
- Regional journals may have limited coverage
- Valid papers may be flagged as suspicious if not in database
Citation Style Detection:
- Pattern-based detection may not be 100% accurate
- Complex or non-standard formats may be misidentified

API Rate Limits

Semantic Scholar API: ~100 requests per 5 minutes (free tier)
Delays built in to prevent rate limiting (1.5-2s between requests)
Large batches may take several minutes to complete

Extraction Limitations

Complex multi-line references may have parsing errors
Non-standard citation formats may extract incorrectly
Manual editing available to fix extraction errors

Browser Compatibility

Requires modern browser with JavaScript enabled
Best performance in Chrome, Firefox, Safari, Edge
Internet connection required for validation (extraction works offline)

Troubleshooting

Problem 1: No references extracted

Solution 1: Check that references follow standard citation formats (APA, MLA, IEEE, ACM, arXiv, URL)

Problem 2: Valid reference marked as suspicious

Solution 2: Use manual verification links - paper may not be in Semantic Scholar database

Problem 3: URL references always show as suspicious

Solution 3: This is expected due to CORS restrictions - click "Visit URL" link to manually verify

Problem 4: "Rate Limited" error during validation

Solution 4: Wait 5-10 minutes before continuing - free API tier has limits

Problem 5: Title or other fields show "Unknown" Solution 5: Edit the fields directly in the results and click "Re-validate This Reference"

Problem 6: Extraction errors with PDF-copied text

Solution 6: The tool auto-fixes most issues (line-break hyphens), but you can manually edit the input text before extraction

Validation Methods Comparison

Full Validation vs Individual Validation

Feature	Full Validation	Individual Validation
Button Location	Top of page ("Start Validation")	Each reference ("✓ Validate This Reference")
Scope	All references at once	Single reference at a time
When to Use	Initial validation of all references	Fix specific suspicious references
Time	2-5 seconds × number of refs	2-5 seconds per reference
Best For	First-time validation	Testing corrections, selective validation
Statistics Update	After all complete	After each validation
Edits Applied	N/A (validates current data)	Read from edited input fields

Recommended Approach:

Use Full Validation for initial check
Identify suspicious references
Edit those references
Use Individual Validation to re-check corrections
Repeat as needed

Project Structure

Root Directory (Main Project Files)

Reference-checking/
├── README.md                 # Main documentation (you are here)
├── index.html                # Main application interface
├── reference-tool.js         # Core JavaScript application
├── sample-references.txt     # Mixed format references for general testing
├── documents/                # User-facing documentation (8 files)
└── testing/                  # Test files for feature validation (11 files)

📂 ./documents/ (8 files)

User documentation and test files mentioned in this README:

Test Files:

COMPREHENSIVE-TEST-REFERENCES.txt - All test cases in one file (16 references)

User Documentation:

USAGE-GUIDE.txt - Detailed step-by-step usage instructions
QUICK-USAGE-GUIDE.txt - Quick reference guide for all-fields-editable feature
EDIT-AND-REVALIDATE-FEATURE.txt - Editing and re-validation guide
ALL-FIELDS-EDITABLE-FEATURE.txt - Complete editing feature documentation

Technical Documentation:

DUAL-VALIDATION-SYSTEM.txt - How the dual validation system works
ARXIV-FEATURE-DOCUMENTATION.txt - arXiv reference support details
CURRENT-STATUS.txt - Current implementation status and active patterns

📂 ./testing/ (11 files)

Test files for validating various features:

test-references.txt - Mixed format references with line-break issues
test-broken-urls.txt - URL extraction with space issues
test-arxiv-references.txt - arXiv format detection and validation
test-fake-reference.txt - Fake reference detection examples
test-8-apa-references.txt - APA reference splitting validation
test-author-extraction.txt - Author name parsing edge cases
test-year-suffix.txt - Year suffix handling (2020a, 2020b)
test-multiline-references.txt - Multi-line reference handling
test-hyphenated-breaks.txt - Hyphen removal from line breaks
test-ieee-nested-quotes.txt - Quote character handling
test-references-incl-fake.txt - Combined test scenarios

Privacy & Security

Local Processing: All reference extraction and parsing happens in your browser
No Data Storage: No references or text are stored on any server
API Calls: Validation requires calls to:
- Semantic Scholar API (for paper verification)
- arXiv API (for arXiv reference verification)
- DOI/CrossRef (for DOI verification - may be blocked by CORS)
No Tracking: No analytics, cookies, or user tracking
Open Source: Pure JavaScript - inspect the code yourself

Future Enhancements (Optional)

Export validated references to BibTeX format
Batch validation mode with selection
Reference deduplication
Citation style conversion
Integration with DOI.org for additional validation
Auto-complete suggestions for venues and authors
Undo/redo functionality for edits
Save and load validation sessions

Development note

initial tool: Janetyc and Claude Code, October 26, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
documents		documents
testing		testing
LICENSE		LICENSE
README.md		README.md
index.html		index.html
reference-tool.js		reference-tool.js
sample-references.txt		sample-references.txt

Folders and files

Latest commit

History

Repository files navigation

Reference Extraction Tool

Pure JavaScript Reference Validator and Extractor

Overview

Key Highlights

QUICK START

Core Features

1. Reference Extraction

2. Reference Validation (User-Initiated)

3. Editable Fields & Individual Validation

4. Statistics Dashboard

5. Manual Verification Support

Technical Requirements

Environment

API Integration

User Interface Components

Key User Interactions

Two Validation Workflows:

Automatic Text Preprocessing

Line-Break Hyphen Removal

Functional Workflow

Expected Output Format

Known Issues & Limitations

CORS Restrictions (Important!)

Validation Coverage

API Rate Limits

Extraction Limitations

Browser Compatibility

Troubleshooting

Validation Methods Comparison

Full Validation vs Individual Validation

Project Structure

Root Directory (Main Project Files)

📂 ./documents/ (8 files)

📂 ./testing/ (11 files)

Privacy & Security

Future Enhancements (Optional)

Development note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages