Skip to content

janetyc/reference-checking-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reference Extraction Tool

Pure JavaScript Reference Validator and Extractor

Overview

A pure JavaScript tool that extracts, validates, and verifies academic references from uploaded text files or plain text input. Features comprehensive editing capabilities, individual reference validation, and intelligent dual-validation system.

Key Highlights

All Fields Editable - Edit authors, title, year, venue, DOI, URL, arXiv ID - everything!

Individual Validation - Validate one reference at a time, no need to run full batch

Dual Validation - Academic references with DOI/arXiv get two-step verification

Fake Reference Detection - Detects arXiv references citing wrong papers (title mismatch)

Smart Text Cleanup - Auto-fixes PDF line-break hyphens and broken URLs

6 Citation Formats - APA, MLA, IEEE, ACM, arXiv, URL references

Real-time Statistics - See validation progress and results instantly

No Backend Required - Pure JavaScript, runs entirely in your browser

Privacy First - No data storage, no tracking, all processing is local


QUICK START

  1. Open index.html in your web browser
  2. Upload sample-references.txt or paste your own references
  3. Click "Extract References" to parse the text
  4. Edit any field directly in the displayed results (all fields are editable)
  5. Click "✓ Validate This Reference" on individual references OR "Start Validation" to validate all
  6. Review validation results and use verification links for manual checking

Core Features

1. Reference Extraction

  • Input Sources:

    • Upload .txt files containing references
    • Direct text input/paste functionality
  • Supported Citation Formats:

    • APA (American Psychological Association)
    • MLA (Modern Language Association)
    • IEEE (Institute of Electrical and Electronics Engineers)
    • ACM (Association for Computing Machinery)
    • arXiv (preprint repository references with arXiv IDs)
    • URL/Web References (websites, online resources with URLs)
  • Output:

    • Display extracted references one by one
    • Parse each reference into structured components:
      • Authors, year, title, venue for academic references
      • Source, title, URL, access date for web references
      • DOI (Digital Object Identifier) links when available
      • arXiv ID and URL for arXiv preprint references
    • Automatic fixing of broken URLs and DOI links from line breaks

2. Reference Validation (User-Initiated)

  • Intelligent Validation Strategies:

    • User activates validation by clicking "Start Validation" button
    • Different strategies based on reference type:

    Pure URL References:

    • Single check: Verify URL accessibility only
    • No academic database checking

    Academic References WITHOUT DOI:

    • Single check: Verify paper exists in Semantic Scholar
    • Search by title, authors, year

    Academic References WITH DOI:

    • DUAL VALIDATION (two separate checks):
      1. Paper Check: Verify paper exists in Semantic Scholar
      2. DOI Check: Verify DOI URL is accessible
    • Both checks must pass for "Valid" status
    • Displays detailed results for each check

    arXiv References:

    • DUAL VALIDATION (two separate checks):
      1. Paper Check: Verify paper exists in Semantic Scholar
      2. arXiv Check: Verify arXiv ID exists using arXiv API
    • Fake Reference Detection:
      • Compares reference title with arXiv paper title (70% similarity threshold)
      • Detects when reference cites "Paper A" but arXiv link points to "Paper B"
      • Shows "🚨 FAKE REFERENCE DETECTED" warning for significant mismatches
      • Displays title similarity percentage
    • Both checks must pass for "Valid" status
    • Displays detailed results for each check
  • Validation Process:

    • Progressive display (one by one as each completes)
    • Flag suspicious or unverifiable references
    • Automatic delays between checks to prevent rate limiting/blocking (1.5-2s)
    • Longer delays (2s) for dual validation and URL checks

3. Editable Fields & Individual Validation

  • All Fields Are Editable:

    • Authors, Year, Title, Venue, DOI (for academic references)
    • Source, Title, URL, Access Date (for URL references)
    • arXiv ID, arXiv URL (for arXiv references)
    • Edit directly in the displayed results - no need to re-extract
  • Individual Validation Buttons:

    • Each reference has its own "✓ Validate This Reference" button
    • Validate individual references without running full validation
    • Button changes to "🔄 Re-validate This Reference" after first validation
    • Make corrections and re-validate as many times as needed
    • Statistics automatically recalculate after each validation
  • Benefits:

    • Fix extraction errors immediately
    • Test corrections before full validation
    • Save time - only validate what needs checking
    • Iterative improvement through multiple re-validations

4. Statistics Dashboard

  • Real-time Statistics Display:
    • Total number of references extracted
    • Valid references count (✓ green badge)
    • Suspicious references count (⚠ yellow badge)
    • Unverified references count (✗ red badge)
    • Automatically recalculates after each individual validation
    • Updates in real-time during full validation

5. Manual Verification Support

  • For Academic References, Provide Links:

    • Semantic Scholar search link
    • Google Scholar search link
    • General Google search link
  • For URL References, Provide Links:

    • Direct link to visit the URL
    • Google search for the URL (to check if it's indexed)
    • Google search for the title
  • Flagging System:

    • Mark references as: Valid, Suspicious, or Unverified
    • Highlight potential issues (missing information, format errors, not found online)
    • Color-coded status indicators for quick visual assessment

Technical Requirements

Environment

  • Pure JavaScript (no backend required)
  • Compatible with modern browsers
  • No external dependencies for core functionality (can use fetch API for validation)

API Integration

  • Semantic Scholar API:

  • Google Scholar:

    • Generate search URLs (no official API)
    • Manual verification option

User Interface Components

  • Input Section:

    • File upload button for .txt files
    • Editable text area for direct input/paste
    • Allow users to modify uploaded text to fix errors
  • Control Buttons:

    • "Extract References" / "Re-run" button to process/reprocess text
    • "Start Validation" button to activate automatic checking process
  • Output Section:

    • Results display area showing references one by one
    • Visual indicators for validation status (Valid/Suspicious/Unverified)
    • Clickable links for manual verification
    • Progressive display: validation results appear one by one as they complete

Key User Interactions

Two Validation Workflows:

Workflow A: Individual Validation (Recommended for corrections)

  1. Upload/Input: Load .txt file or paste reference text
  2. Extract: Click "Extract References" button
  3. Review & Edit: Check extracted references, edit any incorrect fields (all fields are editable)
  4. Validate Individually: Click "✓ Validate This Reference" on specific references
  5. Iterate: Edit and re-validate as needed until satisfied

Workflow B: Full Validation (Recommended for initial check)

  1. Upload/Input: Load .txt file or paste reference text
  2. Edit Input Text (Optional): Modify text in editable text area to fix errors
  3. Extract/Re-run: Click "Extract References" button (can re-run after editing)
  4. Start Full Validation: Click "Start Validation" to check all references automatically
  5. View Progressive Results: Validation results appear one by one as each completes
  6. Edit & Re-validate: Fix errors in suspicious references using individual validation
  7. Manual Verification: Click provided links for remaining suspicious references

Note: The tool is semi-automatic - extraction and validation are triggered by user actions, not automatic on upload.


Automatic Text Preprocessing

Before extracting references, the tool automatically fixes common formatting issues:

Line-Break Hyphen Removal

  • Problem: Text copied from PDFs often has broken words from line breaks
  • Examples:
    • "digitiza- tion" → "digitization"
    • "Manufactur- ing" → "Manufacturing"
    • "Tool- path" → "Toolpath"
  • Solution: Automatically detects and removes line-break hyphens
  • Pattern: Removes hyphen + space/newline when followed by word continuation
  • Preserved: Legitimate hyphenated words (e.g., "state-of-the-art", "machine-learning")

This preprocessing happens automatically during extraction and requires no user action.


Functional Workflow

  1. Input Phase:

    • User uploads .txt file OR pastes text into editable text area
    • Text is displayed in the text area and can be edited
  2. Editing Phase (Optional):

    • User can modify the text to fix formatting errors, typos, or inconsistencies
    • User can re-upload or paste new text at any time
  3. Extraction Phase:

    • User clicks "Extract References" button (or "Re-run" for reprocessing)
    • Detect citation style for each reference (APA/MLA/IEEE/ACM)
    • Extract individual references from text
    • Parse each reference into components
    • Display extracted references one by one
    • Update existing reference items if re-running
  4. Validation Phase (User-Initiated):

    • User clicks "Start Validation" button to begin automatic checking
    • For each reference (processed sequentially):
      • Query Semantic Scholar API with reference details (title, authors, year)
      • Display validation result immediately as it completes (progressive display)
      • If not found, flag as suspicious
      • Generate verification links for manual checking
    • Results appear one by one, not all at once after completion
  5. Manual Verification Phase:

    • User can click on provided links for suspicious references:
      • Semantic Scholar search link
      • Google Scholar search link
      • General Google search link
    • User can manually verify and make decisions about flagged references

Expected Output Format

For each reference, display:

  • Original reference text
  • Detected citation style
  • Parsed components (authors, title, year, venue)
  • Validation status (Valid/Suspicious/Unverified)
  • Manual verification links:
    • [Check on Semantic Scholar]
    • [Check on Google Scholar]
    • [Search on Google]

Known Issues & Limitations

CORS Restrictions (Important!)

  • What is CORS? Cross-Origin Resource Sharing is a browser security feature that blocks requests to external websites
  • Impact on URL Validation:
    • Direct URL checking is blocked by most websites
    • Many URL references will show as "Suspicious" even if valid
    • DOI URL checks may fail due to CORS blocking
    • Workaround: Use the provided manual verification links

Validation Coverage

  • Semantic Scholar Database:

    • Not all papers are indexed (especially older publications)
    • Regional journals may have limited coverage
    • Valid papers may be flagged as suspicious if not in database
  • Citation Style Detection:

    • Pattern-based detection may not be 100% accurate
    • Complex or non-standard formats may be misidentified

API Rate Limits

  • Semantic Scholar API: ~100 requests per 5 minutes (free tier)
  • Delays built in to prevent rate limiting (1.5-2s between requests)
  • Large batches may take several minutes to complete

Extraction Limitations

  • Complex multi-line references may have parsing errors
  • Non-standard citation formats may extract incorrectly
  • Manual editing available to fix extraction errors

Browser Compatibility

  • Requires modern browser with JavaScript enabled
  • Best performance in Chrome, Firefox, Safari, Edge
  • Internet connection required for validation (extraction works offline)

Troubleshooting

Problem 1: No references extracted

Solution 1: Check that references follow standard citation formats (APA, MLA, IEEE, ACM, arXiv, URL)

Problem 2: Valid reference marked as suspicious

Solution 2: Use manual verification links - paper may not be in Semantic Scholar database

Problem 3: URL references always show as suspicious

Solution 3: This is expected due to CORS restrictions - click "Visit URL" link to manually verify

Problem 4: "Rate Limited" error during validation

Solution 4: Wait 5-10 minutes before continuing - free API tier has limits

Problem 5: Title or other fields show "Unknown" Solution 5: Edit the fields directly in the results and click "Re-validate This Reference"

Problem 6: Extraction errors with PDF-copied text

Solution 6: The tool auto-fixes most issues (line-break hyphens), but you can manually edit the input text before extraction


Validation Methods Comparison

Full Validation vs Individual Validation

Feature Full Validation Individual Validation
Button Location Top of page ("Start Validation") Each reference ("✓ Validate This Reference")
Scope All references at once Single reference at a time
When to Use Initial validation of all references Fix specific suspicious references
Time 2-5 seconds × number of refs 2-5 seconds per reference
Best For First-time validation Testing corrections, selective validation
Statistics Update After all complete After each validation
Edits Applied N/A (validates current data) Read from edited input fields

Recommended Approach:

  1. Use Full Validation for initial check
  2. Identify suspicious references
  3. Edit those references
  4. Use Individual Validation to re-check corrections
  5. Repeat as needed

Project Structure

Root Directory (Main Project Files)

Reference-checking/
├── README.md                 # Main documentation (you are here)
├── index.html                # Main application interface
├── reference-tool.js         # Core JavaScript application
├── sample-references.txt     # Mixed format references for general testing
├── documents/                # User-facing documentation (8 files)
└── testing/                  # Test files for feature validation (11 files)

📂 ./documents/ (8 files)

User documentation and test files mentioned in this README:

Test Files:

  • COMPREHENSIVE-TEST-REFERENCES.txt - All test cases in one file (16 references)

User Documentation:

  • USAGE-GUIDE.txt - Detailed step-by-step usage instructions
  • QUICK-USAGE-GUIDE.txt - Quick reference guide for all-fields-editable feature
  • EDIT-AND-REVALIDATE-FEATURE.txt - Editing and re-validation guide
  • ALL-FIELDS-EDITABLE-FEATURE.txt - Complete editing feature documentation

Technical Documentation:

  • DUAL-VALIDATION-SYSTEM.txt - How the dual validation system works
  • ARXIV-FEATURE-DOCUMENTATION.txt - arXiv reference support details
  • CURRENT-STATUS.txt - Current implementation status and active patterns

📂 ./testing/ (11 files)

Test files for validating various features:

  • test-references.txt - Mixed format references with line-break issues
  • test-broken-urls.txt - URL extraction with space issues
  • test-arxiv-references.txt - arXiv format detection and validation
  • test-fake-reference.txt - Fake reference detection examples
  • test-8-apa-references.txt - APA reference splitting validation
  • test-author-extraction.txt - Author name parsing edge cases
  • test-year-suffix.txt - Year suffix handling (2020a, 2020b)
  • test-multiline-references.txt - Multi-line reference handling
  • test-hyphenated-breaks.txt - Hyphen removal from line breaks
  • test-ieee-nested-quotes.txt - Quote character handling
  • test-references-incl-fake.txt - Combined test scenarios

Privacy & Security

  • Local Processing: All reference extraction and parsing happens in your browser
  • No Data Storage: No references or text are stored on any server
  • API Calls: Validation requires calls to:
    • Semantic Scholar API (for paper verification)
    • arXiv API (for arXiv reference verification)
    • DOI/CrossRef (for DOI verification - may be blocked by CORS)
  • No Tracking: No analytics, cookies, or user tracking
  • Open Source: Pure JavaScript - inspect the code yourself

Future Enhancements (Optional)

  • Export validated references to BibTeX format
  • Batch validation mode with selection
  • Reference deduplication
  • Citation style conversion
  • Integration with DOI.org for additional validation
  • Auto-complete suggestions for venues and authors
  • Undo/redo functionality for edits
  • Save and load validation sessions

Development note

  • initial tool: Janetyc and Claude Code, October 26, 2025

About

Reference checking tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors