Skip to content

paulbradshaw/dealingwithdocuments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dealing with documents and text - techniques, tools and tips

This repository contains resources related to dealing with documents as a journalist. Typical challenges include:

  • Scraping multiple documents published online
  • 'Batch downloading' multiple documents for offline analysis
  • Converting documents into searchable text
  • Dealing with scanned documents (OCR)
  • Searching across multiple documents
  • Identifying patterns in documents
  • Identifying entities in documents (people, places, organisations, dates and times)
  • Organising documents
  • Publishing documents
  • Matching names referred to in different ways in different documents

Tools for dealing with documents

Some useful tools to know about include:

  • Tabula (for extracting tables from documents)
  • OutWit Hub (for batch downloading documents - only available in Pro version)
  • Atom (text editor - for using regex for searching across multiple documents)
  • DocumentCloud (entity extraction and OCR)
  • Overview (identify links across documents, search and visualise)
  • Pinpoint "helps reporters quickly go through hundreds of thousands of documents by automatically identifying and organizing the most frequently mentioned people, organizations and locations. Instead of asking users to repeatedly hit “Ctrl+F,” the tool helps reporters use Google Search and Knowledge Graph, optical character recognition and speech-to-text technologies to search through scanned PDFs, images, handwritten notes, e-mails and audio files."
  • Datashare from the ICIJ for extracting text from documents and searching them. User Guide here
  • Open Semantic Search is "Free Software for your own Search Engine, Explorer for Discovery of large document collections, Media Monitoring, Text Analytics, Document Analysis & Text Mining platform"
  • Aleph is "A tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets." Read the documentation on GitHub
  • The Archive Network - for creating personal and publishable archives, handles filetypes such as emails etc. Currently in closed beta.
  • Evernote (organisation, sharing, OCR and tagging)
  • Programming languages open up other ways of dealing with text. For example, you can search documents using command line and regex, convert PDFs in Python using the library pdftoxml, or use tools like sentiment analysis and Natural Language Processing (NLP). The free ebook Text Mining With R explains some techniques.
  • CSV Match is one tool for fuzzy matching.
  • Pandoc is a command line tool for converting, combining, and doing other stuff with documents

Other tools are bookmarked at https://pinboard.in/u:paulbradshaw/t:text+tools

Useful concepts

  • A diff compares documents and shows the differences
  • Regex (regular expression) is a way of describing the pattern in a passage of text (e.g. 2 digits followed by a non-numerical character, followed by two digits...)
  • Sentiment analysis attempts to gauge whether a word or passage of text is positive or negative (used in this story along with diff)
  • Entity extraction attempts to identify and classify entities in a document, such as people, places, organisations, and times and dates.
  • OCR (Optical Character Recognition) attempts to convert images and scanned documents into text that can be searched etc. Google Images, for example, includes OCR so that you can search for images of specific licence plates, signs etc.
  • Fuzzy matching allows you to match different text where they are not exactly the same, e.g. names spelt or arranged slightly differently in different documents.
  • Ngrams are "a contiguous sequence of n items from a given sample of text or speech". In other words, a group of words that occur together (e.g. "police investigated", "investigated a") rather than a single word. The 'n' means 'any number' but related terms like bigrams (two word pairs) and trigrams (three word strings) specify the number of words involved.
  • Topic modeling is a way of categorising or organising documents by shared features in the text. For example you might have a collection of medical reports but need to know what they're about. Topic modeling might identify one cluster which tends to use one vocabulary (operation, incision, surgeon) and another which uses a different cluster of words (consultation, appointment, advised). This can help you to identify the group of documents you need to focus on.

Guides and tutorials

Sources of text to work with

About

Dealing with documents - techniques, tools and tips

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published