Dealing with documents and text - techniques, tools and tips

This repository contains resources related to dealing with documents as a journalist. Typical challenges include:

Scraping multiple documents published online
'Batch downloading' multiple documents for offline analysis
Converting documents into searchable text
Dealing with scanned documents (OCR)
Searching across multiple documents
Identifying patterns in documents
Identifying entities in documents (people, places, organisations, dates and times)
Organising documents
Publishing documents
Matching names referred to in different ways in different documents

Tools for dealing with documents

Some useful tools to know about include:

Tabula (for extracting tables from documents)
OutWit Hub (for batch downloading documents - only available in Pro version)
Atom (text editor - for using regex for searching across multiple documents)
DocumentCloud (entity extraction and OCR)
Overview (identify links across documents, search and visualise)
Pinpoint "helps reporters quickly go through hundreds of thousands of documents by automatically identifying and organizing the most frequently mentioned people, organizations and locations. Instead of asking users to repeatedly hit “Ctrl+F,” the tool helps reporters use Google Search and Knowledge Graph, optical character recognition and speech-to-text technologies to search through scanned PDFs, images, handwritten notes, e-mails and audio files."
Datashare from the ICIJ for extracting text from documents and searching them. User Guide here
Open Semantic Search is "Free Software for your own Search Engine, Explorer for Discovery of large document collections, Media Monitoring, Text Analytics, Document Analysis & Text Mining platform"
Aleph is "A tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets." Read the documentation on GitHub
The Archive Network - for creating personal and publishable archives, handles filetypes such as emails etc. Currently in closed beta.
Evernote (organisation, sharing, OCR and tagging)
Programming languages open up other ways of dealing with text. For example, you can search documents using command line and regex, convert PDFs in Python using the library pdftoxml, or use tools like sentiment analysis and Natural Language Processing (NLP). The free ebook Text Mining With R explains some techniques.
CSV Match is one tool for fuzzy matching.
Pandoc is a command line tool for converting, combining, and doing other stuff with documents

Other tools are bookmarked at https://pinboard.in/u:paulbradshaw/t:text+tools

Useful concepts

A diff compares documents and shows the differences
Regex (regular expression) is a way of describing the pattern in a passage of text (e.g. 2 digits followed by a non-numerical character, followed by two digits...)
Sentiment analysis attempts to gauge whether a word or passage of text is positive or negative (used in this story along with diff)
Entity extraction attempts to identify and classify entities in a document, such as people, places, organisations, and times and dates.
OCR (Optical Character Recognition) attempts to convert images and scanned documents into text that can be searched etc. Google Images, for example, includes OCR so that you can search for images of specific licence plates, signs etc.
Fuzzy matching allows you to match different text where they are not exactly the same, e.g. names spelt or arranged slightly differently in different documents.
Ngrams are "a contiguous sequence of n items from a given sample of text or speech". In other words, a group of words that occur together (e.g. "police investigated", "investigated a") rather than a single word. The 'n' means 'any number' but related terms like bigrams (two word pairs) and trigrams (three word strings) specify the number of words involved.
Topic modeling is a way of categorising or organising documents by shared features in the text. For example you might have a collection of medical reports but need to know what they're about. Topic modeling might identify one cluster which tends to use one vocabulary (operation, incision, surgeon) and another which uses a different cluster of words (consultation, appointment, advised). This can help you to identify the group of documents you need to focus on.

Guides and tutorials

Sources of text to work with

Many political assemblies publish transcripts. In the UK Hansard is the official record of all debates, and it offers HTML downloads, as well as an API. TheyWorkForYou provides the same data with an API.
Most political speeches and statements are published by the Government or political parties. Examples include Council of Europe speeches, the European Parliament, the Parliament of India, and Gov.uk's news and communication
The Chilcott Inquiry published over 150 witness transcripts
Submissions to the Cairncross Review (98 documents) - download them from the call for submissions
Companies House publishes bulk data files of company accounts. These are in XBRL format (.html file extension) XBRL format (.xml file extension). The file names include the company number and filing date. Use command line to navigate to the folder and create a spreadsheet of the filenames using ls > filenames.csv then extract and filter by company number, filing date (=RIGHT(SUBSTITUTE(SUBSTITUTE(A2,".html",""),".xml",""),8)) and filetype.
There's some scraped IOPC recommendations in this folder
Upworthy shared a dataset of their story headlines and tests - read the research here

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
MOT FOI release		MOT FOI release
chilcottwitnessstatements		chilcottwitnessstatements
daiybriefingtranscripts		daiybriefingtranscripts
iopcreports		iopcreports
ocr_with_tesseract_r		ocr_with_tesseract_r
textdatasets		textdatasets
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
What do journalists do with documents.pdf		What do journalists do with documents.pdf
accountstext.md		accountstext.md
datasharehowto.md		datasharehowto.md
documentsgijc19.md		documentsgijc19.md
regexNgramsSpeeches.ipynb		regexNgramsSpeeches.ipynb
tagsexample.md		tagsexample.md
textanalysisexample.xlsx		textanalysisexample.xlsx
topicModelling_DrWhoTweets.ipynb		topicModelling_DrWhoTweets.ipynb
topicmodeling.md		topicmodeling.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dealing with documents and text - techniques, tools and tips

Tools for dealing with documents

Useful concepts

Guides and tutorials

Sources of text to work with

About

Releases

Packages

Languages

paulbradshaw/dealingwithdocuments

Folders and files

Latest commit

History

Repository files navigation

Dealing with documents and text - techniques, tools and tips

Tools for dealing with documents

Useful concepts

Guides and tutorials

Sources of text to work with

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages