How to contribute reports

Official Inquiries collects, preserves and presents key document from official inquries in a useful and usable form.

How to contribute reports

Create a GitHub account

Create a repository

GitHub’s instructions for creating a repository are available here
Naming your repository:
- When choosing a name for your repository, the convention we use is [name of inquiry body]-[date (yyyymmdd)]-[title].
- Each word should be separated with a dash “-” for easy reading.
How to structure your repository
- Our repositories are structured as follows:
- [repository]
  - archive
    - [archive of original source documents]
  - scripts *README.md (detailing any important scripts used to create the repository)
  - text
  - [text extracted from source documents]
  - README.md
  - Datapackage.json
- To see an example, have a look at our repository on the Senate PSI report on the 2008 financial crisis here.

Using Git to sync your repository with GitHub

After creating the repository on GitHub, you will need to clone it to your computer. This can be done by navigating to the directory where you want to clone the repository and using the command git clone [url of repository].
Then, you can make changes to the contents of the repository on your computer. When you create a new file, use the command git add [file path]to add it to the list of files you would like to commit.
When you are happy with your changes to the repository, use the command git commit to commit them. Don’t forget to add a brief note about the changes you are committing.
- The formatting conventions we use for the note are [nature of note][size of changes][brief description]
- For example: [archive][l] added archive of original reports to denote that you made a large change to the archive section, by adding the original reports to it.
To sync your changes with the GitHub repository, use the command git push to push your changes from your local repository to the online version.
Storing large files:
- Sometimes, you will need to upload large files (ie larger than 50mb) to GitHub. It is not advisable to simply upload such large files as they are, so it is important to use git’s large file storage when doing so.
- GitHub provides its own instructions on how to use lfs here
- For example: After installing Git LFS, depending on your setup and preferences, run this command once per machine:
```
$ git lfs install
```
Now specify file patterns to store with Git LFS in .gitattributes:
```
$ cd to/repo
$ git init
$ git lfs track "*.pdf"
```
Commit your changes and push:
```
$ git add .gitattributes
$ git add example.pdf
$ git commit -m "add pdf files"
$ git push
```
You can confirm that Git LFS is managing your pdf file:
```
$ git lfs ls-files
example.pdf
```
Other important commands:
- git status: at any point during the process, you can use git status to check the status of your local changes.
- git pull: use git pull to pull changes made to the online repository and bring your local version up to date with them.

Writing a README.md

The purpose of the README.md is to give a basic outline of what the report contains
The README.md should be written in markdown. A guide to markdown and its syntax can be found here.
Have a look at some of the README.md files in the existing official-inquiries repositories to get an idea of how to structure it. Generally, it should contain:
- An introductory paragraph, explaining what the report is.
- A summary of the report’s contents.
- Where you found the materials.
- What the license for the materials is.

Writing a datapackage.json

The datapackage file is a convenient means of storing data about the repository.
A guide to its use can be found here
Have a look at the datapackage.json for the Senate PSI financial crisis report here to get an idea of the syntax and contents you will need.

Extracting text from a .pdf

Usually, you will have to extract text from a .pdf file of a report to get text for your text directory. To do this, we usually use PDFMiner
Instructions on how to install and use PDFMiner can be found here

In some cases, you may also want to try using Apache PDFbox

To use PDFbox, download the .jarfile and then run the command java -jar pdfbox-app-2.y.z.jar ExtractText [OPTIONS] <inputfile> <outputfile> (where yand zare the version numbers of the file you downloaded)

Raising issues on GitHub

You can raise any project issues on GitHub using the issues tab on the web interface Or, help solve any issues listed there

Finding inquiries

If you have found an interesting official inquiry, the procedure is to add it as a github issue on the main Official Inquiries repository. The issue post should contain:

The country of origin for the inquiry
A synopsis of what the inquiry is
Links to the original documents

We will focus our work on the inquiries that are likely to have the greatest public interest first. If and when an inquiry is chosen, a repository will be created for it, containing:

A README detailing the inquiry with a summary of its contents, a list of the materials used and the license under which they are available to us.
An /archive/ directory containing all the original files
A data package containing relevant information about the inquiry

As an example, look at the repository for the Levin/Coburn report into the Financial Crisis

Contributing to existing repositories

Once we have the original files for an inquiry, then we then want to end up with:

[preserved original file] - the archive file
[preserved original text] – the file having been put through the text extraction tool and only the text extraction tool
[“tidy” text] – the tidied text; this can be tidied in stages, either manually or with a script
[markdown file] – the most up-to-date “tidy” text in markdown, this goes on the site

This allows us to tidy documents in stages. The actions, files and file locations for our workflow look like this:

Converting archive files to text

The most simple way to help is to take files in the /archive/ directory of a repository, convert them into .txt format, and place them in the /text/ directory. We currently do this using PDFMiner. Instructions on how to install and use PDFMiner can be found here

Tidying up converted text files

Once a converted text file has been placed in the /text/ directory, the next step is to tidy the file up and fix issues that the conversion software did not pick up. The criteria for a tidy inquiry can be found just below.
Some inquiries may have scripts which can be run to automate part or all of the tidy process. The Iraq Inquiry repository, for example, has a script in /scripts/process.py that will format footnotes and bullet points and remove unnecessary paragraph space.

Turning tidy text files into markdown

The next step is to convert these text files to markdown so that they can be put up on the website.
A file with the .md extension should be added to the repository's /markdown/ directory.

Criteria for a tidy file

Each text file will vary in how it is untidy, but these are some things to look out for:

Page information

This is needed for jekyll to read the markdown file as a webpage and so is a basic prerequisite of a tidy file. At the top of a tidy file should be:

The page information:
- Which html layout file in the _layout directory it uses (usually page)
- The title of the page
- edit: true
The markdown table of contents. The text of the inquiry's original table of contents should be removed.

Once you're done, the top of the page should look something like:

Headings

In order for markdown's table of contents to work, headings and subheadings should be formatted in markdown
Headings should also be in standard casing, avoiding all-caps where the text conversion has produced this

Bullet points

Bullet points should be formatted in markdown

Page and paragraph numbers

Some inquiries will have section and paragraph numbers. We generally want to keep these.
Page numbers should be:
- Retained, in cases where there are no paragraph or section numbers (people will want some way of knowing where to look in our reports, eg. “on paragraph/page x of the inquiry”)
  - Where they are retained, they should be clearly separate from the text, on their own line, ideally with a line space before and after.
  - Putting the text “Page” before the number helps make it clear that the number is a page number, eg “Page 9” instead of just “9”.
- Deprecated, in cases where there are paragraph or section numbers (these alone should provide enough guidance for users. Too many page numbers is untidy).

Page titles

These should be removed, usually with a simple find and replace

Excess space

Some inquiries, when processed, may process superfluous extra space between some lines, usually where a page break occurred in the source file. This should be removed.

Text disorder

Unfortunately, some inquiries for example this inquiry’s text do not process the text in the correct order.
In this case, it can be worth trying another text extraction tool, if you are aware of one. If you have success like this, please let us know which tool you used.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
screenshots		screenshots
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to contribute reports

Create a GitHub account

Create a repository

Using Git to sync your repository with GitHub

Writing a README.md

Writing a datapackage.json

Extracting text from a .pdf

Raising issues on GitHub

Finding inquiries

Contributing to existing repositories

Converting archive files to text

Tidying up converted text files

Turning tidy text files into markdown

Criteria for a tidy file

Page information

Headings

Bullet points

Page and paragraph numbers

Page titles

Excess space

Text disorder

About

Releases

Packages

Contributors 3

official-inquiries/registry

Folders and files

Latest commit

History

Repository files navigation

How to contribute reports

Create a GitHub account

Create a repository

Using Git to sync your repository with GitHub

Writing a README.md

Writing a datapackage.json

Extracting text from a .pdf

Raising issues on GitHub

Finding inquiries

Contributing to existing repositories

Converting archive files to text

Tidying up converted text files

Turning tidy text files into markdown

Criteria for a tidy file

Page information

Headings

Bullet points

Page and paragraph numbers

Page titles

Excess space

Text disorder

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages