# Regular expressions in command line grep (and other command line utilities)

## Background

We've been collaboratively writing an article, and people have been jumping in and adding references into a shared editable draft. We now want to tidy these up, ensure everything is formatted neatly and use a reference manager to build the bibliography in the correct style.

This outline covers extracting the citations from the draft article text using regular expressions.

## Starting assumptions

You've access to:
- Article contents in a simple text file: `article.txt` (the sample here uses generated text and some citations, mostly in author year style)
- A command line environment with `grep` and other common command line utilities (`head`, `sort`, `tr`, `uniq` etc).

## First look at the article contents

`head` is a utility that outputs the first part of a file (or other input). (`tail` is the converse, which outputs the last).

Here, `head` is used to output the first part of the `article.txt` file:

In [1]:
! head article.txt

 Neque porro ipsum est.  Labore quaerat numquam porro sit (Jarman et al. 2018)     non.  Quaerat quaerat est eius quaerat est (Piet et al. 2019).  Porro magnam velit quiquia est (e.g. Colville & Pritchard 2019) modi (Li & Pritchard 2009).  Quisquam labore sit dolore non amet numquam velit.  Quiquia ut neque est etincidunt (Wyse et al. 2018) consectetur quisquam (Wyse & Dickie 2017).  Neque porro ipsum est.  Labore quaerat numquam porro sit (Victor et al. 2015) non.  Quaerat quaerat est eius quaerat (Baker et al. 2017) est.  Porro magnam velit quiquia est (Nicholson et al. 2018) modi.  Quisquam labore sit dolore non amet numquam velit.  Quiquia ut neque est etincidunt consectetur (e.g. Mounce et al. 2017; Meyer et al. 2016, Ball-Damerow 2019) quisquam. 
 Porro labore non aliquam adipisci (Hedrick et al. 2019; Lendemer et al. 2020) amet adipisci.  Dolorem est quaerat quaerat consectetur sit (Heberling & Isaac 2018) eius.  Consectetur consectetur dolorem quisquam ut sed (Berendsohn & Selt

This article snippet shows that citations are in the form (Author year) separated by semi-colons if multiple citations presented together, eg:
- (Lendemer et al. 2020)
- (Webster 2017; Lendemer et al. 2020)

## First attempt at finding citations in the text
This uses the command line utility `grep` (*Global Regular Expression Print*) 
The arguments passed to grep are:
- `-e "(.*)"` - `-e` says use extended pattern syntax, `"(.*)"` is the pattern
- `--color=always` - colour the matches in the output
- `article.txt` - the file to process

In [2]:
! grep -e "(.*)" --color=always article.txt 

 Neque porro ipsum est.  Labore quaerat numquam porro sit [01;31m[K(Jarman et al. 2018)     non.  Quaerat quaerat est eius quaerat est (Piet et al. 2019).  Porro magnam velit quiquia est (e.g. Colville & Pritchard 2019) modi (Li & Pritchard 2009).  Quisquam labore sit dolore non amet numquam velit.  Quiquia ut neque est etincidunt (Wyse et al. 2018) consectetur quisquam (Wyse & Dickie 2017).  Neque porro ipsum est.  Labore quaerat numquam porro sit (Victor et al. 2015) non.  Quaerat quaerat est eius quaerat (Baker et al. 2017) est.  Porro magnam velit quiquia est (Nicholson et al. 2018) modi.  Quisquam labore sit dolore non amet numquam velit.  Quiquia ut neque est etincidunt consectetur (e.g. Mounce et al. 2017; Meyer et al. 2016, Ball-Damerow 2019)[m[K quisquam. 
 Porro labore non aliquam adipisci [01;31m[K(Hedrick et al. 2019; Lendemer et al. 2020) amet adipisci.  Dolorem est quaerat quaerat consectetur sit (Heberling & Isaac 2018) eius.  Consectetur consectetur dolorem quisqu

This shows that the `.*` part of the match is *greedy* - so the complete pattern `(.*)` will match the opening bracket and then capture *as much as possible* till it sees the final close bracket on the line.

To make the matches more usable, we need to refine the pattern to capture just the citations. As each citation starts with a (usually uppercase) letter and ends with a year, this should be possible.
The pattern we're using is: `([A-Z].*[0-9])` - which says match in order:
- an open bracket `(`
- any uppercase character `[A-Z]`
- anything `.*`
- any digit `[0-9]`
- a close bracket `)`

In [3]:
! grep -e "([A-Z].*[0-9])" --color=always article.txt 

 Neque porro ipsum est.  Labore quaerat numquam porro sit [01;31m[K(Jarman et al. 2018)     non.  Quaerat quaerat est eius quaerat est (Piet et al. 2019).  Porro magnam velit quiquia est (e.g. Colville & Pritchard 2019) modi (Li & Pritchard 2009).  Quisquam labore sit dolore non amet numquam velit.  Quiquia ut neque est etincidunt (Wyse et al. 2018) consectetur quisquam (Wyse & Dickie 2017).  Neque porro ipsum est.  Labore quaerat numquam porro sit (Victor et al. 2015) non.  Quaerat quaerat est eius quaerat (Baker et al. 2017) est.  Porro magnam velit quiquia est (Nicholson et al. 2018) modi.  Quisquam labore sit dolore non amet numquam velit.  Quiquia ut neque est etincidunt consectetur (e.g. Mounce et al. 2017; Meyer et al. 2016, Ball-Damerow 2019)[m[K quisquam. 
 Porro labore non aliquam adipisci [01;31m[K(Hedrick et al. 2019; Lendemer et al. 2020) amet adipisci.  Dolorem est quaerat quaerat consectetur sit (Heberling & Isaac 2018) eius.  Consectetur consectetur dolorem quisqu

Unfortunately its still greedy. Next we replace the "anything" part of the pattern (`.*`) with a negative match to prevent prevent reading past a close bracket:
`([A-Z][^)]*[0-9])`
Which says match in order: 
- an open bracket `(`
- any uppercase character `[A-Z]`
- anything that isn't a close bracket `[^)]*` - the `^` character at the start negates the contents of the `[]` and `*` says repeat any number of times 
- any digit `[0-9]`
- a close bracket `)`

In [4]:
! grep -e "([A-Za-z][^)]*[0-9])" --color=always article.txt 

 Neque porro ipsum est.  Labore quaerat numquam porro sit [01;31m[K(Jarman et al. 2018)[m[K     non.  Quaerat quaerat est eius quaerat est [01;31m[K(Piet et al. 2019)[m[K.  Porro magnam velit quiquia est [01;31m[K(e.g. Colville & Pritchard 2019)[m[K modi [01;31m[K(Li & Pritchard 2009)[m[K.  Quisquam labore sit dolore non amet numquam velit.  Quiquia ut neque est etincidunt [01;31m[K(Wyse et al. 2018)[m[K consectetur quisquam [01;31m[K(Wyse & Dickie 2017)[m[K.  Neque porro ipsum est.  Labore quaerat numquam porro sit [01;31m[K(Victor et al. 2015)[m[K non.  Quaerat quaerat est eius quaerat [01;31m[K(Baker et al. 2017)[m[K est.  Porro magnam velit quiquia est [01;31m[K(Nicholson et al. 2018)[m[K modi.  Quisquam labore sit dolore non amet numquam velit.  Quiquia ut neque est etincidunt consectetur [01;31m[K(e.g. Mounce et al. 2017; Meyer et al. 2016, Ball-Damerow 2019)[m[K quisquam. 
 Porro labore non aliquam adipisci [01;31m[K(Hedrick et al. 2019;

It looks like we have detected the citations in the text of the article draft. Given the citations can be detected, how can they be extracted? Command line `grep` has an option `--only-matching` to output only the matches that it finds (i.e. it drops the surrounding text): 

In [5]:
! grep -e "([A-Za-z][^)]*[0-9])" --only-matching article.txt 

(Jarman et al. 2018)
(Piet et al. 2019)
(e.g. Colville & Pritchard 2019)
(Li & Pritchard 2009)
(Wyse et al. 2018)
(Wyse & Dickie 2017)
(Victor et al. 2015)
(Baker et al. 2017)
(Nicholson et al. 2018)
(e.g. Mounce et al. 2017; Meyer et al. 2016, Ball-Damerow 2019)
(Hedrick et al. 2019; Lendemer et al. 2020)
(Heberling & Isaac 2018)
(Berendsohn & Seltmann 2010; Owens & Johnson 2019)
(Fritze et al. 2012)
(OECD, 2007)
(OECD, 2001)
(Heywood 2017)
(Brummitt 2001)
(Figure 8)
(Table 4)
(Meyer et al. 2016; Daru et al. 2018; Meineke et al. 2018; Nekola et al. 2019; Zizka et al. 2020)
(about 27% of the total, followed by the Caribbean, Central and South America, with 22% and Europe, with about 19%. Africa, Tropical Asia and the Pacific regions comprise the smallest proportion of digitized specimens currently available through GBIF (Fig. 7)
(GBIF 2020)
(Canteiro et al. 2019)
(Soltis 2017)
(GBIF Secretariat 2019)
(Schindel & Cook 2018; Nelson & Ellis 2019; Lendemer et al. 2020)
(Ryan 2013; Smith & 

This list includes some things which are not bibliographic citations, eg notes to self, or references to tables or figures. These can be removed by "piping" into a further `grep` command which specifies a negative match (using the `-v` flag).
A pipe (`|`) is used to redirect the output of one command line utility to act as the input to another. This means that many small utility programs can be chained together into a pipeline. 
The negative match pattern used in the second grep specifies a number of patterns, separated by `|\`

In [6]:
!grep -e "([A-Za-z][^)]*[0-9])" --only-matching article.txt | grep -v -e "(Tab\|(Fig\|(http\|(insert\|(about"

(Jarman et al. 2018)
(Piet et al. 2019)
(e.g. Colville & Pritchard 2019)
(Li & Pritchard 2009)
(Wyse et al. 2018)
(Wyse & Dickie 2017)
(Victor et al. 2015)
(Baker et al. 2017)
(Nicholson et al. 2018)
(e.g. Mounce et al. 2017; Meyer et al. 2016, Ball-Damerow 2019)
(Hedrick et al. 2019; Lendemer et al. 2020)
(Heberling & Isaac 2018)
(Berendsohn & Seltmann 2010; Owens & Johnson 2019)
(Fritze et al. 2012)
(OECD, 2007)
(OECD, 2001)
(Heywood 2017)
(Brummitt 2001)
(Meyer et al. 2016; Daru et al. 2018; Meineke et al. 2018; Nekola et al. 2019; Zizka et al. 2020)
(GBIF 2020)
(Canteiro et al. 2019)
(Soltis 2017)
(GBIF Secretariat 2019)
(Schindel & Cook 2018; Nelson & Ellis 2019; Lendemer et al. 2020)
(Ryan 2013; Smith & Figueiredo 2014)
(January 2020)
(Seberg, Droege et al. 2016)
(Hong et al. 1998, 1999)
(Mounce et al. 2017)
(Wyse & Dickie 2017)
(FAO 2019)
(Hay and Probert 2013)
(Li and Pritchard 2009)
(FAO, 2019)
(Overmann, 2015)
(WIPO, 2019)
(WFCC WCDM 2020)
(Blackwell 2011)
(Willis et al. 2018

A useful example of chaining is that the output of this command (the list of matches) can be piped into a utility (`wc`) to count the number of lines (using the `-l` switch):

In [7]:
!grep -e "([A-Za-z][^)]*[0-9])" --only-matching article.txt | grep -v -e "(Tab\|(Fig\|http\|insert\|about" | wc -l

55


Its probably worth saving this list of extracted references to an intermediate file, and working from there from now on. This is done with a redirection (`>`):

In [8]:
!grep -e "([A-Za-z][^)]*[0-9])" --only-matching article.txt | grep -v -e "(Tab\|(Fig\|http\|insert\|about" > refs.txt
!head refs.txt

(Jarman et al. 2018)
(Piet et al. 2019)
(e.g. Colville & Pritchard 2019)
(Li & Pritchard 2009)
(Wyse et al. 2018)
(Wyse & Dickie 2017)
(Victor et al. 2015)
(Baker et al. 2017)
(Nicholson et al. 2018)
(e.g. Mounce et al. 2017; Meyer et al. 2016, Ball-Damerow 2019)


The next tasks are to (1) remove the brackets surrounding each citations and (2) split multiple citations so that each appears on its own line.
The `tr` command (translate) can be used to translate or to delete characters (the `-d` switch):

In [9]:
!cat refs.txt | tr -d '()'

Jarman et al. 2018
Piet et al. 2019
e.g. Colville & Pritchard 2019
Li & Pritchard 2009
Wyse et al. 2018
Wyse & Dickie 2017
Victor et al. 2015
Baker et al. 2017
Nicholson et al. 2018
e.g. Mounce et al. 2017; Meyer et al. 2016, Ball-Damerow 2019
Hedrick et al. 2019; Lendemer et al. 2020
Heberling & Isaac 2018
Berendsohn & Seltmann 2010; Owens & Johnson 2019
Fritze et al. 2012
OECD, 2007
OECD, 2001
Heywood 2017
Brummitt 2001
Meyer et al. 2016; Daru et al. 2018; Meineke et al. 2018; Nekola et al. 2019; Zizka et al. 2020
GBIF 2020
Canteiro et al. 2019
Soltis 2017
GBIF Secretariat 2019
Schindel & Cook 2018; Nelson & Ellis 2019; Lendemer et al. 2020
Ryan 2013; Smith & Figueiredo 2014
January 2020
Seberg, Droege et al. 2016
Hong et al. 1998, 1999
Mounce et al. 2017
Wyse & Dickie 2017
FAO 2019
Hay and Probert 2013
Li and Pritchard 2009
FAO, 2019
Overmann, 2015
WIPO, 2019
WFCC WCDM 2020
Blackwell 2011
Willis et al. 2018
Overman & Smith, 2016
Ryan et al., 2019
Mounce et al. 2017
Thiers 2020
Lanjo

Unfortunately `tr` only works on single characters - if we want to translate multiple character strings (like the `; ` which separates multiple citations, we need to use `sed` - a stream editor. The command to sed can be broken down as follows:
- `s/oldstring/newstring/g` substitute (`s`) `oldstring` with `newstring` globally (`g`)
The newstring value `\n` means newline.

In [10]:
!cat refs.txt | tr -d '()' | sed 's/; /\n/g'

Jarman et al. 2018
Piet et al. 2019
e.g. Colville & Pritchard 2019
Li & Pritchard 2009
Wyse et al. 2018
Wyse & Dickie 2017
Victor et al. 2015
Baker et al. 2017
Nicholson et al. 2018
e.g. Mounce et al. 2017
Meyer et al. 2016, Ball-Damerow 2019
Hedrick et al. 2019
Lendemer et al. 2020
Heberling & Isaac 2018
Berendsohn & Seltmann 2010
Owens & Johnson 2019
Fritze et al. 2012
OECD, 2007
OECD, 2001
Heywood 2017
Brummitt 2001
Meyer et al. 2016
Daru et al. 2018
Meineke et al. 2018
Nekola et al. 2019
Zizka et al. 2020
GBIF 2020
Canteiro et al. 2019
Soltis 2017
GBIF Secretariat 2019
Schindel & Cook 2018
Nelson & Ellis 2019
Lendemer et al. 2020
Ryan 2013
Smith & Figueiredo 2014
January 2020
Seberg, Droege et al. 2016
Hong et al. 1998, 1999
Mounce et al. 2017
Wyse & Dickie 2017
FAO 2019
Hay and Probert 2013
Li and Pritchard 2009
FAO, 2019
Overmann, 2015
WIPO, 2019
WFCC WCDM 2020
Blackwell 2011
Willis et al. 2018
Overman & Smith, 2016
Ryan et al., 2019
Mounce et al. 2017
Thiers 2020
Lanjouw & Stafl

## Ordering
The command above outputs matches in the order they were found in the input. If we want eg a unique list in alphabetical order, we can pipe into a couple more utilities: `sort` (to sort alphabetically) and `uniq` (to remove duplicates):

In [11]:
 !cat refs.txt | tr -d '()' | sed 's/; /\n/g'| sort | uniq

Baker et al. 2017
Bakker et al.  2019
Bakker et al. 2019
Berendsohn & Seltmann 2010
Besnard et al. 2019
Blackwell 2011
Brummitt 2001
Canteiro et al. 2019
Carine et al. 2018
Daru et al. 2018
Diaz et al. 2019
e.g. Colville & Pritchard 2019
e.g. Mounce et al. 2017
FAO 2019
FAO, 2019
Fritze et al. 2012
Funk 2018
GBIF 2020
GBIF Secretariat 2019
Harmon et al. 2019
Hay and Probert 2013
Heberling & Isaac 2018
Heberling et al. 2019
Hedrick et al. 2019
Heywood 2017
Hong et al. 1998, 1999
January 2020
Jarman et al. 2018
Lanjouw & Stafleu 1964
Lendemer et al. 2020
Li & Pritchard 2009
Li and Pritchard 2009
Meineke et al. 2018
Meyer et al. 2016
Meyer et al. 2016, Ball-Damerow 2019
Mounce et al. 2017
Muggaran et al. 2010
Nekola et al. 2019
Nelson & Ellis 2019
Nicholson et al. 2018
Nualart et al. 2017
OECD, 2001
OECD, 2007
Overman & Smith, 2016
Overmann, 2015
Owens & Johnson 2019
Piet et al. 2019
Ryan 2013
Ryan et al., 2019
Schindel & Cook 2018
Seberg, Droege et al. 2016
Smith & Figueiredo 2014
Soltis