In [None]:
# here we'll need to wget in the epsilon data from the Core repo

Before getting a particular year of data, let's have a quick look at what years are actually present in the data. We'll use `cut` to only return column 6 from the CSV, `grep` to extract only the four digits of the year, `sort` and `uniq` to remove duplicates and then `paste` to put them side by side and so easier to read.

In [132]:
!cut -d, -f6 linnean-society.csv | grep -Eo "[0-9][0-9][0-9][0-9]" | sort | uniq | paste - - - - -

1759	1781	1782	1783	1784
1785	1786	1787	1788	1789
1790	1791	1792	1793	1794
1795	1796	1797	1798	1799
1800	1801	1802	1803	1804
1805	1806	1807	1808	1809
1810	1811	1812	1813	1814
1815	1816	1817	1818	1819
1820	1821	1822	1823	1824
1825	1826	1827	1828	1829
1830	1831	1855	1857	1872
1873	1877			


Enter the year you want from the list above inside the double quotes:

In [106]:
year = "1783"

First we'll use `grep` to get all the lines which contain this year. Grep returns all lines in a file which contain the string. Here the year is preceded by a comma and followed by a dash (to try to exclude false positives):

In [107]:
!grep ",$year-" linnean-society.csv

LINNEAN454,"Broussonet","Pierre Marie Auguste","Smith","Sir James Edward",1783-01-20,"20 Jan 1783","Montpellier, France","Edinburgh","GB-110/JES/COR/1/99, The Linnean Society of London","fre","","LINNEAN454.xml"
LINNEAN1511,"Woodward","Thomas Jenkinson","Smith","Sir James Edward",1783-03-16,"16 Mar 1783","Bungay, Suffolk","Edinburgh","GB-110/JES/COR/18/7, The Linnean Society of London","eng","","LINNEAN1511.xml"
LINNEAN1603,"Smith","Sir James Edward","Smith","James",1783-03-06,"6 Mar 1783","Edinburgh","","GB-110/JES/COR/19/29, The Linnean Society of London","eng","","LINNEAN1603.xml"
LINNEAN1688,"Erskine","David Steuart","Smith","Sir James Edward",1783-01-01,"[1783]","Edinburgh","Edinburgh","GB-110/JES/COR/2/104, The Linnean Society of London","eng","","LINNEAN1688.xml"
LINNEAN1703,"Hamilton","Francis","Smith","Sir James Edward",1783-11-16,"16 Nov 1783","Leny, Callander, Stirling","London","GB-110/JES/COR/2/118, The Linnean Society of London","eng","","LINNEAN1703.xml"
LINNEAN1704

Now we want to cut out columns 2 and 4, which contain the surname of sender and recipient. By default the column separator with the `cut` command is a space so we set the delimiter to a comma:

In [108]:
!grep ",$year-" linnean-society.csv | cut -d, -f2,4

"Broussonet","Smith"
"Woodward","Smith"
"Smith","Smith"
"Erskine","Smith"
"Hamilton","Smith"
"Hamilton","Smith"
"Black","Smith"
"McGarroch","Smith"
"McGarroch","Smith"
"Pitchford","Smith"
"Repton","Smith"
"Smith","Smith"


To get the counts for edges we count the number of occurrences of each unique line:

In [109]:
!grep ",$year-" linnean-society.csv | cut -d, -f2,4 | sort | uniq -c

   1 "Black","Smith"
   1 "Broussonet","Smith"
   1 "Erskine","Smith"
   2 "Hamilton","Smith"
   2 "McGarroch","Smith"
   1 "Pitchford","Smith"
   1 "Repton","Smith"
   2 "Smith","Smith"
   1 "Woodward","Smith"


Finally we need to make this into CSV. For Gephi, we should also move the number to the end (so Gephi doesn't think it's an ID). We'll also remove the `"` marks using `tr`.

In [125]:
!grep ",$year-" linnean-society.csv | cut -d, -f2,4 | sort | uniq -c | tr -d '"' | perl -pe 's/ +([0-9]+) +(.+$)/$2,$1/'

Black,Smith,1
Broussonet,Smith,1
Erskine,Smith,1
Hamilton,Smith,2
McGarroch,Smith,2
Pitchford,Smith,1
Repton,Smith,1
Smith,Smith,2
Woodward,Smith,1


If this all looks good we can write it out to a file. If you get very few (or no) matches you can try a different year by resetting the year variable in the cell towards the top of this notebook. As we do this we'll also add the headings that Gephi (but not Flourish) requires for an edges table.

In [126]:
!echo "source,target,count" > "$year-edges.csv"
!grep ",$year-" linnean-society.csv | cut -d, -f2,4 | sort | uniq -c | tr -d '"' | perl -pe 's/ +([0-9]+) +(.+$)/$2,$1/' >> "$year-edges.csv"

To get nodes with counts we need to follow a similar procedure but to combine the two columns into one. To get one column under another with cut, the two commands are run consecutively (separated by a semi-colon) and put in round brackets so that everything inside the brackets is executed before the following commands operate.

In [114]:
!(grep ",$year-" linnean-society.csv | cut -d, -f2; grep ",$year-" linnean-society.csv | cut -d, -f4) | sort | uniq -c

   1 "Black"
   1 "Broussonet"
   1 "Erskine"
   2 "Hamilton"
   2 "McGarroch"
   1 "Pitchford"
   1 "Repton"
  14 "Smith"
   1 "Woodward"


And again we'll move the numbers to the end and insert a comma to create a valid CSV file.

In [127]:
!(grep ",$year-" linnean-society.csv | cut -d, -f2; grep ",$year-" linnean-society.csv | cut -d, -f4) | sort | uniq -c | tr -d '"' | perl -pe 's/ +([0-9]+) +(.+$)/$2,$1/'

Black,1
Broussonet,1
Erskine,1
Hamilton,2
McGarroch,2
Pitchford,1
Repton,1
Smith,14
Woodward,1


If all looks good we can again write this to a file:

In [122]:
!(grep ",$year-" linnean-society.csv | cut -d, -f2; grep ",$year-" linnean-society.csv | cut -d, -f4) | sort | uniq -c | perl -pe 's/ +([0-9]+) +(.+$)/$2,$1/' > "$year-nodes.csv"