Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
336 lines (199 sloc) 11 KB

Module 3

Some other discovery / research

I thought if I could go through text, find proper nouns, look them up in Wikipedia for a hit, I could determine if they were (likely) significant. Then I could mark them up as xml.

Below are some URLs from the reading I did...

Clean Wikipedia API

  from nltk import pos_tag, ne_chunk
  from nltk.tokenize import SpaceTokenizer
  tokenizer = SpaceTokenizer()
  toks = tokenizer.tokenize(sentence)
  pos = pos_tag(toks)
  chunked_nes = ne_chunk(pos)
  nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

However, a much better way to do this exists: Stanford Named Entity Recognizer (NER)

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.

I want to try to hook this up using python rather than java.

Linux on Windows 10

Thanks to this post, I can run Linux on my Windows 10 PC. grep and sed seem to work. Exciting.

OCR On Linux on Windows

Installed tesseract

sudo apt-get install tesseract-ocr

Objective: Install and Run OCRFeeder

I wanted to have more control over what areas of a document were processed by OCR. Command line tesseract did not provide this. OCRFeeder has a GUI that uses Tesseract, so it looked like a good bet.

More about OCRFeeder:

It looks like this:

and can produce this:

I installed OCRFeeder

I have to check, I think I did sudo apt-get install ocrfeeder (Fail: I should have kept notes when I started this idea.)

I needed to install an image library for Python to make this work. The image library was Pillow. To install, I needed python-pip. so I installed that:

   $ sudo apt-get install python-pip

The I installed Pillow

   $ pip install Pillow

I then had to make changes to these files to make OCRFeeder with the image library

I made this change:

   # import Image
   from PIL import Image


   $ sudo nano /usr/lib/python2.7/dist-packages/ocrfeeder/util/
   $ sudo nano /usr/lib/python2.7/dist-packages/ocrfeeder/util/
   $ sudo nano /usr/lib/python2.7/dist-packages/ocrfeeder/studio/
   $ sudo nano /usr/lib/python2.7/dist-packages/ocrfeeder/studio/
   $ sudo nano /usr/lib/python2.7/dist-packages/ocrfeeder/feeder/

I made these changes:

   # import Image
   from PIL import Image
   # import ImageDraw
   from PIL import Image


   $ sudo nano /usr/lib/python2.7/dist-packages/ocrfeeder/feeder/

Installed an X Server

On Windows, OCRFeeder would not launch its screen. I needed to install an X Server. (I'm not sure what exactly that is, but it worked.)

Then I got another copy of the Equity

sudo wget --no-parent


Reviewed this article, particularly page 53

Neelin, Lyndal Laurel, Carleton University. Theses and Dissertations.Canadian Studies, and ProQuest (Firm). 2012;2013;. The importance of being shawville: The role of particularity in community resilience.ProQuest Dissertations Publishing.

Exercise 1

Registered for a dhbox on

** Thank you! It is very kind of you to be open to students like me **


       $ curl > texas.txt

I reviewed the contents of text.txt

  1. Shout out to ascii art!:

       <!-- __ _ _ _ __| |_ (_)__ _____
           / _` | '_/ _| ' \| |\ V / -_)
           \__,_|_| \__|_||_|_| \_/\___| -->
  2. More to the point I see content like:

       David G. Burnet to Richard G. Dunlap, August 19, 1839 54

Edited texas.txt. Used CTRL+SHIFT+^ to mark text to delete, paged down and deleted it with CTRL+K. Saved file as texas2.txt

Performed: $grep '\bto\b' texas.txt

Results: (last 2 lines)

Wm. Henry Daingerfield to R. Sieveking, October 1, 1845 1582 Wm. Henry Daingerfield to Ebenezer Allen, February 2, 1846 1582

Made a back up of my own raw file:

$ cp texas2.txt texasbak.txt

I want to operate on texas3.txt

$ cp texas2.txt texas3.txt

I tried this:

$ sed -r -i.bak 's/(.+\bto\b.+)/~(.+\bto\b.+)/g' texas3.txt

Fail, and got this:



$ cp texas2.txt texas3.txt

$ sed -r -i.bak 's/(.+\bto\b.+)/~\1/g' texas3.txt

Got the expected result:

  ~Anson Jones to Charles H. Raymond, October 24, 1844 316

  ~Charles H. Raymond to Anson Jones, November 27, 1844 317

and I have the backup sed made too:

texas2.txt texas3.txt texas3.txt.bak texasbak.txt texas.txt

Made a copy to work on the next step, texas4.txt

$ cp texas3.txt texas4.txt

$ grep '~' texas4.txt > index.txt

$ nano index.txt

I got the expected result, only the lines with a tilde were piped to index.txt

  ~Samuel A. Roberta to Joseph Eve, August 11, 1841 94
  ~Samuel A. Roberta to Joseph Eve, August 17, 1841 94
  ~Samuel A. Roberts to Barnard E. Bee, September 7, 1841 96
  ~Sam Houston to Joseph Eve, July 30, 1842 100
  ~Sam Houston to A. B. Roman, September 12, 1842 101

Make another copy

$ cp index.txt index2.txt

$ grep '[0-9]{4}' index2.txt


This one did not return results for me: grep '[0-9]{4}' index.txt This one did: grep '[0-2][0-9][0-9][0-9]' index.txt

However, sed worked.

$ sed -r -i.bak 's/(,)( [0-9]{4})(.+)/\2/g' index3.txt

Result: ~Sam Houston to A. B. Roman, September 12 1842

Step 4


$ sed -r -i.bak 's/(~)(.+)/\2/g' index4.txt

Result: Sam Houston to A. B. Roman, September 12 1842 (~ is gone)

Step 5

sed -r -i.bak 's/(.+)( to\b)(.+)/\1,\3/g' index5.txt

Result: Sam Houston, A. B. Roman, September 12 1842

I had some problems with this grep .+,.+,.+

   78  nano index5.txt
      79  grep '.+,.+,.+' index5.txt
      80  grep '.+,' index5.txt
      81  grep ',' index5.txt
      82  grep '/.+,.+,.+\w+/g' index5.txt
      83  grep '.+,.+,.+\w' index5.txt
      84  grep '.+,.+,.+' index5.txt
      85  grep '(.+,.+,.+)' index5.txt
      86  grep '(.+)(,)' index5.txt
      87  grep '(,)' index5.txt
      88  grep ',' index5.txt
      89  grep '\(,\)' index5.txt
      90  grep '\.\+\(,\)' index5.txt
      91  grep '.+\(,\)' index5.txt
      92  grep '.\+\(,\)' index5.txt
      93  grep '\.\+\(,\)' index5.txt
      94  grep '.\+\,' index5.txt
      95  grep '.\+,\+,' index5.txt
      96  grep '.\+\,\+,' index5.txt
      97  grep '.\+\,\+\,' index5.txt
      98  grep '.\+\(,\)\+\,' index5.txt
      99  grep '.\+\(,\)' index5.txt
     100  grep '.\+\(\,\)' index5.txt
     101  grep '.\+\(\,\).\+' index5.txt
     102  grep '.\+\(\,\).\+\(\,\)' index5.txt
     103  grep '.\+\(\,\).\+\(\,\).\+\' index5.txt
     104  grep '.\+\(\,\).\+\(\,\).\+' index5.txt
     105  nano index5.txt
     106  grep '.\+\(\,\).\+\(\,\).\+' index5.txt
     107  nano index5.txt
     108  grep '.\+\(\,\).\+\(\,\).\+' index5.txt
     109  grep '.\+\(\,\).\+\(\,\)\.\+' index5.txt
     110  grep '.\+\(\,\)\.\+\(\,\)\.\+' index5.txt
     111  grep '\.\+\(\,\)\.\+\(\,\)\.\+' index5.txt
     112  grep '.\+\(\,\).\+\(\,\).\+' index5.txt
     113  grep '.+\(\,\).\+\(\,\).\+' index5.txt
     114  grep '.\+\(\,\).\+\(\,\).\+' index5.txt
     115  grep '.\+\,.\+\,.\+' index5.txt
     116  grep '.\+,.\+,.\+' index5.txt
     117  grep '.+,.+,.+' index5.txt
     118  grep '.\+,.\+,.\+' index5.txt

I went back and read the annotations and saw:

grep -E ".+,.+,.+," index.txt

That works. But really, (Fail) I should have looked at the command rather than brute forcing it.

Made another backup

$ cp index5.txt index6.txt

I started editing in nano, but found this tedious. Also, I found I wanted to check older files given some years had been trimmed.

Here is a start of my corrections log. 2 asterixes indicated I needed to check these years in the original file

   (corrected)Abner S. Lipscomb, James Hamilton and A. T. Bumley, AugUHt 15, **1840**
   (corrected)Sam, Houston, Joseph Eve, July 30 1842
   (corrected)Sam, Houston, A. B. Roman, September 12 1842
   (corrected)Sam, Houston, A. B. Roman, October 29 1842
   (corrected)Address of Tilghman A. Howard, Anson Jones, between August 2 and 6, **1844**
   (corrected)Eddy and Moss, the collector of the customs at New Orleans, April 23, **1844** 
   (corrected)A. J. Donelson, Secretary of State [Allen,. arf interim], December 10 1844
   (corrected)Williams, Thurston, and Meggerson, Duff Green, January 1 1845
   Barnard E. Bee, James Treat, April 28, 1»40 665 
   E. W. Moore, Muabeau B. Lamar, August 28, 184Q 695 

I wanted to use an editor on Windows. When I downdloaded the file, it was unix formatted with no hard line returns and difficult to read.

Installed a utility tofrodos

$ sudo apt-get tofrodos

$ todos index6dos.txt

Results were more readable

   Sam Houston, J. Pinckney Henderson, December 31 1836
   James Webb, Alc6e La Branche, May 27 1839
   David G. Burnet, Richard G. Dunlap, June 3 1839

Made corrections and saved as cleaned-correspondence.csv

Exercise 2

Downloaded and installed Openrefine. It failed to run, so I downloaded and installed 64 bit java. After that open refine ran.

NER Extention for openrefine

I see an NER Extention for Openrefine. I will look at that later.

Launched Openrefine. Opened cleaned-correspondence.txt from exercise 1. Named and created a new project.

Text facets worked, clustering worked. Excellent tool. (I could have used this so many times)

Cleaned up and merged data.

Changed column titles to source and target

Put data into Gephi

Downloaded and installed Gephi

Imported the csf and Gephi make a dot pattern