# Extracting text from PDFs

PDF files are effectively restricted versions of PostScript files, what you might've heard of. PDF files are actually programs in a very simple programming language and hence can display just about anything. Much of what you see inside a PDF file is text, however, and we can grab that text without the layout information using [pdf2txt.py](https://euske.github.io/pdfminer/). Install it with:
 
```bash
$ pip install pdfminer
```

Then use `pdf2txt.py` as a command from the commandline, which will spit the text out to standard output. First download a sample PDF, such as [Dr_Maxwell_Glen_Berry.pdf](https://www.eisenhower.archives.gov/education/articles/Dr_Maxwell_Glen_Berry.pdf), which we can easily do from the command line using `curl` (which you might have to install):

In [1]:
! curl https://www.eisenhower.archives.gov/education/articles/Dr_Maxwell_Glen_Berry.pdf > /tmp/t.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  306k  100  306k    0     0  70690      0  0:00:04  0:00:04 --:--:-- 77046


That command downloads the file and because of the redirection operator, `>`, the output gets written to `t.pdf` up in `/tmp` directory.

Once we have the data, we can pass the filename to `pdf2txt.py` to extract the text:

In [3]:
! pdf2txt.py /tmp/t.pdf | head -10

World War II Remembered is a multi-year exhibition currently on display at the Eisenhower Presidential Museum.  The 
article  that  follows  is  a  special  feature  of  this  exhibition,  the  sixth  in  a  series  created  to  honor  and  educate  about  the 
generation  that  won  World  War  II.    Featured  are  the  stories  of  real  people  from  the  “World  War  II  Participants  and 
Contemporaries” collection, held and preserved in the archives of the Eisenhower Presidential Library.   
 

     Dr. Maxwell Glen Berry  
     Mission, Kansas 
     U.S. Army, Pacific Theater 
 


Now, redirect that text to a file using the bash `>` operator or the `-o` option on `pdf2txt.py`.

In [4]:
! pdf2txt.py /tmp/t.pdf > /tmp/t.txt
! pdf2txt.py -o /tmp/t.txt /tmp/t.pdf
! head -10 /tmp/t.txt

World War II Remembered is a multi-year exhibition currently on display at the Eisenhower Presidential Museum.  The 
article  that  follows  is  a  special  feature  of  this  exhibition,  the  sixth  in  a  series  created  to  honor  and  educate  about  the 
generation  that  won  World  War  II.    Featured  are  the  stories  of  real  people  from  the  “World  War  II  Participants  and 
Contemporaries” collection, held and preserved in the archives of the Eisenhower Presidential Library.   
 

     Dr. Maxwell Glen Berry  
     Mission, Kansas 
     U.S. Army, Pacific Theater 
 


Once you have text output, you can perform whatever analysis you'd like without having to worry about the data coming in PDF form. For example, you might want to run some analysis on financial documents but they are all in PDF. First, convert to text and then perform your analysis.

### Exercise

Read that text file with a Python script and split the document into a list of words. Print out the first 100 words. It should look like:

```
['World', 'War', 'II', 'Remembered', 'is', 'a', 'multi-year', 'exhibition', ... ]
```

In [12]:
with open('/tmp/t.txt') as f:
    print f.read().split(' ')[:100]

['World', 'War', 'II', 'Remembered', 'is', 'a', 'multi-year', 'exhibition', 'currently', 'on', 'display', 'at', 'the', 'Eisenhower', 'Presidential', 'Museum.', '', 'The', '\narticle', '', 'that', '', 'follows', '', 'is', '', 'a', '', 'special', '', 'feature', '', 'of', '', 'this', '', 'exhibition,', '', 'the', '', 'sixth', '', 'in', '', 'a', '', 'series', '', 'created', '', 'to', '', 'honor', '', 'and', '', 'educate', '', 'about', '', 'the', '\ngeneration', '', 'that', '', 'won', '', 'World', '', 'War', '', 'II.', '', '', '', 'Featured', '', 'are', '', 'the', '', 'stories', '', 'of', '', 'real', '', 'people', '', 'from', '', 'the', '', '\xe2\x80\x9cWorld', '', 'War', '', 'II', '', 'Participants']


## Text processing from the command line

It's often the case that we can do a huge amount of cleanup on unstructured text before using Python to process it more formally. We can convert everything to, delete unwanted characters, squeeze repeated characters, reformat, etc... In this section you will do a number of exercises that get you use to processing files from the command line. If you'd like to dig further, you can see [this link](http://www.tldp.org/LDP/abs/html/textproc.html).

The operating system launches all commands in a pipeline sequence as separate processes, which means they can run on multiple processors. This gives us parallel processing without having to do any work. As data is completed by one stage, it passes it to the next stage of the pipeline, and continues to work on its input. The next stage consumes that input in parallel. Consequently, processing text from the command line can be extremely efficient, much more so than doing it in Python.

### Exercise

Using the `tr` (translate) command from the terminal, strip all of the new lines from the text file you created above (`/tmp/t.txt`).  Look at the manual page with this command:

```python
$ man tr
```

You can pipe the output of `tr` to `head -c 150` to only print out the first 150 characters of the output.

In [6]:
! tr -d '\n' < /tmp/t.txt | head -c 150

World War II Remembered is a multi-year exhibition currently on display at the Eisenhower Presidential Museum.  The article  that  follows  is  a  spe

### Exercise

Reformat the text using `tr` and `fold`. The `fold` command wraps lines at 80 characters; use its `-s` option to making break lines at spaces between words.

In [9]:
! tr -d '\n' < /tmp/t.txt | fold -s | head -10

World War II Remembered is a multi-year exhibition currently on display at the 
Eisenhower Presidential Museum.  The article  that  follows  is  a  special  
feature  of  this  exhibition,  the  sixth  in  a  series  created  to  honor  
and  educate  about  the generation  that  won  World  War  II.    Featured  
are  the  stories  of  real  people  from  the  “World  War  II  Participants  
and Contemporaries” collection, held and preserved in the archives of the 
Eisenhower Presidential Library.         Dr. Maxwell Glen Berry       Mission, 
Kansas      U.S. Army, Pacific Theater  “Every  adult  American  alive  that  
day  remembers where he heard the news of Pearl Harbor.  I heard it on my way 
home from Sunday rounds at St. Luke’s Hospital  in  Kansas  City.   The  


### Exercise

It is sometimes useful to put a line number at the left edge of all lines. For example, you might want to create a unique ID number for each row of a CSV file. Pipe the output of the previous command to `nl` so that you get the line number on the left edge.

In [10]:
! tr -d '\n' < /tmp/t.txt | fold -s | nl | head -10

     1	World War II Remembered is a multi-year exhibition currently on display at the 
     2	Eisenhower Presidential Museum.  The article  that  follows  is  a  special  
     3	feature  of  this  exhibition,  the  sixth  in  a  series  created  to  honor  
     4	and  educate  about  the generation  that  won  World  War  II.    Featured  
     5	are  the  stories  of  real  people  from  the  “World  War  II  Participants  
     6	and Contemporaries” collection, held and preserved in the archives of the 
     7	Eisenhower Presidential Library.         Dr. Maxwell Glen Berry       Mission, 
     8	Kansas      U.S. Army, Pacific Theater  “Every  adult  American  alive  that  
     9	day  remembers where he heard the news of Pearl Harbor.  I heard it on my way 
    10	home from Sunday rounds at St. Luke’s Hospital  in  Kansas  City.   The  


### Exercise

Convert the text to all lowercase using `tr`. Hint: `a-z` and `A-Z` are [regular expressions](http://www.rexegg.com/regex-quickstart.html) that describe e English characters and uppercase English characters.

In [8]:
! tr A-Z a-z < /tmp/t.txt | head -c 150

world war ii remembered is a multi-year exhibition currently on display at the eisenhower presidential museum.  the 
article  that  follows  is  a  sp