# Extracting text from PDFs

PDF files are effectively restricted versions of PostScript files, what you might've heard of. PDF files are actually programs in a very simple programming language and hence can display just about anything. Much of what you see inside a PDF file is text, however, and we can grab that text without the layout information using [pdf2txt.py](https://euske.github.io/pdfminer/). Install it with:
 
```bash
$ pip install pdfminer
```

Then use `pdf2txt.py` as a command from the commandline, which will spit the text out to standard output. First download a sample PDF, such as [Dr_Maxwell_Glen_Berry.pdf](https://www.eisenhower.archives.gov/education/articles/Dr_Maxwell_Glen_Berry.pdf), which we can easily do from the command line using `curl` (which you might have to install):

In [2]:
! curl https://www.eisenhower.archives.gov/education/articles/Dr_Maxwell_Glen_Berry.pdf > /tmp/t.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  306k  100  306k    0     0   113k      0  0:00:02  0:00:02 --:--:--  114k


That command downloads the file and because of the redirection operator, `>`, the output gets written to `t.pdf` up in `/tmp` directory.

Once we have the data, we can pass the filename to `pdf2txt.py` to extract the text:

In [6]:
! pdf2txt.py /tmp/t.pdf | head -10

World War II Remembered is a multi-year exhibition currently on display at the Eisenhower Presidential Museum.  The 
article  that  follows  is  a  special  feature  of  this  exhibition,  the  sixth  in  a  series  created  to  honor  and  educate  about  the 
generation  that  won  World  War  II.    Featured  are  the  stories  of  real  people  from  the  “World  War  II  Participants  and 
Contemporaries” collection, held and preserved in the archives of the Eisenhower Presidential Library.   
 

     Dr. Maxwell Glen Berry  
     Mission, Kansas 
     U.S. Army, Pacific Theater 
 


Now, redirect that text to a file using the bash `>` operator or the `-o` option on `pdf2txt.py`.

In [8]:
! pdf2txt.py /tmp/t.pdf > /tmp/t.txt
! pdf2txt.py -o /tmp/t.txt /tmp/t.pdf
! head -10 /tmp/t.txt

World War II Remembered is a multi-year exhibition currently on display at the Eisenhower Presidential Museum.  The 
article  that  follows  is  a  special  feature  of  this  exhibition,  the  sixth  in  a  series  created  to  honor  and  educate  about  the 
generation  that  won  World  War  II.    Featured  are  the  stories  of  real  people  from  the  “World  War  II  Participants  and 
Contemporaries” collection, held and preserved in the archives of the Eisenhower Presidential Library.   
 

     Dr. Maxwell Glen Berry  
     Mission, Kansas 
     U.S. Army, Pacific Theater 
 


Once you have text output, you can perform whatever analysis you'd like without having to worry about the data coming in PDF form. For example, you might want to run some analysis on financial documents but they are all in PDF. First, convert to text and then perform your analysis.

### Exercise

Read that text file and split the document into a list of words. Print out the first 100 words. It should look like:

```
['World', 'War', 'II', 'Remembered', 'is', 'a', 'multi-year', 'exhibition', ... ]
```

Import the `Counter` object which makes histograms from a list of elements:

```python
from collections import Counter
```

Then, if `words` is your list of words, create and print a histogram:

```python
print Counter(words)
```

The output starts like this:

```
Counter({'': 600, 'the': 52, 'of': 37, 'a': 31, 'and': 28, 'to': 26, '\n': 26, 'in': 25, 'his': 19, 'he': 17, 'was': 16, 'Max': 14, 'that': 11, 'were': 9, '.': 9, 'had': 9, 'for': 9, 'with': 9, 'at': 9, '\nand': 8, 'our': 8, 'I': 7, 'War': 6, 'on': 6, 'have': 5, 'as': 5, '\nin': 5, '1945,': 4, 'would': 4, 'two': 4, 'Maxwell': 4, 'World': 4, 'University': 4, 'Josephine': 4, ...
```

### Exercise

A useful way to visualize histograms or word frequency is with a *word cloud*.  Augment your previous code to include packages for word clouds and graphics:

```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
```

Next, take your histogram and get the top 50 words using function `histo.most_common(...)`. This returns a list of (word,count) tuples but we need it as a dictionary (`dict`) so convert using: `top = dict(...)`.

To get the actual word cloud to appear, add this code to the end of your program:

```python
wordcloud = WordCloud(width=1800,
                      height=1400,
                      max_words=500,
                      random_state=1,
                      relative_scaling=0.25)
wordcloud.fit_words(top)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
```

*Warning*: do not call your python file `wordcloud.py` as that is the package we are importing!

When you run (passing a text file argument), you see something like:

<img src="figures/wordcloud.png" style="width:400px">

If you get stuck, see [cloud.py](https://github.com/parrt/msan692/blob/master/notes/code/cloud.py).