# Extracting text from PDFs

[PDF](https://techterms.com/definition/pdf) stands for "Portable Document Format" and PDF files contain images, text, and page layout information.  PDF files are actually programs in a very simple programming language and, hence, can display just about anything. Much of what you see inside a PDF file is text, however, and we can grab that text without the layout information using [poppler](https://poppler.freedesktop.org/). (I used to use `pdfminer` but somehow no longer works on OS X.) Install it with:
 
```bash
brew install poppler
```

or

```bash
brew upgrade poppler
```

Then use `pdftotext` as a command from the commandline, which will extract out the text  and save in a text file. First download a sample PDF, such as [Tesla model S](https://www.tesla.com/sites/default/files/tesla-model-s.pdf), which we can easily do from the command line using `curl` (which you might have to install):

In [1]:
! curl https://www.tesla.com/sites/default/files/tesla-model-s.pdf > /tmp/tsla.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9438k  100 9438k    0     0  30.3M      0 --:--:-- --:--:-- --:--:-- 30.3M


That command downloads the file and because of the redirection operator, `>`, the output gets written to `tsla.pdf` up in `/tmp` directory.

Once we have the data, we can pass the filename to `pdftotext` to extract the text:

In [2]:
! pdftotext /tmp/tsla.pdf # saves into /tmp/tsla.txt



(Don't worry about those warnings.)

In [3]:
! head -10 /tmp/tsla.txt

Model S
Premium Electric Sedan

An evolution
in automobile
engineering
Tesla’s advanced electric powertrain
delivers exhilarating performance.
Unlike a gasoline internal combustion
engine with hundreds of moving


Once you have text output, you can perform whatever analysis you'd like without having to worry about the data coming in PDF form. For example, you might want to run some analysis on financial documents but they are all in PDF. First, convert to text and then perform your analysis.

### Exercise

Use the `curl` or `wget` command from the commandline to download that PDF file.  Then convert it to a text file somewhere on your dis.

### Exercise

Read that text file with a Python script and split the document into a list of words. Print out the first 100 words. It should look like:

```
['Model', 'S', 'Premium', 'Electric', 'Sedan', 'An', ...]
```

In [4]:
with open('/tmp/tsla.txt') as f:
    print(f.read().split()[:100])

['Model', 'S', 'Premium', 'Electric', 'Sedan', 'An', 'evolution', 'in', 'automobile', 'engineering', 'Tesla’s', 'advanced', 'electric', 'powertrain', 'delivers', 'exhilarating', 'performance.', 'Unlike', 'a', 'gasoline', 'internal', 'combustion', 'engine', 'with', 'hundreds', 'of', 'moving', 'parts,', 'Tesla', 'electric', 'motors', 'have', 'only', 'one', 'moving', 'piece:', 'the', 'rotor.', 'As', 'a', 'result,', 'Model', 'S', 'acceleration', 'is', 'instantaneous,', 'silent', 'and', 'smooth.', 'Step', 'on', 'the', 'accelerator', 'and', 'in', 'as', 'little', 'as', '3.1', 'seconds', 'Model', 'S', 'is', 'travelling', '60', 'miles', 'per', 'hour,', 'without', 'hesitation,', 'and', 'without', 'a', 'drop', 'of', 'gasoline.', 'Model', 'S', 'is', 'an', 'evolution', 'in', 'automobile', 'engineering.', 'All-Wheel', 'Drive', 'Dual', 'Motor', 'Rear', 'Wheel', 'Drive', 'All-Wheel', 'Drive', 'Dual', 'Motor', 'Dual', 'Motor', 'Model', 'S', 'is']
