# Tika

[Apache Tika](https://tika.apache.org/) is the best software in all of history when it comes to converting documents to text. It takes absolutely *anything* and turns it into text (just make sure you've installed tesseract first).

## Installation, Part 1

### OS X

You can install Java using Homebrew. You used to install `adoptopenjdk`, but now you install `temurin`, so I guess that's what we'll use?

```
# brew install --cask adoptopenjdk
brew install --cask temurin
```

### Windows

You'll need to [download Java](https://java.com/en/download/manual.jsp) and install it (pick the offline installer, fewer things to go wrong).

## Installation, Part 2

Now we'll install the [Python bindings](https://github.com/chrismattmann/tika-python) of the Tika library.

```
pip install tika
```

We don't need to download Tika itself because *the Python library does it for us, every single time it runs*.

## Using Tika

Let's see if we can get this to work!



In [11]:
import tika

from tika import parser

This means "hey I'm downloading Tika"

> ```
2021-07-14 20:59:23,901 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /var/folders/jc/9v94cz997jg23m87vhvgv4840000gp/T/tika-server.jar.
```

In [12]:
parsed = parser.from_file('players/players.pdf')
parsed

{'metadata': {'Author': 'Jonathan Soma',
  'Content-Type': 'application/pdf',
  'Creation-Date': '2018-01-30T23:10:06Z',
  'Last-Modified': '2018-01-30T23:10:06Z',
  'Last-Save-Date': '2018-01-30T23:10:06Z',
  'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
   'org.apache.tika.parser.pdf.PDFParser'],
  'X-TIKA:content_handler': 'ToTextContentHandler',
  'X-TIKA:embedded_depth': '0',
  'X-TIKA:parse_time_millis': '21',
  'access_permission:assemble_document': 'true',
  'access_permission:can_modify': 'true',
  'access_permission:can_print': 'true',
  'access_permission:can_print_degraded': 'true',
  'access_permission:extract_content': 'true',
  'access_permission:extract_for_accessibility': 'true',
  'access_permission:fill_in_form': 'true',
  'access_permission:modify_annotations': 'true',
  'created': '2018-01-30T23:10:06Z',
  'creator': 'Jonathan Soma',
  'date': '2018-01-30T23:10:06Z',
  'dc:creator': 'Jonathan Soma',
  'dc:format': 'application/pdf; version=1.3',
  'dc:tit

In [13]:
content = parsed['content'].strip()
print(content)

1_Excel


Player Pos Status Ht Wt DOB 
Rhett Bomar Quarterback Active 6'2' 215 7/2/85
Joe Webb Quarterback Active 6'4' 220 11/14/86
Christian Ponder Quarterback Active 6'2' 229 2/25/88
Adrian Peterson Running Back Active 6'1' 217 3/21/85
Lorenzo Booker Running Back Active 5'10' 191 6/14/84
Ryan D'Imperio Running Back Active 6'3' 240 8/15/87
Jeff Dugan Running Back Active 6'4' 258 4/8/81
Toby Gerhart Running Back Active 6'1' 237 3/28/87
Greg Camarillo Wide Receiver Active 6'1' 190 4/18/82
Juaquin Iglesias Wide Receiver Active 6'0' 204 8/22/87
Freddie Brown Wide Receiver Active 6'4' 215 6/24/86
Jaymar Johnson Wide Receiver Active 6'0' 176 7/10/84
Emmanuel Arceneaux Wide Receiver Active 6'2' 210 9/17/87
Bernard Berrian Wide Receiver Active 6'1' 185 12/27/80
Percy Harvin Wide Receiver Active 5'11' 192 5/28/88
Sidney Rice Wide Receiver Active 6'4' 202 9/1/86
Visanthe Shiancoe Tight End Active 6'4' 250 6/18/80
Jim Kleinsasser Tight End Active 6'3' 272 1/31/77
Cullen Loeffler Center Active 6'

In [16]:
# Try with a scanned PDF

parsed = parser.from_file('players/players-scan.jpg')

In [17]:
content = parsed['content'].strip()
print(content)

Player

Rhett Bomar
Joe Webb
Christian Ponder
Adrian Peterson
Lorenzo Booker
Ryan D'Imperio
Jeff Dugan

Toby Gerhart
Greg Camarillo
Juaquin Iglesias
Freddie Brown
Jaymar Johnson
Emmanuel Arceneaux
Bernard Berrian
Percy Harvin
Sidney Rice
Visanthe Shiancoe
Jim Kleinsasser
Cullen Loeffler
Jon Cooper
John Sullivan
Anthony Herrera
Steve Hutchinson
Seth Olsen
Chris DeGeare
Thomas Welch
Phil Loadholt
Bryant McKinnie
Patrick Brown
Ryan Cook
Chris Kluwe
Brian Robison
Kevin Williams
Ray Edwards
Jared Allen
Tremaine Johnson
Adrian Awasom
Letroy Guion
Jimmy Kennedy
Everson Griffen
Chad Greenway
E.J. Henderson
Heath Farwell
Kenny Onatolu
Jasper Brinkley
Erin Henderson
Madieu Williams
Chris Cook
Marcus Sherels
Asher Allen
Cedric Griffin

Pos

Quarterback
Quarterback
Quarterback
Running Back
Running Back
Running Back
Running Back
Running Back
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Tight End

Tight End

Center

Center

Center

G

> # If you aren't working in English, you'll need to set headers with info from `tesseract --list-langs`

If you want more languages, `brew install tesseract-langs` on OS X or `scoop install tesseract-languages` on Windows.