# Tika

[Apache Tika](https://tika.apache.org/) is the best software in all of history when it comes to converting documents to text. It takes absolutely *anything* and turns it into text (just make sure you've installed tesseract first).

## Installation, Part 1: Java

### OS X

You can install Java using Homebrew.

```
brew install --cask adoptopenjdk
```

### Windows

You'll need to [download Java](https://java.com/en/download/manual.jsp) and install it (pick the offline installer, fewer things to go wrong).

## Installation, Part 2: Python bindings for Tika

Now we'll install the [Python bindings](https://github.com/chrismattmann/tika-python) of the Tika library.

```
pip install tika
```

We don't need to download Tika itself because *the Python library does it for us, every single time it runs*.

## Using Tika

Let's see if we can get this to work!

```python
import tika
from tika import parser

parsed = parser.from_file('......')
```

### Converting PDFs to text

Let's try `players/players.pdf`

### Reading scanned documents

Let's try `players/players-scan.jpg`

### Reading HTML

Let's try `new-york/nys-bill.html`

### Reading web pages

```python
import requests
response = requests.get('....')

parsed = parser.from_buffer(response.text)
```

Let's try `nytimes.com`

### Reading other languages

> If you aren't working in English, you'll need to set headers with info from `tesseract --list-langs`

In [1]:
headers = {
    "X-Tika-OCRLanguage": "chi_sim"
}

# results = parser.from_file('non-english/museums-scanned.jpg', headers=headers)

In [2]:
headers = {
    "X-Tika-OCRLanguage": "grc"
}

# results = parser.from_file('non-english/greek.png', headers=headers)