# Tika

[Apache Tika](https://tika.apache.org/) is the best software in all of history when it comes to converting documents to text. It takes absolutely *anything* and turns it into text (just make sure you've installed tesseract first).

## Installation, Part 1

### OS X

You can install Java using Homebrew.

```
brew install --cask adoptopenjdk
```

### Windows

You'll need to [download Java](https://java.com/en/download/manual.jsp) and install it (pick the offline installer, fewer things to go wrong).

## Installation, Part 2

Now we'll install the [Python bindings](https://github.com/chrismattmann/tika-python) of the Tika library.

```
pip install tika
```

We don't need to download Tika itself because *the Python library does it for us, every single time it runs*.

## Using Tika

Let's see if we can get this to work!



In [1]:
import tika
from tika import parser

This means "hey I'm downloading Tika"

> ```
2021-07-14 20:59:23,901 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /var/folders/jc/9v94cz997jg23m87vhvgv4840000gp/T/tika-server.jar.
```

In [7]:
parsed = parser.from_file('players/players.pdf')

In [11]:
parsed.keys()

dict_keys(['metadata', 'content', 'status'])

In [9]:
parsed['status']

200

In [10]:
parsed['metadata']

{'Author': 'Jonathan Soma',
 'Content-Type': 'application/pdf',
 'Creation-Date': '2018-01-30T23:10:06Z',
 'Last-Modified': '2018-01-30T23:10:06Z',
 'Last-Save-Date': '2018-01-30T23:10:06Z',
 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
  'org.apache.tika.parser.pdf.PDFParser'],
 'X-TIKA:content_handler': 'ToTextContentHandler',
 'X-TIKA:embedded_depth': '0',
 'X-TIKA:parse_time_millis': '50',
 'access_permission:assemble_document': 'true',
 'access_permission:can_modify': 'true',
 'access_permission:can_print': 'true',
 'access_permission:can_print_degraded': 'true',
 'access_permission:extract_content': 'true',
 'access_permission:extract_for_accessibility': 'true',
 'access_permission:fill_in_form': 'true',
 'access_permission:modify_annotations': 'true',
 'created': '2018-01-30T23:10:06Z',
 'creator': 'Jonathan Soma',
 'date': '2018-01-30T23:10:06Z',
 'dc:creator': 'Jonathan Soma',
 'dc:format': 'application/pdf; version=1.3',
 'dc:title': '1_Excel',
 'dcterms:created': 

In [14]:
print(parsed['content'].strip())

1_Excel


Player Pos Status Ht Wt DOB 
Rhett Bomar Quarterback Active 6'2' 215 7/2/85
Joe Webb Quarterback Active 6'4' 220 11/14/86
Christian Ponder Quarterback Active 6'2' 229 2/25/88
Adrian Peterson Running Back Active 6'1' 217 3/21/85
Lorenzo Booker Running Back Active 5'10' 191 6/14/84
Ryan D'Imperio Running Back Active 6'3' 240 8/15/87
Jeff Dugan Running Back Active 6'4' 258 4/8/81
Toby Gerhart Running Back Active 6'1' 237 3/28/87
Greg Camarillo Wide Receiver Active 6'1' 190 4/18/82
Juaquin Iglesias Wide Receiver Active 6'0' 204 8/22/87
Freddie Brown Wide Receiver Active 6'4' 215 6/24/86
Jaymar Johnson Wide Receiver Active 6'0' 176 7/10/84
Emmanuel Arceneaux Wide Receiver Active 6'2' 210 9/17/87
Bernard Berrian Wide Receiver Active 6'1' 185 12/27/80
Percy Harvin Wide Receiver Active 5'11' 192 5/28/88
Sidney Rice Wide Receiver Active 6'4' 202 9/1/86
Visanthe Shiancoe Tight End Active 6'4' 250 6/18/80
Jim Kleinsasser Tight End Active 6'3' 272 1/31/77
Cullen Loeffler Center Active 6'

In [15]:
parsed = parser.from_file('players/players-scan.jpg')

In [16]:
parsed['metadata']

{'Blue Colorant': '(0.1492, 0.0632, 0.7446)',
 'Blue TRC': '0.0085908',
 'CMM Type': 'ADBE',
 'Caption Digest': '212 29 140 217 143 0 178 4 233 128 9 152 236 248 66 126',
 'Class': 'Display Device',
 'Color space': 'RGB',
 'Component 1': 'Y component: Quantization table 0, Sampling factors 2 horiz/2 vert',
 'Component 2': 'Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert',
 'Component 3': 'Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert',
 'Compression Type': 'Baseline',
 'Content-Type': 'image/jpeg',
 'Copyright': 'Copyright 2000 Adobe Systems Incorporated',
 'Data Precision': '8 bits',
 'Device manufacturer': 'none',
 'Exif IFD0:Resolution Unit': 'Inch',
 'Exif IFD0:X Resolution': '300 dots per inch',
 'Exif IFD0:Y Resolution': '300 dots per inch',
 'Exif SubIFD:Exif Image Height': '3300 pixels',
 'Exif SubIFD:Exif Image Width': '2550 pixels',
 'File Modified Date': 'Wed Jul 14 11:03:25 -04:00 2021',
 'File Name': 'apache-tika-696879367379945252

In [19]:
print(parsed['content'].strip())

Player

Rhett Bomar
Joe Webb
Christian Ponder
Adrian Peterson
Lorenzo Booker
Ryan D'Imperio
Jeff Dugan

Toby Gerhart
Greg Camarillo
Juaquin Iglesias
Freddie Brown
Jaymar Johnson
Emmanuel Arceneaux
Bernard Berrian
Percy Harvin
Sidney Rice
Visanthe Shiancoe
Jim Kleinsasser
Cullen Loeffler
Jon Cooper
John Sullivan
Anthony Herrera
Steve Hutchinson
Seth Olsen
Chris DeGeare
Thomas Welch
Phil Loadholt
Bryant McKinnie
Patrick Brown
Ryan Cook
Chris Kluwe
Brian Robison
Kevin Williams
Ray Edwards
Jared Allen
Tremaine Johnson
Adrian Awasom
Letroy Guion
Jimmy Kennedy
Everson Griffen
Chad Greenway
E.J. Henderson
Heath Farwell
Kenny Onatolu
Jasper Brinkley
Erin Henderson
Madieu Williams
Chris Cook
Marcus Sherels
Asher Allen
Cedric Griffin

Pos

Quarterback
Quarterback
Quarterback
Running Back
Running Back
Running Back
Running Back
Running Back
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Wide Receiver
Tight End

Tight End

Center

Center

Center

G

In [20]:
parsed = parser.from_file('new-york/nys-bill.html')
print(parsed['content'].strip())

<pre>
<STYLE><!--U  {color: Green}S  {color: RED} I  {color: DARKBLUE; background-color:yellow}
P.brk {page-break-before:always}--></STYLE>
<BASEFONT SIZE=3>
<PRE WIDTH="122">

<FONT SIZE=5><B>                STATE OF NEW YORK</B></FONT>
        ________________________________________________________________________

                                          7569

                               2019-2020 Regular Sessions

<FONT SIZE=5><B>                   IN ASSEMBLY</B></FONT>

                                       May 9, 2019
                                       ___________

        Introduced  by M. of A. GALEF -- read once and referred to the Committee
          on Corporations, Authorities and Commissions

        AN ACT to amend the public service law, in relation to the  transfer  or
          lease  of  closed  electric  generators; and in relation to payment of
          prevailing wages of affected employees of  the  Indian  Point  Nuclear
          Power Plant

        

In [25]:
import requests

response = requests.get('https://nytimes.com')


In [28]:
parsed = parser.from_buffer(response.text)
print(parsed['content'].strip())

The New York Times - Breaking News, US News, World News and Videos


    Continue reading the main story






SectionsSEARCH
Skip to contentSkip to site index

	U.S.
	International
	Canada
	Español
	中文


Log in


Today’s Paper








	
	
	World
	U.S.
	Politics
	N.Y.
	Business
	Opinion
	Tech
	Science
	Health
	Sports
	Arts
	Books
	Style
	Food
	Travel
	Magazine
	T Magazine
	Real Estate
	Video



	World
	U.S.
	Politics
	N.Y.
	Business
	Opinion
	Tech
	Science
	Health
	Sports
	Arts
	Books
	Style
	Food
	Travel
	Magazine
	T Magazine
	Real Estate
	Video




Europe Lays Out Vision for a Carbonless Future, but Big Obstacles Loom

	An ambitious blueprint to reduce emissions 55 percent by 2030 promises tough haggling among industry, 27 countries and the European Parliament.
	The E.U. hopes to set an example, invent new technologies that it can sell and provide global standards that can lead to a carbon-neutral economy.





A photovoltaic energy farm near the town of Bogatynia, Poland. Maciek Nab

> # If you aren't working in English, you'll need to set headers with info from `tesseract --list-langs`

In [33]:
!tesseract --list-langs

List of available languages (163):
afr
amh
ara
asm
aze
aze_cyrl
bel
ben
bod
bos
bre
bul
cat
ceb
ces
chi_sim
chi_sim_vert
chi_tra
chi_tra_vert
chr
cos
cym
dan
deu
div
dzo
ell
eng
enm
epo
equ
est
eus
fao
fas
fil
fin
fra
frk
frm
fry
gla
gle
glg
grc
guj
hat
heb
hin
hrv
hun
hye
iku
ind
isl
ita
ita_old
jav
jpn
jpn_vert
kan
kat
kat_old
kaz
khm
kir
kmr
kor
kor_vert
lao
lat
lav
lit
ltz
mal
mar
mkd
mlt
mon
mri
msa
mya
nep
nld
nor
oci
ori
osd
pan
pol
por
pus
que
ron
rus
san
script/Arabic
script/Armenian
script/Bengali
script/Canadian_Aboriginal
script/Cherokee
script/Cyrillic
script/Devanagari
script/Ethiopic
script/Fraktur
script/Georgian
script/Greek
script/Gujarati
script/Gurmukhi
script/HanS
script/HanS_vert
script/HanT
script/HanT_vert
script/Hangul
script/Hangul_vert
script/Hebrew
script/Japanese
script/Japanese_vert
script/Kannada
script/Khmer
script/Lao
script/Latin
s

In [31]:
headers = {
    "X-Tika-OCRLanguage": "chi_sim"
}

results = parser.from_file('non-english/museums-scanned.jpg', headers=headers)
print(results['content'].strip())

附件 1

2015 年 度 全 国 博 物 馆 名 录

博物 馆 性 | 质量 等 | 是 否 免费

北京 市 〈151 家 )

故宫 博物 院 文物

人 民 革 命 军事 博物 馆 行业

人 | 文胸 是

 

 

城区 景山 前 街 4 号

 

 

 

 

 

 

 

 

 

 

城区 东 长 安 街 16 号

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

了

 

 

 

 

 

 

 

 

地 质 博 物 馆

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

目 农 业 博 物 馆

 

 

 

 

 

二

 

 

 

 

 

抗日 战争 纪念 馆 文物 区 宛 平 城内 街 101 号

 

 

 

 

 

 

 

 

 

 

 

北京 市 朝阳 区 北 展 东 路 5 号
城区 天 桥 南 大 街 126 号

区 复兴 门 外 大 街 16 号
房山 区 周口 店 大 街 1 号
昌平 区 小 汤山 5806 号

IN

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

首都 博物 馆 文物

店 北京 人 遗址 博物 馆 文物

 

 

 

 

 

 

 

 

 

 

 

二

 

 

 

 

 

 

 

中 国航 空 博物 锯

 

 

 

 

 

北京 天 文 馆 《北京 古 观 象 台 )

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

文物 区 东经 路 21 号

 

 

 

 

洗 区 学 院 路 42 号
西城 区 马连道 南 街 2 号 院 1 号 楼

 

R|R

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R

 

 

 

 

 

 

 

 

 

 

 

 

 

 

子 监 街 13 一 15 号

平 区 十 三 陵 特 区 办 事 处 定 陵

演 区 五 塔 寺 24 

In [32]:
headers = {
    "X-Tika-OCRLanguage": "grc"
}

results = parser.from_file('non-english/greek.png', headers=headers)
print(results['content'].strip())

Η Αϑήνα (Αϑῆναι στὰ ἀρχαία ελληνικά και τὴν καϑαρεύουσα) εἶναι ἡ πρω-
τεύουσα τῆς Βλλάδας. Ἐπίσης είναι ἡ ἔδρα τῆς Περιφέρειας Αττικής. Βρίσχεται
στὴ Στερεά (Κεντρική) Ἑλλάδα και αποτελεί εύρωστο οιχονομιχό, πολιτιστικό χαι
διοικητικό κέντρο τῆς χώρας. Πήρε το ὀνομά τῆς από τὴ ϑεά Αϑηνά που ἦταν χαι
ἡ προστάτιδά της. Η Αϑήνα σήμερα εἰναι μία σύγχρονη πόλη αλλά χαι διάσημη,
χκαϑώς στὴν αρχαιότητα ἦταν πανίσχυρη πόλη-χράτος και σημαντικότατο χέντρο
πολιτισμού. ϑεωρείται ἡ ἱιστορικότερη πόλη τῆς Ευρώπης μαζί με τὴ Ρώμη. ἘΣ
ίναι γνωστή σε όλο τον κόσμο για τα ιστοριχά τῆς μνημεία που διασώϑγηραν,έστω
χκαι μερικώς, στο πέρασμα τῶν αἰώνων. Ἐπίνειο τῆς ἱιστορικής πόλης εἰναι το λι-
μάνι του Πειραιά. Πολιούχος τῆς Πόλης των Αϑηνών εἰναι ο Ἅγιος Διονύσιος ο
ἈΑρεοπαγίτης.


In [35]:
response = requests.get('https://daccess-ods.un.org/access.nsf/GetFileUndocs?Open&DS=S/RES/2585(2021)&Lang=E&Type=DOC')
response

<Response [200]>

In [39]:
parser.from_file('https://www.federalreserve.gov/monetarypolicy/files/BeigeBook_20210602.pdf')

2021-07-14 11:38:35,549 [MainThread  ] [INFO ]  Retrieving https://www.federalreserve.gov/monetarypolicy/files/BeigeBook_20210602.pdf to /var/folders/l0/h__2c37508b8pl19zp232ycr0000gn/T/monetarypolicy-files-beigebook_20210602.pdf.


{'metadata': {'Author': 'Federal Reserve',
  'Content-Type': 'application/pdf',
  'Creation-Date': '2021-06-01T15:48:36Z',
  'Last-Modified': '2021-06-01T17:31:02Z',
  'Last-Save-Date': '2021-06-01T17:31:02Z',
  'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
   'org.apache.tika.parser.pdf.PDFParser'],
  'X-TIKA:content_handler': 'ToTextContentHandler',
  'X-TIKA:embedded_depth': '0',
  'X-TIKA:parse_time_millis': '892',
  'access_permission:assemble_document': 'true',
  'access_permission:can_modify': 'true',
  'access_permission:can_print': 'true',
  'access_permission:can_print_degraded': 'true',
  'access_permission:extract_content': 'true',
  'access_permission:extract_for_accessibility': 'true',
  'access_permission:fill_in_form': 'true',
  'access_permission:modify_annotations': 'true',
  'created': '2021-06-01T15:48:36Z',
  'creator': 'Federal Reserve',
  'date': '2021-06-01T17:31:02Z',
  'dc:creator': 'Federal Reserve',
  'dc:format': 'application/pdf; version=1.7',
  