Author: Zoumana Keita

# Prerequisites

## Install and Import Tika

In [15]:
# !pip install tika # Uncomment this line when it is the first execution
from tika import parser as p 
import requests
import tika

## Different ways to access your file

In [None]:
# 1. From internet
url = "the url to your file"
response = requests.get(url)
results = p.from_buffer(response.content)

# 2. From a local destination
file_path = "path to your file"
results = p.from_file(file_path)

In [6]:
import warnings

In [7]:
# Import the library to avoid warnings
warnings.filterwarnings('ignore')

## Let's Ride on the Shoulder of the Giant!    
I will implement a helper function to avoid repeting the loading of the files from both internet (1st case) and from a given destination (2nd case).   

In [18]:
def get_data_from_web(url):
    response = requests.get(url)
    results = p.from_buffer(response.content)
    return results

def get_data_from_given_path(file_path):
    results = p.from_file(file_path)
    return results

### PDF 

#### a. From the web

In [50]:
# From the web
pdf_url = "https://www.bl.uk/learning/resources/pdf/makeanimpact/sw-transcripts.pdf"
results = get_data_from_web(pdf_url)

In [52]:
print(results.keys())

dict_keys(['metadata', 'content', 'status'])


In [54]:
print(results["status"])

200


results is a dictonary with the following keys: 
**['metadata', 'content', 'status']** 
Then we can get the content of the file by using the **content** key

In [59]:
print("File Content: \n{}".format(results["content"].strip()))

File Content: 
Barack Obama: Words Matter


 
 

 
 

Key Speech Transcripts 
 
 
Barack Obama: Words Matter 
 
Don’t tell me words don’t matter. I have a dream – just words words. We hold 
these truths to be self evident that all men are created equal – just words. We 
have nothing to fear but fear itself – just words, just speeches.  
 
It’s true that speeches don’t solve all problems, but what is also true is that if 
we can’t inspire our country to believe again, then it doesn’t matter how many 
policies and plans we have, and that is why I’m running for president of the 
United States of America, and that’s why we just won 8 elections straight 
because the American people want to believe in change again. Don’t tell me 
words don’t matter! 
 
 
Martin Luther King: I Have a Dream 
 
…and so even though we face the difficulties of today and tomorrow, I still 
have a dream. It is a dream deeply rooted in the American dream. 

I have a dream that one day this nation will rise up and li

Also, we can have many more information about the file using the **metadata** key as shown below. 

In [56]:
print("Metadata Info: \n{}".format(results["metadata"]))

Metadata Info: 
{'Author': 'Pete Pattisson', 'Company': 'Heritage freelancers', 'Content-Type': 'application/pdf', 'Creation-Date': '2009-06-03T09:15:19Z', 'Last-Modified': '2009-06-03T09:15:22Z', 'Last-Save-Date': '2009-06-03T09:15:22Z', 'SourceModified': 'D:20090603091516', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '79', 'access_permission:assemble_document': 'true', 'access_permission:can_modify': 'true', 'access_permission:can_print': 'true', 'access_permission:can_print_degraded': 'true', 'access_permission:extract_content': 'true', 'access_permission:extract_for_accessibility': 'true', 'access_permission:fill_in_form': 'true', 'access_permission:modify_annotations': 'true', 'created': '2009-06-03T09:15:19Z', 'creator': 'Pete Pattisson', 'date': '2009-06-03T09:15:22Z', 'dc:creator': 'Pete Pattisson', 'dc:format': 'applic

#### a. From a local folder

In [60]:
# From existing folder
pdf_file_path = "./data/PASSENGER DISCLOSURE AND ATTESTATION.pdf"
results = get_data_from_given_path(pdf_file_path)
print(results["content"].strip())

ATTACHMENT A - PASSENGER DISCLOSURE AND ATTESTATION TO THE UNITED STATES OF AMERICA (print-only)


   
 

ATTACHMENT A 
 

PASSENGER DISCLOSURE AND ATTESTATION  
TO THE UNITED STATES OF AMERICA 

 
All airlines or other aircraft operators covered by the Order must provide the following 
disclosure to their passengers and collect the attestation prior to embarkation. 
 
 
AIRLINE AND AIRCRAFT OPERATOR DISCLOSURE REQUIREMENT: 
 
As required by United States federal law, all airlines or other aircraft operators must confirm either 
a negative COVID-19 test result or recovery from COVID-19 and clearance to travel and collect 
a passenger attestation on behalf of the U.S. Centers for Disease Control and Prevention (CDC) 
for certain passengers on aircraft departing from a foreign country and arriving in the United States.  
 
Each individual 2 years of age or older must provide a separate attestation. Unless otherwise 
permitted by law, a parent or other authorized individual should attest 

### Docx document 

In [31]:
#docs_file = 

In [61]:
docx_file_path = "./data/covid-19-ets2-sample-employee-choice-vaccination-policy.docx"
results = get_data_from_given_path(docx_file_path)
print(results["content"].strip())

OSHA COVID-19 Vaccination, Testing and Face Covering Policy Template


COVID-19 Vaccination, Testing and Face Covering Policy Template

The OSHA COVID-19 Emergency Temporary Standard (ETS) on Vaccination and Testing generally requires covered employers to establish, implement, and enforce a written mandatory vaccination policy (29 CFR 1910.501(d)(1)).  However, there is an exemption from that requirement for employers that establish, implement, and enforce a written policy allowing any employee not subject to a mandatory vaccination policy to either choose to be fully vaccinated against COVID-19 or provide proof of regular testing for COVID-19 and wear a face covering in lieu of vaccination (29 CFR 1910.501(d)(2)). Employers may use this template to develop a policy that provides employees the choice of COVID-19 vaccination or regular COVID-19 testing and face covering use. 
Employers using this template will need to customize areas marked with blue text and modify (change, add, or rem

### Image File

In [48]:
img_url = "https://i.stack.imgur.com/t3qWG.png"
results = get_data_from_web(img_url)
print(results["content"].strip())

Adobe, the Adobe logo, Acrobat, the Acrobat logo, Acrobat Capture, Adobe Garamond, Adobe
Intelligent Document Platform, Adobe PDF, Adobe Reader, Adobe Solutions Network, Aldus, Dis-
tiller, ePaper, Extreme, FrameMaker, Illustrator, InDesign, Minion, Myriad, PageMaker, Photo-
shop, Poetica, PostScript, and XMP are either registered trademarks or trademarks of Adobe
‘Systems Incorporated in the United States and/or other countries. Microsoft and Windows are
either registered trademarks or trademarks of Microsoft Corporation in the United States and/or
other countries. Apple, Mac, Macintosh, and Power Macintosh are trademarks of Apple Computer,
Inc,, registered in the United States and other countries. IBM is a registered trademark of IBM
Corporation in the United States. Sun is a trademark or registered trademark of Sun Microsys-
tems, Inc. in the United States and other countries. UNIX is a registered trademark of The Open
Group. SVG is a trademark of the World Wide Web Consortium; mark

### Web Page

In [49]:
page_url = "https://en.wikipedia.org/wiki/Ivory_Coast"
results = get_data_from_web(page_url)
print(results["content"].strip())

Ivory Coast - Wikipedia





	
	

	
	

	Ivory Coast

	
		From Wikipedia, the free encyclopedia

		

		

		
		

		Jump to navigation
		Jump to search
		Country in West Africa

This article is about the West African country. For other uses, see Ivory Coast (disambiguation).



Coordinates: 8°N 5°W﻿ / ﻿8°N 5°W﻿ / 8; -5


	Republic of Côte d'Ivoire
République de Côte d'Ivoire (French)

	
        
            

            Flag

        

        
            

             Coat of arms

        

    

	Motto: ‘Union – Discipline – Travail’ (French)
'Unity – Discipline – Work'
	Anthem: L'Abidjanaise
(English: "Song of Abidjan")




	
	Capital	Yamoussoukro (de jure)
Abidjan (de facto) 

6°51′N 5°18′W﻿ / ﻿6.850°N 5.300°W﻿ / 6.850; -5.300
	Largest city	Abidjan
	Official languages	French
	Vernacular
languages		Bété
	Dyula
	Baoulé
	Abron
	Agni
	Cebaara Senufo
	others



	Ethnic groups  (2018)
		41.1% Akan
	27.5% Dyula, Maninka
	17.6% Voltaiques / Gur
	11.0% Kru
	2.8% Othersa



	Religion  (2020