## This notebook extracts text from a PDF and identifies keywords, based on custom parameters

In [1]:
# imports
import fitz
import yake

In [3]:
# read the PDF
# example: https://www.milnerltd.com/wp-content/uploads/2014/02/10-Steps-to-Success-in-Mergers-Acquisitions.pdf
pdf = fitz.open("10-Steps-to-Success-in-Mergers-Acquisitions.pdf")

# loop to extra text from each page
text = ""
for page in pdf:
   text+=page.get_text()
print(text)

2
This white paper reviews the literature on mergers 
from 1967 onwards. It covers both academic and 
consultancy sources and focuses on what makes 
mergers a success or failure. 
The literature often uses the term merger 
interchangeably with the post-acquisition integration 
process. In some cases, the literature also reports 
that merger management has much in common with 
JV management (Norburn and Schoenberg (1990)).
The emphasis in this white paper is on what makes 
mergers work (in the post-acquisition phase) 
and includes the relevant literature on mergers, 
acquisitions and JVs.
The success rate of acquiring and merging 
companies is between 40% and 50% measured over 
a range of criteria (Kitching (1974), Egon Zehnder 
(1987), Norburn and Schoenberg (1987), Bishop and 
Kay (1993)). The purpose of this review is to extract 
the key themes which are associated with successes 
or failures.
Several authors state that there is no merger manual 
with a set of do’s and dont’s (Ernst 

In [24]:
# instantiate
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)

# parameters
language = "en"
# number of words in the phrase
max_ngram_size = 2
# a lower number connotes uniqueness (originally .9 in documentation)
deduplication_threshold = .5
# other algos did not produce very different results in shorter texts
deduplication_algo = 'seqm'
# number of words it looks before and after to understand significance; a larger number gives greater context
windowSize = 20
# number of keywords desired
numOfKeywords = 10

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(text)

# print keywords
for kw in keywords:
    print(kw)

('merger', 0.002784876821676186)
('company', 0.0049354948228752)
('management', 0.008319709444813847)
('business', 0.009208780200299423)
('success', 0.014273612829073794)
('People', 0.015270839294492326)
('merger integration', 0.01615463439763697)
('acquisitions', 0.017776616858400852)
('Norburn', 0.018693627031262885)
('strategic', 0.01928647567207639)
