<div style="background-color: #ffffff ; padding: 10px;">

**Goals**:
1. Extract text from PDF files.
2. Identify 5 keywords and key phrases, based on custom parameters.
3. Create and append a CSV file to store data from each PDF.

**Data:**
1. [10-Steps-to-Success-in-Mergers-Acquisitions.pdf](./10-Steps-to-Success-in-Mergers-Acquisitions.pdf)
2. [pwc-mergers-acquisitions.pdf](./pwc-mergers-acquisitions.pdf)
3. [The_Legal_500_Country_Comparative_Guides_United_States_Mergers_Aquisitions](./The_Legal_500_Country_Comparative_Guides_United_States_Mergers_Aquisitions.pdf)

**Skills**: file dialogs, PDF text extraction, keyword extraction (optimization), CSV file operations

**Technology**: Tkinter, Fitz, YAKE

**Result**: [pdf_keywords_master_list.csv](./pdf_keywords_master_list.csv)

In [49]:
# imports
import tkinter as tk
from tkinter import filedialog

import csv
import os
import pandas as pd

import fitz
import yake

In [50]:
# open PDF file locally (pop-up)
def select_pdf_file():
    root = tk.Tk()
    root.withdraw()
    file_path = filedialog.askopenfilename(
        filetypes=[("PDF files", "*.pdf")],
        title="Select a PDF file"
    )
    return file_path

pdf_path = select_pdf_file()

text extraction

In [51]:
# example: https://www.milnerltd.com/wp-content/uploads/2014/02/10-Steps-to-Success-in-Mergers-Acquisitions.pdf
pdf = fitz.open(pdf_path)

# loop to extra text from each page
text = ""
for page in pdf:
   text+=page.get_text()
print(text)

COUNTRY
COMPARATIVE
GUIDES 2024
The Legal 500
Country Comparative Guides
United States
MERGERS & ACQUISITIONS
Contributor
Cravath, Swaine & Moore LLP
Cravath,
Swaine
&
Moore
LLP
Richard Hall
Corporate Partner | rhall@cravath.com
Daniel J. Cerqueira
Partner | dcerqueira@cravath.com
This country-speciﬁc Q&A provides an overview of mergers & acquisitions laws and regulations applicable in United States.
For a full list of jurisdictional Q&As visit legal500.com/guides
Mergers & Acquisitions: United States
PDF Generated: 9-04-2024
2/10
© 2024 Legalease Ltd
UNITED STATES
MERGERS & ACQUISITIONS
 
1. What are the key rules/laws relevant to
M&A and who are the key regulatory
authorities?
In the U.S., both the federal government and state
governments regulate matters relevant to M&A.
At the federal level, M&A activity is subject to the federal
securities laws, principally the Securities Act of 1933 (the
Securities Act) and the Securities Exchange Act of 1934
(the Exchange Act). The Securities an

keyword extraction

In [52]:
# instantiate
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)

# parameters
language = "en"
# number of words in the phrase
max_ngram_size = 2
# a lower number connotes uniqueness (originally .9 in documentation)
deduplication_threshold = .15
# other algos did not produce very different results in shorter texts
deduplication_algo = 'seqm'
# number of words it looks before and after to understand significance; a larger number gives greater context
windowSize = 1000
# number of keywords desired
numOfKeywords = 5

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(text)

# save keyword list; _ eliminates score in tuple
keyword_list = [kw for kw, _ in keywords]

# print keywords eliminates score in tuple
keyword_list
    

['target company', 'buyer', 'United States', 'Exchange Act', 'law']

save to csv

In [53]:
# establish csv name
csv_name = "pdf_keywords_master_list.csv"

# create csv (if doesn't already exist)
if not os.path.exists(csv_name):
    with open (csv_name, "w") as csvfile:
    # create list for headers
        headers_list = ["file name", "keyword_1", "keyword_2", "keyword_3", "keyword_4", "keyword_5"]
        # instantiate csv writer
        writer = csv.writer(csvfile)
        # add headers to new csv file
        writer.writerow(headers_list)
    
# add row to csv file
# open file in append mode
with open(csv_name, "a", newline="") as csvfile:
    # instantiate csv writer
    writer = csv.writer(csvfile)
    # establish file name variable
    file_title = os.path.basename(pdf_path)
    # create row data
    row = [file_title] + keyword_list
    # write a row
    writer.writerow(row)


preview output csv file

In [54]:
new_csv_df = pd.read_csv(csv_name)

new_csv_df

Unnamed: 0,file name,keyword_1,keyword_2,keyword_3,keyword_4,keyword_5
0,10-Steps-to-Success-in-Mergers-Acquisitions.pdf,merger,company,business,Pritchett,successful acquisition
1,pwc-mergers-acquisitions.pdf,Supervisory Board,management,deal,company,business
2,The_Legal_500_Country_Comparative_Guides_Unite...,target company,buyer,United States,Exchange Act,law
