## TextRank for extracting keywords

### Step 0. Choose your working paper PDF

In [1]:
pdf_path = 'corpus/assembly_wp/a40/a40_wp_063_en.pdf'

### Step 1. Analyze PDF

In [2]:
from utils import *
# Extract all the content
content_plumber = extract_pdf_plumber(pdf_path)
content_plumber

' \n  A40-WP/63 \nInternational Civil Aviation Organization EX/30 \n  3/7/19 \nWORKING PAPER   \n \nASSEMBLY — 40TH SESSION \n \nEXECUTIVE COMMITTEE \n \nAgenda Item 23: Technical Assistance Programme\n \nPROGRESS REPORT ON THE IMPLEMENTATION OF THE COMPREHENSIVE \nREGIONAL IMPLEMENTATION PLAN FOR AVIATION SECURITY AND FACILITATION \nIN AFRICA (AFI SECFAL PLAN) \n \n(Presented by the Council of ICAO) \n \nEXECUTIVE SUMMARY \nThis  paper  presents  the  progress  made  in  the  implementation  of  the  Comprehensive  Regional \nImplementation Plan for Aviation Security and Facilitation in Africa (AFI SECFAL Plan) and its Work \nProgramme since the 39th Assembly held in 2016. In addition, it supports Assembly Resolution A39-38 \ndesigned to promote the implementation of the AFI SECFAL Plan. \nAction: The Assembly is invited to: \na)  review  progress  on  the  implementation  of  the  Resolution  on  the  Comprehensive  Regional \nImplementation Plan for Aviation Security and Facilitatio

In [3]:
# Get the main body of working paper
main_body = main_body_plumber(content_plumber)
main_body

'The threat of potential acts of unlawful interference against civil aviation continues to manifest itself due to the presence of active terrorist groups and events in the African region. The above- mentioned threats co-exist with the recorded increase in air transport passenger flows. Furthermore, the Universal Security Audit Programme Continuous Monitoring Approach (USAP-CMA) results indicate that progress has been achieved in the last three years in raising the level of effective implementation of ICAO Standards and Recommended Practices (SARPs) but more needs to be done to realize the targets established in the Global Aviation Security Plan (GASeP) and the associated regional plans. Increasingly complex current and emerging threats such as cyber security, landside security and insider threat, call for a coordinated and cohesive approach to mitigate the consequences. Furthermore, at the international level, and as part of the coordination with the United Nations Security Council (UN

In [4]:
# Segment into sentences
raw_sent = sent_segment(main_body)
raw_sent

['The threat of potential acts of unlawful interference against civil aviation continues to manifest itself due to the presence of active terrorist groups and events in the African region.',
 'The above- mentioned threats co-exist with the recorded increase in air transport passenger flows.',
 'Furthermore, the Universal Security Audit Programme Continuous Monitoring Approach (USAP-CMA) results indicate that progress has been achieved in the last three years in raising the level of effective implementation of ICAO Standards and Recommended Practices (SARPs) but more needs to be done to realize the targets established in the Global Aviation Security Plan (GASeP) and the associated regional plans.',
 'Increasingly complex current and emerging threats such as cyber security, landside security and insider threat, call for a coordinated and cohesive approach to mitigate the consequences.',
 'Furthermore, at the international level, and as part of the coordination with the United Nations Sec

In [5]:
# Preprocess sentences
sentences = preprocessing(raw_sent)
sentences

['threat potential acts unlawful interference civil aviation continues manifest due presence active terrorist groups events african region',
 'mentioned threats co exist recorded increase air transport passenger flows',
 'furthermore universal security audit programme continuous monitoring approach usap cma results indicate progress achieved last three years raising level effective implementation icao standards recommended practices sarps needs done realize targets established global aviation security plan gasep associated regional plans',
 'increasingly complex current emerging threats cyber security landside security insider threat call coordinated cohesive approach mitigate consequences',
 'furthermore international level part coordination united nations security council unsc global counter terrorism strategy icao leadership traveller identification recognized made significant contribution enhancing air transport facilitation using identification tools described icao traveller ident

### Step 2. Implement  algorithm TextRank

TextRank is inspired by PageRank. So each word is a node, the edge is word pair from window, then the weight can be calculated.

For example:

sentence = \[ $w_1$, $w_2$, $w_3$, ..., $w_n$ \]

window_size = k

windows:

\[ $w_1$, $w_2$, ..., $w_k$ \]

\[ $w_2$, $w_3$, ..., $w_{k+1}$ \]

\[ $w_3$, $w_4$, ..., $w_{k+2}$ \]

...

The word pairs for window \[ $w_1$, $w_2$, ..., $w_k$ \]:

($w_1$, $w_2$), ($w_1$, $w_3$), ..., ($w_1$, $w_k$)

In [6]:
# Prepare candidate words
vocab = get_vocab(sentences)
vocab

OrderedDict([('threat', 0),
             ('potential', 1),
             ('acts', 2),
             ('unlawful', 3),
             ('interference', 4),
             ('civil', 5),
             ('aviation', 6),
             ('continues', 7),
             ('manifest', 8),
             ('due', 9),
             ('presence', 10),
             ('active', 11),
             ('terrorist', 12),
             ('groups', 13),
             ('events', 14),
             ('african', 15),
             ('region', 16),
             ('mentioned', 17),
             ('threats', 18),
             ('co', 19),
             ('exist', 20),
             ('recorded', 21),
             ('increase', 22),
             ('air', 23),
             ('transport', 24),
             ('passenger', 25),
             ('flows', 26),
             ('furthermore', 27),
             ('universal', 28),
             ('security', 29),
             ('audit', 30),
             ('programme', 31),
             ('continuous', 32),
             (

In [7]:
# Build token_pairs from windows in sentences
token_pairs = get_token_pairs(4, sentences)
token_pairs

[('threat', 'potential'),
 ('threat', 'acts'),
 ('threat', 'unlawful'),
 ('potential', 'acts'),
 ('potential', 'unlawful'),
 ('potential', 'interference'),
 ('acts', 'unlawful'),
 ('acts', 'interference'),
 ('acts', 'civil'),
 ('unlawful', 'interference'),
 ('unlawful', 'civil'),
 ('unlawful', 'aviation'),
 ('interference', 'civil'),
 ('interference', 'aviation'),
 ('interference', 'continues'),
 ('civil', 'aviation'),
 ('civil', 'continues'),
 ('civil', 'manifest'),
 ('aviation', 'continues'),
 ('aviation', 'manifest'),
 ('aviation', 'due'),
 ('continues', 'manifest'),
 ('continues', 'due'),
 ('continues', 'presence'),
 ('manifest', 'due'),
 ('manifest', 'presence'),
 ('manifest', 'active'),
 ('due', 'presence'),
 ('due', 'active'),
 ('due', 'terrorist'),
 ('presence', 'active'),
 ('presence', 'terrorist'),
 ('presence', 'groups'),
 ('active', 'terrorist'),
 ('active', 'groups'),
 ('active', 'events'),
 ('terrorist', 'groups'),
 ('terrorist', 'events'),
 ('terrorist', 'african'),
 ('g

In [8]:
# Get normalized matrix
g = get_matrix(vocab, token_pairs)
g

array([[0.        , 0.25      , 0.2       , ..., 0.        , 0.        ,
        0.        ],
       [0.11111111, 0.        , 0.2       , ..., 0.        , 0.        ,
        0.        ],
       [0.11111111, 0.25      , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.2       ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.16666667,
        0.        ]])

In [9]:
# Initionlization for weight(pagerank value)
pr = np.array([1] * len(vocab))

In [10]:
d = 0.85 # damping coefficient, usually is .85
min_diff = 1e-5 # convergence threshold
steps = 10
# node_weight = None # save keywords and its weight

# Iteration
previous_pr = 0
for epoch in range(steps):
    pr = (1-d) + d * np.dot(g, pr)
#     print(pr) # If you would like to check the weight for each epoch 
    if abs(previous_pr - sum(pr))  < min_diff:
        break
    else:
        previous_pr = sum(pr)

# Get weight for each node
node_weight = dict()
for word, index in vocab.items():
    node_weight[word] = pr[index]

# Print top number keywords
node_weight = OrderedDict(sorted(node_weight.items(), key=lambda t: t[1], reverse=True))
for i, (key, value) in enumerate(node_weight.items()):
    print(key + ' - ' + str(value))
    if i > 20-2:
        break

icao - 7.8905155873834
plan - 7.799049151589147
security - 6.968455612102345
aviation - 5.58589919884151
afi - 5.495016458005919
states - 5.104690875396009
african - 4.614649145735901
implementation - 4.191414455451165
regional - 4.093766763851658
secfal - 3.5822409228115837
facilitation - 3.100606425370974
international - 2.790432346121203
standards - 2.381242463887315
annex - 2.27522654530876
use - 2.2343933094805117
approach - 2.2328847801145018
programme - 2.1952750888523473
work - 2.184433106906085
state - 2.1308885514018936
committee - 2.1007648617087065
