## Demo notebook for flashtext package

https://github.com/vi3k6i5/flashtext  

28x faster than a compiled regexp for 1k keywords & ~10k tokens per document https://twitter.com/RadimRehurek/status/904989624589803520

Pure Python too, so prolly lots of room to opt.
Some of the functionality (fuzzy match implementation) does not exist while using 

**pip install flashtext**

https://github.com/vi3k6i5/flashtext/blob/master/flashtext/keyword.py (Line 756) 

So, instead, to retrive the newest one from Github, that can be done  

**pip install git+https://github.com/vi3k6i5/flashtext.git#egg=flashtext**


In [1]:
from flashtext import KeywordProcessor
import pandas as pd

### Single corpus usage examples

In [2]:
"""
Data contains raw text for each sections within a document
"""
df = pd.read_csv("input/2020-07-16_Tier-2-5-sponsor-guidance_Jul-2020_v1.0_section.csv", index_col=0)
document = df.raw_text.iloc[1]
document

"This guidance is for organisations who want to apply for a sponsor licence to \nsponsor migrants under Tier 2 and/or Tier 5 of the points-based system. It \ntells you what we expect if you are a licence holder, the processes you must \nfollow when sponsoring a migrant and how to meet all of the duties and \nresponsibilities associated with being a licensed sponsor. The guidance is \nsubject to change and you should check the dates to make sure you have \nthe latest version. \n \nA new points-based immigration system will come into effect from 1 January \n2021. The future system will apply to both European Economic Area (EEA) \nnationals and non-EEA nationals. You should refer to Annex 9 of this \nguidance if you intend to apply for a licence to sponsor workers under the \nnew system.  \n \nSeparate guidance exists on GOV.UK for UK education providers who wish \nto apply for and hold a licence to sponsor international students to come to \nthe UK under Tier 4 to study. \n \nYou can fin

In [3]:
keyword_processor = KeywordProcessor(case_sensitive=True) # default is True

# keyword_processor.add_keyword(<unclean name>, <standardised name>)
#keyword_processor.add_keyword('Tier 2 and/or Tier 5')
keyword_processor.add_keyword('Tier 2')
keyword_processor.add_keyword('Tier 5')
keyword_processor.add_keyword('European Economic Area', 'EEA')
keyword_processor.add_keyword('nationals', 'non-EEA')
keyword_processor.add_keyword('Annex 9')
keyword_processor.add_keyword('Tier 4', 'Tier 4')

"""
It returns the keywords found in the document (corpus) and their relevant positions
"""
keywords_found = keyword_processor.extract_keywords(document, span_info=True)
print(keywords_found)

[('Tier 2', 102, 108), ('Tier 5', 116, 122), ('EEA', 595, 617), ('non-EEA', 625, 634), ('non-EEA', 647, 656), ('Annex 9', 678, 685), ('Tier 4', 950, 956)]


In [4]:
"""
add_keyword method takes 2 argumens (keyword, clean_name)

keyword : (string)     Keyword that you want to identify
clean_name : (string)  Clean term for that keyword that you would want to get back in return or replace
                       if not provided, keyword will be used as the clean name also.
"""
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Tier 2', ('Immigration status', 'Tier 2'))
keyword_processor.add_keyword('Tier 5', ('Immigration status', 'Tier 5'))
keyword_processor.add_keyword('non-EEA', ('Location', 'non-EEA'))
keywords_extracted = keyword_processor.extract_keywords(document, span_info=True)
keywords_extracted

[(('Immigration status', 'Tier 2'), 102, 108),
 (('Immigration status', 'Tier 5'), 116, 122),
 (('Location', 'non-EEA'), 639, 646)]

In [5]:
"""
add_keywords_from_dict   Method returns the key of the given keywords values
add_keywords_from_list   Similar to dict method
"""
keyword_processor = KeywordProcessor()

keyword_dict = {
    "immigration_status" : ["Tier", "Start-up", "Innovator", "Global Talent"],
    "UK_regions" : ["non-EEA", "England", "Wales", "Scotland", "Northern Ireland"]
}

keyword_processor.add_keywords_from_dict(keyword_dict)
print(len(keyword_processor))
print(keyword_processor.extract_keywords(document))

9
['immigration_status', 'immigration_status', 'UK_regions', 'immigration_status']


### Single file usage examples

In [6]:
"""
Data contains raw texts for each page
"""
df = pd.read_csv("input/2020-07-16_Tier-2-5-sponsor-guidance_Jul-2020_v1.0_data.csv", index_col=0)

In [7]:
"""
Copied the list of keywords created by @Bernhard and @Joao
https://docs.google.com/spreadsheets/d/1ViWKwayEa-k5mdp2T9swBnQLq0L_21uITWRRK46AWCQ/edit#gid=0

Maybe used the one @Larisa shared
https://docs.google.com/spreadsheets/d/1E4RWT0MUCzU5UpVuVGGcjUOqYyIfMuhD8oyodtDcu_w/edit#gid=0
"""
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file('input/Tier-2-5-sponsor-guidance_Jul-2020_v1.0_labels.txt')

In [9]:
"""
Looping over each page and extract the given keywords
"""
for i in range(len(df)):
    page_document = df.raw_text.iloc[i]
    print("\n+++++++++++++ Page number : %d +++++++++++++++++++++++++\n" % (i+1))
    keywords_extracted = keyword_processor.extract_keywords(page_document)
    print(keywords_extracted)
    print()
    print(set(keywords_extracted))
    if i > 2:
        break


+++++++++++++ Page number : 1 +++++++++++++++++++++++++

['tier', 'guidance', 'tier', 'guidance', 'tier', 'sponsor', 'licence', 'sponsor', 'tier', 'tier', 'licence', 'sponsor', 'guidance', 'sponsorship', 'Home Office', 'Home Office', 'sponsor', 'licence', 'sponsor']

{'sponsorship', 'guidance', 'tier', 'sponsor', 'licence', 'Home Office'}

+++++++++++++ Page number : 2 +++++++++++++++++++++++++

['guidance', 'guidance', 'tier', 'tier', 'sponsor', 'licence', 'certificates of sponsorship', 'guidance']

{'guidance', 'certificates of sponsorship', 'tier', 'sponsor', 'licence'}

+++++++++++++ Page number : 3 +++++++++++++++++++++++++

['tier', 'guidance', 'guidance', 'guidance', 'sponsor', 'guidance', 'licence', 'sponsorship', 'tier', 'tier', 'sponsorship', 'licence', 'licence', 'tier', 'tier', 'tier', 'tier', 'tier', 'tier', 'tier', 'tier', 'tier', 'tier', 'Government Authorised Exchange', 'tier', 'guidance']

{'sponsorship', 'guidance', 'tier', 'sponsor', 'licence', 'Government Authorise

### Other uses

This part was not clear to me, but it finds the fuzzy keywords.

In [10]:
"""
Retrieve the nodes where there is a fuzzy match,
via levenshtein distance, and with respect to max_cost
Args:
    word (str): word to find a fuzzy match for
    max_cost (int): maximum levenshtein distance when performing the fuzzy match
    start_node (dict): Trie node from which the search is performed
Yields:
    node, cost, depth (tuple): A tuple containing the final node,
                              the cost (i.e the distance), and the depth in the trie
"""
keyword_processor = KeywordProcessor(case_sensitive=True)
keyword_processor.add_keyword('Marie', 'Mary')
next(keyword_processor.levensthein('Maria', max_cost=1))

({'_keyword_': 'Mary'}, 1, 5)

In [11]:
keyword_processor = KeywordProcessor(case_sensitive=True)
keyword_processor.add_keyword('Marie Blanc', 'Mary')
next(keyword_processor.levensthein('Mari', max_cost=1))

({' ': {'B': {'l': {'a': {'n': {'c': {'_keyword_': 'Mary'}}}}}}}, 1, 5)

In [12]:
keyword_proc = KeywordProcessor()
keyword_proc.add_keyword('keyword')
keyword_proc.add_keyword('keyword with many words')
sentence = "This sentence contains a keywrd with many woords"
print(keyword_proc.extract_keywords(sentence, span_info=True, max_cost=2))
print(keyword_proc.extract_keywords(sentence, span_info=True, max_cost=1))

[('keyword with many words', 25, 48)]
[('keyword', 25, 31)]
