# keyword_extraction

extracting keywords (e.g. MeSH terms) from a text using a simple dictionary lookup approach.




### from raw data to lookup list
get the MeSH terms as 'Datenbankfassung' from [here](https://www.dimdi.de/dynamic/de/klassifikationen/weitere-klassifikationen-und-standards/mesh/) and extract. This source is the publisher of the german MeSH 2019. Only the english terms are used here.
    

In [1]:
import pandas as pd
MH  = pd.read_csv('MH.TXT',
                  sep=';',
                  quotechar='|',
                  encoding='latin1',
                  header=None,
                  names=['id','term_german','term','subheadings']
                 ).drop(columns=['term_german','subheadings'])

ETE = pd.read_csv('ETE.TXT',
                  sep=';',
                  quotechar='|',
                  encoding='latin1',
                  header=None,
                  names=['term','id','term_german']
                 ).drop(columns=['term_german'])[['id','term']]

lookuplist = pd.concat([MH,ETE]).reset_index(drop=True)
lookuplist.head()

Unnamed: 0,id,term
0,D042842,11-beta-Hydroxysteroid Dehydrogenases
1,D043205,11-beta-Hydroxysteroid Dehydrogenase Type 1
2,D043209,11-beta-Hydroxysteroid Dehydrogenase Type 2
3,D015062,11-Hydroxycorticosteroids
4,D014013,"1,2-Dihydroxybenzene-3,5-Disulfonic Acid Disod..."


you can use any DataFrame structured like this one for extracting keywords. 

column names are mandatory. 

be aware that the "term" column must be unique!

### create dictionary

In [2]:
from keyword_extraction import DictLU_Create_Dict

DCC = DictLU_Create_Dict(lookuplist)
dicts_lower = DCC.dicts_lower
dicts_upper = DCC.dicts_upper

list(dicts_lower[4].items())[:5]

[('3-pyridinecarboxylic acid, 1,4-dihydro-2,6-dimethyl-5-nitro-4-(2-(trifluoromethyl)phenyl)-, methyl ester',
  'D001498'),
 ('disorder of sex development, 46,xy', 'D058490'),
 ('adapalene, benzoyl peroxide drug combination', 'D000068801'),
 ('adaptive clinical trials as topic', 'D000075522'),
 ('adaptor protein complex alpha subunits', 'D033965')]

what we get now is a list of dictionarys, one with word-length = 1, one with word-length = 2, etc.

 - dicts_upper: only upper-case words (like 'WHO', 'HIV') 
 - dicts_lower: mixed case words

better save those dictionarys for later use

In [3]:
import pickle

with open('MeSH_dict.p', 'wb') as handle:
    pickle.dump([dicts_lower,dicts_upper], handle)

### extract keywords

get a text or an abstract as example.  e.g. one result from search term 'Malaria' from [Livivo](https://www.livivo.de)

In [4]:
text = 'Malaria is infectious diseases caused by Plasmodium parasite, which transmitted by Anopheles mosquitoes. Although the global burden of malaria has been decreasing in recent years, malaria remains one of the most important infectious diseases, from the point of view of its morbidity and mortality. Imported malaria is one of the major concerns at the evaluation of a febrile illness in a traveler returned from the endemic countries. The diagnosis and management of malaria cases requires much experience and knowledge. We review the epidemiology, pathogenesis, clinical features, diagnosis, prevention and treatment of malaria in Japan.'
from pprint import pprint
pprint(text)

('Malaria is infectious diseases caused by Plasmodium parasite, which '
 'transmitted by Anopheles mosquitoes. Although the global burden of malaria '
 'has been decreasing in recent years, malaria remains one of the most '
 'important infectious diseases, from the point of view of its morbidity and '
 'mortality. Imported malaria is one of the major concerns at the evaluation '
 'of a febrile illness in a traveler returned from the endemic countries. The '
 'diagnosis and management of malaria cases requires much experience and '
 'knowledge. We review the epidemiology, pathogenesis, clinical features, '
 'diagnosis, prevention and treatment of malaria in Japan.')


In [5]:
from keyword_extraction import DictLU_Extract_Exact

DEE=DictLU_Extract_Exact(dicts_upper,dicts_lower)

DEE.run(text)

DEE.result

{'D003933': {'count': 2, 'term': 'diagnosis', 'pos': [(438, 447), (581, 590)]},
 'D010961': {'count': 1, 'term': 'plasmodium', 'pos': [(41, 51)]},
 'D007564': {'count': 1, 'term': 'japan', 'pos': [(631, 636)]},
 'D008288': {'count': 6,
  'term': 'malaria',
  'pos': [(0, 7), (135, 142), (180, 187), (307, 314), (466, 473), (620, 627)]},
 'D009026': {'count': 1, 'term': 'mortality', 'pos': [(287, 296)]},
 'D000852': {'count': 1, 'term': 'anopheles', 'pos': [(83, 92)]},
 'D010271': {'count': 1, 'term': 'parasite', 'pos': [(52, 60)]},
 'D019359': {'count': 1, 'term': 'knowledge', 'pos': [(509, 518)]},
 'D004813': {'count': 1, 'term': 'epidemiology', 'pos': [(534, 546)]},
 'D016454': {'count': 1, 'term': 'review', 'pos': [(523, 529)]}}

for a better visual overview you can also get a simple html with the found words in red.

github is not showing the html correctly. no red words. pls check locally

In [6]:
DEE.create_html()
htmlstring = DEE.html

In [7]:
from IPython.core.display import display, HTML
display(HTML(htmlstring))