# keyword_extraction

### extracting keywords (e.g. MeSH terms) from a text using a simple dictionary lookup approach.

library **keyword_extraction** offers you:
 * **DictLU_Create_Dict**: creating english, german, french dictionaries for the lookup
 * **DictLU_Extract_Exact**
   *  **.fast**: break after the first occurance in the text  
   *  **.full**: all occurances in the text, including index


In [301]:
import pandas as pd
import pickle
from pprint import pprint
import xmltodict
from IPython.core.display import display, HTML

# 1. from raw data to lookup list
get the MeSH terms raw data

* **english, german**: as 'Datenbankfassung' (csv-files) from [DIMDI](https://www.dimdi.de/dynamic/de/klassifikationen/weitere-klassifikationen-und-standards/mesh/)
* **french**: as xml from [Inserm](http://mesh.inserm.fr/FrenchMesh/)



    

### 1.1. english

In [84]:
## read in raw data
MH  = pd.read_csv('MH.TXT',
                  sep=';',
                  quotechar='|',
                  encoding='latin1',
                  header=None,
                  names=['id','term_german','term','subheadings']
                 ).drop(columns=['term_german','subheadings'])
print(f'loaded {len(MH)} main headings')

ETE = pd.read_csv('ETE.TXT',
                  sep=';',
                  quotechar='|',
                  encoding='latin1',
                  header=None,
                  names=['term','id','term_german']
                 ).drop(columns=['term_german'])[['id','term']]
print(f'loaded {len(ETE)} synonyms')

lookuplist = pd.concat([MH,ETE]).reset_index(drop=True)
print(f'    -> {len(lookuplist)} terms in total\n')

print(lookuplist.head(2))
print('\n####################\n')


#you can use any DataFrame structured like this one for extracting keywords. 
#column names are mandatory. 
#be aware that the "term" column must be unique!


# create dictionary
from keyword_extraction import DictLU_Create_Dict

DCC = DictLU_Create_Dict(lookuplist)
dicts_lower = DCC.dicts_lower
dicts_upper = DCC.dicts_upper


pprint(list(dicts_lower[3].items())[:5])


#what we get now is a list of dictionarys, one with word-length = 1, one with word-length = 2, etc.

# - dicts_upper: only upper-case words (like 'WHO', 'HIV') 
# - dicts_lower: mixed case words

#better save those dictionarys for later use



with open('MeSH_dict_english.p', 'wb') as handle:
    pickle.dump([dicts_lower,dicts_upper], handle)

loaded 29351 main headings
loaded 214879 synonyms
    -> 244230 terms in total

        id                                         term
0  D042842        11-beta-Hydroxysteroid Dehydrogenases
1  D043205  11-beta-Hydroxysteroid Dehydrogenase Type 1

####################

[('11-beta-hydroxysteroid dehydrogenase type 1', 'D043205'),
 ('11-beta-hydroxysteroid dehydrogenase type 2', 'D043209'),
 ('1,2-dihydroxybenzene-3,5-disulfonic acid disodium salt', 'D014013'),
 ('carbon-13 magnetic resonance spectroscopy', 'D066241'),
 ('15-hydroxy-11 alpha,9 alpha-(epoxymethano)prosta-5,13-dienoic acid',
  'D019796')]


### 1.2. german

In [399]:
## read in raw data
MH  = pd.read_csv('MH.TXT',
                  sep=';',
                  quotechar='|',
                  encoding='latin1',
                  header=None,
                  names=['id','term','term_english','subheadings']
                 ).drop(columns=['term_english','subheadings'])
print(f'loaded {len(MH)} main headings')

ETD = pd.read_csv('ETD.TXT',
                  sep=';',
                  quotechar='|',
                  encoding='latin1',
                  header=None,
                  names=['term','id','null']
                 )[['id','term']]
print(f'loaded {len(ETD)} synonyms')


lookuplist = pd.concat([MH,ETD]).reset_index(drop=True)
## manual correction
lookuplist.loc[lookuplist['id']=='D007060','term']='Id'


print(f'    -> {len(lookuplist)} terms in total\n')



print(lookuplist.head(2))
print('\n####################\n')

# create dictionary
from keyword_extraction import DictLU_Create_Dict

DCC = DictLU_Create_Dict(lookuplist)
dicts_lower = DCC.dicts_lower
dicts_upper = DCC.dicts_upper


pprint(list(dicts_lower[1].items())[:5])



with open('MeSH_dict_german.p', 'wb') as handle:
    pickle.dump([dicts_lower,dicts_upper], handle)



loaded 29351 main headings
loaded 68789 synonyms
    -> 98140 terms in total

        id                                        term
0  D042842       11-Beta-Hydroxysteroid-Dehydrogenasen
1  D043205  11-Beta-Hydroxysteroid-Dehydrogenase Typ 1

####################

[('1-sarcosin-8-isoleucin-angiotensin ii', 'D015059'),
 ('24,25-dihydroxyvitamin d3', 'D015650'),
 ('25-hydroxyvitamin d2', 'D015652'),
 ('2h-benzo(a)quinolizin-2-ol, '
  '2-ethyl-1,3,4,6,7,11b-hexahydro-3-isobutyl-9,10-dimethoxy',
  'D012369'),
 ('2-oxoisovalerat-dehydrogenase (acylierend)', 'D050645')]


### 1.3. french

In [307]:
# get raw data
with open('fredesc2019.xml','r') as xml_obj:
    my_dict = xmltodict.parse(xml_obj.read())
    xml_obj.close()

In [308]:
## for debugging ONE item

x=my_dict['DescriptorRecordSet']['DescriptorRecord'][666]


res=[]

print("['DescriptorRecordSet']['DescriptorRecord']")
print(type(x))



procname=x['DescriptorName']['String'].split('[')[0]
print('                                      ->',x['DescriptorUI'], x['DescriptorName']['String'],'  -> ',procname)

res.append((x['DescriptorUI'],'fre',procname))

tt=x['ConceptList']['Concept']
    
if isinstance(tt,list):
    pass
else:
    tt=[tt]
    print('          changed to list:', type(tt))
        
    
for ii,y in enumerate(tt):
    print('    yy',ii,type(y))
    #print(y)
    cc=y['TermList']['Term']
    print('        cc',type(cc))
    if isinstance(cc,list):
        pass
    else:
        cc=[cc]
        print('          changed to list:', type(cc))
        
    for dd in cc:
        print('                                      ->',dd['TermUI'],dd['String'])
        res.append((x['DescriptorUI'],dd['TermUI'],dd['String']))

        
        
    

['DescriptorRecordSet']['DescriptorRecord']
<class 'collections.OrderedDict'>
                                      -> D000070996 Sous-famille G des transporteurs à cassette liant l'ATP[ATP Binding Cassette Transporter, Subfamily G]   ->  Sous-famille G des transporteurs à cassette liant l'ATP
          changed to list: <class 'list'>
    yy 0 <class 'collections.OrderedDict'>
        cc <class 'list'>
                                      -> T000939220 ATP Binding Cassette Transporter, Subfamily G
                                      -> T000892291 ABCG Proteins
                                      -> T000892292 ABCG Transporters
                                      -> T000894708 ATP Binding Cassette Transporter, Sub-Family G
                                      -> T000894708 ATP Binding Cassette Transporter, Sub Family G
                                      -> fre0141546 Sous-famille G des transporteurs à cassette liant l'ATP
                                      -> fre0141548 Pr

In [311]:

## ALL items


res=[]

print(f'''loaded {len(my_dict['DescriptorRecordSet']['DescriptorRecord'])} main headings''')

for x in my_dict['DescriptorRecordSet']['DescriptorRecord']:



    procname=x['DescriptorName']['String'].split('[')[0]
    res.append((x['DescriptorUI'],'fre',procname))

    tt=x['ConceptList']['Concept']
    
    if isinstance(tt,list):
        pass
    else:
        tt=[tt]
        
    for ii,y in enumerate(tt):
        cc=y['TermList']['Term']
    
        if isinstance(cc,list):
            pass
        else:
            cc=[cc]
        
        
        for dd in cc:
            res.append((x['DescriptorUI'],dd['TermUI'],dd['String']))

res2 = [(x[0],x[2]) for x in res if x[1][:3]=='fre']


tmplookuplist = pd.DataFrame({
    'id':[x[0] for x in res2],
    'term':[x[1] for x in res2]
})
lookuplist=tmplookuplist.drop_duplicates(keep='first').reset_index(drop=True)
print(f'    -> {len(lookuplist)} terms in total\n')


# create dictionary
from keyword_extraction import DictLU_Create_Dict

DCC = DictLU_Create_Dict(lookuplist)
dicts_lower = DCC.dicts_lower
dicts_upper = DCC.dicts_upper


pprint(list(dicts_lower[3].items())[:5])



with open('MeSH_dict_french.p', 'wb') as handle:
    pickle.dump([dicts_lower,dicts_upper], handle)

loaded 29351 main headings
    -> 122830 terms in total

[("lésions traumatiques de l'abdomen", 'D000007'),
 ('nerf moteur oculaire externe', 'D000010'),
 ('abelson murine leukemia virus', 'D000011'),
 ('malformations dues aux médicaments', 'D000014'),
 ('malformations dues aux radiations', 'D000016')]


# 2. Examples

for a better visual overview you can also get a simple html with the found words in red.

github is not showing the html correctly. no red words. pls check locally


## 2.1. English 

In [388]:
text = 'Malaria is infectious diseases caused by Plasmodium parasite, which transmitted by Anopheles mosquitoes. Although the global burden of malaria has been decreasing in recent years, malaria remains one of the most important infectious diseases, from the point of view of its morbidity and mortality. Imported malaria is one of the major concerns at the evaluation of a febrile illness in a traveler returned from the endemic countries. The diagnosis and management of malaria cases requires much experience and knowledge. We review the epidemiology, pathogenesis, clinical features, diagnosis, prevention and treatment of malaria in Japan.'
pprint(text)
print('\n###  FAST  #############\n')

from keyword_extraction import DictLU_Extract_Exact

[dicts_lower,dicts_upper] = pickle.load( open('MeSH_dict_english.p', "rb" ) ) 

DEE=DictLU_Extract_Exact(dicts_upper,dicts_lower)

DEE.fast(text)

pprint(DEE.fast_ids)
            
DEE.full(text)
print('\n###  FULL  #############\n')


pprint(DEE.result)

DEE.create_html()   ## this works when using the full method, not when using the fast method
htmlstring = DEE.html

print('\n###  HTML  #############')
display(HTML(htmlstring))

('Malaria is infectious diseases caused by Plasmodium parasite, which '
 'transmitted by Anopheles mosquitoes. Although the global burden of malaria '
 'has been decreasing in recent years, malaria remains one of the most '
 'important infectious diseases, from the point of view of its morbidity and '
 'mortality. Imported malaria is one of the major concerns at the evaluation '
 'of a febrile illness in a traveler returned from the endemic countries. The '
 'diagnosis and management of malaria cases requires much experience and '
 'knowledge. We review the epidemiology, pathogenesis, clinical features, '
 'diagnosis, prevention and treatment of malaria in Japan.')

###  FAST  #############

['D003141',
 'D000852',
 'D003933',
 'D004813',
 'D007564',
 'D008288',
 'D009017',
 'D009026',
 'D010961',
 'D016454',
 'D019359',
 'D009033',
 'D010271',
 'D013812']

###  FULL  #############

{'D000852': {'count': 1, 'pos': [(83, 92)], 'term': 'anopheles'},
 'D003141': {'count': 2,
             

## 2.2. German

In [401]:
text='''Die vorliegende Arbeit beschäftigt sich mit dem Thema der Malaria, die nicht nur in Entwicklungsländern eine Rolle spielt, sondern mit der medizinisches Personal auch in Europa zu kämpfen hat. Es wird der Frage nachgegangen, welche geschichtlichen Hintergründe die Malaria hat und wie die Menschen damals dieses Krankheitsbild therapiert und der Krankheit vorgebeugt haben. Außerdem wird behandelt, welche prophylaktischen Möglichkeiten es gegen die Malaria gibt, wie wirksam diese sind und ob es schon einen wirksamen Impfstoff gegen diese Weltkrankheit gibt. Ein Schwerpunkt liegt auf den Tätigkeiten, mit denen Pflegepersonal bei der Pflege Malariaerkrankter konfrontiert ist. Um Antworten auf diese Fragen zu finden, wurden in einer umfangreichen Literaturrecherche Fachbücher, Fachzeitschriften, Forschungsartikel im Internet und fachspezifische englische Studien durchsucht und das Wesentliche herausgefiltert. Zusätzlich wurden drei halbstandardisierte Leitfaden-Interviews mit Menschen, die die Malaria persönlich erlebt haben, durchgeführt. Es wird offensichtlich, wie umfangreich das Thema der Malaria und ihrer Prophylaxe- und Therapiemöglichkeiten ist und wie viel noch zu erforschen wäre.'''
pprint(text)
print('\n###  FAST  #############\n')


from keyword_extraction import DictLU_Extract_Exact

[dicts_lower,dicts_upper] = pickle.load( open('MeSH_dict_german.p', "rb" ) ) 

DEE=DictLU_Extract_Exact(dicts_upper,dicts_lower)

DEE.fast(text)

pprint(DEE.fast_ids)
            
DEE.full(text)
print('\n###  FULL  #############\n')


pprint(DEE.result)

DEE.create_html()   ## this works when using the full method, not when using the fast method
htmlstring = DEE.html

print('\n###  HTML  #############')
display(HTML(htmlstring))

('Die vorliegende Arbeit beschäftigt sich mit dem Thema der Malaria, die nicht '
 'nur in Entwicklungsländern eine Rolle spielt, sondern mit der medizinisches '
 'Personal auch in Europa zu kämpfen hat. Es wird der Frage nachgegangen, '
 'welche geschichtlichen Hintergründe die Malaria hat und wie die Menschen '
 'damals dieses Krankheitsbild therapiert und der Krankheit vorgebeugt haben. '
 'Außerdem wird behandelt, welche prophylaktischen Möglichkeiten es gegen die '
 'Malaria gibt, wie wirksam diese sind und ob es schon einen wirksamen '
 'Impfstoff gegen diese Weltkrankheit gibt. Ein Schwerpunkt liegt auf den '
 'Tätigkeiten, mit denen Pflegepersonal bei der Pflege Malariaerkrankter '
 'konfrontiert ist. Um Antworten auf diese Fragen zu finden, wurden in einer '
 'umfangreichen Literaturrecherche Fachbücher, Fachzeitschriften, '
 'Forschungsartikel im Internet und fachspezifische englische Studien '
 'durchsucht und das Wesentliche herausgefiltert. Zusätzlich wurden drei '
 'halbst

In [384]:
[x for x in dicts_lower[0].items() if x[0][0:4] == 'impf']

[('impfablehnung', 'D000072758'),
 ('impfplan', 'D007115'),
 ('impfprogramme', 'D017589'),
 ('impfschutz-deckungsgrad', 'D000073887'),
 ('impfstoffwirksamkeit', 'D064166'),
 ('impfschutzabdeckung', 'D000073887'),
 ('impfschutzdeckungsgrad', 'D000073887'),
 ('impfstoffe', 'D014612'),
 ('impfstoffstabilitaet', 'D064166'),
 ('impfstoffstabilität', 'D064166'),
 ('impfung', 'D014611'),
 ('impfverweigerung', 'D000072758'),
 ('impfwissenschaften', 'D000078782')]

## 2.3. French

In [386]:
text='''Un essai clinique randomisé, placebo contrôlé a testé chez plus de 27 000 patients avec maladie cardiovasculaire sous statines l’ajout d’un deuxième traitement hypolipémiant par inhibition de la protéine PCSK9, l’évolocumab. Dans une analyse de sous-groupe de cette étude appelée FOURIER, les auteurs ont identifié 2034 patients qui avaient un LDL-cholestérol en dessous de 1,8 mmol/l (médiane 1,7 mmol/l) au début de l’étude. Dans ce groupe, 1030 patients ont reçu de l’évolocumab et ont atteint un taux médian de LDL-cholestérol à 0,5 mmol/l à un an. Le risque de récidive d’événement cardiovasculaire mortel ou non après deux ans était à 4,7 % dans le groupe évolocumab versus 6,8 % dans le groupe placebo. Par comparaison, chez les plus de 25 000 patients avec LDL-cholestérol au-dessus de 1,8 mmol/l (médiane 2,4 mmol/l) au début de l’étude, le risque de récidive cardiovasculaire était à 6,0 % dans le groupe évolocumab versus 7,4 % dans le groupe placebo. Il n’y avait pas de modification de l’effet du traitement hypolipémiant testé en fonction du taux de LDL-cholestérol au départ (p-value pour interaction 0,6) et il n’y avait pas de signal d’effets indésirables majeurs.'''
pprint(text)
print('\n###  FAST  #############\n')


from keyword_extraction import DictLU_Extract_Exact

[dicts_lower,dicts_upper] = pickle.load( open('MeSH_dict_french.p', "rb" ) ) 

DEE=DictLU_Extract_Exact(dicts_upper,dicts_lower)

DEE.fast(text)

pprint(DEE.fast_ids)
            
DEE.full(text)
print('\n###  FULL  #############\n')
pprint(DEE.result)

DEE.create_html()   ## this works when using the full method, not when using the fast method
htmlstring = DEE.html

print('\n###  HTML  #############')
display(HTML(htmlstring))

('Un essai clinique randomisé, placebo contrôlé a testé chez plus de 27 000 '
 'patients avec maladie cardiovasculaire sous statines l’ajout d’un deuxième '
 'traitement hypolipémiant par inhibition de la protéine PCSK9, l’évolocumab. '
 'Dans une analyse de sous-groupe de cette étude appelée FOURIER, les auteurs '
 'ont identifié 2034 patients qui avaient un LDL-cholestérol en dessous de 1,8 '
 'mmol/l (médiane 1,7 mmol/l) au début de l’étude. Dans ce groupe, 1030 '
 'patients ont reçu de l’évolocumab et ont atteint un taux médian de '
 'LDL-cholestérol à 0,5 mmol/l à un an. Le risque de récidive d’événement '
 'cardiovasculaire mortel ou non après deux ans était à 4,7 % dans le groupe '
 'évolocumab versus 6,8 % dans le groupe placebo. Par comparaison, chez les '
 'plus de 25 000 patients avec LDL-cholestérol au-dessus de 1,8 mmol/l '
 '(médiane 2,4 mmol/l) au début de l’étude, le risque de récidive '
 'cardiovasculaire était à 6,0 % dans le groupe évolocumab versus 7,4 % dans '
 'le

In [385]:
[x for x in dicts_lower[0].items() if x[0] == 'cholestérol']

[]

## Speedtest 1 CPU

In [9]:
 from datetime import datetime

In [8]:
from keyword_extraction import DictLU_Extract_Exact
import pickle 
[dicts_lower,dicts_upper] = pickle.load( open('MeSH_dict.p', "rb" ) ) 

DEE=DictLU_Extract_Exact(dicts_upper,dicts_lower)

text = 'Malaria is infectious diseases caused by Plasmodium parasite, which transmitted by Anopheles mosquitoes. Although the global burden of malaria has been decreasing in recent years, malaria remains one of the most important infectious diseases, from the point of view of its morbidity and mortality. Imported malaria is one of the major concerns at the evaluation of a febrile illness in a traveler returned from the endemic countries. The diagnosis and management of malaria cases requires much experience and knowledge. We review the epidemiology, pathogenesis, clinical features, diagnosis, prevention and treatment of malaria in Japan.'
from pprint import pprint
#pprint(text)
print(text)

Malaria is infectious diseases caused by Plasmodium parasite, which transmitted by Anopheles mosquitoes. Although the global burden of malaria has been decreasing in recent years, malaria remains one of the most important infectious diseases, from the point of view of its morbidity and mortality. Imported malaria is one of the major concerns at the evaluation of a febrile illness in a traveler returned from the endemic countries. The diagnosis and management of malaria cases requires much experience and knowledge. We review the epidemiology, pathogenesis, clinical features, diagnosis, prevention and treatment of malaria in Japan.


In [17]:
a=datetime.now()
for i in range(1000):
    if i%100==0:
        print(i)
    DEE.fast(text)
b=datetime.now()
print('fast extraction',b-a)

a=datetime.now()
for i in range(1000):
    if i%100==0:
        print(i)
    DEE.full(text)
b=datetime.now()
print('full extraction',b-a)

0
100
200
300
400
500
600
700
800
900
fast extraction 0:02:34.371130
0
100
200
300
400
500
600
700
800
900
full extraction 0:02:39.806298


## Speedtest Multi CPU

In [None]:
import multiprocessing as mp

In [None]:
## was ist besser?
    - das DEE objekt mitzugeben 
    - das DEE objekt jedes mal neu zu erzeugen
    
    - große oder kleine chunksize?
    