## Part-of-speech tagging and lemmatization with [Stanza NLP](https://stanfordnlp.github.io/stanza/) and MedLexSp

Workspace requirements:
- Python 3.7+
- Stanza NLP: tested with version 1.4.0

#### Import modules and tools

In [1]:
import stanza
import pickle

In [2]:
# Helper functions to change label names and format

def format_pos_name(POS_label_name,predicted_POS):

    ''' Given the name of a part-of-speech tag in MedLexSp dictionary, changes the tag name according to Universal Dependencies used in Spacy / Stanza. 
        In case several tags are possible, the part-of-speech prediction is used to disambigate. 
        E.g. "ADJ;N" (tag name in MedLexSp) and "ADJ" (predicted tag) => output "ADJ". 
        E.g. "N;NPR" -> "NOUN" or "PROPN", "ADJ;N" -> "ADJ" or "NOUN"
        MedLexSp category "AFF" has not an equivalent category in Spacy / Stanza.
    '''
    # keys are MedLexSp PoS codes, values are Spacy / Stanza labels
    POSFormat = {'ADJ': 'ADJ', 'ADV': 'ADV', 'N': 'NOUN', 'PREP': 'ADP', 'V': 'VERB', 'art': 'DET', 'NPR': 'PROPN'}

    if ((POS_label_name == "ADJ;ADV") and (predicted_POS == "ADV")):
        return POSFormat['ADV']
    elif ((POS_label_name == "ADJ;ADV") and (predicted_POS == "ADJ")):
        return POSFormat['ADJ']
    elif ((POS_label_name == "N;NPR") and (predicted_POS == "NOUN")):
        return POSFormat['N']
    elif ((POS_label_name == "N;NPR") and (predicted_POS == "PROPN")):
        return POSFormat['NPR']
    elif ((POS_label_name == "ADJ;N") and (predicted_POS == "ADJ")):
        return POSFormat['ADJ']
    elif ((POS_label_name == "ADJ;N") and (predicted_POS == "NOUN")):
        return POSFormat['N']
    else:
        return POSFormat[POS_label_name]


def get_pos_from_lexicon(word,predicted_POS,POSDict):
    
    ''' Function to get part-of-speech category from MedLexSp lexicon '''

    try:
        word = word.lower()
        if POSDict[word]:
            TuplesList = POSDict[word]
            # Look up the dictionary using the PoS tag, if several categories are possible: "curva": [('ADJ', 'curvo'), ('N', 'curva')]
            if len(TuplesList)>1:
                # Default value (in case the following step fails)
                POS = TuplesList[0][0]
                lemma = TuplesList[0][1]
                # Take the lemma according to PoS predicted by Stanza/Spacy
                for Tuple in TuplesList:
                    POS = Tuple[0]
                    if format_pos_name(POS,predicted_POS) == predicted_POS:
                        lemma = Tuple[1]
                        return format_pos_name(POS,predicted_POS), lemma
            else:
                POS=TuplesList[0][0]
                lemma=TuplesList[0][1]

            return format_pos_name(POS,predicted_POS),lemma
    except:
        return None


In [4]:
# Load POS data from MedLexSp
POSDataFile = open("MedLexSpPOS.pickle",'rb')
POSData = pickle.load(POSDataFile)

for k in list(POSData.items())[100:110]:
    print(k)


('vientre', [('N', 'vientre')])
('vientres', [('N', 'vientre')])
('aa', [('N', 'aa'), ('NPR', 'aa'), ('N', 'aab')])
('retortijón', [('N', 'retortijón')])
('retortijones', [('N', 'retortijón')])
('abdominalgia', [('N', 'abdominalgia')])
('abdominalgias', [('N', 'abdominalgia')])
('dab', [('N', 'dab')])
('abetalipoproteinemia', [('N', 'abetalipoproteinemia')])
('abetalipoproteinémico', [('ADJ', 'abetalipoproteinémico')])


### Part-of-speech tagging and lemmatization without MedLexSp

In [5]:
# Load Spanish model
nlp = stanza.Pipeline('es', processors='tokenize,pos,lemma', verbose=False, use_gpu=False, download_method=None)

In [12]:
# Sentence
text = "Se realizó un PSA total que fue normal. Se explora el páncreas."

In [11]:
# Use Stanza model to process sentence
doc = nlp(text)

for i,sentence in enumerate(doc.sentences):
    for token in sentence.words:
        print(token.text,token.lemma,token.pos, sep="\t")

Se	se	PRON
realizó	realizar	VERB
un	uno	DET
PSA	PSA	PROPN
total	total	ADJ
que	que	PRON
fue	ser	AUX
normal	normal	ADJ
.	.	PUNCT
Se	se	PRON
explora	explorar	VERB
el	el	DET
páncreas	páncrea	NOUN


Note the PoS error in "PSA" (PROPN), and the lemmatization error in "páncreas" (the lemma is not "páncrea", but "páncreas")

### Part-of-speech tagging and lemmatization with MedLexSp

In [13]:
# Use model to process sentence
doc = nlp(text)

for i,sentence in enumerate(doc.sentences):
    for token in sentence.words:
        pos = token.pos
        lemma = token.lemma
        token = token.text
        # Get POS category from MedLexSp lexicon; if not available, use Stanza POS
        if get_pos_from_lexicon(token.lower(),pos,POSData):
            pos, lemma = get_pos_from_lexicon(token,pos,POSData)
        print(token,lemma,pos, sep="\t")

Se	se	PRON
realizó	realizar	VERB
un	uno	DET
PSA	psa	NOUN
total	total	ADJ
que	que	PRON
fue	ser	AUX
normal	normal	ADJ
.	.	PUNCT
Se	se	PRON
explora	explorar	VERB
el	el	DET
páncreas	páncreas	NOUN
.	.	PUNCT


Note the correct PoS of "PSA" (NOUN), and the correct lemmatization of "páncreas"