Morphological Analysis Examples #1

PeterPirog · 2021-12-02T18:45:22Z

@pranaydeeps Is possible to add simple code example how to do morphological analysis of some short ancient greek sentence?
Maybe some examples of:

POS tagging
cosine sentence similarity

Thank You very much for the model. I tried to train smaller BERT model but i didn't have enough GPU resources. I would like to use your model for New Testament analysis.

pranaydeeps · 2021-12-02T19:54:58Z

I think I have some jupyter notebooks which I used for my experiments and analysis.
I will try to clean the code a bit and upload it whenever I can!
Meanwhile if you have a urgent need for it you can take a look at the FLAIR toolkit documentation,
since the Morphological Analysis model is trained using the FLAIR toolkit.

PeterPirog · 2021-12-02T20:19:39Z

@pranaydeeps Hi, thank You for the answer. This isn't urgent so I will wait patiently and try to do something myself.
Now I use code:

from transformers import AutoTokenizer, AutoModel
import torch


tokenizer = AutoTokenizer.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
model = AutoModel.from_pretrained("pranaydeeps/Ancient-Greek-BERT")


###Tokenize the sentences like before:
sent = [
    " Ἀπὸ δὲ ἕκτης ὥρας σκότος ἐγένετο ἐπὶ πᾶσαν τὴν γῆν ἕως ὥρας ἐννάτης", # Mt 27.45 sentence 1
    " γενομένης δὲ ὥρας ἕκτης σκότος ἐγένετο ἐφ᾽ ὅλην τὴν γῆν ἕως ὥρας ἐννάτης", # Mk 15.33 sentence 2
    " Οἱ δὲ παραπορευόμενοι ἐβλασφήμουν αὐτὸν κινοῦντες τὰς κεφαλὰς αὐτῶν", # Mt 27.39 sentence 3
    " ἦν δὲ ὡσεὶ ὥρα ἕκτη Καὶ σκότος ἐγένετο ἐφ᾽ ὅλην τὴν γῆν ἕως ὥρας ἐννάτης" # Lk23.44 sentence 4
]
 # sentence 1 to senetnce 2 = 0.94861794
 # sentence 1 to senetnce 3 = 0.6118592
 # sentence 1 to senetnce 4 = 0.9064161

# initialize dictionary: stores tokenized sentences
token = {'input_ids': [], 'attention_mask': []}
for sentence in sent:
    # encode each sentence, append to dictionary
    new_token = tokenizer.encode_plus(sentence, max_length=128,
                                      truncation=True, padding='max_length',
                                      return_tensors='pt')
    token['input_ids'].append(new_token['input_ids'][0])
    token['attention_mask'].append(new_token['attention_mask'][0])
# reformat list of tensors to single tensor
token['input_ids'] = torch.stack(token['input_ids'])
token['attention_mask'] = torch.stack(token['attention_mask'])

# Process tokens through model:
output = model(**token)
print(output.keys())

# The dense vector representations of text are contained within the outputs 'last_hidden_state' tensor
embeddings = output.last_hidden_state
print(embeddings)

# To perform this operation, we first resize our attention_mask tensor:
att_mask = token['attention_mask']
att_mask.shape

#    output: torch.Size([4, 128])

mask = att_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

#    Output: torch.Size([4, 128, 768])

mask_embeddings = embeddings * mask
mask_embeddings.shape

#    Output: torch.Size([4, 128, 768])

# Then we sum the remained of the embeddings along axis 1:
summed = torch.sum(mask_embeddings, 1)
summed.shape

#    Output: torch.Size([4, 768])

# Then sum the number of values that must be given attention in each position of the tensor:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

#    Output: torch.Size([4, 768])

mean_pooled = summed / summed_mask
print(mean_pooled)

from sklearn.metrics.pairwise import cosine_similarity

# Let's calculate cosine similarity for sentence 0:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()
# calculate
similarity = cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:])

print(similarity)

but maybe there is some better way to do it. As I know the biggest language corpus is TLG http://stephanus.tlg.uci.edu/ but unfortunatelly even untagged texts aren't open source.

pranaydeeps · 2022-04-25T09:57:01Z

@PeterPirog apologies for the delayed reply. You can use something similar to get Morphological Analysis outputs from the pre-trained model if your text is saved line by line in the file "input_text_clean.txt":

from flair.models import SequenceTagger
tagger = SequenceTagger.load('SuperPeitho-FLAIR-v2/final-model.pt')

with open("../input_text_clean.txt", "r") as testfile:
    test_list = test.readlines()

outfile = open("../morph_analysis_outputs.txt", "w")
for testitem in test_list:
    sentence = Sentence(testitem)
    tagger.predict(sentence)
    outputs = sentence.get_spans('pos')
    for output in outputs:
        outfile.write(output + "\n")
    outfile.write("\n")

pranaydeeps self-assigned this Dec 2, 2021

pranaydeeps added the documentation Improvements or additions to documentation label Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Morphological Analysis Examples #1

Morphological Analysis Examples #1

PeterPirog commented Dec 2, 2021

pranaydeeps commented Dec 2, 2021 •

edited

PeterPirog commented Dec 2, 2021 •

edited

pranaydeeps commented Apr 25, 2022 •

edited

Morphological Analysis Examples #1

Morphological Analysis Examples #1

Comments

PeterPirog commented Dec 2, 2021

pranaydeeps commented Dec 2, 2021 • edited

PeterPirog commented Dec 2, 2021 • edited

pranaydeeps commented Apr 25, 2022 • edited

pranaydeeps commented Dec 2, 2021 •

edited

PeterPirog commented Dec 2, 2021 •

edited

pranaydeeps commented Apr 25, 2022 •

edited