Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Morphological Analysis Examples #1

Open
PeterPirog opened this issue Dec 2, 2021 · 3 comments
Open

Morphological Analysis Examples #1

PeterPirog opened this issue Dec 2, 2021 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@PeterPirog
Copy link

@pranaydeeps Is possible to add simple code example how to do morphological analysis of some short ancient greek sentence?
Maybe some examples of:

  • POS tagging
  • cosine sentence similarity

Thank You very much for the model. I tried to train smaller BERT model but i didn't have enough GPU resources. I would like to use your model for New Testament analysis.

@pranaydeeps
Copy link
Owner

pranaydeeps commented Dec 2, 2021

I think I have some jupyter notebooks which I used for my experiments and analysis.
I will try to clean the code a bit and upload it whenever I can!
Meanwhile if you have a urgent need for it you can take a look at the FLAIR toolkit documentation,
since the Morphological Analysis model is trained using the FLAIR toolkit.

@pranaydeeps pranaydeeps self-assigned this Dec 2, 2021
@pranaydeeps pranaydeeps added the documentation Improvements or additions to documentation label Dec 2, 2021
@PeterPirog
Copy link
Author

PeterPirog commented Dec 2, 2021

@pranaydeeps Hi, thank You for the answer. This isn't urgent so I will wait patiently and try to do something myself.
Now I use code:

from transformers import AutoTokenizer, AutoModel
import torch


tokenizer = AutoTokenizer.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
model = AutoModel.from_pretrained("pranaydeeps/Ancient-Greek-BERT")


###Tokenize the sentences like before:
sent = [
    " Ἀπὸ δὲ ἕκτης ὥρας σκότος ἐγένετο ἐπὶ πᾶσαν τὴν γῆν ἕως ὥρας ἐννάτης", # Mt 27.45 sentence 1
    " γενομένης δὲ ὥρας ἕκτης σκότος ἐγένετο ἐφ᾽ ὅλην τὴν γῆν ἕως ὥρας ἐννάτης", # Mk 15.33 sentence 2
    " Οἱ δὲ παραπορευόμενοι ἐβλασφήμουν αὐτὸν κινοῦντες τὰς κεφαλὰς αὐτῶν", # Mt 27.39 sentence 3
    " ἦν δὲ ὡσεὶ ὥρα ἕκτη Καὶ σκότος ἐγένετο ἐφ᾽ ὅλην τὴν γῆν ἕως ὥρας ἐννάτης" # Lk23.44 sentence 4
]
 # sentence 1 to senetnce 2 = 0.94861794
 # sentence 1 to senetnce 3 = 0.6118592
 # sentence 1 to senetnce 4 = 0.9064161

# initialize dictionary: stores tokenized sentences
token = {'input_ids': [], 'attention_mask': []}
for sentence in sent:
    # encode each sentence, append to dictionary
    new_token = tokenizer.encode_plus(sentence, max_length=128,
                                      truncation=True, padding='max_length',
                                      return_tensors='pt')
    token['input_ids'].append(new_token['input_ids'][0])
    token['attention_mask'].append(new_token['attention_mask'][0])
# reformat list of tensors to single tensor
token['input_ids'] = torch.stack(token['input_ids'])
token['attention_mask'] = torch.stack(token['attention_mask'])

# Process tokens through model:
output = model(**token)
print(output.keys())

# The dense vector representations of text are contained within the outputs 'last_hidden_state' tensor
embeddings = output.last_hidden_state
print(embeddings)

# To perform this operation, we first resize our attention_mask tensor:
att_mask = token['attention_mask']
att_mask.shape

#    output: torch.Size([4, 128])

mask = att_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

#    Output: torch.Size([4, 128, 768])

mask_embeddings = embeddings * mask
mask_embeddings.shape

#    Output: torch.Size([4, 128, 768])

# Then we sum the remained of the embeddings along axis 1:
summed = torch.sum(mask_embeddings, 1)
summed.shape

#    Output: torch.Size([4, 768])

# Then sum the number of values that must be given attention in each position of the tensor:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

#    Output: torch.Size([4, 768])

mean_pooled = summed / summed_mask
print(mean_pooled)

from sklearn.metrics.pairwise import cosine_similarity

# Let's calculate cosine similarity for sentence 0:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()
# calculate
similarity = cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:])

print(similarity)

but maybe there is some better way to do it. As I know the biggest language corpus is TLG http://stephanus.tlg.uci.edu/ but unfortunatelly even untagged texts aren't open source.

@pranaydeeps
Copy link
Owner

pranaydeeps commented Apr 25, 2022

@PeterPirog apologies for the delayed reply. You can use something similar to get Morphological Analysis outputs from the pre-trained model if your text is saved line by line in the file "input_text_clean.txt":

from flair.models import SequenceTagger
tagger = SequenceTagger.load('SuperPeitho-FLAIR-v2/final-model.pt')

with open("../input_text_clean.txt", "r") as testfile:
    test_list = test.readlines()

outfile = open("../morph_analysis_outputs.txt", "w")
for testitem in test_list:
    sentence = Sentence(testitem)
    tagger.predict(sentence)
    outputs = sentence.get_spans('pos')
    for output in outputs:
        outfile.write(output + "\n")
    outfile.write("\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants