New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue in Generate Tokenizer for CROHME dataset #81
Comments
Are you on 464e4fc or later? There was a parsing error before. |
Okay, so I updated dataset.py as suggested in #464e4fc and it still won't generate the tokenizer for CROHME dataset. This is the error:
Any idea on how to fix this as it appears to be a Python Library File Error. |
The line in question 237 should appear in 251: Line 251 in ae63a63
The error occurs if the line in question appears twice. So please make sure you have the correct version. |
Okay so I'll just have to cut line number 237 and paste at 251 and that would solve the issue, right? |
No something went wrong when you updated the file. |
Okay, so I updated this dataset.py file and trained the model on CROHME dataset on GPU in Google Collab Environment. I used the same tokenizor.json file while training as that of the formula dataset. Once, the training was done. I decided to take some snips from CROHME dataset using gui.py, but the issue is that the model can't predict handwritten images. There is either a long processing time with an entirely different output or an error. However, it has started recognizing a few of the symbols. attaching screenshots for reference. Steps Followed:
Q1 - What to do now? How can I make it detect LaTeX against handwritten equations? |
Q1: Looks like training was not done. What are the validation metrics? The BLEU score? |
Hey Lukas, Hope you're doing good. Apologies for the interruption but I really do wanna contribute something to this project, especially in that handwritten domain. Anyways, I have split the CROHME dataset into train: test in an 80:20 ratio as suggested. As of training not being done, I executed the following command in the google collab environment(as suggested earlier for GPU):
I got this as a result (which to my understanding appears once the training is complete, do tell me if I'm wrong) So, what seems to be the issue? Was it really not trained? If not, what steps should I follow now? I followed the readme previously to train the model. P.S: The BLEU score is still underway. I'll update it once it's done. Thanks for your support, may you be blessed while you're at it! |
From the other issue I saw that the eval metrics are very bad. python eval.py -d path/to/crohme_test.pkl -c path/to/latest/checkpoint --config path/to/config.yaml |
Hey Buddy, first of all, great work here, I followed the instructions, and it works amazingly for most of the equations. Now I want to train the model for handwritten equations to predict LaTeX. Following README.MD, I'm trying to generate the tokenizer for the CROHME dataset for which I entered the following command:
python dataset/dataset.py --equations dataset/CROHME_math.txt --vocab-size 8000 --out tokenizer.json
I'm getting the following error:
How am I to solve this? I'm assuming that I'll first generate the tokenizer.json file and then train the model on CROHME Dataset. Once I'm done, I'll be able to input handwritten equations and get the corresponding LaTeX. Am I on the right track? Thanks!
The text was updated successfully, but these errors were encountered: