Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in Generate Tokenizer for CROHME dataset #81

Closed
Agha-Muqarib opened this issue Jan 22, 2022 · 9 comments
Closed

Issue in Generate Tokenizer for CROHME dataset #81

Agha-Muqarib opened this issue Jan 22, 2022 · 9 comments

Comments

@Agha-Muqarib
Copy link

Agha-Muqarib commented Jan 22, 2022

Hey Buddy, first of all, great work here, I followed the instructions, and it works amazingly for most of the equations. Now I want to train the model for handwritten equations to predict LaTeX. Following README.MD, I'm trying to generate the tokenizer for the CROHME dataset for which I entered the following command:

python dataset/dataset.py --equations dataset/CROHME_math.txt --vocab-size 8000 --out tokenizer.json

I'm getting the following error:

Generate tokenizer
Traceback (most recent call last):
  File "C:\Users\Saad\OneDrive\Desktop\LaTeX-OCR-main\dataset\dataset.py", line 244, in <module>
    generate_tokenizer(args.equations, args.out, args.vocab_size)
  File "C:\Users\Saad\OneDrive\Desktop\LaTeX-OCR-main\dataset\dataset.py", line 228, in generate_tokenizer     
    trainer = BpeTrainer(special_tokens=["[PAD]", "[BOS]", "[EOS]"], vocab_size=vocab_size, show_progress=True)
TypeError: 'str' object cannot be interpreted as an integer

How am I to solve this? I'm assuming that I'll first generate the tokenizer.json file and then train the model on CROHME Dataset. Once I'm done, I'll be able to input handwritten equations and get the corresponding LaTeX. Am I on the right track? Thanks!

@lukas-blecher
Copy link
Owner

lukas-blecher commented Jan 23, 2022

Are you on 464e4fc or later? There was a parsing error before.
Other than that you are correct. Please tell me if the training is working. I wasn't able to try it out myself yet.

@Agha-Muqarib
Copy link
Author

Okay, so I updated dataset.py as suggested in #464e4fc and it still won't generate the tokenizer for CROHME dataset. This is the error:

Traceback (most recent call last):
  File "C:\Users\Saad\OneDrive\Desktop\LaTeX-OCR-main\dataset\dataset.py", line 237, in <module>
    parser.add_argument('-i', '--images', type=str, nargs='+', default=None, help='Image folders')
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1434, in add_argument
    return self._add_action(action)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1799, in _add_action
    self._optionals._add_action(action)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1636, in _add_action
    action = super(_ArgumentGroup, self)._add_action(action)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1448, in _add_action
    self._check_conflict(action)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1585, in _check_conflict
    conflict_handler(action, confl_optionals)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1594, in _handle_conflict_error
    raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument -i/--images: conflicting option strings: -i, --images

Any idea on how to fix this as it appears to be a Python Library File Error.

@lukas-blecher
Copy link
Owner

The line in question 237 should appear in 251:

parser.add_argument('-i', '--images', type=str, nargs='+', default=None, help='Image folders')

The error occurs if the line in question appears twice. So please make sure you have the correct version.

@Agha-Muqarib
Copy link
Author

Okay so I'll just have to cut line number 237 and paste at 251 and that would solve the issue, right?

@lukas-blecher
Copy link
Owner

No something went wrong when you updated the file.
Copy all of this: https://raw.githubusercontent.com/lukas-blecher/LaTeX-OCR/main/dataset/dataset.py
And replace the entire file with the new content.

@Agha-Muqarib
Copy link
Author

Agha-Muqarib commented Jan 24, 2022

Okay, so I updated this dataset.py file and trained the model on CROHME dataset on GPU in Google Collab Environment. I used the same tokenizor.json file while training as that of the formula dataset. Once, the training was done. I decided to take some snips from CROHME dataset using gui.py, but the issue is that the model can't predict handwritten images. There is either a long processing time with an entirely different output or an error. However, it has started recognizing a few of the symbols. attaching screenshots for reference.

271488874_342816557690886_5825415274987443482_n
271739695_1323370631510385_3955840311309361722_n

Steps Followed:

  • Updated Dataset.py file as mentioned in the previous comment.
  • Trained Model using the steps mentioned in README.MD.
  • Used CROHME dataset for training.
  • Run the project using GUI.py
  • Took snips from the same images folder of CROHME dataset to check if LaTeX against handwritten equation is being predicted or not.

Q1 - What to do now? How can I make it detect LaTeX against handwritten equations?
Q2 - Is there a dire need to split CROHME dataset into train and test or can we still test it manually on the images in dataset like I did in the snips?

@lukas-blecher
Copy link
Owner

Q1: Looks like training was not done. What are the validation metrics? The BLEU score?
Q2: Yes. At least into train/test is necessary. Otherwise you have no way to tell if the model is overfitting. And yeah, don't use the GUI for evaluation reasons. There is an eval.py for that reason where you can get the metrics for the test set. Also, in the GUI there is an extra model for resizing and it never saw handwritten images, so there might me unexpected behavior.

@Agha-Muqarib
Copy link
Author

Hey Lukas, Hope you're doing good. Apologies for the interruption but I really do wanna contribute something to this project, especially in that handwritten domain.

Anyways, I have split the CROHME dataset into train: test in an 80:20 ratio as suggested. As of training not being done, I executed the following command in the google collab environment(as suggested earlier for GPU):

python3 "/content/drive/My Drive/LaTeX-OCR-main/train.py" --config "/content/drive/My Drive/LaTeX-OCR-main/settings/config.yaml"

I got this as a result (which to my understanding appears once the training is complete, do tell me if I'm wrong)

WhatsApp Image 2022-01-25 at 2 40 25 AM

So, what seems to be the issue? Was it really not trained? If not, what steps should I follow now? I followed the readme previously to train the model.

P.S: The BLEU score is still underway. I'll update it once it's done. Thanks for your support, may you be blessed while you're at it!

@lukas-blecher
Copy link
Owner

From the other issue I saw that the eval metrics are very bad.
And from the image above I can see you trained for 50 epochs?
So maybe the there is something wrong with the dataset. But then the train loss wouldn't be ~0.1.
How did you split the dataset? How did you call the evaluation? Because the results of the evaluation in #85 look a lot like you tried to evaluate my pretrained model on handwritten data. You need to specify where your newly trained model is placed

python eval.py -d path/to/crohme_test.pkl -c path/to/latest/checkpoint --config path/to/config.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants