Issue in Generate Tokenizer for CROHME dataset #81

Agha-Muqarib · 2022-01-22T23:00:06Z

Hey Buddy, first of all, great work here, I followed the instructions, and it works amazingly for most of the equations. Now I want to train the model for handwritten equations to predict LaTeX. Following README.MD, I'm trying to generate the tokenizer for the CROHME dataset for which I entered the following command:

python dataset/dataset.py --equations dataset/CROHME_math.txt --vocab-size 8000 --out tokenizer.json

I'm getting the following error:

Generate tokenizer
Traceback (most recent call last):
  File "C:\Users\Saad\OneDrive\Desktop\LaTeX-OCR-main\dataset\dataset.py", line 244, in <module>
    generate_tokenizer(args.equations, args.out, args.vocab_size)
  File "C:\Users\Saad\OneDrive\Desktop\LaTeX-OCR-main\dataset\dataset.py", line 228, in generate_tokenizer     
    trainer = BpeTrainer(special_tokens=["[PAD]", "[BOS]", "[EOS]"], vocab_size=vocab_size, show_progress=True)
TypeError: 'str' object cannot be interpreted as an integer

How am I to solve this? I'm assuming that I'll first generate the tokenizer.json file and then train the model on CROHME Dataset. Once I'm done, I'll be able to input handwritten equations and get the corresponding LaTeX. Am I on the right track? Thanks!

The text was updated successfully, but these errors were encountered:

lukas-blecher · 2022-01-23T11:48:37Z

Are you on 464e4fc or later? There was a parsing error before.
Other than that you are correct. Please tell me if the training is working. I wasn't able to try it out myself yet.

Agha-Muqarib · 2022-01-23T16:03:32Z

Okay, so I updated dataset.py as suggested in #464e4fc and it still won't generate the tokenizer for CROHME dataset. This is the error:

Traceback (most recent call last):
  File "C:\Users\Saad\OneDrive\Desktop\LaTeX-OCR-main\dataset\dataset.py", line 237, in <module>
    parser.add_argument('-i', '--images', type=str, nargs='+', default=None, help='Image folders')
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1434, in add_argument
    return self._add_action(action)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1799, in _add_action
    self._optionals._add_action(action)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1636, in _add_action
    action = super(_ArgumentGroup, self)._add_action(action)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1448, in _add_action
    self._check_conflict(action)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1585, in _check_conflict
    conflict_handler(action, confl_optionals)
  File "C:\Users\Saad\AppData\Local\Programs\Python\Python39\lib\argparse.py", line 1594, in _handle_conflict_error
    raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument -i/--images: conflicting option strings: -i, --images

Any idea on how to fix this as it appears to be a Python Library File Error.

lukas-blecher · 2022-01-23T16:09:17Z

The line in question 237 should appear in 251:

LaTeX-OCR/dataset/dataset.py

Line 251 in ae63a63

    
           parser.add_argument('-i', '--images', type=str, nargs='+', default=None, help='Image folders')

The error occurs if the line in question appears twice. So please make sure you have the correct version.

Agha-Muqarib · 2022-01-23T16:11:57Z

Okay so I'll just have to cut line number 237 and paste at 251 and that would solve the issue, right?

lukas-blecher · 2022-01-23T16:17:37Z

No something went wrong when you updated the file.
Copy all of this: https://raw.githubusercontent.com/lukas-blecher/LaTeX-OCR/main/dataset/dataset.py
And replace the entire file with the new content.

Agha-Muqarib · 2022-01-24T19:08:58Z

Okay, so I updated this dataset.py file and trained the model on CROHME dataset on GPU in Google Collab Environment. I used the same tokenizor.json file while training as that of the formula dataset. Once, the training was done. I decided to take some snips from CROHME dataset using gui.py, but the issue is that the model can't predict handwritten images. There is either a long processing time with an entirely different output or an error. However, it has started recognizing a few of the symbols. attaching screenshots for reference.

Steps Followed:

Updated Dataset.py file as mentioned in the previous comment.
Trained Model using the steps mentioned in README.MD.
Used CROHME dataset for training.
Run the project using GUI.py
Took snips from the same images folder of CROHME dataset to check if LaTeX against handwritten equation is being predicted or not.

Q1 - What to do now? How can I make it detect LaTeX against handwritten equations?
Q2 - Is there a dire need to split CROHME dataset into train and test or can we still test it manually on the images in dataset like I did in the snips?

lukas-blecher · 2022-01-24T21:30:16Z

Q1: Looks like training was not done. What are the validation metrics? The BLEU score?
Q2: Yes. At least into train/test is necessary. Otherwise you have no way to tell if the model is overfitting. And yeah, don't use the GUI for evaluation reasons. There is an eval.py for that reason where you can get the metrics for the test set. Also, in the GUI there is an extra model for resizing and it never saw handwritten images, so there might me unexpected behavior.

Agha-Muqarib · 2022-01-25T19:53:41Z

Hey Lukas, Hope you're doing good. Apologies for the interruption but I really do wanna contribute something to this project, especially in that handwritten domain.

Anyways, I have split the CROHME dataset into train: test in an 80:20 ratio as suggested. As of training not being done, I executed the following command in the google collab environment(as suggested earlier for GPU):

python3 "/content/drive/My Drive/LaTeX-OCR-main/train.py" --config "/content/drive/My Drive/LaTeX-OCR-main/settings/config.yaml"

I got this as a result (which to my understanding appears once the training is complete, do tell me if I'm wrong)

So, what seems to be the issue? Was it really not trained? If not, what steps should I follow now? I followed the readme previously to train the model.

P.S: The BLEU score is still underway. I'll update it once it's done. Thanks for your support, may you be blessed while you're at it!

lukas-blecher · 2022-01-25T22:00:46Z

From the other issue I saw that the eval metrics are very bad.
And from the image above I can see you trained for 50 epochs?
So maybe the there is something wrong with the dataset. But then the train loss wouldn't be ~0.1.
How did you split the dataset? How did you call the evaluation? Because the results of the evaluation in #85 look a lot like you tried to evaluate my pretrained model on handwritten data. You need to specify where your newly trained model is placed

python eval.py -d path/to/crohme_test.pkl -c path/to/latest/checkpoint --config path/to/config.yaml

lukas-blecher closed this as completed Jan 23, 2022

lukas-blecher mentioned this issue Jan 23, 2022

generate the cromhe tokenizer.json ,error,how to fix it ? #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in Generate Tokenizer for CROHME dataset #81

Issue in Generate Tokenizer for CROHME dataset #81

Agha-Muqarib commented Jan 22, 2022 •

edited

lukas-blecher commented Jan 23, 2022 •

edited

Agha-Muqarib commented Jan 23, 2022

lukas-blecher commented Jan 23, 2022

Agha-Muqarib commented Jan 23, 2022

lukas-blecher commented Jan 23, 2022

Agha-Muqarib commented Jan 24, 2022 •

edited

lukas-blecher commented Jan 24, 2022

Agha-Muqarib commented Jan 25, 2022

lukas-blecher commented Jan 25, 2022

Issue in Generate Tokenizer for CROHME dataset #81

Issue in Generate Tokenizer for CROHME dataset #81

Comments

Agha-Muqarib commented Jan 22, 2022 • edited

lukas-blecher commented Jan 23, 2022 • edited

Agha-Muqarib commented Jan 23, 2022

lukas-blecher commented Jan 23, 2022

Agha-Muqarib commented Jan 23, 2022

lukas-blecher commented Jan 23, 2022

Agha-Muqarib commented Jan 24, 2022 • edited

lukas-blecher commented Jan 24, 2022

Agha-Muqarib commented Jan 25, 2022

lukas-blecher commented Jan 25, 2022

Agha-Muqarib commented Jan 22, 2022 •

edited

lukas-blecher commented Jan 23, 2022 •

edited

Agha-Muqarib commented Jan 24, 2022 •

edited