Since it is an Industry-University Cooperation Project,
no longer access to the dataset as the project ended :)
Note the project was worked as a team,
and I was mainly in charge of Text pre-processing / machine learning methods (XGBoost) / Ensembling / Result Analysis
- Processing dirty text
- Machine Learning methods to classify math word problems with equations (Korean + English + Math Symbols + Numbers)
- with the help of pretrained KoBert and KoELECTRA models, classification used with fine tuning
- Ensembling
- Analysis of results: wrongs and rights
- Comparison of different text preprocessing before applied in CountVectorizer and then XGBoost
- From ex1 ~ ex4 the level of removing text and substitution increases
- Comparison of different Models with fixed text preprocessing
- Ensemble method is aggregation of outputs from different selected models with weights
- After comparing...
- using all model outputs and using a small neural network to learn the best way to combine the outputs
- using only the best performing models for each type of method, best weight is found through trial and error
Another method tested, that is not discussed here, is pretraining BERT model with the data currently in hand.
However the dataset is too small, so doesn't show better result than any of the results above.
If there were enough data, it would be expected that such implementation would produce the best results.