Skip to content

Classification of multilingual dataset trained only on English training data using pre-trained models. Model is trained on TPUs using PyTorch and torch_xla library.

Notifications You must be signed in to change notification settings

jhashekhar/multilingual-clf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

multilingual-clf

Data

The data has been used from Kaggle cometion Jigsaw Multilingual Toxic Comment Classification

Workings

Refer to my notebook to see how all of the stuff works out. Kaggle Notebook

  • Use PyTorch nightly. PyTorch and torch_xla seems to be unstable a lot of times.

  • bert-multilingual-uncased models works very easily. There are no SIGKILL or other memory issues.

  • xlm-roberta-base model works too with batch_size=8.

  • xlm-roberta-large is a lot trickier. Garbage collection, limiting the loading of dataloader to once is required.

    • Model needs to be called only once and wrapped with a wrapper function provided in torch_xla library.

Todo

  • Add Multiple Sample Dropout
  • Mixed precision training

About

Classification of multilingual dataset trained only on English training data using pre-trained models. Model is trained on TPUs using PyTorch and torch_xla library.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages