This project fine-tunes the multilingual BERT (bert-base-multilingual-cased) model for language classification. The dataset contains text samples labeled with their respective languages. The goal is to create a robust model capable of accurately predicting the language of a given text.
- Fine-tunes BERT for sequence classification
- Handles imbalanced datasets and missing labels (
NaNtreated as a category, since they actually represent a language) - Supports training continuation from checkpoints
- Evaluates model performance using accuracy, precision, recall, and F1-score
- Applies the trained model to make predictions on new text samples
├── preprocessing.py # Script for data preprocessing
├── fine_tuning.py # Script for initial model training
├── continue.py # Script for resuming training
├── apply.py # Script for making predictions with the trained model
├── train_submission.csv # Training dataset (not included in repo)
├── test_without_labels.csv # Test dataset (not included in repo)
├── README.md # Project documentation
pip install transformers torch pandas scikit-learn tqdm safetensorsEnsure train_submission.csv (training data) is present in the project directory. The CSV should have the following format:
ID,Text,Label
1,"Hello, world!",English
2,"Bonjour le monde!",French
...
Before training, run the preprocessing.py script to clean and prepare the data:
python preprocessing.pyThis script:
- Removes unwanted special characters
- Standardizes text formatting
- Correctly handles missing labels (
NaN) - Splits data into training and validation sets
Run the following command to fine-tune BERT from scratch:
python fine_tuning.pyThis script:
- Loads the dataset and tokenizes the text
- Splits the data into training and validation sets
- Fine-tunes BERT for sequence classification
- Saves the trained model to
./language_classifier
If training was interrupted or needs additional epochs:
python continue.pyThis script is necessary because we are using Metz DCE for training, which may disconnect during the process. It:
- Loads the last saved checkpoint
- Resumes training from where it left off
To classify text from test_without_labels.csv, run:
python apply.pyThis script:
- Loads the trained model
- Tokenizes test data
- Predicts labels for each text entry
- Saves predictions in
test_predictions.csv
During training, performance metrics (accuracy, precision, recall, F1-score) are logged. The model automatically saves the best checkpoint based on validation accuracy.
The fine-tuned model and tokenizer are saved in ./language_classifier. To use it later:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained("./language_classifier")
model = BertForSequenceClassification.from_pretrained("./language_classifier")
model.eval()This project is open-source and available under the MIT License.