Fine-Tuning BERT for Language Classification

Overview

This project fine-tunes the multilingual BERT (bert-base-multilingual-cased) model for language classification. The dataset contains text samples labeled with their respective languages. The goal is to create a robust model capable of accurately predicting the language of a given text.

Features

Fine-tunes BERT for sequence classification
Handles imbalanced datasets and missing labels (NaN treated as a category, since they actually represent a language)
Supports training continuation from checkpoints
Evaluates model performance using accuracy, precision, recall, and F1-score
Applies the trained model to make predictions on new text samples

Repository Structure

├── preprocessing.py    # Script for data preprocessing
├── fine_tuning.py      # Script for initial model training
├── continue.py         # Script for resuming training
├── apply.py            # Script for making predictions with the trained model
├── train_submission.csv  # Training dataset (not included in repo)
├── test_without_labels.csv  # Test dataset (not included in repo)
├── README.md          # Project documentation

Setup

1. Install Dependencies

pip install transformers torch pandas scikit-learn tqdm safetensors

2. Prepare Data

Ensure train_submission.csv (training data) is present in the project directory. The CSV should have the following format:

ID,Text,Label
1,"Hello, world!",English
2,"Bonjour le monde!",French
...

3. Data Preprocessing

Before training, run the preprocessing.py script to clean and prepare the data:

python preprocessing.py

This script:

Removes unwanted special characters
Standardizes text formatting
Correctly handles missing labels (NaN)
Splits data into training and validation sets

Training the Model

Initial Fine-Tuning

Run the following command to fine-tune BERT from scratch:

python fine_tuning.py

This script:

Loads the dataset and tokenizes the text
Splits the data into training and validation sets
Fine-tunes BERT for sequence classification
Saves the trained model to ./language_classifier

Continue Training

If training was interrupted or needs additional epochs:

python continue.py

This script is necessary because we are using Metz DCE for training, which may disconnect during the process. It:

Loads the last saved checkpoint
Resumes training from where it left off

Making Predictions

To classify text from test_without_labels.csv, run:

python apply.py

This script:

Loads the trained model
Tokenizes test data
Predicts labels for each text entry
Saves predictions in test_predictions.csv

Model Evaluation

During training, performance metrics (accuracy, precision, recall, F1-score) are logged. The model automatically saves the best checkpoint based on validation accuracy.

Saving & Loading the Model

The fine-tuned model and tokenizer are saved in ./language_classifier. To use it later:

from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("./language_classifier")
model = BertForSequenceClassification.from_pretrained("./language_classifier")
model.eval()

References

License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Finetune		Finetune
old		old
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train_submission.csv		train_submission.csv
train_submission_clean.csv		train_submission_clean.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-Tuning BERT for Language Classification

Overview

Features

Repository Structure

Setup

1. Install Dependencies

2. Prepare Data

3. Data Preprocessing

Training the Model

Initial Fine-Tuning

Continue Training

Making Predictions

Model Evaluation

Saving & Loading the Model

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fine-Tuning BERT for Language Classification

Overview

Features

Repository Structure

Setup

1. Install Dependencies

2. Prepare Data

3. Data Preprocessing

Training the Model

Initial Fine-Tuning

Continue Training

Making Predictions

Model Evaluation

Saving & Loading the Model

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages