Skip to content

kiyoshi2000/NLP---kaggle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-Tuning BERT for Language Classification

Overview

This project fine-tunes the multilingual BERT (bert-base-multilingual-cased) model for language classification. The dataset contains text samples labeled with their respective languages. The goal is to create a robust model capable of accurately predicting the language of a given text.

Features

  • Fine-tunes BERT for sequence classification
  • Handles imbalanced datasets and missing labels (NaN treated as a category, since they actually represent a language)
  • Supports training continuation from checkpoints
  • Evaluates model performance using accuracy, precision, recall, and F1-score
  • Applies the trained model to make predictions on new text samples

Repository Structure

├── preprocessing.py    # Script for data preprocessing
├── fine_tuning.py      # Script for initial model training
├── continue.py         # Script for resuming training
├── apply.py            # Script for making predictions with the trained model
├── train_submission.csv  # Training dataset (not included in repo)
├── test_without_labels.csv  # Test dataset (not included in repo)
├── README.md          # Project documentation

Setup

1. Install Dependencies

pip install transformers torch pandas scikit-learn tqdm safetensors

2. Prepare Data

Ensure train_submission.csv (training data) is present in the project directory. The CSV should have the following format:

ID,Text,Label
1,"Hello, world!",English
2,"Bonjour le monde!",French
...

3. Data Preprocessing

Before training, run the preprocessing.py script to clean and prepare the data:

python preprocessing.py

This script:

  • Removes unwanted special characters
  • Standardizes text formatting
  • Correctly handles missing labels (NaN)
  • Splits data into training and validation sets

Training the Model

Initial Fine-Tuning

Run the following command to fine-tune BERT from scratch:

python fine_tuning.py

This script:

  • Loads the dataset and tokenizes the text
  • Splits the data into training and validation sets
  • Fine-tunes BERT for sequence classification
  • Saves the trained model to ./language_classifier

Continue Training

If training was interrupted or needs additional epochs:

python continue.py

This script is necessary because we are using Metz DCE for training, which may disconnect during the process. It:

  • Loads the last saved checkpoint
  • Resumes training from where it left off

Making Predictions

To classify text from test_without_labels.csv, run:

python apply.py

This script:

  • Loads the trained model
  • Tokenizes test data
  • Predicts labels for each text entry
  • Saves predictions in test_predictions.csv

Model Evaluation

During training, performance metrics (accuracy, precision, recall, F1-score) are logged. The model automatically saves the best checkpoint based on validation accuracy.

Saving & Loading the Model

The fine-tuned model and tokenizer are saved in ./language_classifier. To use it later:

from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("./language_classifier")
model = BertForSequenceClassification.from_pretrained("./language_classifier")
model.eval()

References

License

This project is open-source and available under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors