# AutoMM for Text - Multilingual Problems

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/text_prediction/multilingual_text.ipynb)
[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/text_prediction/multilingual_text.ipynb)



People around the world speaks lots of languages. According to [SIL International](https://en.wikipedia.org/wiki/SIL_International)'s [Ethnologue: Languages of the World](https://en.wikipedia.org/wiki/Ethnologue),
there are more than **7,100** spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of
real-world problems involve text written in languages other than English.

In this tutorial, we introduce how `MultiModalPredictor` can help you build multilingual models. For the purpose of demonstration,
we use the [Cross-Lingual Amazon Product Review Sentiment](https://webis.de/data/webis-cls-10.html) dataset, which
comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese.
We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways:

- Finetune the German BERT
- Cross-lingual transfer from English to German



In [2]:
!pip install autogluon==0.8.0



In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Sample data - replace with actual Cross-Lingual Amazon Product Review Sentiment dataset
# For demonstration, we create two simple DataFrames for English and German.
english_data = pd.DataFrame({
    'sentence': ['I love this product', 'This is terrible', 'Great quality', 'Not worth the money'],
    'label': [1, 0, 1, 0]  # 1 for positive, 0 for negative
})

german_data = pd.DataFrame({
    'sentence': ['Ich liebe dieses Produkt', 'Das ist schrecklich', 'Großartige Qualität', 'Nicht das Geld wert'],
    'label': [1, 0, 1, 0]  # 1 for positive, 0 for negative
})

# Split the German data for training and testing
train_data_german, test_data_german = train_test_split(german_data, test_size=0.2, random_state=42)

# Split the English data for cross-lingual transfer
train_data_english, test_data_english = train_test_split(english_data, test_size=0.2, random_state=42)

In [4]:
from autogluon.multimodal import MultiModalPredictor

# Finetune the German BERT model
predictor_german = MultiModalPredictor(label='label', eval_metric='accuracy')

# Train on the German data
predictor_german.fit(train_data_german, time_limit=600)

# Evaluate the model on the German test data
german_predictions = predictor_german.predict(test_data_german)
german_accuracy = predictor_german.evaluate(test_data_german)
print("German BERT Model Accuracy:", german_accuracy)

No path specified. Models will be saved in: "AutogluonModels/ag-20240925_223643/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [1, 0]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Detected data scarcity. Consider running using the preset 'few_shot_text_classification' for better performance.
INFO:lightning_fabric.utilities.seed:Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.0.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/content/AutogluonModels/ag-20240925_223643".

- Validation metric is "accuracy".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /content/AutogluonModels/ag-20240

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

0 GPUs are detected, and 0 GPUs will be used.

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M 
1 | validation_metric | MulticlassAccuracy           | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.573   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 1: 'val_accuracy' reached 1.00000 (best 1.00000), saving model to '/content/AutogluonModels/ag-20240925_223643/epoch=0-step=1.ckpt' as top 3
AutoMM has created your model 🎉🎉🎉

- To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/content/AutogluonModels/ag-20240925_223643")
    ```

- You can open a terminal and launch Tensorboard to visualize the training log:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /content/AutogluonModels/ag-20240925_223643
    ```

- If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub: https://github.com/autogluon/autogluon




German BERT Model Accuracy: {'accuracy': 1.0}


In [6]:
# Train the model on the English data
predictor_english = MultiModalPredictor(label='label', eval_metric='accuracy')

# Fit the model on English data
predictor_english.fit(train_data_english, time_limit=400)  # Train for 10 minutes

# Evaluate the model on German test data (Cross-lingual transfer)
german_predictions_cross = predictor_english.predict(test_data_german)
german_accuracy_cross = predictor_english.evaluate(test_data_german)
print("Cross-lingual Transfer Model Accuracy:", german_accuracy_cross)

No path specified. Models will be saved in: "AutogluonModels/ag-20240925_223913/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [1, 0]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Detected data scarcity. Consider running using the preset 'few_shot_text_classification' for better performance.
INFO:lightning_fabric.utilities.seed:Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.0.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/content/AutogluonModels/ag-20240925_223913".

- Validation metric is "accuracy".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /content/AutogluonModels/ag-20240

Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 1: 'val_accuracy' reached 0.00000 (best 0.00000), saving model to '/content/AutogluonModels/ag-20240925_223913/epoch=0-step=1.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 2: 'val_accuracy' reached 0.00000 (best 0.00000), saving model to '/content/AutogluonModels/ag-20240925_223913/epoch=1-step=2.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 2, global step 3: 'val_accuracy' reached 0.00000 (best 0.00000), saving model to '/content/AutogluonModels/ag-20240925_223913/epoch=2-step=3.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 3, global step 4: 'val_accuracy' was not in top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 4, global step 5: 'val_accuracy' was not in top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 5, global step 6: 'val_accuracy' was not in top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 6, global step 7: 'val_accuracy' reached 1.00000 (best 1.00000), saving model to '/content/AutogluonModels/ag-20240925_223913/epoch=6-step=7.ckpt' as top 3
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model 🎉🎉🎉

- To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/content/AutogluonModels/ag-20240925_223913")
    ```

- You can open a terminal and launch Tensorboard to visualize the training log:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /content/AutogluonModels/ag-20240925_223913
    ```

- If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub: https://github.com/autogluon/autogluon




Cross-lingual Transfer Model Accuracy: {'accuracy': 0.0}


In [7]:
print(f"Accuracy of the fine-tuned German BERT: {german_accuracy}")
print(f"Accuracy of the cross-lingual transfer (trained on English): {german_accuracy_cross}")

Accuracy of the fine-tuned German BERT: {'accuracy': 1.0}
Accuracy of the cross-lingual transfer (trained on English): {'accuracy': 0.0}
