# Text Classification using AutoTrain Advanced

In this notebook, we will train a text classification model using AutoTrain Advanced.
You can replace the model with any Hugging Face transformers compatible model and dataset with any other dataset in proper formatting.
For dataset formatting, please take a look at [docs](https://huggingface.co/docs/autotrain/index).

In [1]:
from autotrain.params import TextClassificationParams
from autotrain.project import AutoTrainProject

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
HF_USERNAME = "Lyreck"
HF_TOKEN = "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" # get it from https://huggingface.co/settings/token
# It is recommended to use secrets or environment variables to store your HF_TOKEN
# your token is required if push_to_hub is set to True or if you are accessing a gated model/dataset

In [15]:
import os
print(os.getcwd())

/Users/leo/Desktop/Ecole/SciencesPo/S1/Socio_dig_pub_space/4_et_9-group_work/pythonneries/sociology_scrapping/tiktok_finetuning


In [27]:
params = TextClassificationParams(
    model="cardiffnlp/twitter-xlm-roberta-base-sentiment",
    data_path="data/", #"Lyreck/tiktok_brat_comments", # path to the dataset on huggingface hub
    text_column="text", # the column in the dataset that contains the text
    target_column="label", # the column in the dataset that contains the labels
    train_split="train",
    valid_split="validation",
    epochs=3,
    batch_size=8,
    max_seq_length=512,
    lr=1e-5,
    optimizer="adamw_torch",
    scheduler="linear",
    gradient_accumulation=1,
    #mixed_precision="fp16", #need graphic card for this (no mps available)
    project_name="finetune-tiktok-brat2",
    log="tensorboard",
    push_to_hub=True,
    username=HF_USERNAME,
    token=HF_TOKEN,
)
# tip: you can use `?TextClassificationParams` to see the full list of allowed parameters

In [29]:
?TextClassificationParams

[0;31mInit signature:[0m
[0mTextClassificationParams[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_path[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'bert-base-uncased'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlr[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m5e-05[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mepochs[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_seq_length[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m128[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbatch_size[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m8[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwarmup_ratio[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m0.1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgradient_accumulation[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m1[0m[

If your dataset is in CSV / JSONL format (JSONL is most preferred) and is stored locally, make the following changes to `params`:

```python
params = TextClassificationParams(
    data_path="data/", # this is the path to folder where train.jsonl/train.csv is located
    text_column="text", # this is the column name in the CSV/JSONL file which contains the text
    train_split = "train" # this is the filename without extension
    valid_split = "valid" # this is the filename without extension
    .
    .
    .
)
```

In [28]:
# this will train the model locally
project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

  train_df.loc[:, "autotrain_label"] = train_df["autotrain_label"].astype(str)
  valid_df.loc[:, "autotrain_label"] = valid_df["autotrain_label"].astype(str)
Casting the dataset: 100%|██████████| 13/13 [00:00<00:00, 2439.53 examples/s]
Casting the dataset: 100%|██████████| 10/10 [00:00<00:00, 2565.32 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 13/13 [00:00<00:00, 1915.14 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10/10 [00:00<00:00, 1875.81 examples/s]

[1mINFO    [0m | [32m2024-11-08 23:14:23[0m | [36mautotrain.backends.local[0m:[36mcreate[0m:[36m20[0m - [1mStarting local training...[0m
[1mINFO    [0m | [32m2024-11-08 23:14:23[0m | [36mautotrain.commands[0m:[36mlaunch_command[0m:[36m523[0m - [1m['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'no', '-m', 'autotrain.trainers.text_classification', '--training_config', 'finetune-tiktok-brat2/training_params.json'][0m
[1mINFO    [0m | [32m2024-11-08 23:14:23[0m | [36mautotrain.commands[0m:[36mlaunch_command[0m:[36m524[0m - [1m{'data_path': 'finetune-tiktok-brat2/autotrain-data', 'model': 'cardiffnlp/twitter-xlm-roberta-base-sentiment', 'lr': 1e-05, 'epochs': 3, 'max_seq_length': 512, 'batch_size': 8, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'train_split': 'train', 'valid_split': 'validation', 'text


The following values were not passed to `accelerate launch` and had defaults used instead:
	`--dynamo_backend` was set to a value of `'no'`


INFO     | 2024-11-08 23:14:39 | __main__:train:50 - loading dataset from disk
INFO     | 2024-11-08 23:14:39 | __main__:train:70 - loading dataset from disk
INFO     | 2024-11-08 23:14:45 | __main__:train:143 - Logging steps: 1
INFO     | 2024-11-08 23:14:50 | autotrain.trainers.common:on_train_begin:386 - Starting to train...


  0%|          | 0/6 [00:00<?, ?it/s]

ERROR    | 2024-11-08 23:14:54 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last):
  File "/Users/leo/Desktop/Ecole/SciencesPo/S1/Socio_dig_pub_space/4_et_9-group_work/pythonneries/sociology_scrapping/tiktok_finetuning/autotrainenv/lib/python3.11/site-packages/autotrain/trainers/common.py", line 212, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/Desktop/Ecole/SciencesPo/S1/Socio_dig_pub_space/4_et_9-group_work/pythonneries/sociology_scrapping/tiktok_finetuning/autotrainenv/lib/python3.11/site-packages/autotrain/trainers/text_classification/__main__.py", line 200, in train
    trainer.train()
  File "/Users/leo/Desktop/Ecole/SciencesPo/S1/Socio_dig_pub_space/4_et_9-group_work/pythonneries/sociology_scrapping/tiktok_finetuning/autotrainenv/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^


  0%|          | 0/6 [00:03<?, ?it/s]


21596

In [8]:
project.local

True