# Text Classification with AG's News Topics - Ludwig CLI

*We recommend using a GPU runtime for this example. In the Colab menu bar, choose **Runtime** > **Change Runtime Type** and choose **GPU** under Hardware Accelerator.*

In this notebook, we will show how to use the Ludwig CLI to:


1.   [Download a Dataset](#scrollTo=Download_Dataset)
2.   [Train a Ludwig Model](#scrollTo=Train)
3.   [Evaluate the trained model](#scrollTo=Evaluate)
4.   [Visualize training and test metrics](#scrollTo=Visualize_Metrics)
5.   [Make predictions on New Data](#scrollTo=Make_Predictions_on_New_Data)

In [1]:
# Prerequisite: Installs the latest version of Ludwig in the Colab environment
!python -m pip install git+https://github.com/ludwig-ai/ludwig.git --quiet

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 47 kB 1.9 MB/s 
[K     |████████████████████████████████| 332 kB 8.2 MB/s 
[K     |████████████████████████████████| 136 kB 50.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 45.7 MB/s 
[K     |████████████████████████████████| 144 kB 52.1 MB/s 
[K     |████████████████████████████████| 271 kB 45.9 MB/s 
[K     |████████████████████████████████| 94 kB 3.2 MB/s 
[?25h  Building wheel for ludwig (PEP 517) ... [?25l[?25hdone


# Download Dataset

We'll be using AG's news topic classification dataset, a common benchmark dataset for text classification. This dataset is a subset of the full AG news dataset, constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.


This dataset contains four columns:

| column      | description                                                |
|-------------|------------------------------------------------------------|
| class_index | integer 1-4 corresponding to "world", "sports", "business", "sci_tech" respectively |
| class       | The topic label, one of "world", "sports", "business", "sci_tech" |
| title       | Title of the news article                                  |
| description | Description of the news article                            |


In [2]:
# Downloads the AG news dataset to the current working directory.
!ludwig datasets download agnews

NumExpr defaulting to 2 threads.
███████████████████████
█ █ █ █  ▜█ █ █ █ █   █
█ █ █ █ █ █ █ █ █ █ ███
█ █   █ █ █ █ █ █ █ ▌ █
█ █████ █ █ █ █ █ █ █ █
█     █  ▟█     █ █   █
███████████████████████
ludwig v0.5rc2 - Datasets download



# Train

## Define ludwig config

The Ludwig config declares the machine learning task. It tells Ludwig what to predict, what columns to use as input, and optionally specifies the model type and hyperparameters.

Here, for simplicity, we'll try to predict **class** from **title**.

In [3]:
config_yaml = """
input_features:
  -
    name: title
    type: text
    encoder: parallel_cnn
output_features:
  -
    name: class
    type: category
trainer:
  epochs: 3
"""

# Writes config to a file called "config.yaml"
with open("config.yaml", "w") as f:
  f.write(config_yaml)

## Create and train a model

In [4]:
# Trains the model. This cell might take a few minutes.
!ludwig train --dataset agnews.csv -c config.yaml

NumExpr defaulting to 2 threads.
import ray failed with exception: No module named 'ray'
███████████████████████
█ █ █ █  ▜█ █ █ █ █   █
█ █ █ █ █ █ █ █ █ █ ███
█ █   █ █ █ █ █ █ █ ▌ █
█ █████ █ █ █ █ █ █ █ █
█     █  ▟█     █ █   █
███████████████████████
ludwig v0.5rc2 - Train


╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤═══════════════════════════════════════════════════════════════════╕
│ Experiment name  │ experiment                                                        │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                               │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ Output directory │ /content/results/experiment_run                                   │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ ludwig_vers

# Evaluate

In [5]:
# Generates predictions and performance statistics for the test set.
!ludwig evaluate --model_path results/experiment_run/model \
                 --dataset agnews.csv \
                 --split test \
                 --output_directory test_results

NumExpr defaulting to 2 threads.
import ray failed with exception: No module named 'ray'
███████████████████████
█ █ █ █  ▜█ █ █ █ █   █
█ █ █ █ █ █ █ █ █ █ ███
█ █   █ █ █ █ █ █ █ ▌ █
█ █████ █ █ █ █ █ █ █ █
█     █  ▟█     █ █   █
███████████████████████
ludwig v0.5rc2 - Evaluate

Dataset path: agnews.csv
Model path: results/experiment_run/model

  self.padding, self.dilation, self.groups)
Loading metadata from: results/experiment_run/model/training_set_metadata.json
Evaluation: 100% 60/60 [00:01<00:00, 58.21it/s]

===== class =====
accuracy: 0.8778786659240723
hits_at_k: 0.98973548412323
loss: 0.35242071747779846
overall_stats: { 'avg_f1_score_macro': 0.8775816181946832,
  'avg_f1_score_micro': 0.8778786682458218,
  'avg_f1_score_weighted': 0.8775850917827788,
  'avg_precision_macro': 0.878247879497043,
  'avg_precision_micro': 0.8778786682458218,
  'avg_precision_weighted': 0.8778786682458218,
  'avg_recall_macro': 0.8778763060890774,
  'avg_recall_micro': 0.8778786682458218,
  'av

# Visualize Metrics

In [6]:
# Visualizes confusion matrix, which gives an overview of classifier performance
# for each class.
!ludwig visualize --visualization confusion_matrix \
                  --ground_truth_metadata results/experiment_run/model/training_set_metadata.json \
                  --test_statistics test_results/test_statistics.json \
                  --output_directory visualizations \
                  --file_format png


# If you run ludwig visualize locally, visualizations will automatically show in
# a window. Here in Colab, we can run the following code to load and display
# generated plots inline.
from IPython import display
import ipywidgets
from pathlib import Path

ipywidgets.HBox([
  ipywidgets.Image(value=Path("visualizations/confusion_matrix__class_top5.png").read_bytes()),
  ipywidgets.Image(value=Path("visualizations/confusion_matrix_entropy__class_top5.png").read_bytes()),
])

NumExpr defaulting to 2 threads.
import ray failed with exception: No module named 'ray'


HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\x80\x00\x00\x01\xe0\x08\x06\x00\x…

In [7]:
# Visualizes learning curves, which show how performance metrics changed over
# time during training.
!ludwig visualize --visualization learning_curves \
                  --ground_truth_metadata results/experiment_run/model/training_set_metadata.json \
                  --training_statistics results/experiment_run/training_statistics.json \
                  --file_format png \
                  --output_directory visualizations


# If you run ludwig visualize locally, visualizations will automatically show in
# a window. Here in Colab, we can run the following code to load and display
# generated plots inline.
from IPython import display
import ipywidgets
from pathlib import Path

ipywidgets.VBox([
  ipywidgets.HBox([
    ipywidgets.Image(value=Path("visualizations/learning_curves_combined_loss.png").read_bytes()),
    ipywidgets.Image(value=Path("visualizations/learning_curves_class_loss.png").read_bytes()),
  ]),
  ipywidgets.HBox([
    ipywidgets.Image(value=Path("visualizations/learning_curves_class_accuracy.png").read_bytes()),
    ipywidgets.Image(value=Path("visualizations/learning_curves_class_hits_at_k.png").read_bytes())
  ]),
])

NumExpr defaulting to 2 threads.
import ray failed with exception: No module named 'ray'


VBox(children=(HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\x80\x00\x00\x01\xe…

# Make Predictions on New Data

Lastly we'll show how to generate predictions for new data.

The following are some recent news headlines. Feel free to edit or add your own strings to text_to_predict to see how the newly trained model classifies them.

In [8]:
import pandas as pd

text_to_predict = pd.DataFrame({
  "title": [
    "Google may spur cloud cybersecurity M&A with $5.4B Mandiant buy",
    "Europe struggles to meet mounting needs of Ukraine's fleeing millions",
    "How the pandemic housing market spurred buyer's remorse across America",
  ]
})

text_to_predict.to_csv('text_to_predict.csv', index=False)

In [9]:
!ludwig predict --model_path results/experiment_run/model \
                --dataset text_to_predict.csv \
                --output_directory predictions

# Loads predictions
predictions = pd.read_parquet('predictions/predictions.parquet')
predictions

NumExpr defaulting to 2 threads.
import ray failed with exception: No module named 'ray'
███████████████████████
█ █ █ █  ▜█ █ █ █ █   █
█ █ █ █ █ █ █ █ █ █ ███
█ █   █ █ █ █ █ █ █ ▌ █
█ █████ █ █ █ █ █ █ █ █
█     █  ▟█     █ █   █
███████████████████████
ludwig v0.5rc2 - Predict

Dataset path: text_to_predict.csv
Model path: results/experiment_run/model

  self.padding, self.dilation, self.groups)
Loading metadata from: results/experiment_run/model/training_set_metadata.json
Prediction: 100% 1/1 [00:00<00:00, 133.56it/s]
Saved to: predictions


Unnamed: 0,class_predictions,class_probabilities,class_probability,class_probabilities_<UNK>,class_probabilities_sci_tech,class_probabilities_sports,class_probabilities_world,class_probabilities_business
0,sci_tech,"[2.0307293346899513e-10, 0.951603353023529, 8....",0.951603,2.030729e-10,0.951603,8.1e-05,0.001582,0.046734
1,world,"[6.485091574859325e-08, 0.007995870895683765, ...",0.983966,6.485092e-08,0.007996,0.004161,0.983966,0.003877
2,business,"[4.226141390972771e-05, 0.3360758125782013, 0....",0.536061,4.226141e-05,0.336076,0.004232,0.123588,0.536061
