# Multimodal Classification of Twitter Bots - Ludwig CLI

*We recommend using a GPU runtime for this example. In the Colab menu bar, choose **Runtime** > **Change Runtime Type** and choose **GPU** under Hardware Accelerator.*

In this notebook, we will show how to use the Ludwig CLI to:


1.   [Train a Ludwig Model](#scrollTo=Train)
2.   [Evaluate the trained model](#scrollTo=Evaluate)
3.   [Visualize training and test metrics](#scrollTo=Visualize_Metrics)


This example is uses a dataset from Kaggle, so you'll need a [Kaggle account](https://www.kaggle.com/account/login) to download it.

# Upload Kaggle Credentials

Run the cell below and upload your kaggle.json file.

In [None]:
import os

from google.colab import files

uploaded_files = files.upload()

# Creates the .kaggle directory if it does not already exist.
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

# Write the contents of the uploaded file to ~/.kaggle/kaggle.json
if "kaggle.json" in uploaded_files:
  with open(os.path.expanduser("~/.kaggle/kaggle.json"), "wb") as f:
    f.write(uploaded_files["kaggle.json"])
  os.chmod(os.path.expanduser("~/.kaggle/kaggle.json"), 0o600)

# Download Dataset

We'll be using the [twitter human-bots dataset](https://www.kaggle.com/code/davidmartngutirrez/bots-accounts-eda/data), composed of 37438 rows each corresponding to a Twitter user account. Each row contains 20 feature columns collected via Twitter API. These features contain multiple data modalities, including the account description and the profile image.

The target column **account_type** has two unique values: **bot** or **human**. 25013 user accounts were annotated as human accounts, the remaining 12425 are bots.


This dataset contains 20 columns, but we'll only use these 16 (15 input + 1 target):

| column      | type | description                                         |
|-------------|------|-----------------------------------------------------|
| default_profile | binary | Boolean indicating whether the account has a default profile |
| default_profile_image | binary | Boolean indicating whether the account has a default profile image |
| description | text |  User account description                           |
| favorites_count | number | Total number of favourited tweets             |
| followers_count | number | Total number of followers                     |
| friends_count | number | Total number of friends                         |
| geo_enabled | binary | Boolean indicating whether the account has the geographic location enabled  |
| lang | category | Language of the account                                |
| location | category | Location of the account                            |
| profile_background_image_url | image | Profile background image url      |
| profile_image_url | image | Profile image url                            |
| statuses_count | number | Total number of tweets                         |
| verified | binary | Boolean indicating whether the account has been verified |
| average_tweets_per_day | number | Average tweets posted per day          |
| account_age_days | number | Account age measured in days                 |
| account_type   | category | Account type, one of {bot, human}            |


In [None]:
# Downloads the dataset to the current working directory
!kaggle datasets download danieltreiman/twitter-human-bots-dataset

# Unzips the downloaded dataset, creates profile_images,
# profile_background_images, and twitter_human_bots_dataset.csv
!unzip -q -o twitter-human-bots-dataset.zip

In [None]:
# Previews a few rows of the dataset:
!head twitter_human_bots_dataset.csv

# Train

In [None]:
# Prerequisite: Installs the latest version of Ludwig in the Colab environment
!python -m pip install git+https://github.com/ludwig-ai/ludwig.git --quiet

## Define ludwig config

The Ludwig config declares the machine learning task: which columns to use, their datatypes, and which columns to predict.

In [None]:
config_yaml = """
input_features:
  - name: default_profile
    type: binary
  - name: default_profile_image
    type: binary
  - name: description
    type: text
  - name: favourites_count
    type: number
  - name: followers_count
    type: number
  - name: friends_count
    type: number
  - name: geo_enabled
    type: binary
  - name: lang
    type: category
  - name: location
    type: category
  - name: profile_background_image_path
    type: category
  - name: profile_image_path
    type: image
    preprocessing:
      num_channels: 3
  - name: statuses_count
    type: number
  - name: verified
    type: binary
  - name: average_tweets_per_day
    type: number
  - name: account_age_days
    type: number
output_features:
  - name: account_type
    type: binary
"""

# Writes config to "config.yaml"
with open("config.yaml", "w") as f:
  f.write(config_yaml)

## Create and train a model

In [None]:
# Trains the model. This cell might take a few minutes.
!ludwig train --dataset twitter_human_bots_dataset.csv -c config.yaml

# Evaluate

In [None]:
# Generates predictions and performance statistics for the test set.
!ludwig evaluate --model_path results/experiment_run/model \
                 --dataset twitter_human_bots_dataset.csv \
                 --split test \
                 --output_directory results/experiment_run

# Visualize Metrics

In [None]:
# Visualizes confusion matrix, which gives an overview of classifier performance
# for each class.
!ludwig visualize --visualization confusion_matrix \
                  --ground_truth_metadata results/experiment_run/model/training_set_metadata.json \
                  --test_statistics results/experiment_run/test_statistics.json \
                  --output_directory visualizations \
                  --file_format png

# If you run ludwig visualize locally, visualizations will automatically show in
# a window. Here in Colab, we can run the following code to load and display
# generated plots inline.
from IPython import display
import ipywidgets
from pathlib import Path

ipywidgets.HBox([
  ipywidgets.Image(value=Path("visualizations/confusion_matrix__account_type_top3.png").read_bytes()),
  ipywidgets.Image(value=Path("visualizations/confusion_matrix_entropy__account_type_top3.png").read_bytes()),
])

In [None]:
# Visualizes learning curves, which show how performance metrics changed over
# time during training.
!ludwig visualize --visualization learning_curves \
                  --training_statistics results/experiment_run/training_statistics.json \
                  --output_directory visualizations \
                  --file_format png


ipywidgets.HBox([
  ipywidgets.Image(value=Path("visualizations/learning_curves_account_type_loss.png").read_bytes()),
  ipywidgets.Image(value=Path("visualizations/learning_curves_account_type_accuracy.png").read_bytes()),
])