# **O-TextCat-O, O-TextCat-Z**

Script to train TextCat-SDG on OSDG data and test on OSDG and ZORA data. With single-labelled data, ensure that `exclusive_classes` is set to `True` in your config file.

Pipeline:
- tokenizer `spacy.Tokenizer`
- classifier `TextCatBOW` (bag-of-words + logistic regression)

All other params left at default, including:
- max training steps = 20'000
- patience = 1600
- unlimited epochs

# Installation & Imports

In [None]:
%pip install -q "spacy==3.5.3"

In [None]:
import pandas as pd
import spacy
import shutil
import json
import locale

from spacy.cli.train import train as spacy_train
from google.colab import drive, files
from sklearn.metrics import classification_report

# For encoding issue with spaCy Doc files:
locale.getpreferredencoding = lambda: "UTF-8"
# Clear up memory:
shutil.rmtree("/content/sample_data")

train_set = 'osdg'
model_name = 'textcat'

# Read data & set globals

### Enter your filepaths:

Path to Drive folder or upload files to this session:
- train, dev, and test data in spaCy Doc files: `dataset-name_train.spacy`
- test data in TSV format: `dataset-name_test.tsv`
- model config file: `model-name_config.cfg`

In [None]:
# Run this cell to mount Google Drive:
drive.mount("/content/drive", force_remount=True)

In [None]:
# Edit paths as needed:
base_path = "/content/drive/YOUR_FOLDER/zora_classifier/data"
model_path = f"{base_path}/models/{model_name}"  # where config file is located
config_path = f"{model_path}/{model_name}_config.cfg"
train_path, dev_path = (f"{base_path}/spacy_docs/{train_set}_train.spacy",
                        f"{base_path}/spacy_docs/{train_set}_dev.spacy")

# Train

In [None]:
spacy_train(config_path,
            output_path = train_set,
            overrides={"paths.train": train_path,
                       "paths.dev": dev_path})

Remove `model-last`:

In [None]:
shutil.rmtree(f"/content/{train_set}/model-last")
best_model = f"/content/{train_set}/model-best"

# Test & predict

In [None]:
def get_test_df(file):
  test_df = pd.read_csv(file, sep='\t', keep_default_na=False,
                        index_col=0, encoding='utf-8')
  test_df = test_df.astype({'sdg': 'string',
                            'abstract': 'string'})

  if 'faculty' in test_df.columns:
      test_df.drop(columns=['faculty', 'year'], inplace=True)

  return test_df

In [None]:
def predict(test_df):
  X_test = test_df['abstract'].values
  y_test = test_df['sdg'].values

  nlp = spacy.load(best_model)
  print("Making predictions....")

  spacy_probs = [doc.cats for doc in nlp.pipe(X_test)]
  print("Done making predictions!")

  # For each item, select the label to which the model has assigned the highest probability:
  preds = []
  probs = []
  for label_probs_dict in spacy_probs:
      pred, prob = max(label_probs_dict.items(), key=lambda x: x[1])
      preds.append(pred)
      probs.append(prob)

  preds = pd.Series(preds)
  probs = pd.Series(probs)

  preds_df = pd.DataFrame({'abstract': X_test,
                         'label': y_test,
                         'prediction': preds,
                         'probability': probs})
  # Align original indices
  preds_df.index = test_df.index

  preds_df = preds_df.astype({'abstract': 'string',
                              'label': 'int',
                              'prediction': 'int',
                              'probability': 'float'})

  return preds_df

## OSDG

In [None]:
test_set = 'osdg'
test_cats = [str(i) for i in range(1, 17)]
output_file = f"{best_model}/test_{test_set}_eval.json"

# Edit paths as needed:
test_path = f"{base_path}/spacy_docs/{test_set}_test.spacy"
test_df = get_test_df(f"{base_path}/train_test/{test_set}_test.tsv")

**Syntax:**

`python3 -m spacy benchmark accuracy`

model path, test.spacy filepath: `{best_model}/ {test_path}`

output filename: `{output_file}`

In [None]:
!python3 -m spacy benchmark accuracy \
  {best_model}/ {test_path} \
  --output {output_file}

Create and download predictions dataframe:

In [None]:
preds_df = predict(test_df)

preds_file = f"/content/{train_set}-{test_set}_preds.tsv"
preds_df.to_csv(preds_file, sep='\t', encoding='utf-8')
files.download(preds_file)

## ZORA

Because the OSDG data doesn't have SDG 17, we can't use spaCy's `benchmark accuracy` command to test on the ZORA data -- it disregards SDG 17. Instead, we have to generate the predictions first and then use `sklearn` on them to calculate evaluation metrics.

In [None]:
test_set = 'zora'
output_file = f"{best_model}/test_{test_set}_eval.json"

In [None]:
# Edit path as needed:
test_df = get_test_df(f"{base_path}/train_test/{test_set}_test.tsv")
preds_df = predict(test_df)

preds_file = f"/content/{train_set}-{test_set}_preds.tsv"
preds_df.to_csv(preds_file, sep='\t', encoding='utf-8')
files.download(preds_file)

In [None]:
def generate_classification_report(y_true, y_pred, labels, as_dict):
  """
  Evaluate predictions with sklearn's classification report (by-label, macro, and weighted averages),
  return either as dict or preformatted text.
  """
  target_names = [f"sdg_{i}" if (i > 9) or (i == -1) else f"sdg_0{i}" for i in labels]

  if as_dict:
      report = classification_report(y_true, y_pred, labels=labels, target_names=target_names,
                                              zero_division=0.0, output_dict=True)
  else:
      report = classification_report(y_true, y_pred, labels=labels, target_names=target_names,
                                              digits=4, zero_division=0.0, output_dict=False)

  return report

In [None]:
y_true = preds_df['label']
y_pred = preds_df['prediction']
labels = range(1, 18)

report_dict = generate_classification_report(y_true, y_pred, labels, as_dict=True)
with open(output_file, 'w') as f:
    json.dump(report_dict, f)

In [None]:
report_text = generate_classification_report(y_true, y_pred, labels, as_dict=False)

print(f"\t\t --- ZORA TEST --- \n")
print(report_text)

# Download

Zip model folder and download.

**Syntax:** `/content/desired_filename.zip /content/folder_to_zip`

In [None]:
!zip -r /content/{train_set}.zip {best_model}

In [None]:
files.download(f"/content/{train_set}.zip")