# **Z-TextCat-Z**

Script to train/test TextCat-SDG models on ZORA data, with n=1, 2, and 3. With single-labelled data, ensure that `exclusive_classes` is set to `True` in your config file.

Pipeline:
- tokenizer `spacy.Tokenizer`
- classifier `TextCatBOW` (bag-of-words + logistic regression)

All other params left at default, including:
- max training steps = 20'000
- patience = 1600
- unlimited epochs

### Choose n-gram size:

In [None]:
n = 1
# n = 2
# n = 3

# Installation & Imports

In [None]:
%pip install -q "spacy==3.5.3"

In [None]:
import pandas as pd
import spacy
import shutil
import json
import locale

from spacy.cli.train import train as spacy_train
from google.colab import drive, files

# For encoding issue with spaCy Doc files:
locale.getpreferredencoding = lambda: "UTF-8"
# Clear up memory:
shutil.rmtree("/content/sample_data")

train_set = 'zora'
test_set = 'zora'
model_name = 'textcat'

# Read data & set globals

### Enter your filepaths:

Path to Drive folder or upload files to this session:
- train, dev, and test data in spaCy Doc files: `dataset-name_train.spacy`
- test data in TSV format: `dataset-name_test.tsv`
- model config file: `model-name_config.cfg`

In [None]:
# Run this cell to mount Google Drive:
drive.mount("/content/drive", force_remount=True)

In [None]:
# Edit paths as needed:
base_path = "/content/drive/YOUR_FOLDER/zora_classifier/data"
model_path = f"{base_path}/models/{model_name}"  # where config file is located
config_path = f"{model_path}/{model_name}_config.cfg"
train_path, dev_path, test_path = (f"{base_path}/spacy_docs/{train_set}_train.spacy",
                                   f"{base_path}/spacy_docs/{train_set}_dev.spacy",
                                   f"{base_path}/spacy_docs/{test_set}_test.spacy")
model = f"{train_set}-n{n}"  # destination folder name for the model trained here

# Train

In [None]:
spacy_train(config_path,
            output_path = model,
            overrides={"paths.train": train_path,
                       "paths.dev": dev_path,
                       "components.textcat.model.ngram_size": n})

Remove `model-last`:

In [None]:
shutil.rmtree(f"/content/{model}/model-last")
best_model = f"/content/{model}/model-best"

# Test

In [None]:
test_cats = [str(i) for i in range(1, 18)]
output_file = f"{best_model}/test_{test_set}_eval.json"

**Syntax:**

`python3 -m spacy benchmark accuracy`

model path, test.spacy filepath: `{best_model}/ {test_path}`

output filename: `{output_file}`

In [None]:
!python3 -m spacy benchmark accuracy \
  {best_model}/ {test_path} \
  --output {output_file}

Zip model folder and download.

Syntax: `/content/desired_filename.zip /content/folder_to_zip`

In [None]:
!zip -r /content/{model}.zip {best_model}

In [None]:
files.download(f"/content/{model}.zip")

# Predict

Apply model to test data and get probabilities:

In [None]:
def get_test_df(file):
  test_df = pd.read_csv(file, sep='\t', keep_default_na=False,
                        index_col=0, encoding='utf-8')
  test_df = test_df.astype({'sdg': 'string',
                            'abstract': 'string'})
  test_df.drop(columns=['faculty', 'year'], inplace=True)

  return test_df

In [None]:
# Edit `test_df` path:
test_df = get_test_df(f"{base_path}/train_test/{test_set}_test.tsv")

X_test = test_df['abstract'].values
y_test = test_df['sdg'].values

nlp = spacy.load(best_model)
print("Making predictions....")

spacy_probs = [doc.cats for doc in nlp.pipe(X_test)]
print("Done making predictions!")

# For each item, select the label to which the model has assigned the highest probability:
preds = []
probs = []
for label_probs_dict in spacy_probs:
    pred, prob = max(label_probs_dict.items(), key=lambda x: x[1])
    preds.append(pred)
    probs.append(prob)

preds = pd.Series(preds)
probs = pd.Series(probs)

Create and download predictions dataframe:

In [None]:
preds_df = pd.DataFrame({'abstract': X_test,
                         'label': y_test,
                         'prediction': preds,
                         'probability': probs})
# Align original indices
preds_df.index = test_df.index

preds_df = preds_df.astype({'abstract': 'string',
                            'label': 'int',
                            'prediction': 'int',
                            'probability': 'float'})

preds_file = f"/content/{train_set}-{test_set}_preds-n{n}.tsv" if n != 1 else f"/content/{train_set}-{test_set}_preds.tsv"
preds_df.to_csv(preds_file, sep='\t', encoding='utf-8')
files.download(preds_file)