# NEWSPAPER TITLE CLASSIFICATION BASED ON KNN, KMEANS AND DECISION TREE

To have general view & data structure of Project, refer the `Readme.md` of this Project and general structure of project

![General Structure](general_structure.png)

## 1. IMPORT LIBRARY

In [22]:
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# from chromadb.api.types import normalize_embeddings
from langchain.evaluation import load_dataset
# from sentence_transformers import SentenceTransformer

In [23]:
import re
import nltk # use in case
from collections import Counter, defaultdict
from typing import Literal, List, Union, Dict, Any
from transformers import AutoTokenizer, AutoModel
import torch

In [24]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix, silhouette_score

## 2. DATA FEATURE EXTRACTION

`Note: Refer the Readme.md for Data source collection`

In [25]:
from datasets import load_dataset
ds = load_dataset('UniverseTBD/arxiv-abstracts-large')

Dataset is dictionary collection with feature fields

In [26]:
print(type(ds))
ds

<class 'datasets.dataset_dict.DatasetDict'>


DatasetDict({
    train: Dataset({
        features: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed'],
        num_rows: 2292057
    })
})

> Find the path of the data files:

The data has been downloaded then stored in the path (with Linux)
"/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/*.arrow"

During loading wiht dataset, the command load_dataset shall be removed to cancel the download program

In [27]:
# list down the path of data file
print(ds.cache_files)

{'train': [{'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00000-of-00007.arrow'}, {'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00001-of-00007.arrow'}, {'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00002-of-00007.arrow'}, {'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00003-of-00007.arrow'}, {'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00004-of-00007.arrow'}, {'fi

Collect the fields by keys of dict dataset

In [28]:
train_ds = ds['train']
topic_features = train_ds.column_names
print(topic_features)

['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed']


Following the guidelines of Project, the data will use 'abstract' as input or features and 'categories' as labels.

For categories, the value is separate into 02 fields: main category and sub category, in example: [math.CA] [cs.CG]

We shall need to extract primary category for this Project then lower the letters

In [29]:
ds_splitted = train_ds.select_columns(['abstract', 'categories'])
project_df = ds_splitted.to_pandas()

In [30]:
# separate the value to 2-dimension list: 'acc.phy math' => [acc.phy, math] => acc
category = project_df['categories'].map(lambda x: x.split(' '))
category = category.map(lambda x: x[0].split('.')[0])

category_set = set(category)
print(f'Length of unique primary categories is {len(category_set)}')

Length of unique primary categories is 38


The requirement is extract 1000-2000 Data values base on primary categories below:

`[astro-ph, cond-mat, cs, math, physics]`

In [31]:
use_categories_list = ['astro-ph', 'cond-mat', 'cs', 'math', 'physics']

# creat Regrex OR by '|'
# 'astro-ph'|'cond-mat'|'cs'|'math'|'physics'
pattern = '^(' + '|'.join(use_categories_list) + ')'
# filter the data
project_df_filtered = project_df[project_df['categories'].str.contains(pattern, case=False, na=False)]
# extract 2000 values in random list
dataset_df = project_df_filtered.sample(n=2000, random_state=42)
dataset_df.reset_index(drop=True, inplace=True)
dataset_df.head(10)

  project_df_filtered = project_df[project_df['categories'].str.contains(pattern, case=False, na=False)]


Unnamed: 0,abstract,categories
0,The factorially normalized Bernoulli polynom...,math.PR math.CO
1,We propose a simple uniform lower bound on t...,math.CA
2,We study truncated point schemes of connecte...,math.RA math.AG math.QA
3,"Four-dimensional (4D) printing, a new techno...",cs.CG physics.app-ph
4,We show that the d-th secant variety of a pr...,math.AG math.NT
5,"In recent years, badminton analytics has dra...",cs.AI
6,The influence that the kinematics of pitchin...,physics.flu-dyn
7,We address the problem of evaluating the tra...,physics.bio-ph cond-mat.stat-mech
8,"In this paper, we derive the large deviation...",math.PR math-ph math.MP
9,The Three-Body Parameter(3BP)\n$a^{\scriptsc...,cond-mat.quant-gas physics.atm-clus physics.at...


## 3. DATA PROCESSING

In [32]:
# Check one value of dataset
dataset_df.loc[1, 'abstract']

'  We propose a simple uniform lower bound on the spacings between the\nsuccessive zeros of the Laguerre polynomials $L_n^{(\\alpha)}$ for all\n$\\alpha>-1$. Our bound is sharp regarding the order of dependency on $n$ and\n$\\alpha$ in various ranges. In particular, we recover the orders given in\n\\cite{ahmed} for $\\alpha \\in (-1,1]$.\n'

Data Processing will perform activities below to create raw values:
* Removes `\n` and whitespace characters at the beginning and end of the string.
* Removes special characters (punctuation, non-letter or numeric characters).
* Removes digits.
* Converts all letters to lowercase.
* Gets the label as the primary category (first part) in the categories field.

In [33]:
def abstract_preprocessing(text):
    """
    Function to preprocess abstracts: remove all special characters
    :param abstract_text: the text to be preprocessed
    :return: text after preprocessing
    """
    # remove the enter space
    text = text.strip().replace('\n', ' ')
    # remove the special letters
    text = re.sub(r'[^\w\s]', '', text)
    # remove digit
    text = re.sub(r'\d+', '', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # lower case
    text = text.lower()
    return text

def category_processing(text):
    """
    Function to preprocess categories: collect only first part
    :param text: catergories to be processed
    :return: text after preprocessing
    """
    text_splitted = text.replace('.',' ')
    text_category = text_splitted.split(' ')[0]
    return text_category

Test function of processing

In [34]:
a_sample = dataset_df.loc[1, 'abstract']
c_sample = dataset_df.loc[6, 'categories']
print('before:\n', a_sample)
print('after: \n', abstract_preprocessing(a_sample))
print('before:\n', c_sample)
print('after: \n', category_processing(c_sample))

before:
   We propose a simple uniform lower bound on the spacings between the
successive zeros of the Laguerre polynomials $L_n^{(\alpha)}$ for all
$\alpha>-1$. Our bound is sharp regarding the order of dependency on $n$ and
$\alpha$ in various ranges. In particular, we recover the orders given in
\cite{ahmed} for $\alpha \in (-1,1]$.

after: 
 we propose a simple uniform lower bound on the spacings between the successive zeros of the laguerre polynomials l_nalpha for all alpha our bound is sharp regarding the order of dependency on n and alpha in various ranges in particular we recover the orders given in citeahmed for alpha in
before:
 physics.flu-dyn
after: 
 physics


Function is working correctly. Apply to all dataset

In [35]:
dataset_df = dataset_df.assign(
    abstract = dataset_df['abstract'].apply(abstract_preprocessing),
    categories = dataset_df['categories'].apply(category_processing)
)

In [36]:
print(dataset_df.info())
dataset_df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   abstract    2000 non-null   object
 1   categories  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB
None


Unnamed: 0,abstract,categories
0,the factorially normalized bernoulli polynomia...,math
1,we propose a simple uniform lower bound on the...,math
2,we study truncated point schemes of connected ...,math
3,fourdimensional d printing a new technology em...,cs
4,we show that the dth secant variety of a proje...,math


In [37]:
# Check categories unique value
print(sorted(dataset_df['categories'].unique()))
sorted_catergories_list = sorted(dataset_df['categories'].unique())
sorted_catergories_list.index('math-ph')

['astro-ph', 'cond-mat', 'cs', 'math', 'math-ph', 'physics']


4

Convert categories to number

In [38]:
dataset_df['categories'] = dataset_df['categories'].map(lambda x: sorted_catergories_list.index(x))
dataset_df.head(5)

Unnamed: 0,abstract,categories
0,the factorially normalized bernoulli polynomia...,3
1,we propose a simple uniform lower bound on the...,3
2,we study truncated point schemes of connected ...,3
3,fourdimensional d printing a new technology em...,2
4,we show that the dth secant variety of a proje...,3


## 4. DATA EMBEDDING

We shall apply 03 embedding method for text: CountVectorizer(), tfidf_vectorizer(), embedding_vectorizer()

In [39]:
X_train, X_test, y_train, y_test = train_test_split(dataset_df['abstract'], dataset_df['categories'], test_size=0.2, random_state=42)
print(f'Training sample: {len(X_train)}')
print(f'Test sample: {len(X_test)}')
print(f'Training labels: {len(y_train)}')
print(f'Test labels: {len(y_test)}')

Training sample: 1600
Test sample: 400
Training labels: 1600
Test labels: 400


#### This model does not have normalization

In [40]:
bow_vectorizer = CountVectorizer()
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)
print(X_train_bow.shape)
print(X_test_bow.shape)
print(X_train_bow[0])

(1600, 18450)
(400, 18450)
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 102 stored elements and shape (1, 18450)>
  Coords	Values
  (0, 14782)	2
  (0, 9967)	1
  (0, 6916)	1
  (0, 1327)	2
  (0, 12956)	1
  (0, 16836)	8
  (0, 5218)	1
  (0, 16616)	8
  (0, 11265)	4
  (0, 11383)	9
  (0, 2535)	7
  (0, 7620)	5
  (0, 3661)	4
  (0, 1267)	1
  (0, 7449)	1
  (0, 1341)	1
  (0, 433)	1
  (0, 16618)	1
  (0, 6897)	1
  (0, 15868)	1
  (0, 562)	2
  (0, 7699)	2
  (0, 16611)	2
  (0, 9840)	1
  (0, 7976)	1
  :	:
  (0, 18078)	1
  (0, 5988)	1
  (0, 11185)	1
  (0, 15236)	1
  (0, 5800)	2
  (0, 5219)	1
  (0, 4021)	1
  (0, 9896)	1
  (0, 12855)	1
  (0, 5417)	1
  (0, 12083)	1
  (0, 18156)	2
  (0, 8721)	1
  (0, 4183)	1
  (0, 16303)	1
  (0, 13504)	1
  (0, 3662)	1
  (0, 14893)	1
  (0, 6563)	1
  (0, 13993)	1
  (0, 14180)	1
  (0, 10969)	1
  (0, 3919)	1
  (0, 17670)	1
  (0, 18159)	1


#### This model has default L2 normalization

In [41]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)
print(X_train_tfidf[0])

(1600, 18450)
(400, 18450)
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 102 stored elements and shape (1, 18450)>
  Coords	Values
  (0, 14782)	0.09969270018514081
  (0, 9967)	0.04335014971731978
  (0, 6916)	0.03388746193404882
  (0, 1327)	0.0831834845247899
  (0, 12956)	0.043939168180104515
  (0, 16836)	0.12914015150150226
  (0, 5218)	0.05746113414142209
  (0, 16616)	0.11284799955964703
  (0, 11265)	0.1742724798821033
  (0, 11383)	0.12703332883853058
  (0, 2535)	0.4519794853757873
  (0, 7620)	0.07723101105628082
  (0, 3661)	0.24326615353579287
  (0, 1267)	0.07020992990846252
  (0, 7449)	0.07235470281307164
  (0, 1341)	0.07635736251478561
  (0, 433)	0.040551569206818947
  (0, 16618)	0.05631691573756624
  (0, 6897)	0.03526844413282702
  (0, 15868)	0.037494109609882406
  (0, 562)	0.04664977637875393
  (0, 7699)	0.13501708690010744
  (0, 16611)	0.037943998999576
  (0, 9840)	0.07124102993908804
  (0, 7976)	0.10128697791112083
  :	:
  (0, 18078)	0.06792390215335287
  (0, 598

#### Buld class of user-define vector embedding
The Sentence Transformer is not compatible with Python. The solution is using the torch and hugging faces transformer then auto-pooling the text
Embedding Vectorizer in this case shall use:
* HuggingFace Embedding
* CUDA with auto transform to GPU/ CPU
* Batch-size with long/ complicate input

In [42]:
class EmbeddingVectorizer:
    """
    Vectorizer use SentenceTransformers (default: intfloat/multilingual-e5-base).
    - mode='query'   -> prefix "query: "
    - mode='passage' -> prefix "passage: "
    - mode='raw'     -> keep raw text
    """
    def __init__(self,
                 model_name: str = 'intfloat/multilingual-e5-base',
                 normalize: bool = True,
                 device: Literal['auto', 'cpu', 'cuda'] = 'auto'):
        # specify the device
        if device == 'auto':
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.device = device

        print(f"[EmbeddingVectorizer] Using device: {self.device}")

        # load model/tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name).to(self.device)
        self.normalize = normalize

    def _pool_embeddings(self, model_output):
        # mean pooling sBERT style
        token_embeddings = model_output.last_hidden_state
        attention_mask = model_output.attention_mask
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
        return sum_embeddings / sum_mask

    def _format_inputs(self, texts: List[str], mode: Literal['query', 'passage', 'raw'] = 'query'):
        if mode == 'query':
            return [f"query: {t}" for t in texts]
        elif mode == 'passage':
            return [f"passage: {t}" for t in texts]
        return texts

    def transform_to_numpy(self,
                           texts: List[str],
                           mode: Literal['query', 'passage', 'raw'] = 'query',
                           batch_size: int = 16):
        """
        Encode 1 list văn bản -> numpy array [n_samples, embedding_dim]
        batch_size: batch encoding to use GPU better
        """
        inputs = self._format_inputs(texts, mode)
        all_embeddings = []

        for i in range(0, len(inputs), batch_size):
            batch = inputs[i:i + batch_size]
            encoded = self.tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(self.device)
            with torch.no_grad():
                outputs = self.model(**encoded)
            embs = self._pool_embeddings(outputs)

            if self.normalize:
                embs = torch.nn.functional.normalize(embs)

            all_embeddings.append(embs.cpu())

        all_embeddings = torch.cat(all_embeddings, dim=0)
        return all_embeddings.numpy()

Test embedding class, version of SentenceTransformer not compatible with Python

In [43]:
# embedding_vectorizer = EmbeddingVectorizer()
# print(embedding_vectorizer.transform([a_sample]))

## 5. PRODUCTION MODEL
* KMean Clustering
* KNN
* Decision Tree

Một câu hỏi thực tế, đối với các dữ liệu nhúng, Data Embedding, có cần phải chuẩn hóa dữ liệu không:
* Nếu chỉ dùng embedding để so sánh độ tương tự → dùng L2 normalization.
* Nếu đưa embedding vào machine learning model (LR, SVM, KMeans, KNN) → thường nên chuẩn hóa (StandardScaler hoặc MinMaxScaler).
* Nếu embedding đã được normalize từ trước → không cần làm thêm.

`Embedding thường đã chuẩn hóa sẵn (normalized)`

Các mô hình này thường trả về vector đã L2-normalized (độ dài = 1) hoặc cho phép bật tùy chọn normalize:
* SentenceTransformers (khi dùng normalize_embeddings=True)
* OpenAI Embeddings (ví dụ text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large) → mặc định đã chuẩn hóa.
* Google Universal Sentence Encoder (USE) → chuẩn hóa sẵn.
* Cohere Embeddings API → chuẩn hóa sẵn.
* FastText pre-trained vectors (thường normalize trong nhiều thư viện ứng dụng).
* Word2Vec/GloVe nếu bạn tự normalize sau khi load (thư viện Gensim hay normalize theo option).
* 👉 Các embedding này dùng cosine similarity hoặc dot product trực tiếp mà không cần scaling thêm.

`Embedding chưa chuẩn hóa (raw vector)`

Các mô hình này trả về vector thô, độ dài khác nhau tùy câu/từ:

* BERT / RoBERTa / DistilBERT raw embeddings (hidden states).
* GPT hidden states khi lấy từ transformer layers.
* CLIP (raw text/image embeddings) → cần normalize trước khi tính similarity cosine.
* E5 models (intfloat/e5-base, multilingual-e5-base) → chưa normalize mặc định, nhưng SentenceTransformers wrapper thường cho phép bật normalize.
* Transformer-based encoder khác (XLM-R, mBERT, LaBSE) nếu lấy trực tiếp từ HuggingFace mà không normalize.
* Custom CNN/RNN embeddings (khi bạn huấn luyện encoder riêng).

### 5.1 KMEANS
Silhouetes is use for this option

In [44]:
def train_and_test_kmean(X_train, X_test, y_train, y_test, cluster_range: int=3):
    """
    This function use KMEAN for predict the title. The cluster (k) is apply from 1-30. However to determine the best k, the silhouette method is used
    :param X_train: the training set, scaled to sparse matrix
    :param X_test: the test set, scaled to sparse matrix
    :param y_train: the train set, int value
    :param y_test: the test set, int value
    :param cluster_range: the range of cluster:
    :return: predict value y_pred, accuracy, report
    """
    # Find the optimum cluster number k
    silhouettes = {}
    if cluster_range < 3:
        print('number of clusters should be greater than 3')
        return None
    else:
        for i in range(2, cluster_range):
            kmeans = KMeans(n_clusters=i, random_state=42).fit(X_train)
            silhouettes[i] = silhouette_score(X_train, kmeans.labels_)

        max_si_value_key = max(silhouettes, key=silhouettes.get)
        silhouette_final_value = max_si_value_key + 1 # Get the optimum near max silhouette value

        kmeans = KMeans(n_clusters=silhouette_final_value, random_state=42)
        kmeans.fit(X_train)
        y_pred = kmeans.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)

        return y_pred, accuracy, report, silhouette_final_value

In [45]:
kmean_bow_labels, kmean_bow_accurracy, kmean_bow_report, bow_k = train_and_test_kmean(X_train_bow, X_test_bow, y_train, y_test, 10)
kmean_tfidf_labels, kmeans_tfidf_accuracy, kmean_tfidf_report, tfidf_k = train_and_test_kmean(X_train_tfidf, X_test_tfidf, y_train, y_test, 10)

print("Accuracies for K-Means:")
print(f"Bag of Words: {kmean_bow_accurracy:.4f} with {bow_k} clusters")
print(f"Tf-Idf: {kmeans_tfidf_accuracy:.4f} with {tfidf_k} clusters")

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Accuracies for K-Means:
Bag of Words: 0.2800 with 3 clusters
Tf-Idf: 0.3750 with 3 clusters


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


### 5.2 KNN

In [46]:
def train_and_test_knn(X_train, X_test, y_train, y_test, n_neighbors: int = 5):
    """
    The knn function to predict the categories
    :param X_train:
    :param X_test:
    :param y_train:
    :param y_test:
    :param n_neighbors:
    :return:
    """
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
     # Predict on the test set

    y_pred = knn.predict(X_test)

    # Calculate accuracy and classification report
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return y_pred, accuracy, report

In [47]:
knn_bow_labels, knn_bow_accuracy, knn_bow_report = train_and_test_knn(X_train_bow, X_test_bow, y_train, y_test, 5)
knn_tfidf_labels, knn_tfidf_accuracy, knn_tfidf_report = train_and_test_knn(X_train_tfidf, X_test_tfidf, y_train, y_test, 5)

print("Accuracies for KNN:")
print(f"Bag of Words: {knn_bow_accuracy:.4f} with 5 neighbors")
print(f"Tf-Idf: {knn_tfidf_accuracy:.4f} with 5 neighbors")

Accuracies for KNN:
Bag of Words: 0.3825 with 5 neighbors
Tf-Idf: 0.7875 with 5 neighbors


### 5.3 DECISION TREE

In [50]:
def train_and_test_decision_tree(X_train, X_test, y_train, y_test):
    """
    The decision tree function to predict the categories
    :param X_train:
    :param y_train:
    :param X_test:
    :param y_test:
    :return:
    """
    dt = DecisionTreeClassifier(max_depth=5)
    dt.fit(X_train, y_train)
    # Predict on the test set
    y_pred = dt.predict(X_test)
    # Calculate accuracy and classification report
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return y_pred, accuracy, report

In [52]:
dt_bow_labels, dt_bow_accuracy, dt_bow_report = train_and_test_decision_tree(X_train_bow, X_test_bow, y_train, y_test)
dt_tfidf_labels, dt_tfidf_accuracy, dt_tfidf_report = train_and_test_decision_tree(X_train_tfidf, X_test_tfidf, y_train, y_test)

print("Accuracies for Decision Tree:")
print(f"Bag of Words: {dt_bow_accuracy:.4f}")
print(f"Tf-Idf: {dt_tfidf_accuracy:.4f}")

Accuracies for Decision Tree:
Bag of Words: 0.4800
Tf-Idf: 0.4800


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
