# Example setup for using your own data

Here we show an example on using your own data for classification.

We will use the 20newsgroup data, which is inherently multiclass, but will morph it accordingly.

To run this file, we assume that an Ollama server is up and running at the default ports.

In [64]:
from sklearn.datasets import fetch_20newsgroups
# Keeping a subgroup of the data for speed
cats = ["alt.atheism", "sci.med", "talk.politics.guns"]
to_remove = ("headers", "footers", "quotes")
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats, remove=to_remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats, remove=to_remove)
all_labels_list = newsgroups_test.target_names

In [65]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# We use OHE to morph it into a "multi-label" problem.
ohe = OneHotEncoder()
y_train = ohe.fit_transform(newsgroups_train.target.reshape(-1,1)).toarray()
y_test = ohe.transform(newsgroups_test.target.reshape(-1,1)).toarray()

X_train = np.array(newsgroups_train['data'])
X_test = np.array(newsgroups_test['data'])

print(len(X_train), y_train.shape)

1620 (1620, 3)


In [None]:
from model import LLM_NN

prompt = """
You are a helpful annotator for a newsgroup agency.
Given the context of a news file you have to categorize the file into
one of its related categories.
These are the contents of the file:
{sample_text}.

Select among these labels from other most similar files with their relevance: {similar_labels_freq}.
"""

model = LLM_NN(label_list=all_labels_list, prompt=prompt, threshold=0.1, top_k=11)
model.fit(X_train, y_train);

In [90]:
# For speed
num_to_check = 10
y_pred = model.predict(X_test[:num_to_check])

Predicting on test...: 100%|██████████| 10/10 [00:04<00:00,  2.02it/s]


In [91]:
from sklearn.metrics import classification_report
# We transform it back to multi-class
print(classification_report(ohe.inverse_transform(y_pred[:num_to_check]).flatten(), ohe.inverse_transform(y_test[:num_to_check]).flatten()))

              precision    recall  f1-score   support

           0       0.50      0.33      0.40         3
           1       1.00      1.00      1.00         4
           2       0.50      0.67      0.57         3

    accuracy                           0.70        10
   macro avg       0.67      0.67      0.66        10
weighted avg       0.70      0.70      0.69        10



In [92]:
print("Indicative per sample predictions \n ")
y_pred_labels = model.predict_with_labels(X_test, y_pred)
for sample_index, labels in enumerate(y_pred_labels[:num_to_check]):
    cur_test_labels = sorted(
        [
            all_labels_list[index_nonzero]
            for index_nonzero in y_test[sample_index].flatten().nonzero()[0]
        ]
    )
    print(
        f"(Sample {sample_index + 1}): {X_test[sample_index]}\nCorrect Labels:{cur_test_labels} \nPred Labels: {sorted(labels)}"
    )
    print("\n ################################## \n ")


Indicative per sample predictions 
 
(Sample 1): 
A great deal of documentation exists on exactly that phenomenon. Especially
regarding Vietnam and the Mai Lai (sp?) massacre

Not that I'm suggesting that they started it on purpose but even if they
now know that they accidentally started (or contributed to it) you can
be sure the initial reaction is to lie. Remember the Iranian airliner
which the US navy mistook for a fighter and shot down?
Correct Labels:['talk.politics.guns'] 
Pred Labels: ['talk.politics.guns']

 ################################## 
 
(Sample 2): 

True.


No more risk than smaller stashes unless the stash is somehow confined so
the heat from early ignitions could somehow bulk-heat the remainder.

Two  years ago this month my house and office burned.  In my office was my
reloading bench.  On the top shelf next to the wooden ceiling was 
about 100 lbs of smokeless powder, 5 lbs of black powder, several thousand
primers and a couple thousand loaded rounds, primarily in