# Scikit-LLM
- https://github.com/iryna-kondr/scikit-llm

## Environment

In [1]:
%%capture
!pip install scikit-llm

- https://platform.openai.com/account/api-keys
- https://platform.openai.com/account/org-settings

In [2]:
from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("<API_KEY>")
SKLLMConfig.set_openai_org("<ORGANIZATION_ID>")

## Zero Shot GPTClassifier
- To perform text classification without being re-trained.
- Scikit-LLM will automatically query the OpenAI API and transform the response into a regular list of labels.
- A zero-shot classifier greatly depends on how the label itself is structured. It has to be expressed in natural language, be descriptive and self-explanatory.

In [24]:
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# get classification dataset from sklearn
X, y = get_classification_dataset()

In [25]:
print(len(X))
print(X[1])

30
The special effects in 'Star Battles: Nebula Conflict' were out of this world. I felt like I was actually in space. The storyline was incredibly engaging and left me wanting more. Excellent film.


In [26]:
print(len(y))
print(y)

30
['positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral']


In [27]:
# defining the model
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# fitting the data
clf.fit(X, y)

In [28]:
# predicting the data
y_predict = clf.predict(X)

100%|██████████| 30/30 [00:18<00:00,  1.60it/s]


In [32]:
print(y_predict)

['positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'neutral', 'neutral', 'neutral', 'neutral', 'negative', 'negative', 'neutral', 'neutral', 'neutral']


In [33]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

y_actual = y
labels = list(set(y))
print(labels)

# Calculating Accuracy
accuracy = accuracy_score(y_actual, y_predict)
print(f"Accuracy: {accuracy:.2f}")

# Calculating Precision
precision = precision_score(y_actual, y_predict, labels=labels, average='macro') # 'macro' calculates metrics for each label, and finds their unweighted mean.
print(f"Precision: {precision:.2f}")

# Calculating Recall
recall = recall_score(y_actual, y_predict, labels=labels, average='macro')
print(f"Recall: {recall:.2f}")

# Calculating F1 Score
f1 = f1_score(y_actual, y_predict, labels=labels, average='macro')
print(f"F1 Score: {f1:.2f}")

# Calculating Confusion Matrix
labels = ["positive", "negative", "neutral"]
conf_matrix = confusion_matrix(y_actual, y_predict, labels=labels)
print("\nConfusion Matrix:")
print(conf_matrix)

['positive', 'neutral', 'negative']
Accuracy: 0.90
Precision: 0.92
Recall: 0.90
F1 Score: 0.90

Confusion Matrix:
[[10  0  0]
 [ 0 10  0]
 [ 0  3  7]]


### Training without labeled data

In [37]:
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, _ = get_classification_dataset()

clf = ZeroShotGPTClassifier()
clf.fit(None, ["positive", "negative", "neutral"])
y_predict = clf.predict(X)

100%|██████████| 30/30 [00:23<00:00,  1.29it/s]


In [38]:
print(y_predict)

['positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'neutral', 'neutral', 'neutral', 'neutral', 'negative', 'negative', 'neutral', 'neutral', 'neutral']


## Few-Shot Text Classification
- a few-shot classification, which means that the training samples will be added to prompt and passed to the model.
- the training set should be small enough to fit into a single prompt (we recommend up to 10 samples per label);

In [58]:
from skllm import FewShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, y = get_classification_dataset()

clf = FewShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
y_predict = clf.predict(X)

100%|██████████| 30/30 [00:23<00:00,  1.27it/s]


In [59]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

y_actual = y
labels = list(set(y))
print(labels)

# Calculating Accuracy
accuracy = accuracy_score(y_actual, y_predict)
print(f"Accuracy: {accuracy:.2f}")

# Calculating Precision
precision = precision_score(y_actual, y_predict, labels=labels, average='macro')
print(f"Precision: {precision:.2f}")

# Calculating Recall
recall = recall_score(y_actual, y_predict, labels=labels, average='macro')
print(f"Recall: {recall:.2f}")

# Calculating F1 Score
f1 = f1_score(y_actual, y_predict, labels=labels, average='macro')
print(f"F1 Score: {f1:.2f}")

# Calculating Confusion Matrix
labels = ["positive", "negative", "neutral"]
conf_matrix = confusion_matrix(y_actual, y_predict, labels=labels)
print("\nConfusion Matrix:")
print(conf_matrix)

['positive', 'neutral', 'negative']
Accuracy: 0.97
Precision: 0.97
Recall: 0.97
F1 Score: 0.97

Confusion Matrix:
[[10  0  0]
 [ 0 10  0]
 [ 0  1  9]]


## Dynamic Few-Shot Text Classification
- DynamicFewShotGPTClassifier dynamically selects N samples per class to include in the prompt. This allows the few-shot classifier to scale to datasets that are too large for the standard context window of LLMs.
- During fitting, the whole dataset is partitioned by class, vectorized, and stored.
- During inference, the annoy library is used for fast neighbor lookup, which allows including only the most similar examples in the prompt.

In [None]:
%%capture
!pip install scikit-llm[annoy]

In [None]:
from skllm import DynamicFewShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, y = get_classification_dataset()

clf = DynamicFewShotGPTClassifier(n_examples=3)
clf.fit(X, y)
y_predict = clf.predict(X)

## Multi-Label Zero-Shot Text Classification

In [15]:
# importing Multi-Label zeroshot module and classification dataset
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

# get classification dataset from sklearn
X, y = get_multilabel_classification_dataset()

In [16]:
print(len(X))
print(X[1])

10
The delivery was super fast, but the product did not match the information provided on the website.


In [17]:
print(len(y))
print(y)

10
[['Quality', 'Packaging'], ['Delivery', 'Product Information'], ['Product Variety', 'Customer Support'], ['Price', 'User Experience'], ['Delivery', 'Packaging'], ['Customer Support', 'Return Policy'], ['Product Information', 'Return Policy'], ['Service', 'Delivery', 'Quality'], ['Price', 'Quality', 'User Experience'], ['Product Information', 'Delivery']]


In [18]:
# defining the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# fitting the model
clf.fit(X, y)

# making predictions
y_predict = clf.predict(X)

100%|██████████| 10/10 [00:07<00:00,  1.38it/s]


In [19]:
print(y_predict)

[['Quality', 'Packaging'], ['Delivery', 'Product Information'], ['Product Variety', 'Customer Support'], ['Price', 'User Experience'], ['Delivery', 'Packaging'], ['Customer Support', 'Return Policy'], ['Product Information', 'Return Policy'], ['Delivery', 'Quality'], ['Price', 'Quality', 'User Experience'], ['Product Information', 'Delivery']]


In [23]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import accuracy_score, multilabel_confusion_matrix

# Convert your lists into a binary matrix format using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_bin = mlb.fit_transform(y)
y_predict_bin = mlb.transform(y_predict)


# Label-based metrics
precision = precision_score(y_bin, y_predict_bin, average='micro')
recall = recall_score(y_bin, y_predict_bin, average='micro')
f1 = f1_score(y_bin, y_predict_bin, average='micro')

print("Precision: ", precision)
print("Recall: ", recall)
print("F1-Score: ", f1)

# Exact Match Ratio
exact_match_ratio = accuracy_score(y_bin, y_predict_bin)

# Average Accuracy
average_accuracy = (y_bin == y_predict_bin).mean()

# Multi-label Confusion Matrix
confusion_matrices = multilabel_confusion_matrix(y_bin, y_predict_bin)

print("Exact Match Ratio: ", exact_match_ratio)
print("Average Accuracy: ", average_accuracy)
print("Confusion Matrices:")
for label_index, matrix in enumerate(confusion_matrices):
    print(f"Label: {mlb.classes_[label_index]}")
    print(matrix)

Precision:  1.0
Recall:  0.9545454545454546
F1-Score:  0.9767441860465117
Exact Match Ratio:  0.9
Average Accuracy:  0.99
Confusion Matrices:
Label: Customer Support
[[8 0]
 [0 2]]
Label: Delivery
[[6 0]
 [0 4]]
Label: Packaging
[[8 0]
 [0 2]]
Label: Price
[[8 0]
 [0 2]]
Label: Product Information
[[7 0]
 [0 3]]
Label: Product Variety
[[9 0]
 [0 1]]
Label: Quality
[[7 0]
 [0 3]]
Label: Return Policy
[[8 0]
 [0 2]]
Label: Service
[[9 0]
 [1 0]]
Label: User Experience
[[8 0]
 [0 2]]


### What if you don't have labelled data (Multi Labels case)?


In [36]:
# getting classification dataset for prediction only
X, _ = get_multilabel_classification_dataset()

# Defining all the labels that needs to predicted
candidate_labels = [
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety"
]

# creating the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# fitting the labels only
clf.fit(None, [candidate_labels])

# predicting the data
y_predict = clf.predict(X)
print(y_predict)

100%|██████████| 10/10 [00:08<00:00,  1.12it/s]

[['Quality'], ['Delivery'], ['Product Variety', 'Service'], ['Price', 'Service'], ['Delivery', 'Quality'], ['Service'], ['Quality', 'Service'], ['Service', 'Delivery', 'Quality'], ['Quality', 'Price'], ['Product Variety', 'Delivery']]





## Multi-Label Few-Shot Text Classification

In [None]:
from skllm.models.gpt.gpt_few_shot_clf import MultiLabelFewShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

X, y = get_multilabel_classification_dataset()

clf = MultiLabelFewShotGPTClassifier(max_labels=2, openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

## Text Vectorization and Classifier
- Text vectorization is a process of converting text into numbers so that machines can understand and analyze it more easily.

In [43]:
# Importing the GPTVectorizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTVectorizer
from skllm.datasets import get_classification_dataset

# get classification dataset from sklearn
X, y = get_classification_dataset()

# Creating an instance of the GPTVectorizer class and assigning it to the variable 'model'
gpt_vectorizer = GPTVectorizer()

# transorming the
vectors = gpt_vectorizer.fit_transform(X)

100%|██████████| 30/30 [00:04<00:00,  6.54it/s]


In [44]:
print(len(X))
print(X[0])

30
I was absolutely blown away by the performances in 'Summer's End'. The acting was top-notch, and the plot had me gripped from start to finish. A truly captivating cinematic experience that I would highly recommend.


In [46]:
print(len(vectors))
print(len(vectors[0]))

30
1536


In [47]:
from sklearn.model_selection import train_test_split

# Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Size of X_train:", len(X_train))
print("Size of X_test:", len(X_test))
print("Size of y_train:", len(y_train))
print("Size of y_test:", len(y_test))

Size of X_train: 24
Size of X_test: 6
Size of y_train: 24
Size of y_test: 6


In [48]:
# Importing the necessary modules and classes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

# Creating an instance of LabelEncoder class
le = LabelEncoder()

# Encoding the training labels 'y_train' using LabelEncoder
y_train_encoded = le.fit_transform(y_train)

# Encoding the test labels 'y_test' using LabelEncoder
y_test_encoded = le.transform(y_test)

# Defining the steps of the pipeline as a list of tuples
steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]

# Creating a pipeline with the defined steps
clf = Pipeline(steps)

# Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded'
clf.fit(X_train, y_train_encoded)

# Predicting the labels for the test data 'X_test' using the trained pipeline
y_predict = clf.predict(X_test)

100%|██████████| 24/24 [00:03<00:00,  6.40it/s]
100%|██████████| 6/6 [00:00<00:00,  7.44it/s]


In [53]:
y_test_encoded

array([1, 0, 1, 0, 2, 2])

In [57]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

y_actual = y_test_encoded
labels = list(set(y))
print(labels)

# Calculating Accuracy
accuracy = accuracy_score(y_actual, y_predict)
print(f"Accuracy: {accuracy:.2f}")

# Calculating Precision
precision = precision_score(y_actual, y_predict, average='macro')
print(f"Precision: {precision:.2f}")

# Calculating Recall
recall = recall_score(y_actual, y_predict, average='macro')
print(f"Recall: {recall:.2f}")

# Calculating F1 Score
f1 = f1_score(y_actual, y_predict, average='macro')
print(f"F1 Score: {f1:.2f}")

# Calculating Confusion Matrix
labels = ["positive", "negative", "neutral"]
conf_matrix = confusion_matrix(y_actual, y_predict)
print("\nConfusion Matrix:")
print(conf_matrix)

['positive', 'neutral', 'negative']
Accuracy: 0.50
Precision: 0.33
Recall: 0.50
F1 Score: 0.40

Confusion Matrix:
[[0 2 0]
 [0 1 1]
 [0 0 2]]


  _warn_prf(average, modifier, msg_start, len(result))


## Text Summarization
- You can use it in two ways: on its own or as a step before doing something else (like reducing the size of the data.
- max_words hyperparameter acts as a flexible limit for the number of words in the generated summaries. It is not strictly enforced beyond the provided prompt.

In [60]:
# Importing the GPTSummarizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTSummarizer

# Importing the get_summarization_dataset function
from skllm.datasets import get_summarization_dataset

# Calling the get_summarization_dataset function
X = get_summarization_dataset()

In [62]:
print(len(X))
print(X[0])

10
The AI research company, OpenAI, has launched a new language model called GPT-4. This model is the latest in a series of transformer-based AI systems designed to perform complex tasks, such as generating human-like text, translating languages, and answering questions. According to OpenAI, GPT-4 is even more powerful and versatile than its predecessors.


In [63]:
# Creating an instance of the GPTSummarizer
s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)

# Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'.
# It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries'
summaries = s.fit_transform(X)

100%|██████████| 10/10 [00:09<00:00,  1.10it/s]


In [65]:
print(len(summaries))
summaries

10


array(['OpenAI has released GPT-4, a powerful and versatile language model for complex tasks.',
       'John bought groceries in the morning and made a fruit salad for his guests in the evening.',
       "NASA's first Mars rover, Sojourner, launched in 1996, greatly contributed to our understanding of the Red Planet.",
       'Regular exercise improves memory and cognitive function in older adults, recommends 30 minutes daily.',
       'The Eiffel Tower, completed in 1889, is a beloved symbol of Paris and French architecture.',
       'Microsoft announces new version of Windows with improved security and redesigned user interface.',
       'WHO declares global public health emergency due to unknown virus outbreak, urges nations to strengthen response systems.',
       "Paris, France will host the 2024 Olympics, marking the city's third time hosting the games.",
       "Apple's latest iPhone model features improved camera, faster processor, longer battery life, launching soon.",
       

## Text Translation

In [66]:
from skllm.preprocessing import GPTTranslator
from skllm.datasets import get_translation_dataset

X = get_translation_dataset()

In [68]:
print(len(X))
X

6


['Me encanta bailar salsa y bachata. Es una forma divertida de expresarme.',
 "J'ai passé mes dernières vacances en Grèce. Les plages étaient magnifiques.",
 'Ich habe gestern ein tolles Buch gelesen. Die Geschichte war fesselnd bis zum Ende.',
 'Gosto de cozinhar pratos tradicionais italianos. O espaguete à carbonara é um dos meus favoritos.',
 'Mám v plánu letos v létě vyrazit na výlet do Itálie. Doufám, že navštívím Řím a Benátky.',
 'Mijn favoriete hobby is fotograferen. Ik hou ervan om mooie momenten vast te leggen.']

In [69]:
t = GPTTranslator(openai_model="gpt-3.5-turbo", output_language="English")
translated_text = t.fit_transform(X)

100%|██████████| 6/6 [00:05<00:00,  1.19it/s]


In [71]:
translated_text

array(["I love to dance salsa and bachata. It's a fun way to express myself.",
       'I spent my last vacation in Greece. The beaches were beautiful.',
       'I read a great book yesterday. The story was captivating until the end.',
       'I enjoy cooking traditional Italian dishes. Spaghetti carbonara is one of my favorites.',
       'I plan to go on a trip to Italy this summer. I hope to visit Rome and Venice.',
       'My favorite hobby is photography. I love capturing beautiful moments.'],
      dtype=object)