<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/scikit_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SCIKIT-LLM

## Why this library?
- Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks.

- Similar APIs as scikit-learn, like .fit(), .fit_transform(), and.predict().

- Combine estimators from the scikit-llm library in a Sklearn pipeline.

# SETUP

In [1]:
%%capture
!pip install scikit-llm watermark


In [2]:
%load_ext watermark
%watermark -a "Sudarshan Koirala" -vmp scikit-llm

Author: Sudarshan Koirala

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

scikit-llm: not installed

Compiler    : GCC 9.4.0
OS          : Linux
Release     : 5.15.107+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



[OpenAI API keys](https://platform.openai.com/account/api-keys)  
[OpenAI Organzation Id](https://platform.openai.com/account/org-settings)

In [31]:
# importing SKLLMConfig to configure OpenAI API (key and Name)
from skllm.config import SKLLMConfig

OPENAI_API_KEY = "sk-***"
OPENAI_ORG_ID = "org-***"

# Set your OpenAI API key
SKLLMConfig.set_openai_key(OPENAI_API_KEY )

# Set your OpenAI organization
SKLLMConfig.set_openai_org(OPENAI_ORG_ID)

# OPENAI
- Using OpenAI model is not free. Although, the API cost is not that hard, this is just a reminder.

## Zero-Shot Text Classification
One of the powerful ChatGPT features is the ability to perform text classification without being re-trained. All it requires is just the descriptive labels.

ZeroShotGPTClassifier allows to create such a model as a regular scikit-learn classifier.

### Training as a regular classifier

In [4]:
# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

In [5]:
# sentiment analysis dataset
# labels: positive, negative, neutral
X, y = get_classification_dataset()

In [6]:
len(X)

30

In [7]:
X

["I was absolutely blown away by the performances in 'Summer's End'. The acting was top-notch, and the plot had me gripped from start to finish. A truly captivating cinematic experience that I would highly recommend.",
 "The special effects in 'Star Battles: Nebula Conflict' were out of this world. I felt like I was actually in space. The storyline was incredibly engaging and left me wanting more. Excellent film.",
 "'The Lost Symphony' was a masterclass in character development and storytelling. The score was hauntingly beautiful and complimented the intense, emotional scenes perfectly. Kudos to the director and cast for creating such a masterpiece.",
 "I was pleasantly surprised by 'Love in the Time of Cholera'. The romantic storyline was heartwarming and the characters were incredibly realistic. The cinematography was also top-notch. A must-watch for all romance lovers.",
 "I went into 'Marble Street' with low expectations, but I was pleasantly surprised. The suspense was well-maint

In [8]:
y

['positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral']

In [9]:
# to notice: indexing starts at 0
def training_data(data):
    subset_1 = data[:8]  # First 8 elements from 1-10
    subset_2 = data[10:18]  # First 8 elements from 11-20
    subset_3 = data[20:28]  # First 8 elements from rest of the data

    combined_data = subset_1 + subset_2 + subset_3
    return combined_data

In [10]:
# to notice: indexing starts at 0
def testing_data(data):
    subset_1 = data[8:10]  # Last 2 elements from 1-10
    subset_2 = data[18:20]  # Last 2 elements from 11-20
    subset_3 = data[28:30]  # Last 2 elements from rest of the data

    combined_data = subset_1 + subset_2 + subset_3
    return combined_data

In [11]:
X_train = training_data(X)
print(len(X_train))
X_train

24


["I was absolutely blown away by the performances in 'Summer's End'. The acting was top-notch, and the plot had me gripped from start to finish. A truly captivating cinematic experience that I would highly recommend.",
 "The special effects in 'Star Battles: Nebula Conflict' were out of this world. I felt like I was actually in space. The storyline was incredibly engaging and left me wanting more. Excellent film.",
 "'The Lost Symphony' was a masterclass in character development and storytelling. The score was hauntingly beautiful and complimented the intense, emotional scenes perfectly. Kudos to the director and cast for creating such a masterpiece.",
 "I was pleasantly surprised by 'Love in the Time of Cholera'. The romantic storyline was heartwarming and the characters were incredibly realistic. The cinematography was also top-notch. A must-watch for all romance lovers.",
 "I went into 'Marble Street' with low expectations, but I was pleasantly surprised. The suspense was well-maint

In [12]:
y_train = training_data(y)
print(len(y_train))
y_train

24


['positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral']

In [13]:
X_test = testing_data(X)
print(len(X_test))
X_test

6


["The cinematography in 'Awakening' was nothing short of spectacular. The visuals alone are worth the ticket price. The storyline was unique and the performances were solid. An overall fantastic film.",
 "'Eternal Embers' was a cinematic delight. The storytelling was original and the performances were exceptional. The director's vision was truly brought to life on the big screen. A must-see for all movie lovers.",
 "The acting in 'Desert Mirage' was subpar, and the plot was boring. I found myself yawning multiple times throughout the movie. Save your time and skip this one.",
 "'Crimson Dawn' was a major letdown. The plot was cliched and the characters were flat. The special effects were also poorly executed. I wouldn't recommend it.",
 "'Chasing Shadows' was fairly average. The plot was not bad, and the performances were passable, but it lacked a certain spark. It was just okay.",
 "'Beneath the Surface' was pretty run-of-the-mill. The plot was decent, the performances were okay, but 

In [14]:
y_test = testing_data(y)
print(len(y_test))
y_test

6


['positive', 'positive', 'negative', 'negative', 'neutral', 'neutral']

In [32]:
# defining the openai model to use
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# fitting the data
clf.fit(X_train, y_train)

In [34]:
%%time
# predicting the data
predicted_labels = clf.predict(X_test)

100%|██████████| 6/6 [00:11<00:00,  1.85s/it]

CPU times: user 77.6 ms, sys: 10.6 ms, total: 88.2 ms
Wall time: 11.1 s





In [35]:
for review, sentiment in zip(X_test, predicted_labels):
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n\n")

Review: The cinematography in 'Awakening' was nothing short of spectacular. The visuals alone are worth the ticket price. The storyline was unique and the performances were solid. An overall fantastic film.
Predicted Sentiment: positive


Review: 'Eternal Embers' was a cinematic delight. The storytelling was original and the performances were exceptional. The director's vision was truly brought to life on the big screen. A must-see for all movie lovers.
Predicted Sentiment: positive


Review: The acting in 'Desert Mirage' was subpar, and the plot was boring. I found myself yawning multiple times throughout the movie. Save your time and skip this one.
Predicted Sentiment: negative


Review: 'Crimson Dawn' was a major letdown. The plot was cliched and the characters were flat. The special effects were also poorly executed. I wouldn't recommend it.
Predicted Sentiment: negative


Review: 'Chasing Shadows' was fairly average. The plot was not bad, and the performances were passable, but it

- Scikit-LLM will automatically query the OpenAI API and transform the response into a regular list of labels.
- Scikit-LLM will ensure that the obtained response contains a valid label. If this is not the case, a label will be selected randomly (label probabilities are proportional to label occurrences in the training set).

In [36]:
from sklearn.metrics import accuracy_score

In [37]:
print(f"Accuracy: {accuracy_score(y_test, predicted_labels):.2f}")

Accuracy: 1.00


### Training without label (What if you don't have labelled data ??)
- you don’t even need labeled data to train the model.

In [38]:
# defining the model
clf_no_label = ZeroShotGPTClassifier()

# No training so passing the labels only for prediction
clf_no_label.fit(None, ['positive', 'negative', 'neutral'])

# predicting the labels
predicted_labels_without_training_data = clf_no_label.predict(X_test)
predicted_labels_without_training_data

100%|██████████| 6/6 [00:10<00:00,  1.76s/it]


['positive', 'positive', 'negative', 'negative', 'neutral', 'neutral']

In [39]:
for review, sentiment in zip(X_test, predicted_labels_without_training_data):
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n\n")

Review: The cinematography in 'Awakening' was nothing short of spectacular. The visuals alone are worth the ticket price. The storyline was unique and the performances were solid. An overall fantastic film.
Predicted Sentiment: positive


Review: 'Eternal Embers' was a cinematic delight. The storytelling was original and the performances were exceptional. The director's vision was truly brought to life on the big screen. A must-see for all movie lovers.
Predicted Sentiment: positive


Review: The acting in 'Desert Mirage' was subpar, and the plot was boring. I found myself yawning multiple times throughout the movie. Save your time and skip this one.
Predicted Sentiment: negative


Review: 'Crimson Dawn' was a major letdown. The plot was cliched and the characters were flat. The special effects were also poorly executed. I wouldn't recommend it.
Predicted Sentiment: negative


Review: 'Chasing Shadows' was fairly average. The plot was not bad, and the performances were passable, but it

In [40]:
print(f"Accuracy: {accuracy_score(y_test, predicted_labels_without_training_data):.2f}")

Accuracy: 1.00


- You can train a classifier without explicitly labeled data, simply by specifying the potential labels.
- Label has to be expressed in natural language, be descriptive and self-explanatory

## Multi-Label Zero-shot Text Classification
<font color="orange">What if you have multi-labels case ?? There is Multi-Label Zero-shot Text Classification. In this case also you can go with and without providing labelled data.</font>

## Text Vectorization (inputs to ML model)
- Scikit-LLM provides `GPTVectorizer` class to convert the input text into a fixed-dimensional vector representation.
- Each resulting vector is an array of floating numbers, which is a representation of the corresponding sentence.
- A class that uses OPEN AI embedding model that converts text to GPT embeddings. (Defaults to `text-embedding-ada-002`)

In [42]:
from skllm.preprocessing import GPTVectorizer

X = [
    "Scikit-llm is fantastic library.",
    "You need to learn machine learning.",
    "Learn scikit learn for more information."
]

vectorizer = GPTVectorizer()

vectors = vectorizer.fit_transform(X)

print(vectors)

100%|██████████| 3/3 [00:01<00:00,  2.33it/s]

1536
[[-0.02586892  0.02396249  0.00046008 ... -0.01282313 -0.02209782
  -0.04567068]
 [-0.02056578  0.00380191  0.01963336 ...  0.00560766 -0.00964596
  -0.01989601]
 [-0.00847352  0.01418481  0.02235141 ... -0.00341943 -0.01489205
  -0.03827095]]





In [43]:
# embeddings dimensions of OpenAI text-embedding-ada-002
len(vectors[0])

1536

## 😎 Lets use `scikit-llm` within a `scikit-learn` pipeline

In [44]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from skllm.preprocessing import GPTVectorizer
from xgboost import XGBClassifier

In [46]:
# Encode labels
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

In [47]:
y_train_encoded, y_test_encoded

(array([2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 1]),
 array([2, 2, 0, 0, 1, 1]))

In [48]:
# Use a scikit-learn pipeline
steps = [("GPT", GPTVectorizer()), ("Clf", XGBClassifier())]
clf = Pipeline(steps)

clf.fit(X_train, y_train_encoded)

100%|██████████| 24/24 [00:26<00:00,  1.09s/it]


In [49]:
y_pred_encoded = clf.predict(X_test)

100%|██████████| 6/6 [00:05<00:00,  1.01it/s]


In [52]:
# Revert the encoded labels to actual labels
y_pred = le.inverse_transform(y_pred_encoded)

In [53]:
print(f"\nEncoded labels (train set): {y_train_encoded}\n")
print(f"Actual Labels (train set): {y_train}")

print(f"Predicted labels (encoded): {y_test_encoded}\n")

print("------------------\nEvaluate the performance of XGBoost Classifier:\n")
for test_review, actual_label, predicted_label in zip(X_test, y_test, y_pred):
    print(f"Review: {test_review}\nActual Label: {actual_label}\nPredicted Label: {predicted_label}\n")


Encoded labels (train set): [2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1]

Actual Labels (train set): ['positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral']
Predicted labels (encoded): [2 2 0 0 1 1]

------------------
Evaluate the performance of XGBoost Classifier:

Review: The cinematography in 'Awakening' was nothing short of spectacular. The visuals alone are worth the ticket price. The storyline was unique and the performances were solid. An overall fantastic film.
Actual Label: positive
Predicted Label: neutral

Review: 'Eternal Embers' was a cinematic delight. The storytelling was original and the performances were exceptional. The director's vision was truly brought to life on the big screen. A must-see for all movie lovers.
Actual Label: positive
Pre

In [54]:
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Accuracy: 0.67


# GPT4ALL
- Same data with one of the model from gpt4all
- When running the first time, the model will be downloaded automatically.
- Need to restart runtime.

In [1]:
%%capture
!pip install "scikit-llm[gpt4all]"

- In order to switch from OpenAI to GPT4ALL model, simply provide a string of the format `gpt4all::<model_name>` as an argument.
- While the model runs completely locally, the estimator still treats it as an OpenAI endpoint and will try to check that the API key is present. - You can provide any string as a key.

In [2]:
from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("any string")
SKLLMConfig.set_openai_org("any string")

from skllm import ZeroShotGPTClassifier
clf_gpt4all = ZeroShotGPTClassifier(openai_model="gpt4all::ggml-gpt4all-j-v1.3-groovy")

### Go back and take train, test data from above.

In [15]:
# fitting the data
clf_gpt4all.fit(X_train, y_train)

In [16]:
# predicting the labels
y_pred_gpt4all = clf_gpt4all.predict(X_test)

  0%|          | 0/6 [00:00<?, ?it/s]
  0%|          | 0.00/3.79G [00:00<?, ?iB/s][A
  0%|          | 7.34M/3.79G [00:00<00:52, 71.7MiB/s][A
  0%|          | 14.7M/3.79G [00:00<01:11, 52.8MiB/s][A
  1%|          | 21.0M/3.79G [00:00<01:36, 39.1MiB/s][A
  1%|          | 26.2M/3.79G [00:00<02:00, 31.2MiB/s][A
  1%|          | 33.6M/3.79G [00:00<01:35, 39.1MiB/s][A
  1%|          | 41.9M/3.79G [00:00<01:19, 47.3MiB/s][A
  1%|▏         | 50.3M/3.79G [00:01<01:32, 40.4MiB/s][A
  2%|▏         | 59.8M/3.79G [00:01<01:13, 50.8MiB/s][A
  2%|▏         | 67.1M/3.79G [00:01<01:06, 55.7MiB/s][A
  2%|▏         | 75.5M/3.79G [00:01<00:59, 61.9MiB/s][A
  2%|▏         | 83.9M/3.79G [00:01<00:56, 65.3MiB/s][A
  2%|▏         | 92.3M/3.79G [00:01<00:54, 67.5MiB/s][A
  3%|▎         | 102M/3.79G [00:01<00:50, 73.3MiB/s] [A
  3%|▎         | 110M/3.79G [00:01<00:52, 69.9MiB/s][A
  3%|▎         | 118M/3.79G [00:02<01:33, 39.1MiB/s][A
  3%|▎         | 126M/3.79G [00:02<01:22, 44.5MiB/s][A
  4%|

Model downloaded at:  /root/.cache/gpt4all/ggml-gpt4all-j-v1.3-groovy.bin


 17%|█▋        | 1/6 [04:11<20:59, 251.87s/it]




 33%|███▎      | 2/6 [06:49<13:05, 196.50s/it]




 50%|█████     | 3/6 [09:53<09:31, 190.52s/it]




 67%|██████▋   | 4/6 [13:12<06:28, 194.19s/it]




 83%|████████▎ | 5/6 [17:03<03:27, 207.20s/it]




100%|██████████| 6/6 [21:24<00:00, 214.10s/it]







In [17]:
print("------------------\nEvaluate the performance of GPT4ALL model:\n")
for test_review, actual_label, predicted_label in zip(X_test, y_test, y_pred_gpt4all):
    print(f"Review: {test_review}\nActual Label: {actual_label}\nPredicted Label: {predicted_label}\n")

------------------
Evaluate the performance of GPT4ALL model:

Review: The cinematography in 'Awakening' was nothing short of spectacular. The visuals alone are worth the ticket price. The storyline was unique and the performances were solid. An overall fantastic film.
Actual Label: positive
Predicted Label: positive

Review: 'Eternal Embers' was a cinematic delight. The storytelling was original and the performances were exceptional. The director's vision was truly brought to life on the big screen. A must-see for all movie lovers.
Actual Label: positive
Predicted Label: positive

Review: The acting in 'Desert Mirage' was subpar, and the plot was boring. I found myself yawning multiple times throughout the movie. Save your time and skip this one.
Actual Label: negative
Predicted Label: negative

Review: 'Crimson Dawn' was a major letdown. The plot was cliched and the characters were flat. The special effects were also poorly executed. I wouldn't recommend it.
Actual Label: negative
Pr

In [22]:
from sklearn.metrics import accuracy_score

In [23]:
print(f"Accuracy: {accuracy_score(y_test, y_pred_gpt4all):.2f}")

Accuracy: 0.67


Now, assuming we don't have labels for the data.

In [18]:
# defining the model
clf_gpt4all_no_label = ZeroShotGPTClassifier(openai_model="gpt4all::ggml-gpt4all-j-v1.3-groovy")

# No training so passing the labels only for prediction
clf_gpt4all_no_label.fit(None, ['positive', 'negative', 'neutral'])

In [19]:
%%time
# predicting the labels
y_pred_gpt4all_no_label = clf_gpt4all_no_label.predict(X_test)

 17%|█▋        | 1/6 [11:45<58:47, 705.47s/it]




 33%|███▎      | 2/6 [15:42<28:38, 429.71s/it]




 50%|█████     | 3/6 [19:59<17:32, 350.87s/it]




 67%|██████▋   | 4/6 [31:57<16:32, 496.08s/it]




 83%|████████▎ | 5/6 [36:15<06:50, 410.11s/it]




100%|██████████| 6/6 [40:49<00:00, 408.32s/it]


CPU times: user 1h 11min 29s, sys: 3.07 s, total: 1h 11min 32s
Wall time: 40min 49s





In [20]:
print("------------------\nEvaluate the performance of GPT4ALL model which has no labels:\n")
for test_review, actual_label, predicted_label in zip(X_test, y_test, y_pred_gpt4all_no_label):
    print(f"Review: {test_review}\nActual Label: {actual_label}\nPredicted Label: {predicted_label}\n")

------------------
Evaluate the performance of GPT4ALL model which has no labels:

Review: The cinematography in 'Awakening' was nothing short of spectacular. The visuals alone are worth the ticket price. The storyline was unique and the performances were solid. An overall fantastic film.
Actual Label: positive
Predicted Label: positive

Review: 'Eternal Embers' was a cinematic delight. The storytelling was original and the performances were exceptional. The director's vision was truly brought to life on the big screen. A must-see for all movie lovers.
Actual Label: positive
Predicted Label: positive

Review: The acting in 'Desert Mirage' was subpar, and the plot was boring. I found myself yawning multiple times throughout the movie. Save your time and skip this one.
Actual Label: negative
Predicted Label: negative

Review: 'Crimson Dawn' was a major letdown. The plot was cliched and the characters were flat. The special effects were also poorly executed. I wouldn't recommend it.
Actua

In [24]:
print(f"Accuracy: {accuracy_score(y_test, y_pred_gpt4all_no_label):.2f}")

Accuracy: 0.67
