In [11]:
import pandas as pd
import os

## 1. Reading data from the file / Data Preparation

In [13]:
dataset_path = "subjects-questions.csv"

In [14]:
df = pd.read_csv(dataset_path, 
                 encoding = "ISO-8859-1",  
                 on_bad_lines = 'skip', 
                 engine='python', 
                 names = ['input', 'class', "extra"]
                )
print(df.shape)
df.head()

(122574, 3)


Unnamed: 0,input,class,extra
0,An anti-forest measure is\nA. Afforestation\nB...,Biology,
1,"Among the following organic acids, the acid pr...",Chemistry,
2,If the area of two similar triangles are equal...,Maths,
3,"In recent year, there has been a growing\nconc...",Biology,
4,Which of the following statement\nregarding tr...,Physics,


In [4]:
# Lets define classes distribution
df["class"].value_counts()

class
Physics                                                                                                                                 38435
Chemistry                                                                                                                               37764
Maths                                                                                                                                   33170
Biology                                                                                                                                 13122
 \mathbf{2} \boldsymbol{x}-\mathbf{5} \boldsymbol{y}=\mathbf{1} \)                                                                          2
 (ii) and (iii)                                                                                                                             1
 STATEMENT-2 is True STATEMENT-2 is a correct explanation for STATEMEN                                                                      1


In [5]:
# Make some filtering of the data
df = df[df["class"].isin(['Physics', 'Chemistry', 'Maths', 'Biology'])]
df["class"].value_counts()

class
Physics      38435
Chemistry    37764
Maths        33170
Biology      13122
Name: count, dtype: int64

We can notice that the more-less dataset is balanced. The minority class is Biology, but it might not be too critical for us.

For faster calculations, lets use only 1k records from this dataset. 

In [6]:
df = df.sample(frac=1)
df = df[:1000]
df["class"].value_counts()

class
Physics      334
Chemistry    313
Maths        240
Biology      113
Name: count, dtype: int64

In [7]:
df.head()

Unnamed: 0,input,class,extra
82963,A ray is reflected in turn by three plane mirr...,Physics,
70834,Numbers greater than 1000 but not greater than...,Maths,
114605,The ionisation energy of isotopes of an elemen...,Chemistry,
85906,Which of the following products obtained by de...,Chemistry,
88339,Find the maximum zener current for the\nzener ...,Physics,


## 2. Data Preprocessing

All preprocessing methods can be found in text_preprocessing.py.

I have some doubts that preprocessing methods can be useful for classification because we can use some semantics for embedding models - but let's test them.
Those techniques might be good, when we use Chat GPT, as it allows us to reduce the number of tokens and some costs for triggering Open AI API. 

In [9]:
from text_preprocessing import TextProcessor

Collecting en-core-web-lg==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
processor = TextProcessor()

There are a couple of techniques. Let's apply there lowercasing, remove_punctuation, remove_stop_words, stemming

After we try to compare a model without text processing and with text processing. 

In [11]:
processor = TextProcessor()

df["input_processed"] = (df["input"]
                        .apply(processor.lowercasing)
                        .apply(processor.remove_punctuation)
                        .apply(processor.remove_stop_words).apply(processor.stemming))

In [12]:
df.head()

Unnamed: 0,input,class,extra,input_processed
82963,A ray is reflected in turn by three plane mirr...,Physics,,ray reflect turn three plane mirror mutual rig...
70834,Numbers greater than 1000 but not greater than...,Maths,,number greater 1000 greater 4000 form digit 01...
114605,The ionisation energy of isotopes of an elemen...,Chemistry,,ionis energi isotop element b differ c depend ...
85906,Which of the following products obtained by de...,Chemistry,,follow product obtain destruct distil coal pro...
88339,Find the maximum zener current for the\nzener ...,Physics,,find maximum zener current zener diod shown fi...


## 3. Apply Embeddings

In encoding_data.py, there are classes to encode data with batches. Let's use this class. 
Batched processing was used to avoid memory issues when encoding. 

Also, for encoding was the Sentence Transformers package and a model "baai/bge-large-en-v1.5". The logic for selecting this model is very simple - in previous projects, those embeddings showed great results, for some tasks even better than open AI embeddings. 
I can not guarantee that it's the ideal model, but it might be a good choice for the start.

Let's encode input and processed_input fields to compare what approach is better. 

In [13]:
from encoding_data import BatchEncoder

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
# Encoding initial input 
df = BatchEncoder(dataframe=df, 
                  column_to_encode='input', 
                  embedding_column_name = "embedding").process_data()

# Encoding processed input 
df = BatchEncoder(dataframe=df, 
                  column_to_encode='input_processed', 
                  embedding_column_name = "embedding_processed").process_data()

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: baai/bge-large-en-v1.5
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:root:Batch Encoder for column input initialized with batch size 64.
INFO:root:Creating dataset...
INFO:root:Dataset created.
INFO:root:Starting encoding...
Encoding: 0it [00:00, ?it/s]
Batches:   0%|                                                                                                                                                                                    | 0/2 [00:00<?, ?it/s][A
Batches:  50%|██████████████████████████████████████████████████████████████████████████████████████                                                                                      | 1/2 [00:17<00:17, 17.56s/it][A
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

In [15]:
df.head()

Unnamed: 0,input,class,extra,input_processed,embedding,embedding_processed
82963,A ray is reflected in turn by three plane mirr...,Physics,,ray reflect turn three plane mirror mutual rig...,"[-0.021937372162938118, 0.004053227137774229, ...","[-0.004239920526742935, 0.0012692904565483332,..."
70834,Numbers greater than 1000 but not greater than...,Maths,,number greater 1000 greater 4000 form digit 01...,"[0.03185784071683884, -0.01292299572378397, 0....","[0.015118478797376156, -0.019171688705682755, ..."
114605,The ionisation energy of isotopes of an elemen...,Chemistry,,ionis energi isotop element b differ c depend ...,"[-0.010259386152029037, 0.019595377147197723, ...","[0.02128755673766136, 0.02362244389951229, 0.0..."
85906,Which of the following products obtained by de...,Chemistry,,follow product obtain destruct distil coal pro...,"[0.0018703339155763388, -0.021671300753951073,...","[-0.026240870356559753, -0.04943910613656044, ..."
88339,Find the maximum zener current for the\nzener ...,Physics,,find maximum zener current zener diod shown fi...,"[-3.6853998608421534e-05, 0.000886078807525336...","[0.01002984307706356, -0.02718561887741089, 0...."


## 4. Classification with XGBoost

For the classification was selected XGBoost. The idea behind it is that XGBoost is a good classification algorithm for the majority of tasks. 

Test Train XGBoosst based on processed input/embeddings and raw input/embeddings and compare results.

Please notice that we use random_state=1. It will allow us to decrease randomness when splitting data for raw data and for processed data and decrease bias when we compare results. 

In [17]:
from text_classifier import XGBTextClassifier

In [20]:
# Train and evaluate classifier for raw data
clf = XGBTextClassifier(dataset=df, embedding_column='embedding', predict_column='class')
clf.prepare_data()
clf.train()
y_pred = clf.predict()
clf.evaluate(y_pred)

INFO:root:              precision    recall  f1-score   support

     Biology       0.95      0.70      0.81        27
   Chemistry       0.80      0.88      0.83        80
       Maths       0.93      0.95      0.94        56
     Physics       0.88      0.86      0.87        87

    accuracy                           0.87       250
   macro avg       0.89      0.85      0.86       250
weighted avg       0.87      0.87      0.87       250



In [21]:
# Train and evaluate classifier for raw data
clf = XGBTextClassifier(dataset=df, embedding_column='embedding_processed', predict_column='class')
clf.prepare_data()
clf.train()
y_pred = clf.predict()
clf.evaluate(y_pred)

INFO:root:              precision    recall  f1-score   support

     Biology       0.86      0.70      0.78        27
   Chemistry       0.80      0.80      0.80        80
       Maths       0.88      0.95      0.91        56
     Physics       0.85      0.86      0.86        87

    accuracy                           0.84       250
   macro avg       0.85      0.83      0.84       250
weighted avg       0.84      0.84      0.84       250



We see that results for raw data are better a bit and it's logical as we keep a semantic of the text. 

The situation might change when we tune XGBoost and train a model on bigger number of data.

## 5. Classification with GPT models

GPT classification was done using prompt engineering, and Langchain as a wrapper for the open ai package. 
Please notice that a prompt is very simple and does not have any samples (few-shot learning). 
The idea behind it is that the current task is a simple task for GPT-4 model and adding some samples into the prompt can increase efficiency, but also a number of tokens we send to open ai and cost for this solution. 

In [1]:
from chat_gpt_classifier import GPTTextClassifier

In [5]:
os.environ["OPENAI_API_KEY"] = "sk-..."

In [6]:
GPTTextClassifier().classify("1+4 = 5")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'Maths'

In [7]:
GPTTextClassifier().classify("The law of universal gravitation")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'Physics'

In [8]:
GPTTextClassifier().classify("Potassium carbonate")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'Chemistry'

In [9]:
GPTTextClassifier().classify("Head Disease")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'Biology'

## 6. Summary

I did not calculate metrics for the approach with Chat GPT. Also, XGBoost was trained without any parameters tuning or optimization. 
Also, I do not have a full understanding of the use case. 
So, It's hard to make a conclusion about what approach might be better for the current task. 
But it's my thoughts: 

**Classification with embeddings**

This approach is good when we have:  
1. enough amount of data for training,
2. our classes do not change frequently (for example we have those 4 classes and adding a new class is a low probable event) 
3. Our budget is limited
4. From the security perspective we can not send data to 3-d party APIs. For example, we have some sensitive data from customers. 
5. When response time is important for us. We will have some delay in the response with GPT, because its 3-d party API. 
6. When loads in the server are very big. With a big load on the server, we will send lots of requests to Open AI and can exceed token limits. It slows down a solution. Having some custom models might be a good choice. 

**Classification with Chat GPT**

This approach is good: 
1. When we have a limited time to propose a solution and deploy it. This approach is not more than prompt engineering and creating a simple API that uses open AI models.
2. When we do not have any potential security issues when sending data to 3-d party API. We do not have sensitive data from customers.
3. When the load of the server is not bid and response time is not important for us 
4. When we have a good budget. Each call to Open AI has some costs.
5. Classes are changed frequently. For example, we might need to add some new classes. In this case, prompt engineering might be a simple solution for this task, and adding a new class it's only changes the prompt. 

## 7. Bonus part

If we do not have a class for each record, using LLM, for example ChatGPT (as it made above might be a good choice)
Lets apply this method for 10 records from the dataset. 

In [18]:
df_bonus = df[:5]

def llm_classification(text):
    return GPTTextClassifier().classify(text)

df_bonus["class_llm"] = df_bonus["input"].apply(llm_classification)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bonus["class_llm"] = df_bonus["input"].apply(llm_classification)


In [19]:
df_bonus.head()

Unnamed: 0,input,class,extra,class_llm
0,An anti-forest measure is\nA. Afforestation\nB...,Biology,,Biology
1,"Among the following organic acids, the acid pr...",Chemistry,,Chemistry
2,If the area of two similar triangles are equal...,Maths,,Maths
3,"In recent year, there has been a growing\nconc...",Biology,,Biology
4,Which of the following statement\nregarding tr...,Physics,,Physics


The current task is simple for the GPT-4 model, so I'm not sure that verification might add a lot of value. 
But let's imagine that a task is more complex, for example, we have 50 classes or classes are not too obvious. 

In this case, 
1. I would use an Open AI function calling that should increase the accuracy of the returned result. 
2. We might need to add a description for each class in the prompt, but then we increase the number of tokens and price per request to LLM.

If we do not have human validation, there might be a couple of approaches that probably would help: 
1. Ensenble of models. Let's use not only Open AI models but also Google Gemini for example. And after making some voting for models. Maybe even try to use some open-source models like mixtrail.
2. Add a separate model that evaluates the results of the previous ones. We can design a prompt that responds with a score. 
3. Calculate cosine similarity between predicted class and input. But not sure that this approach will work, as our class is a simple word. 
The advanced approach here will be to describe each class with a couple of sentences calculate the cosine similarity between input and description for each class and return the class with the higher score. 