In [1]:
import pandas as pd
import os

## 1. Reading data from the file / Data Preparation

In [2]:
dataset_path = "subjects-questions.csv"

In [3]:
df = pd.read_csv(dataset_path, 
                 encoding = "ISO-8859-1",  
                 on_bad_lines = 'skip', 
                 engine='python', 
                 names = ['input', 'class', "extra"]
                )
print(df.shape)
df.head()

(122574, 3)


Unnamed: 0,input,class,extra
0,An anti-forest measure is\nA. Afforestation\nB...,Biology,
1,"Among the following organic acids, the acid pr...",Chemistry,
2,If the area of two similar triangles are equal...,Maths,
3,"In recent year, there has been a growing\nconc...",Biology,
4,Which of the following statement\nregarding tr...,Physics,


In [4]:
# Lets define classes distribution
df["class"].value_counts()

class
Physics                                                                                                                                 38435
Chemistry                                                                                                                               37764
Maths                                                                                                                                   33170
Biology                                                                                                                                 13122
 \mathbf{2} \boldsymbol{x}-\mathbf{5} \boldsymbol{y}=\mathbf{1} \)                                                                          2
 (ii) and (iii)                                                                                                                             1
 STATEMENT-2 is True STATEMENT-2 is a correct explanation for STATEMEN                                                                      1


In [5]:
# Make some filtering of the data
df = df[df["class"].isin(['Physics', 'Chemistry', 'Maths', 'Biology'])]
df["class"].value_counts()

class
Physics      38435
Chemistry    37764
Maths        33170
Biology      13122
Name: count, dtype: int64

We can notice that the more-less dataset is balanced. The minority class is Biology, but it might not be too critical for us.

For faster calculations, lets use only 1k records from this dataset. 

In [6]:
df = df.sample(frac=1)
df = df[:1000]
df["class"].value_counts()

class
Chemistry    308
Physics      297
Maths        272
Biology      123
Name: count, dtype: int64

In [7]:
df.head()

Unnamed: 0,input,class,extra
39894,Two cards are drawn simultaneously from a well...,Maths,
76309,In the leaves of which of the following are bu...,Biology,
98914,A chord of a circle of radius \( 12 \mathrm{cm...,Maths,
24853,"An organic compound A containing\n\( C, H \) a...",Chemistry,
53433,What are emulsions? What are their\ndifferent ...,Chemistry,


## 2. Data Preprocessing

All preprocessing methods can be found in text_preprocessing.py.

I have some doubts that preprocessing methods can be useful for classification because we can use some semantics for embedding models - but let's test them.
Those techniques might be good, when we use Chat GPT, as it allows us to reduce the number of tokens and some costs for triggering Open AI API. 

In [8]:
from text_preprocessing import TextProcessor

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [9]:
processor = TextProcessor()

There are a couple of techniques. Let's apply there lowercasing, remove_punctuation, remove_stop_words, stemming

After we try to compare a model without text processing and with text processing. 

In [10]:
processor = TextProcessor()

df["input_processed"] = (df["input"]
                        .apply(processor.lowercasing)
                        .apply(processor.remove_punctuation)
                        .apply(processor.remove_stop_words).apply(processor.stemming))

In [11]:
df.head()

Unnamed: 0,input,class,extra,input_processed
39894,Two cards are drawn simultaneously from a well...,Maths,,two card drawn simultan wellshuffl deck 52 car...
76309,In the leaves of which of the following are bu...,Biology,,leav follow bulliform cell found sunflow b whe...
98914,A chord of a circle of radius \( 12 \mathrm{cm...,Maths,,chord circl radiu 12 mathrmcm subtend angl 60c...
24853,"An organic compound A containing\n\( C, H \) a...",Chemistry,,organ compound contain c h pleasant odour boil...
53433,What are emulsions? What are their\ndifferent ...,Chemistry,,emuls differ type give one exampl type


## 3. Apply Embeddings

In encoding_data.py, there are classes to encode data with batches. Let's use this class. 
Batched processing was used to avoid memory issues when encoding. 

Also, for encoding was the Sentence Transformers package and a model "baai/bge-large-en-v1.5". The logic for selecting this model is very simple - in previous projects, those embeddings showed great results, for some tasks even better than open AI embeddings. 
I can not guarantee that it's the ideal model, but it might be a good choice for the start.

Let's encode input and processed_input fields to compare what approach is better. 

In [12]:
from encoding_data import BatchEncoder

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
# Encoding initial input 
df = BatchEncoder(dataframe=df, 
                  column_to_encode='input', 
                  embedding_column_name = "embedding").process_data()

# Encoding processed input 
df = BatchEncoder(dataframe=df, 
                  column_to_encode='input_processed', 
                  embedding_column_name = "embedding_processed").process_data()

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: baai/bge-large-en-v1.5
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:root:Batch Encoder for column input initialized with batch size 64.
INFO:root:Creating dataset...
INFO:root:Dataset created.
INFO:root:Starting encoding...
Encoding: 0it [00:00, ?it/s]
Batches:   0%|                                                                                                                                                                                    | 0/2 [00:00<?, ?it/s][A
Batches:  50%|██████████████████████████████████████████████████████████████████████████████████████                                                                                      | 1/2 [00:07<00:07,  7.98s/it][A
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

In [14]:
df.head()

Unnamed: 0,input,class,extra,input_processed,embedding,embedding_processed
39894,Two cards are drawn simultaneously from a well...,Maths,,two card drawn simultan wellshuffl deck 52 car...,"[-0.001785461325198412, 0.04096945747733116, 0...","[-0.0017594101373106241, 0.03013821318745613, ..."
76309,In the leaves of which of the following are bu...,Biology,,leav follow bulliform cell found sunflow b whe...,"[0.032961826771497726, -0.017931874841451645, ...","[0.031177779659628868, -0.006774567533284426, ..."
98914,A chord of a circle of radius \( 12 \mathrm{cm...,Maths,,chord circl radiu 12 mathrmcm subtend angl 60c...,"[-0.0130480220541358, -0.027748994529247284, 0...","[0.013170714490115643, -0.006082989741116762, ..."
24853,"An organic compound A containing\n\( C, H \) a...",Chemistry,,organ compound contain c h pleasant odour boil...,"[-0.042638301849365234, 0.004175940528512001, ...","[-0.006746021565049887, -0.011572941206395626,..."
53433,What are emulsions? What are their\ndifferent ...,Chemistry,,emuls differ type give one exampl type,"[0.01102669257670641, 0.023864492774009705, 0....","[-0.03406751528382301, 0.00356256659142673, 0...."


## 4. Classification with XGBoost

For the classification was selected XGBoost. XGBoost is a good classification algorithm for the majority of tasks. 
The idea behind it is the next: XGboost is an advanced tree-based algorithm, that uses the ensemble of trees. Models are used in the way when the next model is trained after another helping to accomplish a better accuracy. It also has some level of regularization which allows to reduce overfitting.  

Test Train XGBoosst based on processed input/embeddings and raw input/embeddings and compare results.

Please notice that we use random_state=1. It will allow us to decrease randomness when splitting data for raw data and for processed data and decrease bias when we compare results. 

In [15]:
from text_classifier import XGBTextClassifier

In [16]:
# Train and evaluate classifier for raw data
clf = XGBTextClassifier(dataset=df, embedding_column='embedding', predict_column='class')
clf.prepare_data()
clf.train()
y_pred = clf.predict()
clf.evaluate(y_pred)

INFO:root:              precision    recall  f1-score   support

     Biology       1.00      0.78      0.88        32
   Chemistry       0.83      0.85      0.84        61
       Maths       0.88      0.96      0.92        83
     Physics       0.87      0.84      0.86        74

    accuracy                           0.88       250
   macro avg       0.89      0.86      0.87       250
weighted avg       0.88      0.88      0.88       250



In [17]:
# Train and evaluate classifier for raw data
clf = XGBTextClassifier(dataset=df, embedding_column='embedding_processed', predict_column='class')
clf.prepare_data()
clf.train()
y_pred = clf.predict()
clf.evaluate(y_pred)

INFO:root:              precision    recall  f1-score   support

     Biology       0.89      0.75      0.81        32
   Chemistry       0.76      0.84      0.80        61
       Maths       0.87      0.94      0.90        83
     Physics       0.85      0.76      0.80        74

    accuracy                           0.84       250
   macro avg       0.84      0.82      0.83       250
weighted avg       0.84      0.84      0.83       250



We see that results for raw data are better a bit and it's logical as we keep a semantic of the text. 

The situation might change when we tune XGBoost and train a model on bigger number of data.

## 5. Classification with GPT models

GPT classification was done using prompt engineering, and Langchain as a wrapper for the open ai package. 
Please notice that a prompt is very simple and does not have any samples (few-shot learning). 
The idea behind it is that the current task is a simple task for GPT-4 model and adding some samples into the prompt can increase efficiency, but also a number of tokens we send to open ai and cost for this solution. 

In [None]:
from chat_gpt_classifier import GPTTextClassifier, GPTTextEvaluator

In [None]:
# Please import OPENAI_API_KEY to have an access to GPT model
os.environ["OPENAI_API_KEY"] = "sk-..."

In [None]:
GPTTextClassifier().classify("1+4 = 5")

In [21]:
GPTTextClassifier().classify("The law of universal gravitation")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'Physics'

In [22]:
GPTTextClassifier().classify("Potassium carbonate")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'Chemistry'

In [23]:
GPTTextClassifier().classify("Head Disease")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'Biology'

Lets evaluate current model to compare it with previuos results.
Lets make inheritance from XGBTextClassifier for the classification 

In [24]:
clf = GPTTextEvaluator(dataset=df, embedding_column='embedding_processed', predict_column='class', input_column='input_processed')
clf.prepare_data()
y_pred = clf.predict()
clf.evaluate(y_pred)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

We see that an approach with the Chat GPT model outperforms the XGBoost approach for our small number of data samples.

## 6. Summary

The current findings suggest that the GPT model approach yields superior results. However, it's important to note that the XGBoost model was trained without any parameter tuning or optimization, and we only used 1k records from the training dataset. Furthermore, our understanding of the use case is not comprehensive. Therefore, it's challenging to definitively determine which approach is more suitable for the task at hand. 

Here are my thoughts on the potential benefits of each approach:

**Embeddings-Based Classification**

This method could be beneficial when: 
1. There is a sufficient amount of training data.
2. The classes do not frequently change (for instance, if we only have four classes and the likelihood of adding a new class is low).
3. Budget constraints exist.
4. Data cannot be sent to third-party APIs due to security concerns, such as when dealing with sensitive customer information.
5. Response time is a critical factor. There will be a delay in response with GPT as it's a third-party API.
6. The server is under heavy load. High server load could lead to exceeding token limits with Open AI, slowing down the solution. In such cases, custom models could be a better option.
7. Models need to know specific domains and general models like GPT do not have that knowledge. 

**Chat GPT-Based Classification**

This method could be beneficial when:
1. There is a need to quickly propose and deploy a solution. This approach primarily involves prompt engineering and creating a simple API that utilizes Open AI models.
2. There are no potential security issues with sending data to a third-party API, i.e., there's no sensitive customer data involved.
3. The server load is manageable and response time is not a priority.
4. There's a generous budget, as each call to Open AI incurs a cost.
5. Classes change frequently. If new classes need to be added regularly, prompt engineering could offer a simple solution, as adding a new class would only require changing the prompt.

## 7. Bonus part

If we do not have a class for each record, using LLM, for example ChatGPT (as it made above might be a good choice)
Lets apply this method for 10 records from the dataset. 

In [26]:
df_bonus = df[:10]

def llm_classification(text):
    return GPTTextClassifier().classify(text)

df_bonus["class_llm"] = df_bonus["input_processed"].apply(llm_classification)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_inde

In [30]:
df_bonus[['input', 'class', 'class_llm']].head(10)

Unnamed: 0,input,class,class_llm
39894,Two cards are drawn simultaneously from a well...,Maths,Maths
76309,In the leaves of which of the following are bu...,Biology,Biology
98914,A chord of a circle of radius \( 12 \mathrm{cm...,Maths,Maths
24853,"An organic compound A containing\n\( C, H \) a...",Chemistry,Chemistry
53433,What are emulsions? What are their\ndifferent ...,Chemistry,Chemistry
113274,Which of the following is an example of oxidat...,Chemistry,Chemistry
34655,Which of the following is not a characteristic...,Physics,Physics
27692,Find the complement of the angle :\n\( \frac{1...,Maths,Maths
51766,Give the main drawback of Rutherford's\nmodel.,Chemistry,Physics
58331,Heights of students of class \( X \) are given...,Maths,Maths


Let's imagine that we do not have a ground truth class for each record and we need to somehow generate and verify it without human validation.

How to generate classes: 
1. I would use an Open AI function calling that should increase the accuracy of the returned result instead of prompt engineering. 
2. We might need to add a description for each class in the prompt, but then we increase the number of tokens and price per request to LLM.

If we do not have human validation, there might be a couple of approaches that probably would help: 
1. Ensenble of models. Let's use not only Open AI models but also Google Gemini for example. And after making some voting for models. Maybe even try to use some open-source models like mixtrail.
2. Add a separate model that evaluates the results of the previous ones. We can design a prompt that responds with a score. 
3. Calculate cosine similarity between predicted class and input. But not sure that this approach will work, as our class is a simple word. 
The advanced approach here will be to describe each class with a couple of sentences calculate the cosine similarity between input and description for each class and return the class with the higher score.

But anyway, it would be nice to have human validation for edge cases: at least for cases when 2 models show different classes or model scores regarding selected classes are low.