<a href="https://colab.research.google.com/github/parvardi/MathAI/blob/problem-classifier/Problem_Type_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Math Competition Problem Classifier
A math olympiad problem type classifier using the following:
*   Huggingface math olympiad dataset ``hendrycks/competition_math``
*   Facebook AI Similarity Search (FAISS) Library
*   SentenceTransfomer Model

## Installation and Library Imports

In [16]:
!pip install datasets sentence-transformers faiss-cpu

import pandas as pd
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from sklearn.metrics import classification_report, accuracy_score



## Load the Dataset and the Model


In [17]:
dataset = load_dataset("hendrycks/competition_math")

train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

model = SentenceTransformer('all-MiniLM-L6-v2')

## Initializing and Indexing the Model



In [18]:

# Encode the training and test data
print("Encoding training data...")
train_embeddings = model.encode(train_df['problem'].tolist(), convert_to_tensor=False, show_progress_bar=True)
print("Encoding test data...")
test_embeddings = model.encode(test_df['problem'].tolist(), convert_to_tensor=False, show_progress_bar=True)

train_embeddings = np.array(train_embeddings).astype('float32')
test_embeddings = np.array(test_embeddings).astype('float32')

# Normalize embeddings
faiss.normalize_L2(train_embeddings)
faiss.normalize_L2(test_embeddings)

dimension = train_embeddings.shape[1]

# Indexing
index = faiss.IndexFlatIP(dimension)
index.add(train_embeddings)
print(f"Number of vectors in the index: {index.ntotal}")

# Set the number of nearest neighbors
k = 5

Encoding training data...


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Encoding test data...


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

Number of vectors in the index: 7500


## Similarity Search and Accuracy Score Calculation

In [19]:
print("Starting similarity search...")
distances, indices = index.search(test_embeddings, k)

# Assigning labels based on k nearest neighbors
def assign_label(neighbor_indices):
    neighbor_labels = train_df.iloc[neighbor_indices]['type']
    return neighbor_labels.mode()[0]

print("Assigning labels...")
test_df['predicted_type'] = [assign_label(idx) for idx in indices]

# Predictions
print(test_df[['problem', 'type', 'predicted_type']].head())

# Calculating accuracy
accuracy = accuracy_score(test_df['type'], test_df['predicted_type'])
print(f"Accuracy: {accuracy:.2f}")

print(classification_report(test_df['type'], test_df['predicted_type']))


Starting similarity search...
Assigning labels...
                                             problem     type  \
0  How many vertical asymptotes does the graph of...  Algebra   
1  What is the positive difference between $120\%...  Algebra   
2  Find $x$ such that $\lceil x \rceil + x = \dfr...  Algebra   
3                     Evaluate $i^5+i^{-25}+i^{45}$.  Algebra   
4            If $2^8=4^x$, what is the value of $x$?  Algebra   

         predicted_type  
0  Intermediate Algebra  
1               Algebra  
2               Algebra  
3               Algebra  
4               Algebra  
Accuracy: 0.73
                        precision    recall  f1-score   support

               Algebra       0.68      0.80      0.74      1187
Counting & Probability       0.72      0.78      0.75       474
              Geometry       0.67      0.77      0.72       479
  Intermediate Algebra       0.85      0.75      0.80       903
         Number Theory       0.74      0.75      0.75       540
   