# Glia - Text Classification

In [2]:
import pandas as pd
import numpy as np

## Read data

The dataset consists of 13.8k rows of chatbot messages with the associated label. The dataset is open-source and available on HuggingFace: https://huggingface.co/datasets/Bhuvaneshwari/intent_classification

In [3]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,text,label
0,listen to westbam alumb allergic on google music,PlayMusic
1,add step to me to the 50 clásicos playlist,AddToPlaylist
2,i give this current textbook a rating value of...,RateBook
3,play the song little robin redbreast,PlayMusic
4,please add iris dement to my playlist this is ...,AddToPlaylist


## Data preparation

I am taking a subset of data just to make computation faster. I tried fine-tuning a pretrained Transformer model, but I was running out of memory. Instead, I will take 100 messages per label (12 labels, so a total of 1200 messages), and use scikit-learn to classify each message to the correct intent.

In [4]:
mini_df = pd.DataFrame(columns=['text', 'label'])

grouped = df.groupby('label')

for label, group in grouped:
    
    if len(group) >= 100:
        selected_samples = group.sample(n=100, random_state=42)  
    else:
        selected_samples = group
    
    mini_df = pd.concat([mini_df, selected_samples])

There is a label for which there are not 100 messages, so we end up with 1199 messages in total.

In [5]:
mini_df.shape

(1199, 2)

Map the intent to a number, so we can train a classifier

In [6]:
intents = list(set(mini_df['label']))

id2label = {idx:label for idx, label in enumerate(intents)}
label2id = {label:idx for idx, label in enumerate(intents)}

mini_df['label'] = mini_df['label'].map(label2id)

mini_df.head()

Unnamed: 0,text,label
12633,put corrina corrina onto my classical x list,9
6954,add kd lang to my deep focus playlist,9
825,add elkie brooks to happy birthday playlist,9
2732,add judge jules to instrumental study,9
4052,put beside you in my spotify orchestra cello p...,9


## Split the data
I will do a 70/30 train/test split

In [7]:
from sklearn.model_selection import train_test_split

X = mini_df['text']
y = mini_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

## Modeling
### Extract features from text

Using scikit-learn, I will use the TF-IDF transformation to extract features from the text. That way, we have numerical data to train a classifier

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

X_train_count = vect.fit_transform(X_train)

X_train_count.shape

(839, 1259)

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(use_idf=False).fit(X_train_count)

X_train_tfidf = transformer.fit_transform(X_train_count)

X_train_tfidf.shape

(839, 1259)

### Baseline classifier

I will use a dummy classifier as a baseline model to evaluate more advanced classifiers.

In [10]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report

dummy_clf = DummyClassifier(strategy='uniform', random_state=42)

dummy_clf.fit(X_train_tfidf, y_train)

# Evaluate the model
X_test_count = vect.transform(X_test)
X_test_tfidf = transformer.transform(X_test_count)    # Apply the TF-IDF transformation on X_test

label_idx = list(id2label.keys())   # Get the id of each label
label_names = list(label2id.keys())   # Get the name of each label

dummy_preds = dummy_clf.predict(X_test_tfidf)

dummy_clf_report = classification_report(y_test, dummy_preds, labels=label_idx, target_names=label_names)

print(dummy_clf_report)

                      precision    recall  f1-score   support

           excitment       0.14      0.17      0.15        30
          GetWeather       0.08      0.07      0.07        30
        Cancellation       0.03      0.03      0.03        30
         Affirmation       0.13      0.13      0.13        30
      BookRestaurant       0.16      0.13      0.15        30
           PlayMusic       0.12      0.10      0.11        30
SearchScreeningEvent       0.15      0.17      0.16        30
  SearchCreativeWork       0.06      0.07      0.06        30
        Book Meeting       0.07      0.07      0.07        30
       AddToPlaylist       0.05      0.03      0.04        30
            RateBook       0.11      0.10      0.11        30
           Greetings       0.10      0.13      0.11        30

            accuracy                           0.10       360
           macro avg       0.10      0.10      0.10       360
        weighted avg       0.10      0.10      0.10       360



As expected, the performance of the dummy classifier is very bad, with only 10% in accuracy. Now, let's test a naive Bayes classifier and an SVM (support vector machine) classifier.

**Note that I report the accuracy because each intent is present in equal proportions. Otherwise, I would look at the weighted F1-Score**.

### Naive Bayes classifier

In [12]:
from sklearn.naive_bayes import MultinomialNB

naive_clf = MultinomialNB()

naive_clf.fit(X_train_tfidf, y_train)

naive_preds = naive_clf.predict(X_test_tfidf)

naive_clf_report = classification_report(y_test, naive_preds, labels=label_idx, target_names=label_names)

print(naive_clf_report)

                      precision    recall  f1-score   support

           excitment       1.00      1.00      1.00        30
          GetWeather       0.97      1.00      0.98        30
        Cancellation       1.00      1.00      1.00        30
         Affirmation       1.00      1.00      1.00        30
      BookRestaurant       1.00      0.97      0.98        30
           PlayMusic       0.94      0.97      0.95        30
SearchScreeningEvent       0.93      0.87      0.90        30
  SearchCreativeWork       0.83      0.83      0.83        30
        Book Meeting       1.00      1.00      1.00        30
       AddToPlaylist       0.97      1.00      0.98        30
            RateBook       1.00      1.00      1.00        30
           Greetings       1.00      1.00      1.00        30

            accuracy                           0.97       360
           macro avg       0.97      0.97      0.97       360
        weighted avg       0.97      0.97      0.97       360



Already, the performance is much better, and in fact it is very good, with an accuracy of 97%.

**Note that I report the accuracy because each intent is present in equal proportions. Otherwise, I would look at the weighted F1-Score**.

### SVM classifier

In [13]:
from sklearn.svm import SVC

sv_clf = SVC()

sv_clf.fit(X_train_tfidf, y_train)

sv_preds = sv_clf.predict(X_test_tfidf)

sv_clf_report = classification_report(y_test, sv_preds, labels=label_idx, target_names=label_names)

print(sv_clf_report)

                      precision    recall  f1-score   support

           excitment       1.00      1.00      1.00        30
          GetWeather       0.97      1.00      0.98        30
        Cancellation       1.00      1.00      1.00        30
         Affirmation       1.00      1.00      1.00        30
      BookRestaurant       1.00      0.97      0.98        30
           PlayMusic       0.90      0.93      0.92        30
SearchScreeningEvent       0.96      0.87      0.91        30
  SearchCreativeWork       0.74      0.83      0.78        30
        Book Meeting       1.00      0.97      0.98        30
       AddToPlaylist       1.00      1.00      1.00        30
            RateBook       0.97      0.93      0.95        30
           Greetings       1.00      1.00      1.00        30

            accuracy                           0.96       360
           macro avg       0.96      0.96      0.96       360
        weighted avg       0.96      0.96      0.96       360



The SVM classifier performs really well too, but its accuracy is lower than the Naive Bayes model (96 vs 97). Therefore, the Naive Bayes model is the best option for this scenario.

**Note that I report the accuracy because each intent is present in equal proportions. Otherwise, I would look at the weighted F1-Score**.

## Train a pipeline and save the model

Now, I create a pipeline for the model to classifiy new text and to deploy it through a REST API.

In [15]:
from sklearn.pipeline import Pipeline

intent_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

intent_clf.fit(X_train, y_train)

#### Save the model

In [16]:
from joblib import dump

dump(intent_clf, 'intent_clf.joblib')

['intent_clf.joblib']

In [17]:
id2label

{0: 'excitment',
 1: 'GetWeather',
 2: 'Cancellation',
 3: 'Affirmation',
 4: 'BookRestaurant',
 5: 'PlayMusic',
 6: 'SearchScreeningEvent',
 7: 'SearchCreativeWork',
 8: 'Book Meeting',
 9: 'AddToPlaylist',
 10: 'RateBook',
 11: 'Greetings'}