# Text Classification Project Overview

This project focuses on classifying text data into different categories using a range of machine learning and deep learning techniques. The project encompasses the following key components:

## Data Preprocessing

- The project initiates with the loading of a JSONL dataset containing textual information.
- Data preprocessing takes place, involving tokenization, lemmatization, and the removal of stopwords, rendering the text data suitable for subsequent natural language processing (NLP) and machine learning tasks.

## Word Embeddings

- Word embeddings are employed to represent text data in the form of continuous vectors.
- Both Word2Vec and FastText embeddings are applied to capture the semantic nuances within the text.

## Model Development and Training

- Various machine learning models are utilized for text classification, encompassing Multinomial Naive Bayes, Support Vector Classifier (SVC), Logistic Regression, Decision Tree Classifier, and Random Forest Classifier.
- Additionally, a deep learning model in the form of an LSTM (Long Short-Term Memory) neural network is implemented.

## Model Evaluation

- Each model is trained and assessed on the preprocessed text data.
- Performance metrics, including accuracy, precision, recall, and F1-score, are computed to gauge the effectiveness of the models.

## Ensemble Prediction

- The project incorporates an ensemble approach, combining predictions from multiple models to make a consensus prediction for a given input text.
- This ensemble method leverages the strengths of various models to enhance the overall classification accuracy.

## Model Comparison

- A bar chart is generated to provide a visual comparison of the accuracies achieved by different models.
- This comparative analysis facilitates the evaluation and selection of the most suitable model for the text classification task.

This project offers a comprehensive exploration of text classification techniques, spanning traditional machine learning algorithms and deep learning models. Furthermore, it showcases the potential of ensemble methods in improving classification accuracy. The project encompasses the complete text classification pipeline, encompassing data preprocessing, model development, evaluation, and comparative analysis.


## Libraries

This code cell imports various libraries for different tasks in the notebook. Here's a quick overview of the libraries:

- **Standard Libraries**: Used for basic operations.
- **Data Manipulation**: Essential for data handling.
- **Natural Language Processing (NLP)**: Includes NLP and text processing tools.
- **Data Visualization**: For creating visualizations.
- **Machine Learning**: Required for machine learning tasks.
- **Deep Learning**: Frameworks for deep learning.
- **Text Processing**: For text feature extraction.
- **Stopwords**: Downloads stopwords for text preprocessing.
- **Initialize WordNet Lemmatizer**: Initializes a WordNet lemmatizer.

These libraries will be used throughout the notebook for various data analysis and modeling tasks.

In [1]:
# Standard Libraries
import json
import jsonlines
import re
import warnings
warnings.filterwarnings("ignore")

# Data Manipulation
import numpy as np
import pandas as pd

# Natural Language Processing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
from gensim.models.fasttext import FastText

# Data Visualization
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from tabulate import tabulate
import gradio as gr

# Machine Learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    f1_score,
    confusion_matrix,)

# Deep Learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.utils import to_categorical

# Text Processing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import Counter

# Stopwords
nltk.download("stopwords")

# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hossein/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Loading and Preparing Data

In this code cell, data is loaded and prepared for further analysis. Here's a breakdown of what's happening:

- **Initialize empty lists**: Three empty lists (`utterances`, `scenarios`, and `classes`) are created to store data from the dataset.

- **Load the JSONL dataset**: The code opens a JSONL file named 'fa-IR.jsonl' and iterates through its lines. For each line, it loads a JSON record and checks if it contains the keys 'utt' (for utterance), 'scenario' (for scenario label), and 'partition' (for partition label). If these keys are present, the corresponding data is appended to the respective lists.

- **Create a DataFrame**: Finally, a pandas DataFrame (`df`) is created from the collected data, with columns 'utt' for utterances, 'label' for scenario labels, and 'partition' for partition labels.

This code sets the stage for data analysis and manipulation using the loaded dataset stored in the DataFrame.

In [28]:
utterances = []
scenarios = []
classes = []

with open('fa-IR.jsonl', 'r') as file:
    for line in file:
        record = json.loads(line)
        if 'utt' in record and 'scenario' in record and 'partition' in record:
            utterances.append(record['utt'])
            scenarios.append(record['scenario'])
            classes.append(record['partition'])

df = pd.DataFrame({'utt': utterances, 'label': scenarios, 'partition': classes})
df.head()

Unnamed: 0,utt,label,partition
0,این هفته ساعت پنج صبح بیدارم کن,alarm,test
1,مرا جمعه ساعت نه صبح بیدار کن,alarm,train
2,یک زنگ هشدار را برای دو ساعت دیگر تنظیم کن,alarm,train
3,ساکت,audio,test
4,الی ساکت شو,audio,train


In [35]:
# Plot the count of unique labels with plotly_dark template
label_counts = df['label'].value_counts()
label_fig = px.bar(label_counts, x=label_counts.index, y=label_counts.values, labels={'x':'Label', 'y':'Count'}, title='Label Distribution')
label_fig.update_layout(template="plotly_dark")

# Plot the count of unique partitions with plotly_dark template
partition_counts = df['partition'].value_counts()
partition_fig = px.bar(partition_counts, x=partition_counts.index, y=partition_counts.values, labels={'x':'Partition', 'y':'Count'}, title='Partition Distribution')
partition_fig.update_layout(template="plotly_dark")

# Create a pie chart for label distribution
label_pie_fig = px.pie(label_counts, values=label_counts.values, names=label_counts.index, title='Label Distribution')
label_pie_fig.update_layout(template="plotly_dark")

# Create a pie chart for partition distribution
partition_pie_fig = px.pie(partition_counts, values=partition_counts.values, names=partition_counts.index, title='Partition Distribution')
partition_pie_fig.update_layout(template="plotly_dark")

# Display the plots
# label_fig.show()
# partition_fig.show()
# label_pie_fig.show()
# partition_pie_fig.show()


![plt](images/01.png)
![plt](images/02.png)
![plt](images/03.png)
![plt](images/04.png)

## Data Splitting

In this code cell, the dataset is split into training and testing subsets. Here's a brief explanation:

- **Create a train DataFrame**: A new DataFrame called `train_df` is created by selecting rows from the original DataFrame `df` where the 'partition' column has the value 'train'. This effectively isolates the training data.

- **Create a test DataFrame**: Similarly, another DataFrame called `test_df` is created by selecting rows from the original DataFrame `df` where the 'partition' column has the value 'test'. This separates the testing data from the dataset.

These DataFrames, `train_df` and `test_df`, now contain the training and testing subsets of the data, respectively. They can be used for model training and evaluation.

In [3]:
# Create a train DataFrame where partition is 'train'
train_df = df[df['partition'] == 'train']

# Create a test DataFrame where partition is 'test'
test_df = df[df['partition'] == 'test']

## Text Preprocessing

In this code cell, text preprocessing steps are applied to the training and testing datasets. The following steps are performed:

- **Define additional stopwords**: A list of additional stopwords specific to the Persian language is defined. These stopwords are words that are commonly removed from text during preprocessing because they often carry little meaningful information.

- **Initialize the WordNet Lemmatizer**: The WordNet Lemmatizer from NLTK is initialized. Lemmatization is a process of reducing words to their base or root form.

- **Define a preprocessing function**: A function called `preprocess_text` is defined. This function takes a text input, tokenizes it, lemmatizes the tokens, and removes stopwords from the text. The cleaned words are then joined back into a sentence.

- **Apply preprocessing to train and test data**: The `preprocess_text` function is applied to the 'utt' column of both the training (`train_df`) and testing (`test_df`) DataFrames. The results are stored in new columns named 'Processed_utt' for both DataFrames.

This preprocessing step is crucial for text data before it's used for machine learning or NLP tasks. It helps remove noise and reduce the dimensionality of the data.

In [4]:
# Define the stopwords
stop_words = [
    "و", "در", "به", "از", "كه", "مي", "اين", "است", "را", "با", "هاي",
    "براي", "آن", "يك", "شود", "شده", "خود", "ها", "كرد", "شد", "اي",
    "تا", "كند", "بر", "بود", "گفت", "نيز", "وي", "هم", "كنند", "دارد", "ما",
    "کن", "کرد", "کردن", "باش", "بود", "بودن", "شو", "شد", "شدن", "دار",
    "داشت", "داشتن", "خواه", "خواست", "خواستن", "گوی", "گفت", "گفتن",
    "گیر", "گرفت", "گرفتن", "آی", "آمد", "آمدن", "توان", "توانست",
    "توانستن", "یاب", "یافت", "یافتن", "آور", "آورد", "آوردن","دارم","هستند"]


# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Tokenize the text
    words = word_tokenize(text)
    
    # Lemmatize the words and remove stopwords
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    
    # Join the cleaned words back into a sentence
    cleaned_text = ' '.join(words)
    
    return cleaned_text

train_df['Processed_utt'] = train_df['utt'].apply(preprocess_text)
test_df['Processed_utt'] = test_df['utt'].apply(preprocess_text)

In [5]:
train_df.head()

Unnamed: 0,utt,label,partition,Processed_utt
1,مرا جمعه ساعت نه صبح بیدار کن,alarm,train,مرا جمعه ساعت نه صبح بیدار
2,یک زنگ هشدار را برای دو ساعت دیگر تنظیم کن,alarm,train,یک زنگ هشدار برای دو ساعت دیگر تنظیم
4,الی ساکت شو,audio,train,الی ساکت
5,توقف,audio,train,توقف
6,برای ده ثانیه متوقف کن,audio,train,برای ده ثانیه متوقف


## Word Embedding with Word2Vec

In this code cell, Word2Vec embeddings are generated for the processed text data. Here's an explanation of the steps:

- **Tokenize the processed text data**: The 'Processed_utt' column of both the training and testing DataFrames is tokenized using the `word_tokenize` function from NLTK. This step breaks down the text into individual words or tokens.

- **Train Word2Vec models**: Two Word2Vec models (`word2vec_model` for training data and `word2vec_model_te` for testing data) are trained on the tokenized text. These models learn word embeddings in vector space, where each word is represented as a vector. You can adjust the model parameters (e.g., `vector_size`, `window`, `min_count`, `sg`) as needed for your specific task.

- **Function to convert text to average Word2Vec vectors**: A function called `get_average_word2vec` is defined. It takes a list of tokens and a Word2Vec model as input and returns the average vector representation of the tokens. If a token is not found in the model (out-of-vocabulary word), it either generates a random vector or uses a zero vector, depending on the `generate_missing` parameter.

- **Apply the function to create Word2Vec vectors**: The `get_average_word2vec` function is applied to each row of tokenized text in the 'Processed_utt' column of both the training and testing DataFrames. The resulting Word2Vec vectors are stored in new columns named 'Word2Vec' for both DataFrames.

These Word2Vec embeddings capture semantic information about words in the text, making them suitable for various NLP tasks such as text classification or clustering.

In [6]:
# Tokenize the processed text data
tokenized_text = train_df['Processed_utt'].apply(word_tokenize)
tokenized_text_te = test_df['Processed_utt'].apply(word_tokenize)

# Train a Word2Vec model on your tokenized text (adjust parameters as needed)
word2vec_model = Word2Vec(tokenized_text, vector_size=100, window=5, min_count=1, sg=0)
word2vec_model_te = Word2Vec(tokenized_text_te, vector_size=100, window=5, min_count=1, sg=0)

# Function to convert text to the average Word2Vec vectors
def get_average_word2vec(tokens_list, vector, generate_missing=False, k=100):
    if len(tokens_list) < 1:
        return np.zeros(k)
    if generate_missing:
        vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]
    else:
        vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]
    length = len(vectorized)
    summed = np.sum(vectorized, axis=0)
    averaged = np.divide(summed, length)
    return averaged

# Apply the function to create Word2Vec vectors for each tweet
train_df['Word2Vec'] = tokenized_text.apply(lambda x: get_average_word2vec(x, word2vec_model.wv))
test_df['Word2Vec'] = tokenized_text_te.apply(lambda x: get_average_word2vec(x, word2vec_model_te.wv))

## Encoding Categorical Labels

In this code cell, categorical labels are encoded into numerical values using a `LabelEncoder`. Here's how it works:

- **Initialize the LabelEncoder**: Two `LabelEncoder` instances are created, one for the "Entity" column and another for the "Sentiment" column. These encoders will map the unique categorical values to numerical labels.

- **Fit and transform the categorical features**: The `fit_transform` method of the `LabelEncoder` is applied to both the training (`train_df`) and testing (`test_df`) DataFrames for their respective label columns. This step assigns a unique numerical label to each category in the "Entity" and "Sentiment" columns.

The encoded labels are stored in new columns named 'label_Encoded' in both the training and testing DataFrames. These numerical labels can be used for training machine learning models, as many machine learning algorithms require numerical inputs for labels instead of categorical values.

In [7]:
# Initialize the LabelEncoder for "Entity" and "Sentiment"
sentiment_label_encoder = LabelEncoder()

# Fit and transform the categorical features to numerical values
train_df['label_Encoded'] = sentiment_label_encoder.fit_transform(train_df['label'])
test_df['label_Encoded'] = sentiment_label_encoder.fit_transform(test_df['label'])

In [8]:
train_df.head()

Unnamed: 0,utt,label,partition,Processed_utt,Word2Vec,label_Encoded
1,مرا جمعه ساعت نه صبح بیدار کن,alarm,train,مرا جمعه ساعت نه صبح بیدار,"[-0.06767874, 0.6883688, -0.08782967, -0.07785...",0
2,یک زنگ هشدار را برای دو ساعت دیگر تنظیم کن,alarm,train,یک زنگ هشدار برای دو ساعت دیگر تنظیم,"[-0.07333779, 0.87345713, -0.11739839, -0.0823...",0
4,الی ساکت شو,audio,train,الی ساکت,"[-0.01489456, 0.10686286, -0.015576277, -0.012...",1
5,توقف,audio,train,توقف,"[-0.02130556, 0.11986162, -0.018047756, -0.017...",1
6,برای ده ثانیه متوقف کن,audio,train,برای ده ثانیه متوقف,"[-0.058250546, 0.60042155, -0.07817509, -0.076...",1


In [9]:
test_df.head()

Unnamed: 0,utt,label,partition,Processed_utt,Word2Vec,label_Encoded
0,این هفته ساعت پنج صبح بیدارم کن,alarm,test,این هفته ساعت پنج صبح بیدارم,"[-0.032603335, 0.13861217, 0.021972628, -0.037...",0
3,ساکت,audio,test,ساکت,"[-0.0020866692, -0.008983905, 0.0027714646, -0...",1
8,صورتی همان چیزی است که نیاز داریم,iot,test,صورتی همان چیزی که نیاز داریم,"[-0.020410202, 0.068406925, 0.014616796, -0.02...",7
14,و تاریک شده است,iot,test,تاریک,"[0.00031434774, 0.0064995815, -0.009134296, -0...",7
19,علی چراغ‌های اتاق خواب را خاموش کن,iot,test,علی چراغ‌های اتاق خواب خاموش,"[-0.013984281, 0.05558011, 0.0059327064, -0.01...",7


## Text Vectorization with CountVectorizer and TF-IDF

In this code cell, text data is vectorized using two different methods: CountVectorizer and TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer. Here's how it's done:

- **Tokenize the text data**: The 'Processed_utt' column in both the training (`train_df`) and testing (`test_df`) DataFrames is tokenized using the `word_tokenize` function. This step splits the text into individual words or tokens.

- **Prepare target labels**: The target labels for both training and testing datasets are stored in 'y_train' and 'y_test' variables, respectively.

- **CountVectorizer**: 
  - The `CountVectorizer` from scikit-learn is initialized in the 'vectorizer' variable. This vectorizer converts text data into a matrix of word counts.
  - It's applied to the 'Processed_utt' column of both the training and testing DataFrames, resulting in 'X_train' and 'X_test', which are matrices of word counts.

- **TF-IDF Vectorizer**:
  - The `TfidfVectorizer` from scikit-learn is initialized in the 'tfidf_vectorizer' variable. This vectorizer converts text data into a matrix of TF-IDF features.
  - It's applied to the 'Processed_utt' column of both the training and testing DataFrames, resulting in 'X_train' and 'X_test', which are matrices of TF-IDF features.

These text vectorization techniques transform text data into numerical format, making it suitable for machine learning models that require numerical input. CountVectorizer represents words as raw counts, while TF-IDF Vectorizer assigns weights to words based on their importance in the corpus.

In [10]:
train_df['tokenized_text'] = train_df['Processed_utt'].apply(word_tokenize)
test_df['tokenized_text'] = test_df['Processed_utt'].apply(word_tokenize)
y_train = train_df['label']
y_test = test_df['label']

In [11]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_df['Processed_utt'])
X_test = vectorizer.transform(test_df['Processed_utt'])

In [12]:
tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(train_df['Processed_utt'])
X_test = tfidf_vectorizer.transform(test_df['Processed_utt'])

## Text Classification with Multinomial Naive Bayes (MNB) Classifier

In this code cell, a Multinomial Naive Bayes (MNB) classifier is used for text classification. Here's a breakdown of the steps:

- **MNB Classifier Training and Prediction**:
  - An MNB classifier (`mnb`) is initialized.
  - The classifier is trained on the training data (`X_train` and `y_train`) using the `fit` method.
  - Predictions are made on the testing data (`X_test`) using the `predict` method.
  - Accuracy is calculated by comparing the predicted labels (`y_pred`) with the actual labels (`y_test`).

- **Display Results**:
  - The accuracy of the MNB classifier is printed as a percentage.
  - A classification report is generated, providing detailed evaluation metrics such as precision, recall, F1-score, and support for each class.

- **Hyperparameter Tuning with Grid Search**:
  - A parameter grid (`param_grid`) is defined, specifying different values for the alpha parameter of the MNB classifier.
  - A new MNB classifier (`multinomial_nb`) is created.
  - A `GridSearchCV` object (`grid_search`) is instantiated to perform grid search with 5-fold cross-validation and accuracy scoring.
  - The grid search is performed using the training data to find the best alpha value.
  - The best MNB classifier (`best_multinomial_nb`) and its corresponding best alpha value are extracted from the grid search results.

- **Training and Evaluation with Best Model**:
  - The best MNB classifier is trained on the full training set using the best alpha value.
  - The best model is evaluated on the test set, and its accuracy is printed.

This code demonstrates the process of training, evaluating, and tuning a Multinomial Naive Bayes classifier for text classification, along with reporting accuracy and detailed classification metrics.

In [13]:
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred = mnb.predict(X_test)

accuracy_nb = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy_nb:.2f}')
print(report)

Accuracy: 0.78
                precision    recall  f1-score   support

         alarm       0.98      0.59      0.74        96
         audio       1.00      0.53      0.69        62
      calendar       0.52      1.00      0.68       402
       cooking       0.96      0.38      0.54        72
      datetime       1.00      0.54      0.70       103
         email       0.83      0.94      0.88       271
       general       0.75      0.23      0.35       189
           iot       0.96      0.95      0.95       220
         lists       0.90      0.61      0.73       142
         music       1.00      0.36      0.53        81
          news       0.98      0.77      0.86       124
          play       0.81      0.99      0.89       387
            qa       0.75      0.86      0.81       288
recommendation       0.93      0.55      0.69        94
        social       1.00      0.56      0.72       106
      takeaway       1.00      0.61      0.76        57
     transport       0.93      0

In [14]:
# Define the parameter grid to search
param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0, 5.0]}

# Create the MultinomialNB classifier
multinomial_nb = MultinomialNB()

# Create a GridSearchCV object to perform grid search with cross-validation
grid_search = GridSearchCV(multinomial_nb, param_grid, cv=5, scoring='accuracy')

# Fit the grid search to your training data
grid_search.fit(X_train, y_train)

# Get the best estimator (classifier) and its corresponding hyperparameters
best_multinomial_nb = grid_search.best_estimator_
best_alpha = best_multinomial_nb.alpha

# Train the best model on the full training set
best_multinomial_nb.fit(X_train, y_train)

# Evaluate the best model on the test set
y_pred = best_multinomial_nb.predict(X_test)
accuracy_nb = accuracy_score(y_test, y_pred)

print(f'Best Alpha: {best_alpha}')
print(f'MNB Test Accuracy with Best Model: {accuracy_nb:.2f}')

Best Alpha: 0.1
MNB Test Accuracy with Best Model: 0.86


## Text Classification with Random Forest Classifier

In this code cell, a Random Forest Classifier is used for text classification. Here's a breakdown of the steps:

- **Hyperparameter Tuning with Grid Search**:
  - A parameter grid (`param_grid`) is defined, specifying different values for the 'n_estimators' (number of trees in the forest) and 'max_depth' (maximum depth of the tree) hyperparameters of the Random Forest Classifier.
  - A Random Forest Classifier (`rf`) is initialized.
  - A `GridSearchCV` object (`grid_search`) is instantiated to perform grid search with 5-fold cross-validation.
  - The grid search is performed using the training data to find the best combination of hyperparameters.

- **Training and Evaluation with Best Model**:
  - The best Random Forest Classifier (`best_rf`) obtained from the grid search is trained on the full training set using the best hyperparameters.
  - The best model is evaluated on the test set, and its accuracy is printed.

This code demonstrates the process of hyperparameter tuning, training, and evaluating a Random Forest Classifier for text classification, along with reporting accuracy.

Please note that there seems to be a minor typo in the accuracy print statement where it says "MNB Test Accuracy with Best model." It should be "RF Test Accuracy with Best Model."

In [15]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],}

rf = RandomForestClassifier()
grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_

# Train the best model on the full training set
best_rf.fit(X_train, y_train)

# Evaluate the best model on the test set
y_pred = best_rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred)

print(f'RF Test Accuracy with Best model: {accuracy_rf:.2f}')

RF Test Accuracy with Best model: 0.85


## Text Classification with Support Vector Classifier (SVC)

In this code cell, a Support Vector Classifier (SVC) is used for text classification. Here's a breakdown of the steps:

- **SVC Classifier Training and Prediction**:
  - An SVC classifier (`svm_classifier`) with a linear kernel is initialized.
  - The classifier is trained on the training data (`X_train` and `y_train`) using the `fit` method.
  - Predictions are made on the testing data (`X_test`) using the `predict` method.
  - Accuracy is calculated by comparing the predicted labels (`y_pred`) with the actual labels (`y_test`).

- **Display Results**:
  - The accuracy of the SVC classifier is printed as a percentage.
  - A classification report is generated, providing detailed evaluation metrics such as precision, recall, F1-score, and support for each class.

- **Hyperparameter Tuning with Grid Search**:
  - A parameter grid (`param_grid`) is defined, specifying different values for the 'C' (penalty parameter) and 'kernel' hyperparameters of the SVC classifier.
  - A new SVC classifier (`SVC()`) is created.
  - A `GridSearchCV` object (`grid_search`) is instantiated to perform grid search with 5-fold cross-validation.
  - The grid search is performed using the training data to find the best combination of hyperparameters.

- **Training and Evaluation with Best Model**:
  - The best SVC classifier (`best_svm_classifier`) obtained from the grid search is trained on the full training set.
  - The best model is evaluated on the test set, and its accuracy is printed.

This code demonstrates the process of training, evaluating, and tuning a Support Vector Classifier for text classification, along with reporting accuracy and detailed classification metrics.

In [16]:
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
y_pred = svm_classifier.predict(X_test)

accuracy_svc = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy_svc:.2f}')
print(report)

Accuracy: 0.88
                precision    recall  f1-score   support

         alarm       0.95      0.95      0.95        96
         audio       0.92      0.87      0.89        62
      calendar       0.85      0.94      0.89       402
       cooking       0.89      0.78      0.83        72
      datetime       0.98      0.85      0.91       103
         email       0.96      0.95      0.96       271
       general       0.53      0.55      0.54       189
           iot       0.98      0.95      0.97       220
         lists       0.95      0.84      0.89       142
         music       0.91      0.77      0.83        81
          news       0.93      0.82      0.87       124
          play       0.94      0.96      0.95       387
            qa       0.71      0.88      0.79       288
recommendation       0.85      0.79      0.82        94
        social       0.94      0.85      0.89       106
      takeaway       0.98      0.81      0.88        57
     transport       0.95      0

In [17]:
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],}

grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_svm_classifier = grid_search.best_estimator_

# Train the best model on the full training set
best_svm_classifier.fit(X_train, y_train)

# Evaluate the best model on the test set
y_pred = best_svm_classifier.predict(X_test)
accuracy_svc = accuracy_score(y_test, y_pred)

print(f'SVC Test Accuracy with Best Model: {accuracy_svc:.2f}')

SVC Test Accuracy with Best Model: 0.88


## Text Classification with Logistic Regression

In this code cell, Logistic Regression is used for text classification. Here's a breakdown of the steps:

- **Logistic Regression Classifier Training and Prediction**:
  - A Logistic Regression classifier (`lr_classifier`) is initialized with a maximum number of iterations (`max_iter`) set to 1000.
  - The classifier is trained on the training data (`X_train` and `y_train`) using the `fit` method.
  - Predictions are made on the testing data (`X_test`) using the `predict` method.
  - Accuracy is calculated by comparing the predicted labels (`y_pred`) with the actual labels (`y_test`).

- **Display Results**:
  - The accuracy of the Logistic Regression classifier is printed as a percentage.
  - A classification report is generated, providing detailed evaluation metrics such as precision, recall, F1-score, and support for each class.

- **Hyperparameter Tuning with Grid Search**:
  - A parameter grid (`param_grid`) is defined, specifying different values for the 'C' (inverse of regularization strength) hyperparameter of the Logistic Regression classifier.
  - A new Logistic Regression classifier (`lr_classifier`) is created.
  - A `GridSearchCV` object (`grid_search`) is instantiated to perform grid search with 5-fold cross-validation and accuracy scoring.
  - The grid search is performed using the training data to find the best value for the 'C' hyperparameter.

- **Training and Evaluation with Best Model**:
  - The best Logistic Regression classifier (`best_lr_classifier`) obtained from the grid search is trained on the full training set using the best 'C' value.
  - The best model is evaluated on the test set, and its accuracy is printed.

This code demonstrates the process of training, evaluating, and tuning a Logistic Regression classifier for text classification, along with reporting accuracy and detailed classification metrics.

In [18]:
lr_classifier = LogisticRegression(max_iter=1000)
lr_classifier.fit(X_train, y_train)
y_pred = lr_classifier.predict(X_test)

accuracy_lr = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy_lr:.2f}')
print(report)

Accuracy: 0.87
                precision    recall  f1-score   support

         alarm       0.97      0.86      0.91        96
         audio       0.94      0.81      0.87        62
      calendar       0.84      0.95      0.89       402
       cooking       0.91      0.72      0.81        72
      datetime       0.96      0.83      0.89       103
         email       0.96      0.94      0.95       271
       general       0.59      0.49      0.54       189
           iot       0.97      0.96      0.96       220
         lists       0.94      0.85      0.89       142
         music       0.92      0.72      0.81        81
          news       0.93      0.77      0.85       124
          play       0.93      0.96      0.94       387
            qa       0.64      0.92      0.76       288
recommendation       0.87      0.78      0.82        94
        social       0.98      0.81      0.89       106
      takeaway       0.98      0.81      0.88        57
     transport       0.96      0

In [19]:
# Define the parameter grid to search
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100]}

# Create the Logistic Regression classifier
lr_classifier = LogisticRegression(max_iter=1000)

# Create a GridSearchCV object to perform grid search with cross-validation
grid_search = GridSearchCV(lr_classifier, param_grid, cv=5, scoring='accuracy')

# Fit the grid search to your training data
grid_search.fit(X_train, y_train)

# Get the best estimator (classifier) and its corresponding hyperparameters
best_lr_classifier = grid_search.best_estimator_
best_C = best_lr_classifier.C

# Train the best model on the full training set
best_lr_classifier.fit(X_train, y_train)

# Evaluate the best model on the test set
y_pred = best_lr_classifier.predict(X_test)
accuracy_lr = accuracy_score(y_test, y_pred)

print(f'Best C: {best_C}')
print(f'LR Test Accuracy with Best Model: {accuracy_lr:.2f}')

Best C: 10
LR Test Accuracy with Best Model: 0.88


## Text Classification with Decision Tree Classifier

In this code cell, a Decision Tree Classifier is used for text classification. Here's a breakdown of the steps:

- **Decision Tree Classifier Training and Prediction**:
  - A Decision Tree Classifier (`dt_classifier`) is initialized.
  - The classifier is trained on the training data (`X_train` and `y_train`) using the `fit` method.
  - Predictions are made on the testing data (`X_test`) using the `predict` method.
  - Accuracy is calculated by comparing the predicted labels (`y_pred`) with the actual labels (`y_test`).

- **Display Results**:
  - The accuracy of the Decision Tree Classifier is printed as a percentage.
  - A classification report is generated, providing detailed evaluation metrics such as precision, recall, F1-score, and support for each class.

- **Hyperparameter Tuning with Grid Search**:
  - A parameter grid (`param_grid`) is defined, specifying different values for the 'max_depth', 'min_samples_split', and 'min_samples_leaf' hyperparameters of the Decision Tree Classifier.
  - A new Decision Tree Classifier (`DecisionTreeClassifier()`) is created.
  - A `GridSearchCV` object (`grid_search`) is instantiated to perform grid search with 5-fold cross-validation.
  - The grid search is performed using the training data to find the best combination of hyperparameters.

- **Training and Evaluation with Best Model**:
  - The best Decision Tree Classifier (`best_dt_classifier`) obtained from the grid search is trained on the full training set using the best hyperparameters.
  - The best model is evaluated on the test set, and its accuracy is printed.

This code demonstrates the process of training, evaluating, and tuning a Decision Tree Classifier for text classification, along with reporting accuracy and detailed classification metrics.

In [20]:
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)

accuracy_dt = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy_dt:.2f}')
print(report)

Accuracy: 0.79
                precision    recall  f1-score   support

         alarm       0.80      0.82      0.81        96
         audio       0.79      0.81      0.80        62
      calendar       0.78      0.82      0.80       402
       cooking       0.78      0.74      0.76        72
      datetime       0.88      0.79      0.83       103
         email       0.89      0.90      0.89       271
       general       0.38      0.47      0.42       189
           iot       0.94      0.94      0.94       220
         lists       0.82      0.80      0.81       142
         music       0.80      0.69      0.74        81
          news       0.80      0.71      0.75       124
          play       0.91      0.89      0.90       387
            qa       0.69      0.74      0.72       288
recommendation       0.54      0.53      0.54        94
        social       0.88      0.75      0.81       106
      takeaway       0.88      0.74      0.80        57
     transport       0.85      0

In [21]:
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_dt_classifier = grid_search.best_estimator_

# Train the best model on the full training set
best_dt_classifier.fit(X_train, y_train)

# Evaluate the best model on the test set
y_pred = best_dt_classifier.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred)

print(f'DT Test Accuracy with Best Model: {accuracy_dt:.2f}')

DT Test Accuracy with Best Model: 0.79


## Text Classification with FastText Word Embeddings

In this code cell, FastText word embeddings are used for text classification. Here's a breakdown of the steps:

- **Load FastText Word Embeddings Model**:
  - A pre-trained FastText word embeddings model (`fasttext_model`) is loaded using the `FastText.load_fasttext_format` function. This model contains pre-trained word vectors for the Persian language.

- **Text to Embedding Conversion**:
  - A function called `text_to_embedding` is defined. This function takes a text input, tokenizes it into words, and converts each word to its word vector using the FastText model.
  - It filters out words that are not in the FastText model's vocabulary and computes the average word vector for the entire text.
  - The result is an average word embedding for the input text.

- **Convert Text Data to Word Embeddings**:
  - The 'utt' column of both the training (`train_df`) and testing (`test_df`) DataFrames is processed to obtain FastText word embeddings. For each text in the DataFrame, the `text_to_embedding` function is applied to obtain a word embedding vector.
  - These word embedding vectors are stored in `X_train_embeddings` and `X_test_embeddings`, respectively.

- **Training with Word Embeddings**:
  - The Logistic Regression classifier (`lr_classifier`) is trained on the training data with FastText word embeddings (`X_train_embeddings`) instead of traditional text features.

- **Make Predictions and Evaluate**:
  - Predictions are made on the test set using the trained model with word embeddings.
  - Accuracy is calculated by comparing the predicted labels (`y_pred`) with the actual labels (`y_test`).

- **Display Results**:
  - The accuracy of the Logistic Regression classifier using FastText word embeddings is printed as a percentage.

This code demonstrates the process of using pre-trained FastText word embeddings for text classification, allowing the model to leverage pre-trained word vectors to capture semantic information in the text data.

In [22]:
fasttext_model = FastText.load_fasttext_format('cc.fa.300.bin')

def text_to_embedding(text, model, vector_size):
    words = text.split()
    # Filter words that are in the FastText model's vocabulary
    valid_words = [word for word in words if word in model.wv]
    if valid_words:
        embeddings = [model.wv[word] for word in valid_words]
        average_embedding = np.mean(embeddings, axis=0)
    else:
        # If no valid words are found, create a zero vector
        average_embedding = np.zeros(vector_size)
    return average_embedding

X_train_embeddings = [text_to_embedding(text, fasttext_model, vector_size=300) for text in train_df['utt']]
X_test_embeddings = [text_to_embedding(text, fasttext_model, vector_size=300) for text in test_df['utt']]

# Train the model on the FastText word embeddings
lr_classifier.fit(X_train_embeddings, train_df['label'])

# Make predictions
y_pred = lr_classifier.predict(X_test_embeddings)

# Evaluate the model and decode the labels as needed

accuracy_lr_fsttext = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy_lr_fsttext:.2f}')
# print(report)

Accuracy: 0.80


## Text Classification with LSTM Neural Network

In this code cell, a Long Short-Term Memory (LSTM) neural network is used for text classification. Here's a breakdown of the steps:

- **Data Preparation**:
  - Processed text data is extracted from the 'Processed_utt' column of both the training (`train_df`) and testing (`test_df`) DataFrames.
  - Target labels are stored in `y_train` and `y_test`.

- **Text Tokenization and Padding**:
  - A `Tokenizer` is used to tokenize the text data and convert it into sequences of integers.
  - The sequences are padded to a common maximum sequence length to ensure uniform input size for the neural network.

- **Label Encoding**:
  - A `LabelEncoder` is used to encode the target labels into numerical values.
  - The encoded labels are converted to one-hot encoding to be used for multi-class classification.

- **LSTM Model Construction**:
  - A Sequential neural network model is created.
  - It starts with an embedding layer to convert integer sequences into dense vectors.
  - An LSTM layer with 128 units is added to capture sequential information.
  - A dense layer with a softmax activation function is added to produce class probabilities.

- **Model Compilation**:
  - The model is compiled with categorical cross-entropy as the loss function and the Adam optimizer.
  - Accuracy is chosen as a metric to monitor during training.

- **Training**:
  - The model is trained on the padded training data (`X_train_padded`) and one-hot encoded labels (`y_train_one_hot`) with 10 epochs and a batch size of 256.

- **Evaluation**:
  - The trained model is evaluated on the test data (`X_test_padded` and `y_test_one_hot`).
  - Test loss and accuracy are printed to assess the model's performance.

This code demonstrates the process of text classification using an LSTM neural network, including data preprocessing, model construction, training, and evaluation.

In [23]:
# Extract processed text data
X_train_text = train_df['Processed_utt']
X_test_text = test_df['Processed_utt']

# Target labels
y_train = train_df['label']
y_test = test_df['label']

# Tokenize and pad the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text)

X_train_seq = tokenizer.texts_to_sequences(X_train_text)
X_test_seq = tokenizer.texts_to_sequences(X_test_text)

max_sequence_length = max([len(seq) for seq in X_train_seq + X_test_seq])
X_train_padded = pad_sequences(X_train_seq, maxlen=max_sequence_length)
X_test_padded = pad_sequences(X_test_seq, maxlen=max_sequence_length)

# Encode labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)
num_classes = len(label_encoder.classes_)

# Build an LSTM model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=256, input_length=max_sequence_length))
model.add(LSTM(128))
model.add(Dense(num_classes, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Convert labels to one-hot encoding
y_train_one_hot = to_categorical(y_train_encoded, num_classes=num_classes)
y_test_one_hot = to_categorical(y_test_encoded, num_classes=num_classes)

# Train the model
model.fit(X_train_padded, y_train_one_hot, validation_data=(X_test_padded, y_test_one_hot), epochs=10, batch_size=256)

# Evaluate the model
loss, accuracy_nn = model.evaluate(X_test_padded, y_test_one_hot)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy_nn}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: 0.5008704662322998
Test Accuracy: 0.874579668045044


## Model Accuracy Comparison Bar Chart

In this code cell, a bar chart is created to visually compare the accuracies of different text classification models. Here's a breakdown of the steps:

- **Model Names**:
  - An array `model_names` is defined, containing the names of the models for which accuracies are being compared.

- **Create a Bar Chart Trace**:
  - A bar chart trace is created using Plotly's `go.Bar` function. The x-axis represents the model names (`model_names`), and the y-axis represents the corresponding accuracies.
  - Each bar is color-coded differently to distinguish between models.

- **Layout Configuration**:
  - The layout for the plot is configured using Plotly's `go.Layout` function. It includes a title, axis labels, and a dark template for the plot.

- **Create a Figure and Add the Trace**:
  - A Plotly figure (`fig`) is created and initialized with the bar chart trace and layout.

- **Display the Interactive Plot**:
  - The interactive plot is displayed using `fig.show()`.

This code allows for a visual comparison of model accuracies in a bar chart, making it easier to assess and compare the performance of different text classification models.

In [24]:
# Model names
model_names = ['Random Forest', 'Neural Network', 'Naive Bayes', 'SVC', 'Decision Tree', 'Logistic Regression (FastText)']

# Create a bar chart trace for accuracies
trace = go.Bar(x=model_names, y=[accuracy_rf, accuracy_nn, accuracy_nb, accuracy_svc, accuracy_dt, accuracy_lr_fsttext],
               marker=dict(color=['blue', 'green', 'red', 'purple', 'orange', 'pink']))

# Create the layout for the plot
layout = go.Layout(
    title='Model Accuracy Comparison',
    xaxis=dict(title='Model'),
    yaxis=dict(title='Accuracy'),template='plotly_dark')


# Create a figure and add the trace
fig = go.Figure(data=[trace], layout=layout)

# Display the interactive plot
# fig.show()

![acuuracy](images/1.png)

## Text Classification Ensemble Prediction

In this code cell, an ensemble prediction for the category of an input text is made using multiple text classification models. Here's a breakdown of the steps:

- **Input Text**:
  - An input text is provided in the variable `input_text`. You can replace this text with your own input text that you want to classify.

- **Text Preprocessing**:
  - The input text is preprocessed using a preprocessing function (`preprocess_text`). This function tokenizes the text, removes stopwords, and performs lemmatization.

- **Vectorization of Preprocessed Input**:
  - The preprocessed input text (`preprocessed_input`) is vectorized using the same vectorizer that was used during training. In this case, TF-IDF vectorization is used (`tfidf_vectorizer.transform`).

- **Prediction with Individual Models**:
  - Predictions are made for the input text using three different models: Logistic Regression (`best_lr_classifier`), Support Vector Classifier (`best_svm_classifier`), and Decision Tree Classifier (`best_dt_classifier`).

- **Ensemble Prediction with Majority Voting**:
  - The predicted labels from the three models are combined into a list (`all_predicted_labels`).
  - Majority voting is performed to determine the final predicted label. The label that occurs most frequently among the three predictions is selected as the final category.

- **Print Predicted Category**:
  - The final predicted category label is printed.

This code demonstrates how to use an ensemble of multiple classification models to make a consensus prediction for the category of an input text, leveraging the strengths of different models to improve classification accuracy.

In [25]:
# Replace 'input_text' with your actual input text
input_text = "مذاکرات ایران و آمریکا"

# Preprocess the input text (you need to implement your preprocessing)
preprocessed_input = preprocess_text(input_text)

# Vectorize the preprocessed input text using the same vectorizer used during training
vectorized_input = tfidf_vectorizer.transform([preprocessed_input])  # Use your vectorizer (e.g., TF-IDF) here

# Initialize lists to store predicted labels from different models
predicted_labels_lr = best_lr_classifier.predict(vectorized_input)
predicted_labels_svm = best_svm_classifier.predict(vectorized_input)
predicted_labels_rf = best_dt_classifier.predict(vectorized_input)

# Combine the predicted labels from different models into a single list
all_predicted_labels = [predicted_labels_lr[0], predicted_labels_svm[0], predicted_labels_rf[0]]

# Perform majority voting to determine the final label
final_label = Counter(all_predicted_labels).most_common(1)[0][0]

# Decode the final label back to its original category
# final_category = sentiment_label_encoder.inverse_transform([final_label])[0]

# Print the final category
print(f'Predicted Category: {final_label}')

Predicted Category: news


## Using the Text Classifier Interface

To make use of the text classification model developed in this project, you can interact with the user interface provided. Follow these steps to utilize the interface for predicting the category of a given text:

1. **Launch the Interface**: Run the provided code to launch the interface. You can do this by executing the script or cell containing the code snippet.

2. **Enter Text**: In the interface, you will find an input field labeled "text." Enter the text you want to classify into this field.

3. **Get Predicted Category**: After entering the text, click the interface's "Submit" or "Predict" button (depending on the interface design). The model will process the input text and predict its category based on the trained machine learning models.

4. **View Predicted Category**: The predicted category will be displayed as the output of the interface. This represents the model's classification for the provided text.

Feel free to experiment with different texts to observe how the model classifies them. This user-friendly interface simplifies the process of obtaining predictions from the text classification model.

Remember that the accuracy of predictions depends on the quality of the training data and the performance of the underlying machine learning models used in the project.

In [26]:
# Initialize lists to store predicted labels from different models
def predict_labels(text):

    # Preprocess the input text (you need to implement your preprocessing)
    preprocessed_input = preprocess_text(text)

    # Vectorize the preprocessed input text using the same vectorizer used during training
    vectorized_input = tfidf_vectorizer.transform([preprocessed_input])  # Use your vectorizer (e.g., TF-IDF) here

    # Initialize lists to store predicted labels from different models
    predicted_labels_lr = best_lr_classifier.predict(vectorized_input)
    predicted_labels_svm = best_svm_classifier.predict(vectorized_input)
    predicted_labels_rf = best_dt_classifier.predict(vectorized_input)

    # Combine the predicted labels from different models into a single list
    all_predicted_labels = [predicted_labels_lr[0], predicted_labels_svm[0], predicted_labels_rf[0]]

    # Perform majority voting to determine the final label
    final_label = Counter(all_predicted_labels).most_common(1)[0][0]

    return final_label

iface = gr.Interface(
    fn=predict_labels,
    inputs=["text"],
    outputs="text",
    title="Text Classifier",
    description="Enter a text and get the predicted category.",
)

iface.launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




![gradio](images/2.png)