# Text Classification Using ML/DL on the 20 Newsgroups Dataset

In this notebook, we will be working with a publicly available dataset consisting of documents from 20 different Usenet newsgroups. The task is to design, implement, and train two types of classifiers — a traditional machine learning model (Naive Bayes and Logistic Regression) and a deep learning model (Convolutional Neural Network, or CNN) — to classify these documents into their corresponding newsgroups.

The dataset can be found [here](http://qwone.com/~jason/20Newsgroups/)

## About the dataset

The 20 newsgroup dataset sorted by date version contains 18846 documents. However, the matlab version which contains the `train.data`, `test.data`, etc. files contain 18774 documents as it was processessed on a script that removes single-word and empty documents from rainbow to matlab.

Here, we will pre-process the data from scratch and create the `train.data`, `test.data`, etc.

First, let's download the `20news-bydate` dataset using link provided above, and load it in the below format:
```json
{
    "data": ["doc1", "doc2", .... ],
    "target": ["label_id1", "label_id2", ..... ],
    "target_names": ["label_name1", "label_name2", ..... ],
}
```

In [68]:
# This was obtained from the `.map` files

classes_to_idx = {
    'alt.atheism': 0,
    'comp.graphics': 1,
    'comp.os.ms-windows.misc': 2,
    'comp.sys.ibm.pc.hardware': 3,
    'comp.sys.mac.hardware': 4,
    'comp.windows.x': 5,
    'misc.forsale': 6,
    'rec.autos': 7,
    'rec.motorcycles': 8,
    'rec.sport.baseball': 9,
    'rec.sport.hockey': 10,
    'sci.crypt': 11,
    'sci.electronics': 12,
    'sci.med': 13,
    'sci.space': 14,
    'soc.religion.christian': 15,
    'talk.politics.guns': 16,
    'talk.politics.mideast': 17,
    'talk.politics.misc': 18,
    'talk.religion.misc': 19
}

In [69]:
import os

def create_dataset_dict(base_path):
    """
    Create a dictionary for a given folder (either 'train' or 'test')
    that contains 'data', 'target_names', and 'target'.
    """
    dataset = {
        'data': [],  # Will contain document contents
        'target_names': [],  # Will contain class labels
        'target': []  # Will contain corresponding label ids
    }

    # Traverse each folder (representing a class)
    for label_folder in os.listdir(base_path):
        label_folder_path = os.path.join(base_path, label_folder)
        if os.path.isdir(label_folder_path):
            # Add the label to the target_names
            dataset['target_names'].append(label_folder)
            
            # Get the label id from the label_map
            label_id = classes_to_idx[label_folder]
            
            # Add the documents in this folder to the dataset
            for doc_file in os.listdir(label_folder_path):
                doc_file_path = os.path.join(label_folder_path, doc_file)
                if os.path.isfile(doc_file_path):
                    with open(doc_file_path, 'r', encoding='latin1') as f:
                        document = f.read()
                    dataset['data'].append(document)
                    dataset['target'].append(label_id)

    return dataset

# Set your paths to the train and test folders
train_folder = '20news-bydate/20news-bydate-train'
test_folder = '20news-bydate/20news-bydate-test'

# Create the trainset and testset
trainset = create_dataset_dict(train_folder)
testset = create_dataset_dict(test_folder)

In [70]:
print(f"First document in trainset: {trainset['data'][10000]}")
print(f"First label in trainset: {trainset['target'][0]}")
print(f"First label name in trainset: {trainset['target_names'][0]}")

First document in trainset: From: news@cbnewsk.att.com
Subject: Re: Bible Unsuitable for New Christians
Organization: AT&T Bell Labs
Lines: 8

True.

Also read 2 Peter 3:16

Peter warns that the scriptures are often hard to understand by those who
are not learned on the subject.

Joe Moore

First label in trainset: 17
First label name in trainset: talk.politics.mideast


In [71]:
len(trainset['data']) + len(testset['data'])

18846

In [72]:
# Let's view random indices in the dataset

import random

samples = random.sample(range(len(trainset['data'])), 3)

for idx in samples:
    doc = trainset['data'][idx]
    label = trainset['target'][idx]
    
    print(doc)
    print(label)
    print(f'LABEL: {trainset["target_names"][label]} ----END----\n\n')


From: jake@bony1.bony.com (Jake Livni)
Subject: Re: Basil, opinions? (Re: Water on the brain)
Organization: The Department of Redundancy Department
Lines: 15

In article <1qmr5qINN5af@early-bird.think.com> shaig@Think.COM (Shai Guday) writes:

>The Litani river flows in a west-southwestern direction and indeed does
>not run through the buffer zone.  The Hasbani does flow into the Jordan
>but contrary to what our imaginative poster might write, there has been
>no increase in the inflow from this river that is not proportional to
>climatic changes in rainfall.

What did you have to go and bring THAT up for?  Now they're going to
say that Israel is stealing the RAIN, too....

-- 
Jake Livni  jake@bony1.bony.com           Ten years from now, George Bush will
American-Occupied New York                   have replaced Jimmy Carter as the
My opinions only - employer has no opinions.    standard of a failed President.

17
LABEL: soc.religion.christian ----END----


From: alan@apple.com (Alan M

Before we can do any sort of training on the dataset, we first need to preprocess it to make it more meaningful and remove unnecessary context. 

In [73]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import download
from typing import List, Dict, Tuple
import numpy as np

download('punkt')  # for word_tokenize
download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/prashanthjaganathan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/prashanthjaganathan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

The below function preprocesses text by:

- Converting to lowercase.
- Removing emails, URLs, special characters, and unwanted words.
- Removing stopwords and single-word or empty documents.
- Tokenizing the text (white-space)
- Lemmatizing words to their base form, and also removes numbers

Lemmatization is used instead of stemming because it produces valid dictionary words and preserves meaning and context, which is important for tasks like text classification.

In [74]:
def preprocess_document(document: str):
    """Remove unnecessary words from the document"""

    # Convert to lower case
    document = document.lower()

    # Remove emails, URLs, etc.
    document = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '', document)
    document = re.sub(r'http\S+', '', document)

    # Remove stop words
    words = document.strip().split()
    stop_words = set(stopwords.words('english'))
    stop_words.update(['forward', 'reply'])  # Add words to stop_words set
    words = [word for word in words if word not in stop_words]
    document = ' '.join(words)

    # Remove special characters like -, <>, etc.
    document = re.sub(r'[^a-zA-Z0-9\s]', '', document)  # Remove non-alphanumeric characters except spaces

    # Remove words like From, Subject, Originator, Lines, Nntp-Posting-Host, Organization
    unwanted_words = ['from', 'subject', 'originator', 'lines', 'nntppostinghost', 'organization', 're']
    document = ' '.join([word for word in document.split() if word not in unwanted_words])

    # Tokenize the document (split into words)
    words = word_tokenize(document)

    # Add code to remove numbers here

    # Perform lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words if not word.isdigit()]

    # Remove single word and empty docs
    if len(lemmatized_words) <= 1:
        return None  # return None if the document has no meaningful content

    # Reconstruct the document
    cleaned_document = ' '.join(lemmatized_words)

    return lemmatized_words, cleaned_document

Let's test our preprocessor on one document

In [75]:
document = """From: Mark.Prado@p2.f349.n109.z1.permanet.org (Mark Prado)
Subject: Sixty-two thousand (was Re: How many read sci.space?)
Lines: 32


Reply address: mark.prado@permanet.org

If anyone knows anyone else who would like to get sci.space,
but doesn't have an Internet feed (or has a cryptic Internet
feed), I would be willing to feed it to them.  I have a nice
offline message reader/editor, an automated modem "mailer"
program which will pick up mail bundles (quickly and easily),
and an INSTALL.EXE to set them up painlessly.  No charge for
the sci.space feed, though you have to dial Washington, D.C.
This is NOT a BBS -- it's a store & forward system for mail
bundles, with minimum connect times.  (I'm used to overseas
calls.)  (This is not an offer for a free feed for any other
particular newsgroups.)  Speeds of up to 14400 (v32bis) are
supported.  VIP's might be offered other free services, such
as Internet address and other functionality.

I get my feed from UUNET and run a 4-line hub.  I've been
hubbing for years -- I have an extremely reliable hub.

The software I provide runs under MS-DOS (and OS/2 and Windows
as a DOS box).  Other, compatible software packages exist for
the MacIntosh and Unix.

Any responses should be private and go to:  
mark.prado@permanet.org

(By the way, to all, my apologies for the public traffic on my
glib question.  I really didn't expect public replys.  But thanks
to Bill Higgins for the interesting statistics and the lead.)

 * Origin: PerManNet FTSC <=> Internet gateway (1:109/349.2)
"""

tokenized_doc, document = preprocess_document(document)
document

'mark prado sixtytwo thousand be many read scispace address anyone know anyone else would like get scispace internet fee or cryptic internet fee would will fee them nice offline message readereditor automate modem mailer program pick mail bundle quickly easily installexe set painlessly charge scispace fee though dial washington dc bbs store system mail bundle minimum connect time im use overseas call this offer free fee particular newsgroups speed v32bis support vips might offer free service internet address functionality get fee uunet run 4line hub hubbing years extremely reliable hub software provide run msdos and os2 windows do box other compatible software package exist macintosh unix responses private go to by way all apologies public traffic glib question really expect public reply thank bill higgins interest statistics lead origin permannet ftsc internet gateway'

## Let's preprocess the dataset

In [76]:
def preprocess_dataset(documents, labels):
    processed_labels = []
    processed_tokenized_docs = []
    processed_docs: List[str] = []
    for document, label in zip(documents, labels):
        tokenized_doc, document = preprocess_document(document)
        if document is not None:
            processed_docs.append(document)
            processed_labels.append(label)
            processed_tokenized_docs.append(tokenized_doc)

    return processed_tokenized_docs, processed_docs, processed_labels


In [77]:
trainset['tokenized_data'], trainset['data'], trainset['target'] = preprocess_dataset(trainset['data'], trainset['target'])
testset['tokenized_data'], testset['data'], testset['target'] = preprocess_dataset(testset['data'], testset['target'])

### Word Representations

Now that we have the dataset pre-processed, let's convert the word into embeddings.

#### TF-IDF

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorized_train_corpus = tfidf_vectorizer.fit_transform(trainset['data'])
tfidf_vectorized_test_corpus = tfidf_vectorizer.transform(testset['data'])

#### Trained Word2Vec

In [79]:
from gensim.models.word2vec import Word2Vec

word2vec_vectorizer = Word2Vec(sentences=trainset['tokenized_data'], vector_size=300, window=10)

def get_document_vector(words, model):
    valid_words = [word for word in words if word in model.wv]
    
    if not valid_words:
        return np.zeros(model.vector_size)
    
    word_vectors = np.array([model.wv[word] for word in valid_words])
    document_vector = np.mean(word_vectors, axis=0)
    # document_vector = np.linalg.norm(word_vectors, axis=0)
    
    return document_vector


word2vec_vectorized_train_corpus = np.array([get_document_vector(doc, word2vec_vectorizer) for doc in trainset['tokenized_data']])
word2vec_vectorized_test_corpus = np.array([get_document_vector(doc, word2vec_vectorizer) for doc in testset['tokenized_data']])

### Pre-trained Word2Vec

In [80]:
import gensim
import numpy as np

model_path = 'GoogleNews-vectors-negative300.bin'

word2vec_word_embeddings = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)
words = word2vec_word_embeddings.index_to_key
pretrained_word2vec_word_embeddings = {}
for word in words:
    pretrained_word2vec_word_embeddings[word] = word2vec_word_embeddings[word]

# Function to get document vectors using pretrained embeddings
def get_document_vector_with_pretrained_embeddings(words, pretrained_model):
    valid_words = [word for word in words if word in pretrained_model]
    
    if not valid_words:
        return np.zeros(len(pretrained_model['the']))  # Default to zero vector (length of the word vector)
    
    word_vectors = np.array([pretrained_model[word] for word in valid_words])
    document_vector = np.mean(word_vectors, axis=0)
    # document_vector = np.max(word_vectors, axis=0)
    # document_vector = np.linalg.norm(word_vectors, axis=0)
    # document_vector = np.sum(word_vectors, axis=0)

    return document_vector

pretrained_word2vec_vectorized_train_corpus = np.array([get_document_vector_with_pretrained_embeddings(doc, pretrained_word2vec_word_embeddings) for doc in trainset['tokenized_data']])
pretrained_word2vec_vectorized_test_corpus = np.array([get_document_vector_with_pretrained_embeddings(doc, pretrained_word2vec_word_embeddings) for doc in testset['tokenized_data']])

print(pretrained_word2vec_vectorized_train_corpus[:3])
print(pretrained_word2vec_vectorized_test_corpus.shape)

[[ 3.22478935e-02  2.51459666e-02  3.05642430e-02  8.23636875e-02
  -4.10373919e-02 -5.55739179e-02  3.24003361e-02 -7.62371346e-02
   9.44162011e-02  7.79554099e-02 -2.67080665e-02 -1.20082222e-01
  -7.26899058e-02  6.18105382e-02 -1.09070711e-01  7.62318373e-02
  -6.11698180e-02  9.55687612e-02  1.29045649e-02 -1.28017247e-01
   1.44207412e-02  1.46352900e-02  7.40464032e-02  3.07931262e-03
   2.36822339e-03 -3.83814834e-02 -3.44493836e-02  3.32995281e-02
   8.54092166e-02 -7.39862770e-02  4.70619574e-02 -7.33743310e-02
  -8.75527859e-02  1.66909676e-02 -7.54641965e-02 -2.76427045e-02
   8.26629326e-02  7.36553594e-02  3.39817666e-02  4.08144444e-02
   9.98479575e-02  7.26093352e-03  1.56373844e-01 -4.23527695e-02
   3.28020267e-02 -5.08261919e-02 -1.83327682e-02 -3.50073650e-02
  -7.87057206e-02  8.31426214e-03  2.51607061e-03  6.23758556e-03
   3.80729027e-02  9.29526892e-03 -6.60566315e-02 -1.84292104e-02
  -8.05564821e-02 -2.59923842e-02 -4.06723022e-02 -1.09897055e-01
  -6.17178

## Define and train the models

In [81]:
def train(model, X_train, Y_train):
    model.fit(X_train, Y_train)
    return model

##### 1. Naive Bayes Classifier 

**NOTE:** We won't be able to use word2vec embeddings on MultinomialNB because it can take negative values. 

In [82]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes_classifier_tfidf = MultinomialNB()
naive_bayes_classifier_tfidf = train(naive_bayes_classifier_tfidf, tfidf_vectorized_train_corpus, trainset['target'])

##### 2. Logistic Regression

In [83]:
from sklearn.linear_model import LogisticRegression

# using tf-idf
logistic_regression_clf_tfidf = LogisticRegression()
logistic_regression_clf_tfidf = train(logistic_regression_clf_tfidf, tfidf_vectorized_train_corpus, trainset['target'])

In [84]:
from sklearn.linear_model import LogisticRegression

# using word2vec
logistic_regression_clf_word2vec = LogisticRegression()
logistic_regression_clf_word2vec = train(logistic_regression_clf_word2vec, word2vec_vectorized_train_corpus, trainset['target'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [85]:
from sklearn.linear_model import LogisticRegression

# using pre-trained word2vec
logistic_regression_clf_pretrained_word2vec = LogisticRegression()
logistic_regression_clf_pretrained_word2vec = train(logistic_regression_clf_pretrained_word2vec, pretrained_word2vec_vectorized_train_corpus, trainset['target'])

#### 3. CNN Model

In [86]:
from keras.datasets import imdb
from keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv1D, MaxPooling1D, Embedding, GlobalMaxPool1D, Dropout, BatchNormalization, Input, Reshape
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam, SGD

In [87]:
def create_cnn_model(input_shape, num_classes=20):
    """Create a CNN model for text classification with improvements"""
    model = Sequential()
    model.add(Input(shape=input_shape))
    model.add(Reshape((input_shape[0], 1)))


    # 4 convolutional layers with different filter sizes
    model.add(Conv1D(256, 5, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling1D())
    
    model.add(Conv1D(128, 5, padding='same', activation='relu'))  
    model.add(BatchNormalization())
    model.add(MaxPooling1D())
    
    model.add(Conv1D(64, 3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling1D())
    
    model.add(Conv1D(32, 3, padding='same', activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling1D())

    model.add(GlobalMaxPool1D())
    model.add(Flatten())
    
    model.add(Dense(200, activation='relu'))
    model.add(Dropout(0.5))  # Dropout layer to prevent overfitting


    model.add(Dense(num_classes, activation='softmax'))  # Softmax for multi-class classification
    model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=0.01, momentum=0.9), metrics=['accuracy', Precision(), Recall()])
    
    return model

lr_scheduler = ReduceLROnPlateau(monitor='loss', factor=0.1, patience=3, verbose=1, min_lr=0.0001)

In [88]:
maxlen = max(tfidf_vectorized_test_corpus.getnnz(axis=1))
maxlen

2032

In [89]:
tfidf_vectorized_train_corpus_padded = pad_sequences(tfidf_vectorized_train_corpus.toarray(), maxlen)
tfidf_vectorized_test_corpus_padded = pad_sequences(tfidf_vectorized_test_corpus.toarray(), maxlen)

tfidf_vectorized_train_corpus_padded = tfidf_vectorized_train_corpus_padded.reshape(tfidf_vectorized_train_corpus_padded.shape[0], tfidf_vectorized_train_corpus_padded.shape[1], 1)
tfidf_vectorized_test_corpus_padded = tfidf_vectorized_test_corpus_padded.reshape(tfidf_vectorized_test_corpus_padded.shape[0], tfidf_vectorized_test_corpus_padded.shape[1], 1)



# Convert the target labels to one-hot encoding
trainset_target = to_categorical(trainset['target'], num_classes=20)
testset_target = to_categorical(testset['target'], num_classes=20)

In [90]:

tfidf_cnn = create_cnn_model(input_shape=(maxlen, ))

In [91]:
print(f"Train target shape: {trainset_target.shape}")
print(f"Test target shape: {testset_target.shape}")

Train target shape: (11314, 20)
Test target shape: (7532, 20)


In [92]:
tfidf_cnn.fit(tfidf_vectorized_train_corpus_padded, trainset_target, epochs=5, batch_size=128, verbose=2, callbacks=[lr_scheduler])

Epoch 1/5
89/89 - 89s - 997ms/step - accuracy: 0.0488 - loss: 2.9946 - precision_8: 0.0000e+00 - recall_8: 0.0000e+00 - learning_rate: 0.0100
Epoch 2/5
89/89 - 92s - 1s/step - accuracy: 0.0508 - loss: 2.9924 - precision_8: 0.0000e+00 - recall_8: 0.0000e+00 - learning_rate: 0.0100
Epoch 3/5
89/89 - 93s - 1s/step - accuracy: 0.0483 - loss: 2.9914 - precision_8: 0.0000e+00 - recall_8: 0.0000e+00 - learning_rate: 0.0100
Epoch 4/5
89/89 - 94s - 1s/step - accuracy: 0.0492 - loss: 2.9910 - precision_8: 0.0000e+00 - recall_8: 0.0000e+00 - learning_rate: 0.0100
Epoch 5/5
89/89 - 96s - 1s/step - accuracy: 0.0492 - loss: 2.9908 - precision_8: 0.0000e+00 - recall_8: 0.0000e+00 - learning_rate: 0.0100


<keras.src.callbacks.history.History at 0x310db9190>

#### CNN + Word2Vec

In [93]:
maxlen = 300

In [94]:
word2vec_cnn = create_cnn_model(input_shape=(maxlen, ))

In [95]:
print(word2vec_vectorized_train_corpus.shape)
print(f"Target labels shape: {np.array(trainset['target']).shape}")

(11314, 300)
Target labels shape: (11314,)


In [None]:
# Assuming word2vec_vectorized_train_corpus is (num_samples, embedding_dim)
# Reshape to (num_samples, sequence_length, embedding_dim), where sequence_length is known
word2vec_vectorized_train_corpus = word2vec_vectorized_train_corpus.reshape(
    word2vec_vectorized_train_corpus.shape[0],  # number of samples
    maxlen, # the number of words in each document     # word2vec embedding dimension (e.g., 300)
)

print(word2vec_vectorized_train_corpus.shape)  # Should be (num_samples, sequence_length, embedding_dim)


(11314, 300)


In [97]:
word2vec_cnn.fit(word2vec_vectorized_train_corpus, trainset_target, epochs=30, batch_size=128, verbose=2)

Epoch 1/30
89/89 - 14s - 154ms/step - accuracy: 0.2295 - loss: 2.4275 - precision_9: 0.5028 - recall_9: 0.0399
Epoch 2/30
89/89 - 13s - 148ms/step - accuracy: 0.3760 - loss: 1.8183 - precision_9: 0.6238 - recall_9: 0.1523
Epoch 3/30
89/89 - 13s - 149ms/step - accuracy: 0.4255 - loss: 1.6766 - precision_9: 0.6589 - recall_9: 0.2091
Epoch 4/30
89/89 - 14s - 155ms/step - accuracy: 0.4386 - loss: 1.6126 - precision_9: 0.6585 - recall_9: 0.2337
Epoch 5/30
89/89 - 15s - 164ms/step - accuracy: 0.4646 - loss: 1.5601 - precision_9: 0.6726 - recall_9: 0.2614
Epoch 6/30
89/89 - 14s - 161ms/step - accuracy: 0.4831 - loss: 1.5057 - precision_9: 0.6949 - recall_9: 0.2907
Epoch 7/30
89/89 - 14s - 160ms/step - accuracy: 0.4906 - loss: 1.4815 - precision_9: 0.6843 - recall_9: 0.3010
Epoch 8/30
89/89 - 14s - 163ms/step - accuracy: 0.5079 - loss: 1.4296 - precision_9: 0.7110 - recall_9: 0.3242
Epoch 9/30
89/89 - 15s - 165ms/step - accuracy: 0.5095 - loss: 1.4084 - precision_9: 0.7046 - recall_9: 0.3318
E

<keras.src.callbacks.history.History at 0x478b53210>

#### CNN + Pre-trained Word2Vec

In [98]:
pretrained_word2vec_cnn = create_cnn_model(input_shape=(maxlen, ))

pretrained_word2vec_vectorized_train_corpus = pretrained_word2vec_vectorized_train_corpus.reshape(
    pretrained_word2vec_vectorized_train_corpus.shape[0],  # number of samples
    maxlen,  # word2vec embedding dimension (e.g., 300)
)

print(pretrained_word2vec_vectorized_train_corpus.shape)  # Should be (num_samples, sequence_length, embedding_dim)


(11314, 300)


In [99]:
pretrained_word2vec_cnn.fit(pretrained_word2vec_vectorized_train_corpus, trainset_target, epochs=30, batch_size=128, verbose=2)

Epoch 1/30
89/89 - 14s - 156ms/step - accuracy: 0.2293 - loss: 2.4873 - precision_10: 0.5676 - recall_10: 0.0390
Epoch 2/30
89/89 - 13s - 150ms/step - accuracy: 0.4147 - loss: 1.7158 - precision_10: 0.6850 - recall_10: 0.2022
Epoch 3/30
89/89 - 13s - 151ms/step - accuracy: 0.4915 - loss: 1.5000 - precision_10: 0.7225 - recall_10: 0.2929
Epoch 4/30
89/89 - 14s - 153ms/step - accuracy: 0.5289 - loss: 1.3790 - precision_10: 0.7381 - recall_10: 0.3552
Epoch 5/30
89/89 - 14s - 154ms/step - accuracy: 0.5589 - loss: 1.3021 - precision_10: 0.7580 - recall_10: 0.3923
Epoch 6/30
89/89 - 14s - 153ms/step - accuracy: 0.5802 - loss: 1.2365 - precision_10: 0.7668 - recall_10: 0.4229
Epoch 7/30
89/89 - 14s - 153ms/step - accuracy: 0.6036 - loss: 1.1815 - precision_10: 0.7680 - recall_10: 0.4520
Epoch 8/30
89/89 - 14s - 152ms/step - accuracy: 0.6194 - loss: 1.1506 - precision_10: 0.7851 - recall_10: 0.4739
Epoch 9/30
89/89 - 13s - 151ms/step - accuracy: 0.6307 - loss: 1.0959 - precision_10: 0.7857 - r

<keras.src.callbacks.history.History at 0x4344f2610>

## Let's evaluate our models

In [100]:
import pandas as pd
from sklearn.metrics import classification_report

def evauate_model(model, X_test, Y_test, target_names):
    """Generates a report with metrics like precision, recall, f1, support for each class"""
    y_pred = model.predict(X_test)
    report = classification_report(Y_test, y_pred, target_names=target_names, output_dict=True)
    report = pd.DataFrame(report).transpose()
    return report

#### 1. Evaluating Naive Bayes Classifier with TF-IDF Vectorizer

In [101]:
report = evauate_model(naive_bayes_classifier_tfidf, tfidf_vectorized_test_corpus, testset['target'], testset['target_names'])
report

Unnamed: 0,precision,recall,f1-score,support
talk.politics.mideast,0.847368,0.504702,0.632613,319.0
rec.autos,0.786517,0.719794,0.751678,389.0
comp.sys.mac.hardware,0.781609,0.690355,0.733154,394.0
alt.atheism,0.689579,0.793367,0.737841,392.0
rec.sport.baseball,0.859375,0.857143,0.858257,385.0
comp.os.ms-windows.misc,0.893064,0.782278,0.834008,395.0
rec.sport.hockey,0.945763,0.715385,0.814599,390.0
sci.crypt,0.874704,0.934343,0.903541,396.0
sci.med,0.926829,0.954774,0.940594,398.0
talk.politics.misc,0.946835,0.942065,0.944444,397.0


#### 2. Evaluating Logistic Regression with TF-IDF Vectorizer

In [102]:
report = evauate_model(logistic_regression_clf_tfidf, tfidf_vectorized_test_corpus, testset['target'], testset['target_names'])
report

Unnamed: 0,precision,recall,f1-score,support
talk.politics.mideast,0.735915,0.655172,0.693201,319.0
rec.autos,0.693182,0.784062,0.735826,389.0
comp.sys.mac.hardware,0.753247,0.736041,0.744544,394.0
alt.atheism,0.717391,0.757653,0.736973,392.0
rec.sport.baseball,0.86059,0.833766,0.846966,385.0
comp.os.ms-windows.misc,0.850144,0.746835,0.795148,395.0
rec.sport.hockey,0.768349,0.858974,0.811138,390.0
sci.crypt,0.911227,0.881313,0.896021,396.0
sci.med,0.955959,0.927136,0.941327,398.0
talk.politics.misc,0.924812,0.929471,0.927136,397.0


#### 3. Evaluating Logistic Regression with Word2Vec Vectorizer

In [103]:
report = evauate_model(logistic_regression_clf_word2vec, word2vec_vectorized_test_corpus, testset['target'], testset['target_names'])
report

Unnamed: 0,precision,recall,f1-score,support
talk.politics.mideast,0.411594,0.445141,0.427711,319.0
rec.autos,0.507576,0.51671,0.512102,389.0
comp.sys.mac.hardware,0.523923,0.555838,0.539409,394.0
alt.atheism,0.507418,0.436224,0.469136,392.0
rec.sport.baseball,0.475177,0.348052,0.401799,385.0
comp.os.ms-windows.misc,0.597964,0.594937,0.596447,395.0
rec.sport.hockey,0.66416,0.679487,0.671736,390.0
sci.crypt,0.506234,0.512626,0.50941,396.0
sci.med,0.45082,0.552764,0.496614,398.0
talk.politics.misc,0.565217,0.556675,0.560914,397.0


#### 4. Evaluating Logistic Regression with Pre-trained Word2Vec Vectorizer

In [104]:
report = evauate_model(logistic_regression_clf_pretrained_word2vec, pretrained_word2vec_vectorized_test_corpus, testset['target'], testset['target_names'])
report

Unnamed: 0,precision,recall,f1-score,support
talk.politics.mideast,0.489676,0.520376,0.504559,319.0
rec.autos,0.589327,0.652956,0.619512,389.0
comp.sys.mac.hardware,0.561743,0.588832,0.574969,394.0
alt.atheism,0.568063,0.553571,0.560724,392.0
rec.sport.baseball,0.596491,0.52987,0.56121,385.0
comp.os.ms-windows.misc,0.653846,0.64557,0.649682,395.0
rec.sport.hockey,0.738903,0.725641,0.732212,390.0
sci.crypt,0.792208,0.770202,0.78105,396.0
sci.med,0.781022,0.806533,0.793572,398.0
talk.politics.misc,0.855037,0.876574,0.865672,397.0


#### 4. Evaluating CNN with TF-IDF

In [105]:
test_loss, test_accuracy, test_precision, test_recall = tfidf_cnn.evaluate(tfidf_vectorized_test_corpus_padded, testset_target, verbose=1)
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")
print(f"Test Precision: {test_precision}")
print(f"Test Recall: {test_recall}")

[1m236/236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 65ms/step - accuracy: 0.0593 - loss: 2.9857 - precision_8: 0.0000e+00 - recall_8: 0.0000e+00
Test Loss: 2.9903924465179443
Test Accuracy: 0.05297397822141647
Test Precision: 0.0
Test Recall: 0.0


### 5. Evaluating CNN with Word2Vec

In [106]:
test_loss, test_accuracy, test_precision, test_recall = word2vec_cnn.evaluate(word2vec_vectorized_test_corpus, testset_target, verbose=1)
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")
print(f"Test Precision: {test_precision}")
print(f"Test Recall: {test_recall}")

[1m236/236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - accuracy: 0.5328 - loss: 1.6312 - precision_9: 0.6419 - recall_9: 0.4516
Test Loss: 1.80009925365448
Test Accuracy: 0.4779607057571411
Test Precision: 0.5769230723381042
Test Recall: 0.4002920985221863


### 6. Evaluating CNN with Pretrained Word2Vec

In [107]:
test_loss, test_accuracy, test_precision, test_recall = pretrained_word2vec_cnn.evaluate(pretrained_word2vec_vectorized_test_corpus, testset_target, verbose=1)
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")
print(f"Test Precision: {test_precision}")
print(f"Test Recall: {test_recall}")

[1m236/236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - accuracy: 0.5639 - loss: 1.8119 - precision_10: 0.6400 - recall_10: 0.5250
Test Loss: 1.6807277202606201
Test Accuracy: 0.5841742157936096
Test Precision: 0.6551558971405029
Test Recall: 0.5440785884857178
