#**Text Classification using 20 Newsgroups Dataset**

Author: Mohamed Oussama NAJI

Date: March 13, 2024

## Introduction

In this notebook, we will explore text classification using the 20 Newsgroups dataset. We will use various feature extraction techniques such as CountVectorizer, TfidfVectorizer, Word2Vec, and Doc2Vec, along with different classifiers like MultinomialNB, LogisticRegression, SVC, and DecisionTreeClassifier. We will perform a grid search to find the best combination of vectorizer and classifier parameters.

## Table of Contents
1. Importing Libraries
2. Loading the Dataset
3. Custom Transformers
   - Word2VecTransformer
   - Doc2VecTransformer
4. Defining the Pipeline
5. Defining Grid Search Parameters
6. Executing Grid Search
7. Saving and Printing Results
8. Conclusion


## Importing Libraries <a id="importing-libraries"></a>

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.base import BaseEstimator, TransformerMixin


In [None]:
nltk.download('punkt')

## Loading the Dataset <a id="loading-dataset"></a>


In [None]:
categories = ['alt.atheism', 'talk.religion.misc']
data = fetch_20newsgroups(subset='train', categories=categories)


## Custom Transformers <a id="custom-transformers"></a>


### Word2VecTransformer <a id="word2vec-transformer"></a>


In [None]:
class Word2VecTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, size=100, min_count=1):
        self.size = size
        self.min_count = min_count
        self.model = None

    def fit(self, X, y=None):
        sentences = [word_tokenize(doc) for doc in X]
        self.model = Word2Vec(sentences, vector_size=self.size, min_count=self.min_count)
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.model.wv[word] for word in words if word in self.model.wv]
                    or [np.zeros(self.size)], axis=0) for words in X])


### Doc2VecTransformer <a id="doc2vec-transformer"></a>


In [None]:
class Doc2VecTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vector_size=100, min_count=1, epochs=40):
        self.vector_size = vector_size
        self.min_count = min_count
        self.epochs = epochs
        self.model = None

    def fit(self, X, y=None):
        tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(X)]
        self.model = Doc2Vec(tagged_data, vector_size=self.vector_size, min_count=self.min_count, epochs=self.epochs)
        return self

    def transform(self, X):
        return np.array([self.model.infer_vector(word_tokenize(doc)) for doc in X])


## Defining the Pipeline <a id="defining-pipeline"></a>


In [None]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),  # Placeholder vectorizer
    ('clf', LogisticRegression()),  # Placeholder classifier
])


## Defining Grid Search Parameters <a id="defining-grid-search-parameters"></a>



In [None]:
parameters = [
    {
        'vect': [CountVectorizer(), TfidfVectorizer()],
        'vect__ngram_range': [(1, 1), (1, 2)],  # Test both unigrams and bigrams
        'clf': [MultinomialNB(), LogisticRegression(), SVC(), DecisionTreeClassifier()],
    },
    {
        'vect': [Word2VecTransformer(), Doc2VecTransformer()],
        'clf': [LogisticRegression(), SVC(), DecisionTreeClassifier()],  # MultinomialNB is excluded
    }
]


## Executing Grid Search <a id="executing-grid-search"></a>


In [None]:
grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(data.data, data.target)


## Saving and Printing Results <a id="saving-printing-results"></a>


In [None]:
def save_and_print_results(grid_search):
    results_df = pd.DataFrame(grid_search.cv_results_)
    selected_columns = ['rank_test_score', 'mean_test_score', 'std_test_score', 'param_vect', 'param_vect__ngram_range', 'param_clf']
    results_df = results_df[selected_columns].copy()
    results_df.columns = ['Rank', 'Mean Test Score', 'Std Test Score', 'Vectorizer', 'N-Gram Range', 'Classifier']

    tabular_data = results_df.to_string(index=False)

    best_params = grid_search.best_params_
    best_vect = best_params.get('vect', 'Vectorizer not specified')
    best_clf = best_params.get('clf', 'Classifier not specified')

    best_score = f"\nBest score: {grid_search.best_score_:.3f}"
    best_params_str = f"Best parameters set: Vectorizer: {best_vect}, Classifier: {best_clf}"
    tabular_data += best_score + '\n' + best_params_str

    with open('Oussama_Task0_Text_Classification.txt', 'w') as f:
        f.write(tabular_data)

    print(tabular_data)

save_and_print_results(grid_search)


## Conclusion <a id="conclusion"></a>

In this notebook, we performed text classification on a subset of the 20 Newsgroups dataset using various feature extraction techniques and classifiers. We used CountVectorizer, TfidfVectorizer, Word2Vec, and Doc2Vec for feature extraction, and MultinomialNB, LogisticRegression, SVC, and DecisionTreeClassifier as classifiers.

We defined a pipeline with placeholders for the vectorizer and classifier, and then performed a grid search to find the best combination of parameters. The grid search results were saved to a text file and printed in a tabular format, including the best score and best parameter set.

This analysis demonstrates the process of text classification using different feature extraction techniques and classifiers, and highlights the importance of hyperparameter tuning through grid search to find the optimal model configuration.

For further improvement, you can consider exploring additional feature extraction techniques, preprocessing steps (e.g., stemming, lemmatization), and other classifiers or ensemble methods. Additionally, you can experiment with different subsets of the 20 Newsgroups dataset or apply this approach to other text classification tasks.