<a href="https://www.kaggle.com/code/jaymanvirk/matrix-factorization-nmf-vs-supervised-learning?scriptVersionId=145722092" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## BBC News Classification: Matrix Factorization vs Supervised Learning

**Jay Manvirk (Ivan Loginov)**<br/>University of Colorado, Boulder<br/>jay.manvirk@gmail.com

### Table of Contents

1. [Abstract](#abstract)
2. [Introduction](#introduction)
3. [Libraries and raw data](#libraries_data)
    - 3.1 [Libraries](#libraries)
    - 3.2 [Raw data](#raw_data)
4. [Exploratory Data Analysis](#eda)
    - 4.1 [Short Datasets Summary](#short_summary)
    - 4.2 [Number of Articles per Category](#articles_per_category)
    - 4.3 [Word Frequencies](#word_frequencies)
5. [Data Preprocessing](#data_preprocessing)
    - 5.1 [Text Cleaning](#text_cleaning)
    - 5.2 [TF-IDF Vectorization](#tfidf_vectorization)
6. [Data Modelling](#data_modelling)
    - 6.1 [Unsupervised Learning (NMF)](#ul_nmf)
    - 6.2 [Supervised Learning (SL)](#sl)
7. [Model Results Comparison between NMF and SL](#comparison)
    - 7.1 [Full Train Dataset](#full_train_dataset)
    - 7.2 [Samples of the Train Dataset](#samples_train_dataset)
8. [Submission Results](#submission_results)
9. [Conclusion](#conclusion)
10. [References](#references)

### 1. Abstract <a class="anchor" id="abstract"></a>

This study presents a fraction of an analysis of a BBC News dataset, encompassing Exploratory Data Analysis (EDA) and preprocessing stages, followed by a performance comparison of Non-Negative Matrix Factorization (NMF) against various supervised learning (SL) algorithms. The dataset comprises articles' texts and their categories: business, sport, tech, politics and entartainment.

The results of this study showed that SL algorithms such as SVM and Random Forest (RF) scored better than NMF in terms of accuracy, but were completely outperformed by NMF in terms of computational speed.

Additionally the NMF provided surprising results, when on every sample size, 50%, 20% and 10% of the train dataset, it got test scores on par with the train scores. SVM and RF models while resulting in higher accuracy than NMF across sample sizes, got more prominent overfitting problem with the data size reduction.

### 2. Introduction <a class="anchor" id="introduction"></a>

In this notebook we're going to look a bit closer at the model performance comparison between:
* Unsupervised Learning algorithm, concretely Non-Negative Matrix Factorization
* Supervised Learning algorithms: Logistic Regression, Random Forest and SVM

The choice of SL algorithms as well as model parameters is arbitrary and is mostly based on:
* the default values from packages' documentation
* the arbitrary minimum of top basic SL models among other NLP kernels on Kaggle

The notebook is divided into several sections in order to evaluate final comparison of models.
* **EDA:**<br/>
    Couple of major visualizations to understand the distribution of number of articles per category and frequencies of words. With that we can have an idea whether to even the number of articles per category or to remove stopwords and other characters which do not carry much information about article's class.
* **Preprocess data for modelling:**<br/>
    Tokenization, lemmatization, stopwords removal. We're going to remove stopwords, convert texts into word-tokens and then convert those word-tokens to their base form. All of that is to improve model's performance and reduce training time.
* **Unsupervised learning, NMF:**<br/>
    NMF training based on the preprocessed text using several different parameters to choose the best one for later comparison with the SL. Mean accuracy scores computation via 5-fold cross validation.
* **Supervised learning:**<br/>
    Selection of the aforementioned SL algorithms and their training on the preprocessed data using different parameters as in the previous section. Mean accuracy scores computation via 5-fold cross validation.
* **Comparison between NMF and SL:**<br/>
    Final comparison is going to be presented as a table with models' accuracy scores and their runtime. This section is going to be comprised of 2 parts: results on the 100% of the dataset and on selected samples of it, specifically 50%, 20% and 10%.

### 3. Libraries and data <a class="anchor" id="libraries_data"></a>

#### 3.1 Libraries <a class="anchor" id="libraries"></a>

In [1]:
# basics
import numpy as np
import itertools
import random
import os

# EDA
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data preprocessing
import re
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Unsupervised learning
from sklearn.decomposition import NMF

# Supervised learning
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# helper functions
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import GridSearchCV

#### 3.2 Raw Data <a class="anchor" id="raw_data"></a>

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_data = pd.read_csv("/kaggle/input/learn-ai-bbc/BBC News Train.csv")
test_data = pd.read_csv("/kaggle/input/learn-ai-bbc/BBC News Test.csv")
sample_data = pd.read_csv("/kaggle/input/learn-ai-bbc/BBC News Sample Solution.csv")

### 4. Exploratory Data Analysis <a class="anchor" id="eda"></a>

#### 4.1 Short Datasets Summary <a class="anchor" id="short_summary"></a>

In [None]:
def print_short_summary(name, data):
    """
    Prints data head, shape and info.
    Args:
        name (str): name of dataset
        data (dataframe): dataset in a pd.DataFrame format
    """
    print(name)
    print('\n1. Data head:')
    print(data.head())
    print('\n2. Data shape: {}'.format(data.shape))
    print('\n3. Data info:')
    data.info()

In [None]:
print_short_summary('Train dataset', train_data)

In [None]:
print_short_summary('Test dataset',test_data)

In [None]:
print_short_summary('Sample dataset',sample_data)

#### 4.2 Number of Articles per Category <a class="anchor" id="articles_per_category"></a>

We can see that the number of articles per category doesn't vary much, which makes the training dataset somewhat balanced. Otherwise, we would have needed to either even the number of articles by reducing the sample size of categories or by choosing models which are kind of immune to the imbalanced dataset alongside with more relevant metrics such as F1-score or AUC.

In [None]:
# Plot histogram of number of articles per category
plt.figure(figsize=(16, 9))
sns.histplot(train_data, x = 'Category')
plt.title('Category count disribution')
plt.show()

#### 4.3 Word Frequencies <a class="anchor" id="word_frequencies"></a>

Here we're going to examine texts closer on the matter of word importance. Prepositions, atricles, conjuctions can be considered irrelevant in determining category. From the text below it's also seen that apostrophes have been replaced be space leaving single letters untethered. These single words can be discarded in model training as well for the same reason of insignificance.

In [None]:
# Print category and text as an example
print('Category: {}\n'.format(train_data['Category'][0]))
print('Text:\n{}'.format(train_data['Text'][0]))

In [None]:
# Plot top 25 words by frequency
w = train_data['Text'].str.split(expand=True).unstack().value_counts()
l = w[:25]/np.sum(w)*100
plt.figure(figsize=(16,9))
plt.bar(l.index, l.values)
plt.xlabel('Words')
plt.ylabel('Percentage of total word count (%)')
plt.title('Top 25 words by frequency')
plt.show()

#### 5. Data Preprocessing <a class="anchor" id="data_preprocessing"></a>

#### 5.1 Text Cleaning <a class="anchor" id="text_cleaning"></a>

In [None]:
# Load English dataset of lemmas and stopwords
sp_process = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

In [None]:
def get_clean_text(text):
    """
    Returns lemmatized text without single letters and digits in lower case.
    Args:
        text (str): text of an article
    Returns:
        text (str): cleand text
    """
    # Convert to lowercase
    text = text.lower()
    # Replace digits and single letters
    pattern = r'\b([a-zA-Z])\b|\d+|[.,!?()-\:]'
    text = re.sub(pattern, '', text)
    # Tokenize
    words = word_tokenize(text)
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Lemmatize
    text = sp_process(text)
    words = [token.lemma_ for token in text]
    # Join the words back into a string
    text = ' '.join(words)
    
    return text


In [None]:
# Clean train and test datasets
npv = np.vectorize(get_clean_text)
train_clean_data = npv(train_data['Text'])
test_clean_data = npv(test_data['Text'])

In [None]:
# Print an example of the cleaned text
train_clean_data[0]

#### 5.2 TF-IDF Vectorization <a class="anchor" id="tfidf_vectorization"></a>

TF-IDF, or Term Frequency-Inverse Document Frequency, is a numerical statistic that reflects how important a word is to an article in a collection of such. This reflection is later used as the word weight in modelling. With these weights TF-IDF supposedly enhances the model's ability to find unique and relevant words within each article, eventually improving the accuracy.

In [None]:
# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(sublinear_tf = True
                             , min_df = 5
                             , stop_words = 'english'
                             , norm = 'l2'
                             , encoding = 'latin-1'
                             , ngram_range = (1,2)
                             )
tfidf_vect = vectorizer.fit(train_clean_data)
X_train = tfidf_vect.transform(train_clean_data)
X_test = tfidf_vect.transform(test_clean_data)

y_train = train_data['Category']
y_test = sample_data['Category']

#### 6. Data Modelling <a class="anchor" id="data_modelling"></a>

Since we're extensively using GridSearchCV in our modelling, there is no need to additionaly split our train dataset into training and testing samples. The reason for that is a built-in cross validation function which already takes responsibility in splitting training set into (k-1) parts to train on and the remaining to test on. We're going to use 5-fold cross-validation.

In [None]:
def get_results_table(results):
    """
    Returns table of models' results sorted by accuracy and runtime
    Args:
        results (dataframe): table of models' results
    Returns:
        df (dataframe): pd.DataFrame with selected columns sorted by accuracy and runtime
    """
    columns = ['model','params','mean_fit_time','mean_score_time','mean_train_score', 'mean_test_score']
    df = results.copy()[columns]
    df['mean_runtime'] = df['mean_fit_time'] + df['mean_score_time']
    df = df[['model','params','mean_runtime','mean_train_score','mean_test_score']]
    df = df.sort_values(by=['mean_test_score','mean_runtime'], ascending=[False,True])
    
    return df

##### 6.1 Unsupervised Learning (NMF) <a class="anchor" id="ul_nmf"></a>

In this section we're going to train unsupervised model with different parameters. In our case it is going to be NMF.<br/>Since we do not want end up neither in overfitting nor in underfitting, it's better to use our tokenized features only from training dataset and evaluate model performance on 100% of such.<br/>However, additional experiments in regard to using only 50%, 20% and 10% of the training dataset will be conducted in the Section 7 as well. The reason for this is to reveal which model requires a smaller amount of data to achieve similar results or in other words more data-efficient.

In [None]:
def get_max_accuracy(y_true, y_pred):
    """
    Returns max accuracy among category permutations
    Args:
        y_true (ndarray): true labels of a dataset
        y_pred (ndarray): predicted lables from a model
    Returns:
        mx (float): maximum accuracy score among all category permutations
    """
    y_pred = np.argmax(y_pred, axis = 1)
    l = np.unique(y_true)
    t = y_true.values
    mx = 0
    for p in itertools.permutations(range(len(l))):
        c = np.array([l[p.index(x)] for x in y_pred])
        v = np.mean(c == t)
        mx = max(mx,v)
            
    return mx

In [None]:
class NMF_custom(BaseEstimator, ClassifierMixin):
    """
    Custom NMF class to use in conjunction with GridSearchCV
    , so that predict() function can be used in the cross validation
    Inherited classes:
        BaseEstimator, ClassifierMixin: necessary for the GridSearchCV
    """
    def __init__(self, n_components = 5, init = None, l1_ratio = 0, max_iter = 200):
        """
        Set self variables to the arguments for the fit() function
        Args:
            smaller set of the same arguments as in the docs of NMF
        """
        self.n_components = n_components
        self.init = init
        self.l1_ratio = l1_ratio
        self.max_iter = max_iter

    def fit(self, X, y=None):
        """
        Fit NMF to the X data using self variables
        Args:
            X (ndarray): data to fit
        Returns:
            self
        """
        self.nmf = NMF(n_components = self.n_components
                       , init = self.init
                       , l1_ratio = self.l1_ratio
                       , max_iter = self.max_iter)
        self.nmf.fit(X)
        return self

    def predict(self, X):
        """
        Transform X using fitted model to use as a prediction
        Args:
            X (ndarray): data to predict from
        Returns:
            self.nmf.transform (ndarray): ndarray of predicted categories
        """
        return self.nmf.transform(X)


In [None]:
def get_table_grid_nmf(X_train, y_train):
    """
    Returns tuple (table, best_estimator)
    Args:
        X_train (ndarray): data to train on
        y_train (ndarray): training labels for GridSearchCV evaluation
    Returns:
        (table, grid.best_estimator_)
    """
    # Number of unique categories
    n_unq_cat = len(np.unique(train_data['Category'].values))
    # Parameter grid to use in GridSearchCV
    param_grid = {
        'n_components': [n_unq_cat]
        ,'init': ['random', 'nndsvda']
        ,'l1_ratio': [0.0, 0.5, 1.0]
        ,'max_iter': [200, 400, 600]
    }
    # Create GridSearchCV object with custom scoring function
    grid = GridSearchCV(estimator = NMF_custom()
                        , param_grid = param_grid
                        , scoring = make_scorer(get_max_accuracy)
                        , return_train_score = True
                        , cv = 5)
    # Fit X_train
    grid = grid.fit(X_train, y_train)
    # Add new key 'model' as a name of the trained model for later use in a table
    n = len(grid.cv_results_['params'])
    grid.cv_results_['model'] = ['NMF']*n
    # Convert dictionary of ndarray to dataframe
    table = pd.DataFrame(grid.cv_results_)
    # Get cleaned table and sorted by accuracy and runtime
    table = get_results_table(table)
    
    return (table, grid.best_estimator_)

In [None]:
results_nmf, best_nmf = get_table_grid_nmf(X_train, y_train)

In [None]:
# Top 10 NMF models
results_nmf[:10]

##### 6.2 Supervised Learning (SL) <a class="anchor" id="sl"></a>

In [None]:
# Create a list of dictionaries to use in the GridSearchCV within the loop
model_params = [
    {
        'model': LogisticRegression()
        ,'name': 'LogReg'
        ,'param_grid':
        {
            'penalty': ['l1', 'l2']
            ,'C': [0.001,  0.1]
            ,'solver': ['saga']
        }
    }
    ,{
        'model': RandomForestClassifier()
        ,'name': 'Random Forest'
        ,'param_grid':
        {
            'n_estimators': [50, 100]
            ,'max_depth': [None, 10]
        }
    }
    ,{
        'model': SVC()
        ,'name': 'SVM'
        ,'param_grid':
        {
            'C': [0.1, 1, 10]
            ,'kernel': ['rbf']
            ,'gamma': [0.1, 1, 10]
        }
    }
]

In [None]:
def get_table_grid_sl(model_params, X_train, y_train):
    """
    Returns a table of all results from selected models and their parameters
    Args:
        models_params (list): list of dictionaries of models to train
        X_train (ndarray): data to train
        y_train (ndarray): labels to compute accuracy score
    Returns:
        table (list): list of dictionaries with results from training via GridSeachCV
    """
    n = len(model_params)
    table = []
    max_score = 0
    best_sl = None
    # Loop through all the models with specified parameters
    for i in range(n):
        # Create GridSeachCV object with specified parameters
        grid = GridSearchCV(model_params[i]['model']
                            , model_params[i]['param_grid']
                            , return_train_score = True
                            , cv = 5)
        # Fit object to training data
        grid = grid.fit(X_train, y_train)
        # Add new key 'model' for later depiction in a table comparison
        n = len(grid.cv_results_['params'])
        grid.cv_results_['model'] = [model_params[i]['name']]*n
        # Append GridSearchCV results to the table list
        table.append(grid.cv_results_)
        if grid.best_score_ > max_score:
            max_score = grid.best_score_
            best_sl = grid.best_estimator_

    return (table, best_sl)

In [None]:
# Get SL models' results in a list of dictionaries
results_sl, best_sl = get_table_grid_sl(model_params, X_train, y_train)

In [None]:
def get_readable_table_sl(results_sl):
    """
    Returns readable dataframe with GridSearchCV results on every row
    Args:
        results_sl (list): list of GridSearchCV dictionaries with results
    Returns:
        results_sl (dataframe): cleaned dataframe with results on every row
    """
    df = pd.DataFrame(results_sl)
    df = df.explode(['model'
                     ,'params'
                     ,'mean_fit_time'
                     ,'mean_score_time'
                     ,'mean_train_score'
                     ,'mean_test_score']
                    , ignore_index=True)
    # Get cleaned dataframe that is sorted by accuracy and runtime
    results_sl = get_results_table(df)
    
    return results_sl

In [None]:
# Convert results_sl list to readable Dataframe
results_sl = get_readable_table_sl(results_sl)

In [None]:
# Top 10 Supervised models
results_sl[:10]

#### 7. Model Results Comparison between NMF and SL <a class="anchor" id="comparison"></a>

This section consists of 2 parts: model comparison using full dataset and using only fraction of it, concretely 50%, 20% and 10%.<br/>
In the first part we're going to see the main results of this study, mainly to help answer this question: does NMF perform better or worse than selected SL models.<br/>
Second part is going to be about comparing SL models vs NMF models on the different portions of the train data. The main goal here is to reveal which model is more data efficient than others.

#### 7.1 Full Train Dataset <a class="anchor" id="full_train_dataset"></a>

In [None]:
# Concatenate two tables from NMF and SL results into one and sort in the same manner
results_total = pd.concat([results_nmf, results_sl], axis=0, ignore_index=True)
results_total = results_total.sort_values(by=['mean_test_score','mean_runtime']
                          , ascending=[False,True])

In [None]:
# Top 10 Model Ratings
results_total[:10]

#### 7.2 Samples of the Train Dataset <a class="anchor" id="samples_train_dataset"></a>

Given the set of arbitrary samples of the train data, 50%, 20% and 10%, to test on, we can observe rather interesting results. Every model except LogReg scored surprisingly good, having loss in the accuracy within 2%, while traning just on the half of the provided data. This is an important observation, since it provides opportunity to increase computational speed with only a small drop in the accuracy score.

The overfitting problem can also be observed here, as the mean train score is generally higher than mean test score across selected models. But what stands out the most are the NMF results. While SL models kept increasing overfitting with the sample size reduction, NMF test scores not only stayed at the same level as the train scores, but once were even higher. That might be due to the fact that NMF doesn't account labels during training session. Which in this particular case gave an advantage over SL algorithms.

In [None]:
def get_table_samples(samples):
    """
    Returns final table of SL and NMF training results by fraction of the data
    Args:
        parts (list): list of fractions to select from dataset
    Returns:
        table (dataframe): final training results table
    """
    n = X_train.shape[0]
    table = pd.DataFrame()
    for i in range(len(samples)):
        n = int(samples[i] * n)
        # Create list of indexes
        indexes = list(range(n))
        # Shuffle the list of indexes randomly
        random.shuffle(indexes)
        # Select fraction of the shuffled indexes
        indexes = indexes[:n]
        # Get tables of SL and NMF results
        results_sl, best_sl = get_table_grid_sl(model_params, X_train[indexes], y_train[indexes])
        results_nmf, best_nmf = get_table_grid_nmf(X_train[indexes], y_train[indexes])
        # Convert results_sl to readable dataframe
        results_sl = get_readable_table_sl(results_sl)
        # Create temporary table of results_sl and results_nmf
        temp_table = pd.concat([results_nmf, results_sl], axis=0, ignore_index=True)
        # Set column sample for clarification
        temp_table['sample'] = samples[i]
        # Concatenate with final table
        table = pd.concat([temp_table, table], axis = 0, ignore_index = True)
    
    return table

In [None]:
# Get the table of training results in regard to selected sample of the dataset
samples = [0.5, 0.2, 0.1]

results_samples = get_table_samples(samples)

In [None]:
# Model Ratings based on the samples
results_samples.fillna(0, inplace = True)
temp_table = results_total.copy()
temp_table['sample'] = 1.0
temp_table['mean_train_score'] = temp_table['mean_train_score'].astype(float)
temp_table['mean_test_score'] = temp_table['mean_test_score'].astype(float)
result = pd.concat([temp_table, results_samples], axis=0, ignore_index=True)
result = result.groupby(['model', 'sample']).apply(lambda x: x.loc[x['mean_test_score'].idxmax()])
result = result.drop('sample', axis = 1)
result

#### 8. Submission Results <a class="anchor" id="submission_results"></a>

In [None]:
# Make short table of top SL and NMF models performances

y_pred = best_sl.predict(X_test)
results = test_data.copy()
results['Category'] = y_pred
results.drop('Text', axis = 1, inplace = True)

In [None]:
# Print results table
results

In [None]:
# Make submission
results.to_csv('submission.csv', index=False)

#### Submission's public score: 0.98231

#### 9. Conclusion <a class="anchor" id="conclusion"></a>

The superior performance of supervised learning, SVM and RF concretely, suggests that in our situation where labeled data is available and the task demands high predictive accuracy, SVM and RF prove to be more effective.

One of the possible explanation for such success is the obvious one: these models leverage the labeled data to learn patterns and relationships within the data, enabling them to make more accurate predictions.

However, NMF, being in a disadvantage, still holds significant value. As can be seen in the table though its accuracy is slightly worse, it took much less time to run calculations in comparison to the leader - SVM. That makes NMF very attractive in the case of unlabeled data, as it provides good performance, and, what's more interesting, it can be considered for the labeled data as well if we were to trade a slight drop in the accuracy for higher computational speed.

Additionally, it's crucial to point out, that most of the models scored surprisingly good, having slight loss in the accuracy score, while traning just on the half of the provided data. Especially surprising results gave NMF, when on every sample size it provided test scores on par with the train scores. This observation tells us that not only algorithm efficiency or feature engineering, but the reduction of the sample size itself may lead to a good speed-accuracy trade-off.

#### 10. References <a class="anchor" id="references"></a>

* Problem-solving with ML: automatic document classification<br/>
https://cloud.google.com/blog/products/ai-machine-learning/problem-solving-with-ml-automatic-document-classification
* Scikit-learn API Reference<br/>
https://scikit-learn.org/stable/modules/classes.html
* Basic EDA,Cleaning and GloVe<br/>
https://www.kaggle.com/code/shahules/basic-eda-cleaning-and-glove/notebook
* Approaching (Almost) Any NLP Problem on Kaggle<br/>
https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
* Spooky NLP and Topic Modelling tutorial<br/>
https://www.kaggle.com/code/arthurtok/spooky-nlp-and-topic-modelling-tutorial/notebook
* [Fake News] Easy NLP Text Classification!<br/>
https://www.kaggle.com/code/ohseokkim/fake-news-easy-nlp-text-classification