# Overview

Due to the length of the original notebook, I decided to create a separate submission for the Non-Negative Matrix Factroization.

An **important distinction** to make is the differnce between topic modeling and topic classification. Topic modeling groups unlabeld text by the topics the model thinks it contains and groups them by those topics. Topic classification  contains labeled training data. Hence with topic modeling we generally deal with unsupervised techniques such as Latent Dirichlet Allocation and Latent Semantic Analysis. While with topic classification we generally deal with supervised learning algorithms such as support vector machines and naive bayes analysis. There are also deep learning architectures used for text classification auch as Convolutional Neural Networks and Recurrent Neural Networks. I will provide resources in a separate notebook for further information. 

 

**Issue with NMF for classifcation** : Nonnegative matrix factorization is often utlized for topic modeling and not top classifcation. For topic modeling, we can choose a certain number of topics for the matrix to be contained in. For our purposes, we will choose 5 topics (since we have 5 different catgeories to classify articles). 

**Note:** This notebook is for learning purposes. It is **not** an attempt to achieve better results or runtime than the supervised models used in the other notebook. 

### Installing Dependencies

In [102]:
!pip install opendatasets # for downloading dataset into notenook with Kaggle API key
!pip install NLTK #installing NLTK and packages needed for preprocessing

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Importing Modules

In [103]:
import opendatasets as od
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import re
import string
import math
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import sklearn.metrics as metrics
import numpy as np
from sklearn.metrics import accuracy_score
from itertools import permutations

### Required Downloads from NLTK for preprocessing

In [104]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [105]:
od.download("https://www.kaggle.com/competitions/learn-ai-bbc/data")

Skipping, found downloaded files in "./learn-ai-bbc" (use force=True to force download)


In [106]:
test = pd.read_csv('/content/learn-ai-bbc/BBC News Test.csv', low_memory = False) 
train =  pd.read_csv('/content/learn-ai-bbc/BBC News Train.csv', low_memory = False)

## Condensed Preprocessing and Vectorizing

In [107]:
def text_preprocessing(df,col_text):
  stop_words = set(stopwords.words("english"))
  stop_words.add('said') #appending 'said' to stop_words list
  for i in range(len(df[col_text])):
    df[col_text][i] = word_tokenize(df[col_text][i]) #tokenizing
  for i in range(len(df)):
    df[col_text][i] = [word for word in df[col_text][i] if word not in string.punctuation] #removing punctuation 
  lemmatizer = WordNetLemmatizer()
  for i in range(len(df)):
    df[col_text][i] = [word for word in df[col_text][i] if word not in stop_words] #removing stopwords
  for i in range(len(df)):
    df[col_text][i] = [lemmatizer.lemmatize(word) for word in df[col_text][i]] #lemmatizing
  for i in range(len(df)):
    df[col_text][i] = ' '.join(df[col_text][i]) #rejoining words within each row to allow for vectorization 
  return df

In [108]:
text_preprocessing(train, 'Text') #preprocessing text for model
train_vec = TfidfVectorizer().fit_transform(train['Text'].values) #vectorizing for model

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the cave

## Modeling

In [109]:
nmf = NMF(n_components=5, 
                init='nndsvda', 
                solver = 'mu',
                beta_loss = 'kullback-leibler',
                l1_ratio = 0.5,
                random_state = 47)
transformed = nmf.fit_transform(train_vec)

In [110]:
predicted_topics = [np.argsort(each)[::-1][0] for each in transformed]

In [111]:
print(predicted_topics[0:30])

[4, 4, 4, 2, 4, 1, 0, 3, 4, 3, 1, 3, 4, 4, 0, 0, 3, 0, 0, 2, 0, 3, 0, 0, 2, 4, 2, 0, 1, 4]


In [140]:
def find_max_accuracy(df, true_val_col, pred_labels): #takes a dataframe, column name of labels, and list of numbers of length (number of unique categories)
  label_perms = permutations(pred_labels,5) #creating all possible permuations of labels 
  top_permutation = []
  top_accuracy = 0
  for perm in label_perms:
    true_vals = []
    for val in df[true_val_col].values:
      if val == 'business':
        true_vals.append(perm[0])
      elif val == 'entertainment':
        true_vals.append(perm[1])
      elif val == 'sport':
        true_vals.append(perm[2]) 
      elif val == 'tech':
        true_vals.append(perm[3]) 
      else:
        true_vals.append(perm[4])
    accuracy = metrics.accuracy_score(true_vals, predicted_topics)
    if accuracy >  top_accuracy:
      top_accuracy = accuracy
      top_permutation = perm 
  return top_permutation, top_accuracy
labels, accuracy = find_max_accuracy(train, 'Category', [0,1,2,3,4]) #extracting labels that give the model the highest accuracy

In [114]:
print(f"Correct labels: {labels}\nModel Accuracy: {accuracy}")

Correct labels: (4, 3, 0, 2, 1)
Model Accuracy: 0.9395973154362416


## Evaluation


## Submission

In [115]:
text_preprocessing(test, 'Text') #preprocessing text for model
test_vec = TfidfVectorizer().fit_transform(test['Text'].values) #vectorizing for model
nmf = NMF(n_components=5, 
                init='nndsvda', 
                solver = 'mu',
                beta_loss = 'kullback-leibler',
                l1_ratio = 0.5,
                random_state = 47)
transformed_test = nmf.fit_transform(test_vec)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the cave

In [132]:
final_sub_pred = []
pred_test = [np.argsort(each)[len(each)-1] for each in transformed_test] #create list by grabbing indices of value with highest probability
for val in pred_test: # coverting numeric labels into categorical labels for submission 
  if val == labels[0] :
      final_sub_pred.append('business')
  elif val == labels[1]:
    final_sub_pred.append('entertainment')
  elif val == labels[2]:
    final_sub_pred.append('sport') 
  elif val == labels[3]:
    final_sub_pred.append('tech') 
  else:
    final_sub_pred.append('politics')

In [117]:
submission = pd.DataFrame(list(zip(test['ArticleId'], final_sub_pred)),
               columns =['ArticleId', 'Category'])
submission.head(20)

Unnamed: 0,ArticleId,Category
0,1018,sport
1,1319,entertainment
2,1138,sport
3,459,tech
4,1020,sport
5,51,sport
6,2025,politics
7,1479,politics
8,27,business
9,397,tech


In [118]:
submission.to_csv('News_NMF_submission.csv', index=False)

## Conclusion

There are major issues with using Non-negative Matrix Factorization. The model generalized horribly to new data. It achieved an accuracy of .42857 on the submission (test data). The supervised models in the other notebook performed much better. In the future I could run many iterations to tune hyperparamters for a pipeline consisting of the preprocessor, vectorizer, and model. This would likely be done through a gridsearch utilizing accuracy to choose the best hyperparamters. However, the runtime would be terribly inefficent so I decided not to do this.