# Classify BBC Articles
Date: Aug2023

source: https://www.kaggle.com/competitions/learn-ai-bbc/data?select=BBC+News+Train.csv

In [77]:
import polars as pl
import altair as alt
import numpy as np
from sklearn.decomposition import NMF

from sklearn.feature_extraction.text import TfidfVectorizer as tfidv
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from time import time

In [78]:
# Read the data
train_url = "https://raw.githubusercontent.com/nsxydis/bbc-project/main/BBC%20News%20Train.csv"
train = pl.read_csv(train_url)
test_url = "https://raw.githubusercontent.com/nsxydis/bbc-project/main/BBC%20News%20Test.csv"
test = pl.read_csv(test_url)

# Explore the data
In the below sections we'll examine the test and train datasets using the pandas info tool. 

In [79]:
# Look at the data
print("Train Data:")
train.to_pandas().info()
print(train)

print("\n\nTest Data:")
test.to_pandas().info()

Train Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  1490 non-null   int64 
 1   Text       1490 non-null   object
 2   Category   1490 non-null   object
dtypes: int64(1), object(2)
memory usage: 35.1+ KB
shape: (1_490, 3)
┌───────────┬───────────────────────────────────┬───────────────┐
│ ArticleId ┆ Text                              ┆ Category      │
│ ---       ┆ ---                               ┆ ---           │
│ i64       ┆ str                               ┆ str           │
╞═══════════╪═══════════════════════════════════╪═══════════════╡
│ 1833      ┆ worldcom ex-boss launches defenc… ┆ business      │
│ 154       ┆ german business confidence slide… ┆ business      │
│ 1101      ┆ bbc poll indicates economic gloo… ┆ business      │
│ 1976      ┆ lifestyle  governs mobile choice… ┆ tech          │
│ 917       ┆ enron boss

# Data Cleaning Part 1
There are repeated entries in the Kaggle data -- the website notes there are 1440 unique records in the training data, but there are 1490 rows. That means we'll need to remove the duplicates.

In [80]:
# Note: Kaggle says there's 1440 unique records in the train dataframe, but there's 1490 rows

# Check for and remove duplicate records 
def duplicates(df, fields):
    originalLen = len(df)
    filtered = df.unique(subset = fields, keep = 'first')
    print(f"{originalLen - len(filtered)} duplicates found in the training data")
    return filtered

train = duplicates(train, ['Text', 'Category'])

50 duplicates found in the training data


# Data Cleaning Part 2
Common words and punctuation should be removed from the data to prevent their high frequencies from impacting the results.

In [81]:
def removeCommon(text):
    '''Removes common words & punctuation from text and returns the data'''
    # NOTE: Hyphens are allowed for hyphenated words
    common = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    filtered = ""
    for word in tokens:
        if word not in common and (word.replace('-', "").isalnum()):
            filtered += f" {word}"
    return filtered

train = train.with_columns(pl.col('Text').apply(lambda x: removeCommon(x)))
test = test.with_columns(pl.col('Text').apply(lambda x: removeCommon(x)))


In [82]:
# Explore the data via charts

# Selection
selection = alt.selection_point(encodings = ['x'])

title = ["Click a column to change the chart on the right", '', 'Distribution of articles']
# Breakdown of article types
articleTypes = alt.Chart(train.to_pandas(), title = title).mark_bar().encode(
    x = 'Category',
    y = 'count()',
    tooltip = ['Category', 'count()'],
    color = 'Category'
).add_params(selection)

# Breakdown of article lengths
articleLengths = alt.Chart(train.to_pandas(), title = "Article Lengths").mark_bar().encode(
    x = alt.X('x:Q', title = "Number of Words", bin=True),
    y = 'count()',
    color = 'Category',
    tooltip = ['Category', 'count()']
).transform_calculate(
    x = 'length(datum.Text)'
).transform_filter(
    selection
)
articleTypes | articleLengths


# Word Embedding
Text analysis requires words to be translated into data that is more readily interpretable by algorithms. To start the text data must be tokenized, or separated by whitespace & punctuations. Next the individual words are assigned index values and counted through the document. Afterward the tokens are normalized based on their occurrence in the documents analyzed. This is process is known as vectorization. 

For large texts, certain common words will have a large frequency but low meaning (words like ‘and’, ‘if’, ‘a’). To account for this there is an inverse-document-frequency (IDF) method, which measures:

$idf(t) = log( (1 + n) / (1 + df(t)) + 1$

Where t is the token, n is the number of documents, and df(t) is the number of documents containing the token. The token frequency (TF) is multiplied by the IDF in the TF-IDF method used by:
```
from sklearn.feature_extraction.text import TfidfVectorizer
```

Overall this provides a method to compare & classify the texts of documents against one another, as used in the remainder of this notebook. The method will be useful for identifying common words (sometimes called stop words) that may have been missed by the previous data cleaning.

# Model Building Question 1
1) Think about this and answer: when you train the unsupervised model for matrix factorization, should you include texts (word features) from the test dataset or not as the input matrix? Why or why not?
    - It's possible that there are words in the test data that are not in the training data, so it can be beneficial to factorize using all of the available data.

In [83]:
# Word embedding - combine train and test words together
tfidf = tfidv()
words = train['Text'].to_list() + test['Text'].to_list()
we = tfidf.fit(words)
weTrain = tfidf.transform(train['Text'])

In [84]:
# Set up the training model
nCat = len(train['Category'].unique())
nmf = NMF(n_components = nCat, random_state = 42)

# Build the model
nmf.fit(weTrain)

# Add the predicted categories to our data
# Note: This is for a polars lambda function in this cell
def addCat(predCat, pred):
    return predCat[pred]

# Test the model
def score(model, embedded, data, associations = None, report = True, accuracy = True, con = False):
    matrix = model.transform(embedded)
    yp = [np.argmax(r) for r in matrix]

    # Add back to our training data
    yp = pl.DataFrame({'pred' : yp})
    yp = data.hstack(yp)

    # Associate prediction numbers with categories
    if associations == None:
        nCat = len(data['Category'].unique())
        predCat = {}
        for n in range(nCat):
            focus = yp.filter(pl.col('pred') == n)
            mode = focus['Category'].mode()
            if len(mode) > 1:
                print(f"Error: there are too many modes for predicted category {n}")
            predCat[n] = mode[0]
    else:
        predCat = associations

    yp = yp.with_columns(pl.col('pred').apply(lambda x: addCat(predCat, x)).alias("predCat"))

    if con:
        print(confusion_matrix(yp['Category'], yp['predCat']))

    if accuracy:
        # Report the accuracy
        acc = sum(yp['Category'] == yp['predCat']) / len(yp)
        if report:
            print(f"Training Accuracy: {round(acc*100, 2)}%")
        return yp, predCat, acc
    else:
        return yp, predCat

trained, associations, _ = score(nmf, weTrain, train)

Training Accuracy: 90.49%


# Model Building Question 2
2) Build a model using the matrix factorization method(s) and predict the train and test data labels. Choose any hyperparameter (e.g., number of word features) to begin with.
    - Model was built using the default settings, with the number of components set to 5 and random state set to 42.

In [85]:
# Compare results against the test model
weTest = tfidf.transform(test['Text'])
ypTest, _ = score(nmf, weTest, test, associations, accuracy = False)

# Write the results to a csv file and submit
ypTest = ypTest.rename({'predCat' : 'Category'})
ypTest[['ArticleId', 'Category']].write_csv("Solution_unSup.csv")
print("Kaggle Accuracy: 91.292%")

Kaggle Accuracy: 91.292%


# Model Building Question 3
3) Measure the performances on predictions from both train and test datasets. You can use accuracy, confusion matrix, etc., to inspect the performance. You can get accuracy for the test data by submitting the result to Kaggle. 
    - Accuracy for the training model:  90.49%
    - Accuracy for the test model:      91.30%

In [86]:
# Test model with different parameters
init = [None, 'random', 'nndsvd', 'nndsvda', 'nndsvdar']
solver = ['cd', 'mu']
beta_loss = ['frobenius', 'kullback-leibler', 'itakura-saito']

rest = {
    'accuracy'  : [],
    'init'      : [],
    'solver'    : [],
    'beta_loss' : []
}
best = [0, None, None, None]

for i in init:
    for s in solver:
        for b in beta_loss:
            # Note what we're working on
            rest['init'].append(i)
            rest['solver'].append(s)
            rest['beta_loss'].append(b)
            # print(f"\ninit: {i}, solver: {s}, beta_loss: {b}")
            nmf = NMF(n_components=nCat, random_state = 42, init = i, solver = s, beta_loss = b)
            try:
                nmf.fit(weTrain)
                _, _, accuracy = score(nmf, weTrain, train, report = False)
                if accuracy > best[0]:
                    best = [accuracy, i, s, b]
                rest['accuracy'].append(accuracy)
            except:
                rest['accuracy'].append(None)
                # print("Error!")
# print("Overall results")
rest = pl.DataFrame(rest)
pl.Config.set_tbl_rows(len(rest))
# print(rest.sort(by = 'accuracy', descending=True))
print(f"\n\nBest Results -- accuracy: {round(best[0]*100, 2)}%")
print(f"init: {best[1]}, solver: {best[2]}, beta_loss: {best[3]}")
        





Best Results -- accuracy: 92.36%
init: nndsvdar, solver: mu, beta_loss: kullback-leibler


# Model Building Question 4
See below for a table ordered by accuracy. Accuracy values of "NaN" mean the settings are incompatible. Other values of "None" represent the default settings for the NMF method.

In [87]:
from IPython.display import display, HTML
rest = rest.sort(by = 'accuracy', descending=True)
display(HTML(rest.to_pandas().to_html()))

Unnamed: 0,accuracy,init,solver,beta_loss
0,0.923611,nndsvdar,mu,kullback-leibler
1,0.922917,,mu,kullback-leibler
2,0.922917,nndsvda,mu,kullback-leibler
3,0.909028,nndsvdar,mu,frobenius
4,0.908333,,mu,frobenius
5,0.908333,nndsvda,mu,frobenius
6,0.906944,nndsvd,mu,kullback-leibler
7,0.904861,,cd,frobenius
8,0.904861,nndsvd,cd,frobenius
9,0.904861,nndsvda,cd,frobenius


# Additional Review of model performance
Next we'll compare the confusion matricies for the top three models (below). Since there isn't a great deal of variety between them, we will use the model parameters that gave the best accuracy.

In [88]:
def compare(init, solver, beta_loss):
    '''Prints a confusion matrix for the given parameters'''
    print(f"{init}, {solver}, {beta_loss}")
    nmf = NMF(n_components=nCat, random_state = 42, init = init, solver = solver, beta_loss = beta_loss)
    nmf.fit(weTrain)
    _, _ = score(nmf, weTrain, train, accuracy = False, report = False, con = True)
    print()


# Compare the best three models
for n in range(3):
    init = rest['init'][n]
    solver = rest['solver'][n]
    beta_loss = rest['beta_loss'][n]
    compare(init, solver, beta_loss)

nndsvdar, mu, kullback-leibler


[[275   1  10   6  43]
 [  0 243   2   7  11]
 [  3   2 255   5   1]
 [  1   2   0 339   0]
 [  0   4   3   9 218]]

None, mu, kullback-leibler
[[275   0   9   8  43]
 [  1 242   1   7  12]
 [  4   2 251   8   1]
 [  0   2   0 340   0]
 [  0   3   1   9 221]]

nndsvda, mu, kullback-leibler
[[275   0   9   8  43]
 [  1 242   1   7  12]
 [  4   2 251   8   1]
 [  0   2   0 340   0]
 [  0   3   1   9 221]]



# Model Building Question 5:
5) Improve the model performance if you can- some ideas may include but are not limited to; using different feature extraction methods, fit models in different subsets of data, ensemble the model prediction results, etc. 
    - One possible method is to use lemmatization and stemming techniques on the 'Text' data before we put it into our models. Lemmatization reduces variations of words down to a 'base' word (like 'run' from 'running') and stemming removes prefixes and suffixes from words, again helping get to a 'base' word. Both of these techniques can help standardize the texts and make things easier to compare. See below for an implementation.
    - New vs. Old Training Accuracy:    91.60% vs. 90.49%
    - New vs. Old Test Accuracy:        92.79% vs. 91.30%

Note: This compares the default hyper-parameter settings for both groups.

In [100]:
# Try to modify the text to change performance
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize

def process(text):
    # Tokenize the text into words
    words = word_tokenize(text)
    
    # Initialize lemmatizer and stemmer
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    
    # Apply lemmatization and stemming to each word
    lwords = [lemmatizer.lemmatize(word) for word in words]
    swords = [stemmer.stem(word) for word in lwords]
    
    # Join the processed words back into a preprocessed text
    text = ' '.join(swords)
    
    return text

modTrain = train.with_columns(pl.col("Text").apply(lambda x: process(x)))
modTest = test.with_columns(pl.col("Text").apply(lambda x: process(x)))
mod = modTrain['Text'].to_list() + modTest['Text'].to_list()
mtfidf = tfidv()
mwe = mtfidf.fit(mod)
mweTrain = mtfidf.transform(modTrain['Text'])
mweTest = mtfidf.transform(modTest['Text'])

mnmf = NMF(n_components = nCat, random_state = 42)
mnmf.fit(mweTrain)

_, mAssociations, acc = score(mnmf, mweTrain, modTrain, report = False)
print(f"Training accuracy: {round(acc * 100, 2)}%")

mypTest, _ = score(mnmf, mweTest, modTest, mAssociations, accuracy = False, report = False)
mypTest = mypTest.rename({'predCat' : 'Category'})
mypTest[['ArticleId', 'Category']].write_csv("Solution_unSup_lem.csv")
print("Kaggle Accuracy: 92.789%")

Training accuracy: 91.6%
Kaggle Accuracy: 92.789%


# Compare with supervised learning
Next we'll use k-nearest-neighbors (KNN) to classify the data.

In [90]:
from sklearn.neighbors import KNeighborsClassifier as KNN

knn = KNN(n_neighbors=5)
knn.fit(weTrain, train['Category'])

# Test the training method
supTrain = knn.predict(weTrain).tolist()
supTrain = pl.DataFrame({'pred' : supTrain})
supTrain = train.hstack(supTrain)
acc = sum(supTrain['Category'] == supTrain['pred']) / len(supTrain)
print(f"Accuracy: {round(acc*100, 2)}%")

Accuracy: 95.76%


In [91]:
# Submit the test model
supTest = knn.predict(weTest).tolist()
supTest = pl.DataFrame({'pred' : supTest})
supTest = test.hstack(supTest)
supTest = supTest.rename({'pred' : 'Category'})
supTest[['ArticleId', 'Category']].write_csv("Solution_sup.csv")
print("Kaggle Accuracy = 93.469%")


Kaggle Accuracy = 93.469%


# Supervised training results - Part 1
Using the KNN method, we get the following results:
- training accuracy: 95.76%
- test accuracy: 93.47%

We get moderately good results in the training and test set using the KNN method. Though the method can be impacted by the curse of dimensionality. The testing results doesn't score as well as the training data did, which could suggest overfitting. 

We adjusted the training/test size to see effects on performance. To prevent having to submit to Kaggle multiple times, the train data was used for the train/test split. The results are in the table below. The runtime for the NMF model was significantly lower than KNN, meaning it might be worth using NMF if speed is a critical factor. Both models improved with a larger training size, but the supervised model had higher accuracy when the sample sizes were small, meaning it might be better to use KNN when there isn't much data.

We'll evaluate a separate supervised model in the next section.

In [92]:
# Test the model with different test group sizes
results = {
    'size'  : [],
    'sup'   : [],
    'supt'  : [],
    'unsup' : [],
    'unsupt': []
}
def supVunsup(size, results):
    '''Compare the accuracy of unsupervised learning vs. supervised'''
    results['size'].append(size)

    subTrain, subTest = train_test_split(train, train_size = size, random_state=42)
    subWeTrain = tfidf.transform(subTrain['Text'])
    subWeTest = tfidf.transform(subTest['Text'])
    
    # Unsupervised fitting
    start = time()
    subNMF = NMF(n_components=5, random_state=42)
    subNMF.fit(subWeTrain)
    yp, predCat, _ = score(subNMF, subWeTrain, subTrain, report = False)

    ypTest, _, acc = score(subNMF, subWeTest, subTest, associations = predCat, accuracy = True, report = False)
    
    results['unsup'].append(round(acc*100, 2))
    results['unsupt'].append(round(time() - start, 2))
    
    # Supervised fitting
    start = time()
    knn = KNN(n_neighbors = 5)
    knn.fit(subWeTrain, subTrain['Category'])

    supTest = knn.predict(subWeTest).tolist()
    supTest = pl.DataFrame({'pred' : supTest})
    supTest = subTest.hstack(supTest)
    acc = sum(supTest['Category'] == supTest['pred']) / len(supTest)
    # print(f"Accuracy: {round(acc * 100, 2)}%")
    results['sup'].append(round(acc * 100, 2))
    results['supt'].append(round(time() - start, 2))

for size in [0.1, 0.2, 0.5, 0.75]:
    supVunsup(size, results)

results = pl.DataFrame(results)
rename = {
    'sup'   : 'Supervised Accuracy',
    'supt'  : 'Supervised Runtime (s)',
    'unsup' : 'Unsupervised Accuracy',
    'unsupt': 'Unsupervised Runtime (s)'
}
results = results.rename(rename)
display(HTML(results.to_pandas().to_html()))



Unnamed: 0,size,Supervised Accuracy,Supervised Runtime (s),Unsupervised Accuracy,Unsupervised Runtime (s)
0,0.1,91.82,2.15,60.11,1.14
1,0.2,92.45,3.47,89.84,0.5
2,0.5,92.36,6.13,89.86,0.44
3,0.75,93.33,4.19,91.67,0.6


# Supervised training results - Part 2
Next we'll look at the SVC method and compare results against the previous results.

In [93]:
# Testing with another model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC

# Encode the target variable (Category)
encoder = LabelEncoder()
yEncode = encoder.fit_transform(train['Category'])

# Train a classifier
classifier = SVC()
classifier.fit(weTrain, yEncode)

# Predict on test data
predictions = classifier.predict(weTest)

# Decode the predicted labels
categories = encoder.inverse_transform(predictions)


In [94]:
# Test the training data
pred = classifier.predict(weTrain)
cat = encoder.inverse_transform(pred)
p = pl.DataFrame({'pred' : cat})
res = train.hstack(p)
s = sum(res['Category'] == res['pred']) / len(res)
print(f"SVC Training Accuracy: {round(s*100, 2)}%")

SVC Training Accuracy: 100.0%


In [95]:
# Test the test data
res = test.hstack(pl.DataFrame({'Category' : categories}))
res[['ArticleId', 'Category']].write_csv('svcPred.csv')
print("SVC Kaggle Results: 97.959%")

SVC Kaggle Results: 97.959%


# Overall Results
Compare the results for each of the models. Since all the models weren't optimized, we'll compare the default settings for each one. Note the y-axis scale of the graph starts at 90 and goes to 100 percent.

In [96]:
# Overall results
types = ['NMF', 'NMF', 'KNN', 'KNN', "SVC", "SVC"]
methods = ["NMF: Train", "NMF: Test", "KNN: Train", "KNN: Test", "SVC: Train", "SVC: Test"]
scores = [90.49, 91.30, 95.76, 93.47, 100.00, 97.96]
results = {
    'Type'   : [],
    'Method' : [],
    'Score'  : []
}
for n in range(len(methods)):
    results['Type'].append(types[n])
    results['Method'].append(methods[n]) 
    results['Score'].append([scores[n]])

results = pl.DataFrame(results)

title = "Overall initial model results"
alt.Chart(results.to_pandas(), title = title).mark_bar(clip = True).encode(
    x = alt.X('Method', scale = alt.Scale(domain = methods)),
    y = alt.Y('Score:Q', title = 'Accuracy (Percent)', scale = alt.Scale(domain = [90, 100])),
    color = 'Type'
)

