# 1. Title : BBC News Classification

Text documents are one of the richest sources of data for businesses.

We’ll use a public dataset from the BBC comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech.

The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category.

The competition is evaluated using Accuracy as a metric.

# 2. Load Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import itertools
import math

from IPython.display import Image

from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
from sklearn.decomposition import NMF
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_squared_error

# 3.Load Data

Get the train and test data, respectively.

In [2]:
train = pd.read_csv("../input/learn-ai-bbc/BBC News Train.csv")
test = pd.read_csv("../input/learn-ai-bbc/BBC News Test.csv")
train.info()

In [3]:
train

The train data had 1490 rows and 3 columns.

In [4]:
test.info()

In [5]:
test

The test data had 735 rows and 2 columns.

# 4.EDA

In [6]:
cnt = list(train.groupby("Category").count().iloc[:, 0])
cate = list(train.iloc[:, 2].unique())

plt.bar(cate, cnt,color = ['gold', 'b', '#FF0000', 'green',"blue"], alpha = 0.3)
plt.xlabel("Category")
plt.ylabel("Count")
plt.title("Article categories")
plt.show()

**TF-IDF(Term Frequency-Inverse Document Frequency)**

TF-IDF can be used primarily to determine the similarity of documents, to determine the importance of search results in search systems, and to determine the importance of certain words within a document.

With TF-IDF, you can compare documents with more information than with traditional DTM(Document-Term Matrix). TF-IDF does not always guarantee better performance than DTM, but in many cases, it can achieve better performance than DTM.

In [7]:
trcorpus = train["Text"]
trvector = TfidfVectorizer()
trresponse = trvector.fit_transform(trcorpus).todense()
print(trresponse.shape)

tecorpus = list(test["Text"])
tevector = TfidfVectorizer()
teresponse = tevector.fit_transform(tecorpus).todense()
print(teresponse.shape)

In [8]:
trfeature_names = np.asarray(trvector.get_feature_names()).reshape(1, trresponse.shape[1])
trfeature_TFIDF = np.asarray(trresponse.sum(axis = 0))
trIndex = np.argsort(trfeature_TFIDF)

nplot = 30
trplot_names = trfeature_names[:, trIndex[0, ::-1]][:, 50 : 50 + nplot].tolist()[0]
trplot_TFIDF = trfeature_TFIDF[:, trIndex[0, ::-1]][:, 50 : 50 + nplot].tolist()[0]

plt.figure(figsize=(25, 10))
plt.bar(trplot_names, trplot_TFIDF)
plt.xlabel("Words")
plt.ylabel("TF-IDF")
plt.xticks(rotation=90)
plt.show()

In [9]:
tefeature_names = np.asarray(tevector.get_feature_names()).reshape(1, teresponse.shape[1])
tefeature_TFIDF = np.asarray(teresponse.sum(axis = 0))
teIndex = np.argsort(tefeature_TFIDF)

nplot = 30
teplot_names = tefeature_names[:, teIndex[0, ::-1]][:, 50 : 50 + nplot].tolist()[0]
teplot_TFIDF = tefeature_TFIDF[:, teIndex[0, ::-1]][:, 50 : 50 + nplot].tolist()[0]

plt.figure(figsize=(25, 10))
plt.bar(teplot_names, teplot_TFIDF)
plt.xlabel("Words")
plt.ylabel("TF-IDF")
plt.xticks(rotation=90)
plt.show()

Look at words with high TF-IDF weights while adjusting the nplot value, and make sure they contain generic unfiltered words.

**1) Think about this and answer: when you train the unsupervised model for matrix factorization, should you include texts (word features) from the test dataset or not as the input matrix? Why or why not?**

Supervised Learning uses only train data to prevent overfitting of test data. However, unsupervised learning does not assume labels from the beginning. Therefore, there is no real difference between test data and train data. There is no risk of overfitting because there is no feedback that occurs when a label is predicted. This allows you to train models for two sets and then use the full model to predict the categories for each individually.

# 5. Create Models

**2) Build a model using the matrix factorization method(s) and predict the train and test data labels. Choose any hyperparameter (e.g., number of word features) to begin with.**

Use NMF and initial conditions are init = 'nnsvd', beta_loss = "frobenius", max_iter = 1000.

**3) Measure the performances on predictions from both train and test datasets. You can use accuracy, confusion matrix, etc., to inspect the performance. You can get accuracy for the test data by submitting the result to Kaggle.**

Several models were created and performance evaluations were performed. Please check the code below.

In [10]:
corpus = pd.concat([train["Text"], test["Text"]], copy = True, ignore_index = True).tolist()
vector = TfidfVectorizer()
response = vector.fit_transform(corpus)
response.shape

**Model 1**

In [11]:
model1 = NMF(n_components = 5, random_state = 123, init = 'nndsvd', beta_loss = "frobenius", max_iter= 1000)
W = model1.fit_transform(response)
H = model1.components_

In [12]:
train.iloc[0, :]

In [13]:
W.shape, H.shape

In [14]:
W[0, :]

In [15]:
def predictions(W):

    pred = np.zeros(shape = (W.shape[0]))
    n_rows = W.shape[0]
    n_cols = W.shape[1]
    
    for i in range(n_rows):
        result = (None, 0) 
        for j in range(n_cols):
            if W[i, j] > result[1]:
                result = (j, W[i, j])
        if result[0] == None:
            continue
        pred[i] = result[0]
 
    return(pred)

def catenames(df, pred):

    df_numpy = df.to_numpy()
    n = df_numpy.shape[0]
    labels = df_numpy[:, 2]
    cate = list(np.unique(labels))
    candi = list(itertools.permutations(cate))
    result = (None, float("inf"))
    
    for c in candi:
        n2= 0
        for i in range(n):
            if c[int(pred[i])] != labels[i]:
                n2 = n2 + 1
        if n2 < result[1]:
            result = (c, n2)
    
    return(result)

In [16]:
pred = predictions(W)
catename = catenames(train, pred)

print(catename[0], catename[1])

In [17]:
ca = list(train["Category"])
ca2 = list()
labeldic = {
    catename[0][0] : 0,
    catename[0][1] : 1,
    catename[0][2] : 2,
    catename[0][3] : 3, 
    catename[0][4] : 4
}
for c in ca:
    temp = labeldic[c]
    ca2.append(temp)
ca2 = np.asarray(ca2)
ca2

In [18]:
conf_matrix = confusion_matrix(ca2, pred[:ca2.shape[0]])
accuracy = accuracy_score(ca2, pred[:ca2.shape[0]])
recall = recall_score(ca2, pred[:ca2.shape[0]],average = None)
precision = precision_score(ca2, pred[:ca2.shape[0]], average = None)
RMSE = mean_squared_error(ca2, pred[:ca2.shape[0]])**0.5
print(conf_matrix)
print(accuracy)
print(catename[0])
print(recall)
print(precision) 
print(RMSE)

In [19]:
print(precision.mean())
print(recall.mean())

In [20]:
tepred = pred[ca2.shape[0]:].astype(int) 
make = {
    0 : catename[0][0],
    1 : catename[0][1],
    2 : catename[0][2],
    3 : catename[0][3], 
    4 : catename[0][4] 
}
pred_f = []
for p in tepred:
    temp = make[p]
    pred_f.append(temp)
    
testdict = {
    "ArticleId" : list(test["ArticleId"]),
    "Category" : pred_f
}

result1= pd.DataFrame(testdict)

In [21]:
Image("../input/images/model1.jpg")

**Model 2**

Use NMF and initial conditions are init = 'nndsvda', solver = "mu", beta_loss = "kullback-leibler"

In [22]:
model2 = NMF(n_components = 5, random_state = 123, init = 'nndsvda', solver = "mu", beta_loss = "kullback-leibler", max_iter= 1000)
W = model2.fit_transform(response)
H = model2.components_

In [23]:
pred2 = predictions(W)
catename = catenames(train, pred2)

print(catename[0], catename[1])

In [24]:
conf_matrix2 = confusion_matrix(ca2, pred2[:ca2.shape[0]])
accuracy = accuracy_score(ca2, pred2[:ca2.shape[0]])
recall = recall_score(ca2, pred2[:ca2.shape[0]],average = None)
precision = precision_score(ca2, pred2[:ca2.shape[0]], average = None)
RMSE = mean_squared_error(ca2, pred[:ca2.shape[0]])**0.5
print(conf_matrix2)
print(accuracy)
print(catename[0])
print(recall)
print(precision)
print(RMSE)

In [25]:
print(precision.mean())
print(recall.mean())

In [26]:
tepred = pred2[ca2.shape[0]:].astype(int) 
make = {
    0 : catename[0][0],
    1 : catename[0][1],
    2 : catename[0][2],
    3 : catename[0][3], 
    4 : catename[0][4] 
}
pred_f = []
for p in tepred:
    temp = make[p]
    pred_f.append(temp)
    
testdict = {
    "ArticleId" : list(test["ArticleId"]),
    "Category" : pred_f
}

result2= pd.DataFrame(testdict)

In [27]:
Image("../input/images/model2.jpg")

**4) Change hyperparameter(s) and record the results. We recommend including a summary table and/or graphs.**

Hyperparameters are changed and performance evaluated. Compare in a graph.

In [28]:
model_n = ['Model1', 'Model2']
acc11 = [0.89115, 0.96326]
colors = ['y', 'dodgerblue']
plt.bar(model_n, acc11,color = colors)

plt.xlabel("Model")
plt.ylabel("Accuracy")
plt.show()

**5) Improve the model performance if you can- some ideas may include but are not limited to; using different feature extraction methods, fit models in different subsets of data, ensemble the model prediction results, etc.**

Hyperparameters were changed and it was confirmed that model2 performed better than model1.

# 6. supervised learning

In [29]:
prop = 0.1
start = train.to_numpy().shape[0]
n_tr = math.floor(prop * start)
train2 = response[: n_tr, :]
test2 = response[start:, :]

trca = list(train["Category"])[:n_tr]
nca = []
for ca in trca:
    nca.append(labeldic[ca])
nca = np.asarray(nca)

logi_model = LogisticRegression().fit(train2, nca)
pred_logi = logi_model.predict(test2)

pred_new = []
for p in pred_logi:
    temp = make[p]
    pred_new.append(temp)
    
testdict = {
    "ArticleId" : list(test["ArticleId"]),
    "Category" : pred_new
}

ans1 = pd.DataFrame(testdict)

In [30]:
Image("../input/images/ans1_model3.jpg")

In [31]:
prop = 0.2
start = train.to_numpy().shape[0]
n_tr = math.floor(prop * start)
train2 = response[: n_tr, :]
test2 = response[start:, :]

trca = list(train["Category"])[:n_tr]
nca = []
for ca in trca:
    nca.append(labeldic[ca])
nca = np.asarray(nca)

logi_model = LogisticRegression().fit(train2, nca)
pred_logi = logi_model.predict(test2)

pred_new = []
for p in pred_logi:
    temp = make[p]
    pred_new.append(temp)
    
testdict = {
    "ArticleId" : list(test["ArticleId"]),
    "Category" : pred_new
}

ans2 = pd.DataFrame(testdict)

In [32]:
Image("../input/images/ans2_model4.jpg")

In [33]:
prop = 0.5
start = train.to_numpy().shape[0]
n_tr = math.floor(prop * start)
train2 = response[: n_tr, :]
test2 = response[start:, :]

trca = list(train["Category"])[:n_tr]
nca = []
for ca in trca:
    nca.append(labeldic[ca])
nca = np.asarray(nca)

logi_model = LogisticRegression().fit(train2, nca)
pred_logi = logi_model.predict(test2)

pred_new = []
for p in pred_logi:
    temp = make[p]
    pred_new.append(temp)
    
testdict = {
    "ArticleId" : list(test["ArticleId"]),
    "Category" : pred_new
}

ans3 = pd.DataFrame(testdict)

In [34]:
Image("../input/images/ans3_model5.jpg")

In [35]:
prop = 1.0
start = train.to_numpy().shape[0]
n_tr = math.floor(prop * start)
train2 = response[: n_tr, :]
test2 = response[start:, :]

trca = list(train["Category"])[:n_tr]
nca = []
for ca in trca:
    nca.append(labeldic[ca])
nca = np.asarray(nca)

logi_model = LogisticRegression().fit(train2, nca)
pred_logi = logi_model.predict(test2)

pred_new = []
for p in pred_logi:
    temp = make[p]
    pred_new.append(temp)
    
testdict = {
    "ArticleId" : list(test["ArticleId"]),
    "Category" : pred_new
}

ans4 = pd.DataFrame(testdict)

In [36]:
Image("../input/images/ans4_model6.jpg")

# Result

**1) Pick and train a supervised learning method(s) and compare the results (train and test performance)**

To maximize accuracy, supervised learning is a better choice. See Graph

**2) Discuss comparison with the unsupervised approach. You may try changing the train data size (e.g., Include only 10%, 20%, 50% of labels, and observe train/test performance changes). Which methods are data-efficient (require a smaller amount of data to achieve similar results)? What about overfitting?**

The higher the Proportion of labels used, the better the accuracy.

TF-IDF is calculated over all data and so there is some concern of overfitting due to correlation between the train and test sets.
A great follow up project would be to iterate over the number of categories and try to find spikes in test prediction accuracy. This would correspond to the discovery of true subtopics.

In [37]:
props = [0.1, 0.2, 0.5, 1]
acc = [0.68027, 0.85578, 0.96598, 0.97551]

plt.plot(props, acc)
plt.xlabel("Proportion of labels used")
plt.ylabel("Accuracy")
plt.show()

# 7. Result summary

Store the best performance results and finally check the accuracy.

In [38]:
ans4.to_csv('submission2.csv', index=False)