# Naive Bayes classification of news articles

Xiaoyu Lu

The objective of this assignment is to use the Naive Bayes algorithm to build a classifier to automatically
categorize news articles into different topics. I used the Reuterâ€™s dataset (`Reuters-21578`), which include
thousands of news article items, each with its own topic label. The data are saved in 22 separate `*.sgm`
files. For this assignment, I am only focusing on the below topics:

> money, fx, crude, grain, trade, interest, wheat, ship, corn, oil, dlr, gas, oilseed, supply, sugar,gnp, coffee, veg, gold, soybean, bop, livestock, cp

The `pyspark` will be used for the entire project. We also use `sklearn` to compare the speed with `pyspark`.

In [1]:
from os import chdir, getcwd
from glob import glob
import pyspark
import numpy as np
from bs4 import BeautifulSoup
import pandas as pd
import time
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import PorterStemmer, SnowballStemmer

In [101]:
path = getcwd()
chdir(path)

In [20]:
topic_list = ["money", "fx", "crude", "grain", "trade", "interest", "wheat", 
              "ship", "corn", "oil", "dlr", "gas", "oilseed", "supply", "sugar", 
              "gnp", "coffee", "veg", "gold", "soybean", "bop", "livestock", "cpi"]

Below are two functions we will use during this assignment

In [21]:
def if_topic_in(topic, topic_list = topic_list):
    """function to determine if each entry belongs to our topic list
    ---------------------------------------------
    
    :param topic: list of many topics of one article
    :param topic_list: list of pre-defined topics
    
    :returns: index of first element in the topic list that belongs to topic_list
    """
    try:
        ans = list(set(topic).intersection(topic_list))
    except:
        ans = ""
    
    return ans

In [22]:
def cleanbody(text):
    """function to clean text by removing punctuations, and numbers
    ---------------------------------------------
    
    :param text: a string
    
    :returns: string with punctuations and numbers removed
    """
    stopwords_set= set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    text = text.replace('\n',' ').lower().strip()
    text = re.sub("[^a-zA-Z]+", " ", text).split()
    text = ' '.join(stemmer.stem(i) for i in text)
    stemmed = ' '.join([word for word in text.split() if word not in stopwords_set])
    return(stemmed)

## 1. Pre-processing the Reuters' dataset (Reuters-21578)

For this project we are only interested in articles whose topic falls in the list of:

``["money", "fx", "crude", "grain", "trade", "interest", "wheat", "ship", "corn", "oil", "dlr", "gas", "oilseed", "supply", "sugar", "gnp", "coffee", "veg", "gold", "soybean", "bop", "livestock", "cpi"]``

Some articles are of multiple topics separated by dashlines (e.g. "money-fx"). In this case, we will create duplicate articles with each article corresponding to one topic. This will inevitably reduce the accuracy of our output, but we can live with it for this project.

We started with parsing through all the (`*.sgm`) files and create a list of all relevant articles as tuples (topic, body). The `BeautifulSoup` library is used for parsing.

In [107]:
f_list = glob('reuters21578/*.sgm')

In [108]:
doi_list = list()
for filename in f_list:
    print('Start parsing {0}...'.format(filename))
    file = open(filename, 'rb')
    soup = BeautifulSoup(file, 'html.parser')
    file.close()
    for topic_raw in soup.find_all('topics'):
        topic = topic_raw.get_text().split('-')
        topic = if_topic_in(topic)
        if len(topic) != 0:
            body = topic_raw.find_next('body').get_text()
            for t in topic:
                tb_tup = (t, body)
                doi_list.append(tb_tup)
    print('Done.')

Start parsing reuters21578/reut2-004.sgm...
Start parsing reuters21578/reut2-010.sgm...
Start parsing reuters21578/reut2-011.sgm...
Start parsing reuters21578/reut2-005.sgm...
Start parsing reuters21578/reut2-013.sgm...
Start parsing reuters21578/reut2-007.sgm...
Start parsing reuters21578/reut2-006.sgm...
Start parsing reuters21578/reut2-012.sgm...
Start parsing reuters21578/reut2-016.sgm...
Start parsing reuters21578/reut2-002.sgm...
Start parsing reuters21578/reut2-003.sgm...
Start parsing reuters21578/reut2-017.sgm...
Start parsing reuters21578/reut2-001.sgm...
Start parsing reuters21578/reut2-015.sgm...
Start parsing reuters21578/reut2-014.sgm...
Start parsing reuters21578/reut2-000.sgm...
Start parsing reuters21578/reut2-019.sgm...
Start parsing reuters21578/reut2-018.sgm...
Start parsing reuters21578/reut2-020.sgm...
Start parsing reuters21578/reut2-008.sgm...
Start parsing reuters21578/reut2-009.sgm...
Start parsing reuters21578/reut2-021.sgm...


In [109]:
data = pd.DataFrame(doi_list)
data.columns = (['topic', 'body'])
data['body'] = data['body'].apply(cleanbody)
print('A total number of {0} items were retrieved. Articles with multiple classes are recorded multiple times.'.format(len(data)))
data.head()

A total number of 3625 items were retrieved. Articles with multiple classes are recorded multiple times.


Unnamed: 0,topic,body
0,trade,hous trade lawmak took first vote measur desig...
1,trade,soviet first deputi prime minist vsevolod mura...
2,crude,venezuela suppli ecuador yet undetermin amount...
3,trade,britain today call japan increas foreign impor...
4,oil,british minist said propos european communiti ...


The final output is saved as a `pandas` DataFrame. A total number of 3625 items were retrieved. Articles with multiple topics are recorded multiple times.

For convenience, we will save the data as a `.txt` file. The first 10 entries are also saved separately as a `.txt` file

In [110]:
data.to_csv('training_test_data.txt', index=False)
data.loc[0:10].to_csv('top10.txt', index = False)

Let's print out the first 10 items. It can be observed that all the plurals, tense-related modifications, and stop words have been removed. 

In [9]:
data_top10

Unnamed: 0,topic,body
0,trade,hous trade lawmak took first vote measur design toughen u trade law held tomorrow difficult vote controversi plan protect american industri meet close session hous way mean trade subcommitte fail resolv one sensit issu bill whether forc major foreign trade partner sever cut trade surplus unit state subcommitte consid tone version democrat sponsor trade legisl aim open foreign market drop last year effort forc presid reagan retali quota tariff congression aid ask identifi said lawmak intend wrap propos tomorrow consid propos mandat retali without set specif trade penalti legisl face anoth hurdl full way mean committe next week befor full hous vote rep richard gephardt missouri democrat seek parti presidenti nomin said may offer amend call reduct trade surplus countri barrier import u good would moder version earlier plan forc mandatori ten per cent annual cut trade surplus unit state japan south korea taiwan west germani countri largest trade imbal criteria good amend set standard get trade deficit told report trade law chang becom part major congression administr effort turn around record u trade deficit billion dlrs last year open foreign market make u product competit hous speaker jame wright texa democrat said today expect full hous approv trade bill may reagan accept final congression bill expect whatev report way mean committe pass good bill effect bill told report comprehens trade bill includ work committe eas export control high technolog aid u worker displac foreign competit stimul research develop remov foreign trade barrier improv educ worker train lawmak agre first time u industri could charg foreign produc unfair competit deni basic worker right collect bargain safeti rule payment minimum wage appropri countri econom develop transfer u trade repres clayton yeutter power held reagan decid whether retali foreign violat fair trade rule whether injur industri deserv import relief agre make easier compani get temporari relief import competit agre industri provid plan becom competit administr announc support yeutter said yesterday cautious optimist democrat led hous come accept bill reuter
1,trade,soviet first deputi prime minist vsevolod murakhovski said end brief visit countri want boost joint busi franc reduct franc trade deficit soviet union depend french murakhovski also chairman state agro industri committe gosagroprom told news confer discuss varieti possibl deal french compani rhone poulenc pechiney imec declin put figur possibl contract said discuss plant protect process high sulphur gas rhone poulenc packag technolog agricultur product pechiney fruit veget juic process imec offici pechiney said agreement intent packag could sign soon could give ani detail two compani immedi avail comment ask whether foresaw reduct year franc trade shortfal billion franc first month billion whole murakhovski told reuter depend franc meet pari last januari french soviet foreign trade minist said commit increas effort reduc deficit estim time show french mln franc surplus decemb murakhovski said soviet union prepar talk anybodi interest propos offer latest technolog assur mutual advantag said soviet union mani task ahead would deal rapid propos consid interest encourag compani take advantag new law guarante interest foreign partner joint ventur said agreement yet finalis new joint ventur law said concret deal yet finalis result one billion dollar accord sign moscow last month french businessman jean baptist doumeng said doumeng interagra compani prepar propos examin soviet union doumeng last month said agreement exchang one billion dollar worth good murakhovski said agreement one intent design primarili renew increas soviet union food product capac reuter
2,crude,venezuela suppli ecuador yet undetermin amount crude oil help meet export commit serious affect last week earthquak energi mine minist arturo hernandez grisanti said gave detail deal said crude oil loan agreement made state oil compani petroleo de venezuela pdvsa ecuador cepe ecuador forc suspend oil export expect four month earthquak damag pipelin oil account per cent export incom hernandez speak report miraflor palac result talk ecuador deputi energi minist fernando santo alvit arriv last night volum lent ecuador would discount opec quota would affect venezuela said would august produc quota sell addit amount ecuador would repay us said elabor quota arrang say ecuador would notifi opec telex venezuela would lend certain amount mani day venezuela opec output quota current million barrel day ecuador set bpd reuter
3,trade,britain today call japan increas foreign import risk rise protection harm would bring trade nation british trade industri secretari paul channon said japan must heed report issu japanes govern advisori bodi decemb call faster domest demand help cut trade surplus restructur economi recognis strong yen brought problem japan domest economi told group japanes businessmen london short term difficulti allow deflect japan fundament reform necessari said domest issu japan import propens doe expand veri soon real risk protectionist lobbi particular u japan massiv surplus said may well succeed secur action govern would high injuri trade nation like japan u k channon said substanti growth volum trade japan britain amount billion sterl billion dlrs last year ad regrett much one direct japanes sell us billion sterl billion dlrs sold reuter
4,oil,british minist said propos european communiti tax veget oil fat would rais price fish chip pledg govern would fight lord belstead junior agricultur minist told hous lord tax would rais price raw materi use mani process food pct said revenu rais tax consum call propos repugn reuter
5,veg,british minist said propos european communiti tax veget oil fat would rais price fish chip pledg govern would fight lord belstead junior agricultur minist told hous lord tax would rais price raw materi use mani process food pct said revenu rais tax consum call propos repugn reuter
6,money,u treasuri secretari jame baker declin comment februari pari accord six major industri nation agre foster exchang rate stabil ask report speech befor nation fit foundat banquet ani currenc intervent level set pari baker repli never talk intervent baker also declin comment view foreign exchang market reaction accord reuter
7,fx,u treasuri secretari jame baker declin comment februari pari accord six major industri nation agre foster exchang rate stabil ask report speech befor nation fit foundat banquet ani currenc intervent level set pari baker repli never talk intervent baker also declin comment view foreign exchang market reaction accord reuter
8,crude,ecuador ask opec rais oil export quota barrel per day compens lost output due last week earthquak deputi energi minist fernando santo alvit said santo alvit arriv caraca last night discuss aid plan ecuador say organis petroleum export countri opec would approach addit output would relat plan discuss venezuela mexico lend ecuador crude repair pipelin damag quak earlier venezuelan energi mine minist aturo hernandez grisanti said countri would suppli unspecifi part ecuador export commit santo alvit told report hope first cargo barrel could leav maracaibo weekend suppli refineri near guayaquil ad ecuador also want make bpd ship caribbean destin mexico might suppli ecuador south korean market ecuador may unabl export oil five month due extens damag mile stretch pipelin link jungl oilfield pacif port balao reuter
9,crude,china close second round bid foreign firm offshor oil explor right china daili report quot spokesman china nation offshor oil corp cnooc say china sign eight contract foreign firm block pearl river mouth south yellow sea cover total area sq km second round bid began end onli one well far produc result lufeng km south east shenzhen output barrel day well drill group japanes compani spokesman ad cnooc readi enter contract offshor block befor third round bid began say would ad contract would bound restrict impos dure second round china sign oil contract agreement compani countri sinc offshor explor open foreign eleven contract termin oil discov foreign firm invest billion dlrs offshor china sinc reuter


## 2. TF-IDF transformation in `sklearn`

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk.tokenize import RegexpTokenizer

In [10]:
data = pd.read_csv('training_test_data.txt',index_col=None)

In [11]:
body_list = list(data['body'])
start = time.clock()
vectorizer = TfidfVectorizer()
vectorizer.fit(data['body'])
print ("sklearn TFIDF processing time: {0:.5f} s".format(time.clock() - start))

sklearn TFIDF processing time: 0.37128 s


## 3. TF-IDF transformation in `pyspark`

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import Column
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.ml.feature import HashingTF, IDF, OneHotEncoder, StringIndexer
from pyspark.ml.classification import NaiveBayes

path = getcwd()
chdir(path)

spark = SparkSession\
        .builder\
        .appName("NewsClassification")\
        .getOrCreate()

df = spark.read.csv("training_test_data.txt",header=True,inferSchema=True)

Below is a udf (user defined function) that split the body text. The lines directly beneath applies the funciton to the `body` column meanwhile change the type to `ArrayType`.

In [4]:
def text_split(text):
    """
    user-defined funtion to split the text
    """
    text = text.split()
    return text

In [5]:
clean_udf = udf(text_split, ArrayType(StringType()))
df = df.withColumn("body", clean_udf("body"))

In [6]:
#following section transforms the text using TFIDF
start = time.clock()
hashingTF = HashingTF(inputCol="body", outputCol="term_freq")
df = hashingTF.transform(df)
idf = IDF(inputCol="term_freq", outputCol="tfidf")
idfModel = idf.fit(df)
df = idfModel.transform(df)
print ("pyspark TFIDF processing time: {0:.5f} s".format(time.clock() - start))

pyspark TFIDF processing time: 0.01982 s


## 4. Building a Naive Bayes Classifier

The first step is to convert the topics (nominal) to a list of discrete integers

In [7]:
#Using the OneHotEncoder to convert the topics into discrete integers
stringIndexer = StringIndexer(inputCol="topic", outputCol="topicIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

The entire dataset will be split 3 ways into the training/test/cross-evaluation set, and 3 different split proportions (`50/40/10`, `60/30/10`, and `70/20/10`) were used. The Naive Bayes classfier was trained, and for each split condition our model will train 10 times to evaluate the sensitivity of the model.

A total number of 30 models will be trained, and their parameters and accuracy are stored as key-value pairs in a dictionary.

In [8]:
val_dict = dict()
train_test_cv_split_params = {'50/40/10': [0.5, 0.4, 0.1],
                               '60/30/10': [0.6, 0.3, 0.1], 
                               '70/20/10': [0.7, 0.2, 0.1]}

for split_param in train_test_cv_split_params.keys(): #run the model for each train/test/cv split
    for seed in np.arange(10): #run each model 10 times using different random seed
        train,test,cv = indexed.select("tfidf","topicIndex").randomSplit(train_test_cv_split_params[split_param],seed=seed)

        #Naive bayes
        nb = NaiveBayes(featuresCol="tfidf", labelCol="topicIndex", predictionCol="NB_pred",
                        probabilityCol="NB_prob", rawPredictionCol="NB_rawPred")
        nbModel = nb.fit(train)
        cv = nbModel.transform(cv)
        total = cv.count()
        correct = cv.where(test['topicIndex'] == cv['NB_pred']).count()
        accuracy = correct/total
        val_dict[(split_param, seed)] = accuracy

In [9]:
params = max(val_dict, key = val_dict.get)
print("The combination of parameters that produced the highest accuracy ({0:.2f}): train/test/cv split ratio: {1}, randomseed: {2}".format(max(val_dict.values()),params[0], params[1]))

The combination of parameters that produced the highest accuracy (0.52): train/test/cv split ratio: 70/20/10, randomseed: 5


In [13]:
def meancal(val_dict, split_param):
    l = list()
    for i in val_dict.keys():
        if i[0] == split_param:
            l.append(val_dict[i])
    lmean = np.mean(l)
    lstd = np.std(l)
    return (lmean, lstd)

In [16]:
print('The mean accuracy of the 30 models: {0:.3f}'.format(np.mean(list(val_dict.values()))))

for split_param in train_test_cv_split_params:
    mean_accuracy, std_accuracy = meancal(val_dict, split_param)
    print('The split condition {0} has a mean accuracy of {1:.3f}'.format(split_param, mean_accuracy))
    print('The st.d. of split condition {0} for 10 runs: {1:.3f}'.format(split_param, std_accuracy))

The mean accuracy of the 30 models: 0.491
The split condition 50/40/10 has a mean accuracy of 0.481
The st.d. of split condition 50/40/10 for 10 runs: 0.013
The split condition 60/30/10 has a mean accuracy of 0.495
The st.d. of split condition 60/30/10 for 10 runs: 0.013
The split condition 70/20/10 has a mean accuracy of 0.498
The st.d. of split condition 70/20/10 for 10 runs: 0.012


Generally, the accuracy of the model increases as we have a higher proportion of training data. In our case the highest performing model was produced with a split of `70/20/10` with a mean accuracy of 0.498. The `70/20/10` split also produces the lowest st.d., therefore it is less sensitive to the split of the data compared to the other two. It is possible that if we increase the size of the training data, we will increase the model accuracy.



For this assignment, I built a simple Naive Bayes classifier to classify articles based on the topics. As the
result suggests, when using the highest percent of data for training, the model produced the highest accuracy.
A possible future development might be to gather more data to see if the accuracy improves.

Only a limited number of topics were used for this assignment. It would be interesting to see what the
model performance will be when more topics are used. Also, we only used single topic for each article
entry, without considering multi-labeled articles.

Last, I trained the model based on the TFIDF result created using the cleaned text. I did not consider
the combination of multiple words and their influence to the model. `Ngram` provides such capability, and it
would be interesting to see if the model improves.