# Article Classification using Naive Bayes

Created by Patrick Steeves as part of Independent Study with Professor Kanungo<br>
George Washington University 12/23/2017

This project trains a NB classifier on news article headlines to classify articles by topics

In [1]:
%run Module.ipynb

### Import Data

The original data can be found at https://www.kaggle.com/uciml/news-aggregator-dataset <br>
The data contains 400,000 headlines from news stories in 2014 in one of 4 categories: health, business, science and tech, entertainment

Import our data from GitHub

In [2]:
titles = importData()

Let's take a look at the data below. As we can see in the CATEGORY column, titles have the following categories: <br>
b - business <br>
e - entertainment <br>
t - science and technology <br>
m - health <br>

In [3]:
titles.iloc[:5,:]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


Cleaning all headlines takes about 2 hours, so give user option to pull pre-cleaned data directly from GitHub

In [4]:
download = 0   # Set to 1 to download pre-cleaned titles

In [5]:
if download:
    url = "https://github.com/psteeves/NLP-projects/raw/master/Naive%20Bayes%20Topic%20Classifier/Data/"
    filename3, headers3 = urllib.request.urlretrieve(url+'cleaned_data.csv', filename='cleaned_titles.csv')
    headlines = pd.read_csv('cleaned_titles.csv', encoding = 'latin1', keep_default_na = False)
    
else:
    headlines = cleanWords(titles, 'TITLE')
    headlines = headlines.drop(['URL','STORY','TIMESTAMP','HOSTNAME','ID'], axis=1)

Started cleaning headlines...
Done cleaning 6.6% of headlines
Done cleaning 14.9% of headlines
Done cleaning 23.1% of headlines
Done cleaning 31.2% of headlines
Done cleaning 39.4% of headlines
Done cleaning 47.9% of headlines
Done cleaning 56.3% of headlines
Done cleaning 64.6% of headlines
Done cleaning 73.0% of headlines
Done cleaning 81.6% of headlines
Done cleaning 89.9% of headlines
Done cleaning 98.3% of headlines
Took 122.1 minutes to clean titles


In [6]:
headlines.iloc[:3,:]

Unnamed: 0,TITLE,PUBLISHER,CATEGORY,STOPPED_WORDS
0,"Fed official says weak data caused by weather,...",Los Angeles Times,b,"fed,offici,say,weak,data,caus,weather,slow,taper"
1,Fed's Charles Plosser sees high bar for change...,Livemint,b,"fed,charl,plosser,see,high,bar,chang,pace,taper"
2,US open: Stocks fall after Fed official hints ...,IFA Magazine,b,"us,open,stock,fall,fed,offici,hint,acceler,taper"


### Train Classifier

Below we train our Naive Bayes classifier on 80% of our data. The docs for the classifier object can be found in the Modules.ipynb notebook at<br> https://github.com/psteeves/NLP-projects/blob/master/Naive%20Bayes%20Topic%20Classifier/Scripts/Module.ipynb

In [7]:
classifier = NBClassifier(headlines['STOPPED_WORDS'], headlines['CATEGORY'], train_split = 0.8)

In [8]:
classifier.trainPDF()

Creating PDf for topic 1/4
Creating PDf for topic 2/4
Creating PDf for topic 3/4
Creating PDf for topic 4/4


In [9]:
classifier.classifyData()

Predicting train data
20.2% complete
40.7% complete
60.9% complete
81.2% complete
96.9% complete
Predicting test data
44.1% complete
98.8% complete


Check train and test accuracy

In [10]:
print(classifier.train_accuracy)
print(classifier.test_accuracy)

0.915767704929
0.909693757362


Let's take a look at some examples of misclassifications

In [24]:
indices = list(classifier.misclassified.index)[:15]
pd.concat([headlines.iloc[indices,[0,2,3]], classifier.misclassified.iloc[:15,2]], axis = 1)

Unnamed: 0,TITLE,CATEGORY,STOPPED_WORDS,PREDICTED
192,"Happy Birthday, Bull",b,"happi,birthday,bull",e
225,"Hackers Leak Mt. Gox Database, Reveal Blog of ...",b,"hacker,leak,mt,gox,databas,reveal,blog,former,ceo",t
283,Anonymous hackers uncover alleged proof of MtG...,b,"anonym,hacker,uncov,alleg,proof,mtgox,fraud,si...",t
391,American Airlines & JetBlue: Le Divorce,b,"american,airlin,jetblu,le,divorc",e
400,JetBlue airplanes at their gates at John F. Ke...,b,"jetblu,airplan,gate,john,f,kennedi,airport,new...",e
672,Paul Ryan calls for sanctions against Russian ...,b,"paul,ryan,call,sanction,russian,oligarch",e
840,China learns not to rescue its lame ducks,b,"china,learn,rescu,lame,duck",e
1114,Orion Death Stars Kill Off Planets Before They...,t,"orion,death,star,kill,planet,form",e
1115,'Death Stars' In Orion Are Masters Of Life; Al...,t,"death,star,orion,master,life,allow,new,cycl,pl...",e
1120,'Death Stars' Wreck Havoc in the Orion Nebula,t,"death,star,wreck,havoc,orion,nebula",e


Most of the reasons for misclassifications are obvious. Articles about death stars in a nebula were classified as entertainment because they could easily be a story about the Star Wars universe. Many articles about cryptocurrencies are misclassified as business, but the article probably touches on both tech and business.<br>
A logical next step in improving the classifier could be to add bigrams to the titles. This way, names like American Airlines and Paul Ryan could be recognized as coupled words that appear often in the business category

<br><br>Let's check to see if classifier also works on random titles:

In [12]:
print(classifier.predictCats('Apple posts higher returns this quarter'))

{'b': -41.82298628747276, 't': -45.110326962092124, 'm': -55.38170972703517, 'e': -52.790787340199586}


Great, business has the highest probability

In [26]:
print(classifier.predictCats('Cruise stars in upcoming action sequel'))

{'b': -57.73306589610843, 't': -54.14482823990055, 'm': -59.56361398253921, 'e': -44.09201983404998}


Entertainment wins!