# Article Classification using Naive Bayes

Created by Patrick Steeves as part of Independent Study with Professor Kanungo<br>
George Washington University 12/23/2017

This project trains a NB classifier on news article headlines to classify articles by topics

In [None]:
%run Module.ipynb

### Import Data

The original data can be found at https://www.kaggle.com/uciml/news-aggregator-dataset <br>
The data contains 400,000 headlines from news stories in 2014 in one of 4 categories: health, business, science and tech, entertainment

Import our data from GitHub

In [3]:
titles = importData()

Let's take a look at the data below. As we can see in the CATEGORY column, titles have the following categories: <br>
b - business <br>
e - entertainment <br>
t - science and technology <br>
m - health <br>

In [4]:
titles.iloc[:5,:]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


Cleaning all headlines takes about 2 hours, so give user option to pull pre-cleaned data directly from GitHub

In [6]:
download = 0   # Set to 1 to download pre-cleaned titles

In [7]:
if download:
    filename3, headers3 = urllib.request.urlretrieve(url+'cleaned_titles.zip', filename='cleaned_titles.zip')
    zip_ref = zipfile.ZipFile('cleaned_titles.zip', 'r')
    zip_ref.extractall()
    zip_ref.close()
    headlines = pd.read_csv('cleaned_titles.csv', encoding = 'latin1', keep_default_na = False)
    
else:
    headlines = cleanWords(titles, 'TITLE')
    headlines.drop(['URL','STORY','TIMESTAMP','HOSTNAME','ID'], axis=1)

Started cleaning headlines...
Done cleaning 6.1% of headlines
Done cleaning 14.2% of headlines
Done cleaning 22.4% of headlines
Done cleaning 30.9% of headlines
Done cleaning 38.8% of headlines
Done cleaning 46.7% of headlines
Done cleaning 55.0% of headlines
Done cleaning 63.3% of headlines
Done cleaning 71.5% of headlines
Done cleaning 79.7% of headlines
Done cleaning 87.9% of headlines
Done cleaning 96.1% of headlines
Took 125.1 minutes to clean titles


In [8]:
headlines.iloc[:3,:]

Unnamed: 0,TITLE,PUBLISHER,CATEGORY,STOPPED_WORDS
0,"Fed official says weak data caused by weather,...",Los Angeles Times,b,"fed,offici,say,weak,data,caus,weather,slow,taper"
1,Fed's Charles Plosser sees high bar for change...,Livemint,b,"fed,charl,plosser,see,high,bar,chang,pace,taper"
2,US open: Stocks fall after Fed official hints ...,IFA Magazine,b,"us,open,stock,fall,fed,offici,hint,acceler,taper"


### Train Classifier

In [12]:
classifier = NBClassifier(headlines['STOPPED_WORDS'], headlines['CATEGORY'], train_split = 0.8)

In [13]:
classifier.trainPDF()

Creating PDf for category 1/4
Creating PDf for category 2/4
Creating PDf for category 3/4
Creating PDf for category 4/4


In [14]:
classifier.classifyData()

Predicting train data
19.0% complete
39.1% complete
57.6% complete
77.4% complete
96.9% complete
Predicting test data
78.3% complete


Check train and test accuracy

In [15]:
print(classifier.train_accuracy)
print(classifier.test_accuracy)

0.909904575237
0.904140977488


Let's take a look at some examples of misclassifications

In [16]:
indices = list(classifier.misclassified.index)[:10]
pd.concat([headlines.iloc[indices,[0,2,3]], classifier.misclassified.iloc[:20,2]], axis = 1)

Unnamed: 0,TITLE,CATEGORY,STOPPED_WORDS,PREDICTED
34,10 Things You Need To Know Before The Opening ...,b,"thing,need,know,open,bell",e
49,Why eBay Spinning Off Paypal Makes Sense,b,"ebay,spin,paypal,make,sens",t
178,"Happy 5th birthday, bull market",b,"happi,th,birthday,bull,market",e
225,"Hackers Leak Mt. Gox Database, Reveal Blog of ...",b,"hacker,leak,mt,gox,databas,reveal,blog,former,ceo",t
280,Massive hacking attacks revealed,b,"massiv,hack,attack,reveal",t
428,"Get on the bus, Gus: US posts record public tr...",b,"get,bu,gu,us,post,record,public,transit,use",e
641,Rand Paul: I wouldn't let Putin get away with ...,b,"rand,paul,let,putin,get,away",e
655,Window on Westminster,b,"window,westminst",t
672,Paul Ryan calls for sanctions against Russian ...,b,"paul,ryan,call,sanction,russian,oligarch",e
1109,Orion Death Stars Destroy Planets Before They ...,t,"orion,death,star,destroy,planet,even,form",e


<br>Check to see if classifier also works on random titles:<br><br>

In [17]:
print(classifier.predictCats('Apple posts higher returns this quarter'))

{'b': -44.812632743367146, 't': -48.022133511231836, 'm': -58.28771720505842, 'e': -55.31794230064467}


Great, business has the highest probability

In [18]:
print(classifier.predictCats('Tom Cruise stars in upcoming action movie'))

{'b': -71.59521386580798, 't': -67.5631847361503, 'm': -74.84158060779703, 'e': -55.5692844385924}


Entertainment wins!