# Article Classification using Naive Bayes

Created by Patrick Steeves as part of Independent Study with Professor Kanungo<br>
George Washington University 12/23/2017

This project trains a NB classifier on news article headlines to classify articles by topics

### Import Data

The data can be found at https://www.kaggle.com/uciml/news-aggregator-dataset <br>
The data contains 400,000 headlines from news stories in 2014 in one of 4 categories: health, business, science and tech, entertainment

Import nececssary modules and define URL for downloads

In [1]:
import urllib.request
import zipfile
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import time
import re

url = "https://github.com/psteeves/NLP-projects/raw/master/Naive%20Bayes%20Topic%20Classifier/Data/"

The data was almost 100MB in size, so it was split into two files and zipped before being uploaded to Github. The lines below download and unzip the data files

In [2]:
filename1, headers1 = urllib.request.urlretrieve(url+'news1.zip', filename='news1.zip')
filename2, headers2 = urllib.request.urlretrieve(url+'news2.zip', filename='news2.zip')

zip_ref = zipfile.ZipFile('news1.zip', 'r')
zip_ref.extractall()
zip_ref.close()

zip_ref = zipfile.ZipFile('news2.zip', 'r')
zip_ref.extractall()
zip_ref.close()

In [3]:
news1 = pd.read_csv('news1.csv', encoding = 'latin1')
news2 = pd.read_csv('news2.csv', encoding = 'latin1')

frames = [news1, news2]
titles = pd.concat(frames).reset_index(drop=True)

Let's take a look at the data below. As we can see in the CATEGORY column, titles have the following categories: <br>
b - business <br>
e - entertainment <br>
t - science and technology <br>
m - health <br>

In [4]:
titles.iloc[:5,:]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


The function below cleans titles by dropping unnecessary columns, removing punctuation and stopwords, and stemming words

In [5]:
def cleanWords(df):
    headlines = df.drop(['URL','STORY','TIMESTAMP','HOSTNAME','ID'], axis=1)

    start = time.time()   # Time how long cleaning takes
    stopped_words = []   # List of titles converted to stopped words

    print("Started cleaning headlines...")
    stemmer = PorterStemmer()
    
    checkpoint = time.time()
    for idx, row in df['TITLE'].iteritems():
        cleaned_title = re.sub('[^a-zA-Z]+',' ', row).lower()    # Only keep alphabetical characters
        words = [stemmer.stem(word) for word in cleaned_title.split() if word not in stopwords.words('english')]   # Stem and filter stopwords
        stopped_words.append(','.join(words))   # Append cleaned words to list of cleaned titles
        
        if time.time() - checkpoint > 600:   # Update user on progress every 10min
            print("Done cleaning {:2.1f}% of headlines".format(100*idx/len(df)))
            checkpoint = time.time()

    headlines['STOPPED_WORDS'] = stopped_words   # Add cleaned column to dataframe

    headlines.to_csv("cleaned_titles.csv",index=False)

    print("Took {:4.1f} minutes to clean titles".format((time.time()-start)/60))

Give the use the option of either cleaning data using above function (and return a CSV of cleaned titles) or download the pre-cleaned titles from GitHub

In [6]:
download = 0   # Set to 1 to download pre-cleaned titles

In [7]:
if download:
    filename3, headers3 = urllib.request.urlretrieve(url+'cleaned_titles.zip', filename='cleaned_titles.zip')
    zip_ref = zipfile.ZipFile('cleaned_titles.zip', 'r')
    zip_ref.extractall()
    zip_ref.close()
    
else:
    cleanWords(titles)

headlines = pd.read_csv('cleaned_titles.csv', encoding = 'latin1', keep_default_na = False)

Started cleaning headlines...
Done cleaning 6.1% of headlines
Done cleaning 14.2% of headlines
Done cleaning 22.4% of headlines
Done cleaning 30.9% of headlines
Done cleaning 38.8% of headlines
Done cleaning 46.7% of headlines
Done cleaning 55.0% of headlines
Done cleaning 63.3% of headlines
Done cleaning 71.5% of headlines
Done cleaning 79.7% of headlines
Done cleaning 87.9% of headlines
Done cleaning 96.1% of headlines
Took 125.1 minutes to clean titles


In [8]:
headlines.iloc[:3,:]

Unnamed: 0,TITLE,PUBLISHER,CATEGORY,STOPPED_WORDS
0,"Fed official says weak data caused by weather,...",Los Angeles Times,b,"fed,offici,say,weak,data,caus,weather,slow,taper"
1,Fed's Charles Plosser sees high bar for change...,Livemint,b,"fed,charl,plosser,see,high,bar,chang,pace,taper"
2,US open: Stocks fall after Fed official hints ...,IFA Magazine,b,"us,open,stock,fall,fed,offici,hint,acceler,taper"


### Build Classifier

Import additional packages needed for model

In [9]:
import numpy as np
from collections import Counter
import math

Given word counts over topics, calculate P(category | word)

In [10]:
def getProb(pdf, words_in_cat, cat, word, total_words):
    # words_in_cat is number of unique words in category, total_words is total number of words in all categories
    # pdf is the word counts over the categories
    laplace_smooth = 2
    prob = pdf.get(word)
    return ((0 if prob is None else prob) + laplace_smooth) / (words_in_cat + laplace_smooth*total_words)    # If word is not in PDF of cat, first term in expression is 0

<br>Build class for NB Classifier

In [11]:
class NBClassifier:
    """
    Naive Bayes Classifier. Given Pandas Series of article titles and categories, learns to perform topic classification
    """
    def __init__(self, titles, categories, train_split = 1):
        self.data = pd.concat([titles,categories], axis=1)              # Dataframe of cleaned titles and cats
        train_idx = np.random.rand(len(self.data)) < train_split        # Training data rows
        self.train_data = self.data.loc[train_idx,:].copy()
        self.test_data = self.data.loc[~train_idx,:].copy()
        
        all_words = []       # All words in titles, including duplicates
        for row in titles:
            all_words += row.split(',')

        self.total_words = len(all_words)
        self.word_count = dict(Counter(all_words))
        self.common_words = {w:c for w,c in self.word_count.items() if c > 5}  # Only keep words that appear at least 10 times
        self.unique_words = self.common_words.keys()
        self.categories = set(categories)
        self.pdf = {}               # Word counts over all categories, to be trained later
        self.words_per_cat = {}     # Number of words per category
        
        self.train_accuracy = None
        self.test_accuracy = None
        self.trained = False        # Indicator if classifier has already been trained
        self.misclassified = None   # Misclassified titles from test set, or from training test if no train/test split

        
    def trainPDF(self):
        """
        Update word count and total number of words in each category
        """
        i = 1
        for cat in self.categories:
            print("Creating PDf for category {}/{}".format(i,len(self.categories)))
            relevant_cat = self.train_data.loc[lambda df: df.iloc[:,1] == cat,:]
            self.words_per_cat[cat] = 0
            self.pdf[cat]={}
            for row in relevant_cat.iloc[:,0]:
                title_words = row.split(',')
                self.words_per_cat[cat] += len(title_words)     # Iteratively number of words
                for word in title_words:
                    if self.pdf[cat].get(word):
                        self.pdf[cat][word] += 1       # For every word in title, iteratively update word count for category
                    else:
                        self.pdf[cat][word] = 1        # If word has not been seen already in category, set count to 1
            i+=1
            
        self.trained = True

        
    def predictCats(self, title, already_stopped = False):
        """
        Given a title, predict category of article. Assumes that title is not already cleaned. Returns dictionary of category probabilities for title.
        """
        if already_stopped:   # If title fed is already stopped
            words = [word for word in title.split(',')]
        else:
            stemmer = PorterStemmer()
            cleaned_title = re.sub('[^a-zA-Z]+',' ', title).lower()
            words = [stemmer.stem(word) for word in cleaned_title.split() if word not in stopwords.words('english')]
        preds = {}
        for cat in self.categories:
            preds[cat] = 0
            for word in words:
                # Get probability for each category using Naive Bayes
                preds[cat] += math.log(getProb(self.pdf[cat], self.words_per_cat[cat], cat, word, self.total_words))

        return preds

    
    def classifyData(self):
        """
        Using trained Classifier, predict training and testing data.
        """
        if not self.trained:
            print("You must train the classifier first")
            return
        
        print("Predicting train data")
        self.train_data.loc[:,'PREDICTED'] = ''
        start_min = time.time()
        
        for idx, row in self.train_data.iterrows():
            tr_predictions = self.predictCats(row.iloc[0], already_stopped = True)
            self.train_data.loc[idx,'PREDICTED'] = max(tr_predictions, key = tr_predictions.get)  # Predicted cat is one with highest probability
            
            if time.time() - start_min > 120:     # Update user on progress every 2 minutes
                print("{:2.1f}% complete".format(100*idx/self.train_data.index[-1]))
                start_min = time.time()

        if len(self.test_data) > 0:
            self.test_data.loc[:,'PREDICTED'] = ''
            print("Predicting test data")
            start_min = time.time()
            for idx, row in self.test_data.iterrows():
                te_predictions = self.predictCats(row.iloc[0], already_stopped = True)
                self.test_data.loc[idx,'PREDICTED'] = max(te_predictions, key = te_predictions.get)

                if time.time() - start_min > 120:
                    print("{:2.1f}% complete".format(100*idx/self.train_data.index[-1]))
                    start_min = time.time()

        self.train_accuracy = sum(self.train_data.PREDICTED == self.train_data.CATEGORY) / len(self.train_data)
        if len(self.test_data > 0):
            self.test_accuracy = sum(self.test_data.PREDICTED == self.test_data.CATEGORY) / len(self.test_data)
            self.misclassified = self.test_data.loc[self.test_data['PREDICTED'] != self.test_data.iloc[:,1]]
        else:
            self.misclassified = self.train_data.loc[self.train_data['PREDICTED'] != self.train_data.iloc[:,1]]

### Train Classifier

In [12]:
classifier = NBClassifier(headlines['STOPPED_WORDS'], headlines['CATEGORY'], train_split = 0.8)

In [13]:
classifier.trainPDF()

Creating PDf for category 1/4
Creating PDf for category 2/4
Creating PDf for category 3/4
Creating PDf for category 4/4


In [14]:
classifier.classifyData()

Predicting train data
19.0% complete
39.1% complete
57.6% complete
77.4% complete
96.9% complete
Predicting test data
78.3% complete


Check train and test accuracy

In [15]:
print(classifier.train_accuracy)
print(classifier.test_accuracy)

0.909904575237
0.904140977488


Show examples of misclassifications

In [16]:
indices = list(classifier.misclassified.index)[:20]
pd.concat([headlines.iloc[indices,[0,2,3]], classifier.misclassified.iloc[:20,2]], axis = 1)

Unnamed: 0,TITLE,CATEGORY,STOPPED_WORDS,PREDICTED
34,10 Things You Need To Know Before The Opening ...,b,"thing,need,know,open,bell",e
49,Why eBay Spinning Off Paypal Makes Sense,b,"ebay,spin,paypal,make,sens",t
178,"Happy 5th birthday, bull market",b,"happi,th,birthday,bull,market",e
225,"Hackers Leak Mt. Gox Database, Reveal Blog of ...",b,"hacker,leak,mt,gox,databas,reveal,blog,former,ceo",t
280,Massive hacking attacks revealed,b,"massiv,hack,attack,reveal",t
428,"Get on the bus, Gus: US posts record public tr...",b,"get,bu,gu,us,post,record,public,transit,use",e
641,Rand Paul: I wouldn't let Putin get away with ...,b,"rand,paul,let,putin,get,away",e
655,Window on Westminster,b,"window,westminst",t
672,Paul Ryan calls for sanctions against Russian ...,b,"paul,ryan,call,sanction,russian,oligarch",e
1109,Orion Death Stars Destroy Planets Before They ...,t,"orion,death,star,destroy,planet,even,form",e


Check to see if classifier also works on random title

In [17]:
print(classifier.predictCats('Apple posts higher returns this quarter'))

{'b': -44.812632743367146, 't': -48.022133511231836, 'm': -58.28771720505842, 'e': -55.31794230064467}


Great, business has the highest probability

In [18]:
print(classifier.predictCats('Tom Cruise stars in upcoming action movie'))

{'b': -71.59521386580798, 't': -67.5631847361503, 'm': -74.84158060779703, 'e': -55.5692844385924}


Entertainment wins!