# Article Classification using Naive Bayes

Created by Patrick Steeves as part of Independent Study with Professor Kanungo<br>
George Washington University 12/23/2017

### Import Data

The data can be found at https://www.kaggle.com/uciml/news-aggregator-dataset <br>
The data contains 400,000 headlines from news stories in 2014 in one of 4 categories: health, business, science and tech, entertainment

Import nececssary modules, set directory path, and import titles data

In [14]:
import zipfile
import pandas as pd
import os
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import time
from collections import Counter
import re
import math

path = "C:\\Users\\patri\\Documents\\GW\\2017Fall\\Independent Study\\Simple project\\"
titles = pd.read_csv(path+"Data\\news_headlines.csv")

Titles have the following categories: <br>
b - business <br>
e - entertainment <br>
t - science and technology <br>
m - health <br>

In [3]:
titles.iloc[:5,:]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207


Clean titles. Drop unnecessary columns, remove punctuation and stopwords, and stem words

In [8]:
def cleanWords(df, path = 0):
    # If path variable is specified, create CSV file with cleaned titles
    headlines = df.drop(['URL','STORY','TIMESTAMP','HOSTNAME','ID'], axis=1)

    start = time.time()
    stopped_words = []
    i = 0
    print("Started cleaning headlines...")
    stemmer = PorterStemmer()
    for row in df['TITLE']:
        cleaned_title = re.sub('[^a-zA-Z]+',' ', row).lower()    # Only keep alphabetical characters
        words = [stemmer.stem(word) for word in cleaned_title.split() if word not in stopwords.words('english')]   # Stem and filter words
        stopped_words.append(','.join(words))
        i+=1
        if i % 30000 == 0:
            print("Done cleaning {} headlines".format(i))   # Update user on progress

    headlines['STOPPED_WORDS'] = stopped_words
    if path:
        headlines.to_csv(path+"Data\\headline_words.csv",index=False)
    print("Took {:07.2f} seconds to create CSV file with filtered words".format(time.time()-start))
    return headlines

If data has not been cleaned already, import CSV of cleaned headlines. If CSV has not yet been created, clean headlines

In [6]:
if locals().get('headlines') == None:
    if "headline_words.csv" in os.listdir(path+"Data"):
        headlines = pd.read_csv(path + "Data\\headline_words.csv", encoding = "latin1", keep_default_na = False)
    else:
        headlines = mod.cleanWords(titles, path)

In [7]:
headlines.iloc[:3,:]

Unnamed: 0,TITLE,PUBLISHER,CATEGORY,STOPPED_WORDS
0,"Fed official says weak data caused by weather,...",Los Angeles Times,b,"fed,offici,say,weak,data,caus,weather,slow,taper"
1,Fed's Charles Plosser sees high bar for change...,Livemint,b,"fed,charl,plosser,see,high,bar,chang,pace,taper"
2,US open: Stocks fall after Fed official hints ...,IFA Magazine,b,"us,open,stock,fall,fed,offici,hint,acceler,taper"


### Build Classifier

Given a PDF of words, calculate P(cat | word), modified using Laplacian Smoothing

In [9]:
def getProb(pdf, words_in_cat, cat, word, total_words):
    # words_in_cat is number of unique words in category, total_words is total number of words in all categories
    laplace_smooth = 1
    prob = pdf.get(word)
    return ((0 if prob is None else prob) + laplace_smooth) / (words_in_cat + laplace_smooth*total_words)    # If word is not in PDF of cat, first term in expression is 0

<br>Build class for NB Classifier

In [12]:
class NBClassifier:
    """
    Naive Bayes Classifier. Given Pandas Series of titles and categories, trains PDF to perform classification on article categories.
    """
    def __init__(self, titles, categories, train_split = 1):
        self.data = pd.concat([titles,categories], axis=1)                               # Dataframe of cleaned titles and cats
        train_idx = np.random.rand(len(self.data)) < train_split                         # Training data rows
        self.train_data = self.data.loc[train_idx,:].copy()                              # Train data
        self.test_data = self.data.loc[~train_idx,:].copy()                              # Test data
        self.all_words = []
        self.pdf = {}
        self.words_per_cat = {}
        
        i = 0
        for row in titles:                                                               # List of all words used in titles
            self.all_words += row.split(',')
            i += 1
            if i % 50000 == 0:
                print("Done reading {} headlines".format(i))

        self.total_words = len(self.all_words)                                          # Number of total words used
        self.word_count = dict(Counter(self.all_words))                                 # Word count
        self.common_words = {w:c for w,c in self.word_count.items() if c > 4}           # Only keep words that appear at least 5 times
        self.common_words.pop('')
        self.unique_words = self.common_words.keys()                                    # List of unique words
        self.categories = set(categories)                                               # Set possible categories
        self.train_accuracy = None
        self.test_accuracy = None
        self.trained = False                                                            # True if PDF has already been trained on data
        self.misclassified = None                                                       # Misclassified titles from test set

    def trainPDF(self):
        """
        Train PDF on data. Creates dictionary of categories, which each contains dictionary of word counts
        """
        i = 1
        for cat in self.categories:
            print("Creating PDf for category {}/{}".format(i,len(self.categories)))
            indexed = self.train_data.loc[lambda df: df.iloc[:,1] == cat,:]             # Subset of training titles corresponding to category
            self.words_per_cat[cat] = 0
            self.pdf[cat]={}
            for row in indexed.iloc[:,0]:
                title_words = row.split(',')
                self.words_per_cat[cat] += len(title_words)                             # Iteratively count number of words in titles for category
                for word in title_words:
                    if self.pdf[cat].get(word):
                        self.pdf[cat][word] += 1                                        # For every word in title, iteratively update word count for category
                    else:
                        self.pdf[cat][word] = 1
            i+=1
        self.trained = True                                                             # Set trained flag to true

    def predictCat(self, title, already_stopped = False):
        """
        Given a title, predict category of article. Assumes that title is not already cleaned. Returns dictionary of category probabilities for title.
        """
        if already_stopped:
            words = [word for word in title.split(',')]
        else:
            stemmer = PorterStemmer()
            cleaned_title = re.sub('[^a-zA-Z]+',' ', title).lower()
            words = [stemmer.stem(word) for word in cleaned_title.split() if word not in stopwords.words('english')]
        preds = {}
        for cat in self.categories:
            preds[cat] = 0
            for word in words:
                preds[cat] += math.log(getProb(self.pdf[cat], self.words_per_cat[cat], cat, word, self.total_words))

        return preds

    def runClassifier(self):
        """
        Using trained Classifier, predict training and testing data.
        """
        if not self.trained:
            print("You must train the classifier first")
            return
        self.train_data.loc[:,'PREDICTED'] = ''                                         # Add column for predicted category
        self.test_data.loc[:,'PREDICTED'] = ''
        print("Predicting train data")
        start_min = time.time()        
        for idx, row in self.train_data.iterrows():
            tr_predictions = self.predictCat(row.iloc[0], True)                         # Probability of each category for given title
            self.train_data.loc[idx,'PREDICTED'] = max(tr_predictions, key = tr_predictions.get)   # Predict title using max prob
            if time.time() - start_min > 30:
                print("{:2.1f}% Done with training data".format(100*idx/self.train_data.index[-1]))
                start_min = time.time()

        print("Predicting test data")
        start_min = time.time()
        for idx, row in self.test_data.iterrows():
            te_predictions = self.predictCat(row.iloc[0], True)
            self.test_data.loc[idx,'PREDICTED'] = max(te_predictions, key = te_predictions.get)
            if time.time() - start_min > 30:
                print("{:2.1f}% Done with testing data".format(100*idx/self.train_data.index[-1]))
                start_min = time.time()

        self.train_accuracy = sum(self.train_data.PREDICTED == self.train_data.CATEGORY) / len(self.train_data)   # Update training accuracy
        if len(self.test_data > 0):
            self.test_accuracy = sum(self.test_data.PREDICTED == self.test_data.CATEGORY) / len(self.test_data)   # Update testing accuracy
        
        if len(self.test_data > 0):                                                                               # Get misclassified titles
            self.misclassified = self.test_data.loc[self.test_data['PREDICTED'] != self.test_data.iloc[:,1]]
        else:
            self.misclassified = self.train_data.loc[self.train_data['PREDICTED'] != self.train_data.iloc[:,1]]

### Train Classifier

In [15]:
classifier = NBClassifier(headlines['STOPPED_WORDS'], headlines['CATEGORY'], 0.9)

Done reading 50000 headlines
Done reading 100000 headlines
Done reading 150000 headlines
Done reading 200000 headlines
Done reading 250000 headlines
Done reading 300000 headlines
Done reading 350000 headlines
Done reading 400000 headlines


In [17]:
classifier.trainPDF()

Creating PDf for category 1/4
Creating PDf for category 2/4
Creating PDf for category 3/4
Creating PDf for category 4/4


In [18]:
classifier.runClassifier()

Predicting train data
12.6% Done with training data
26.7% Done with training data
41.3% Done with training data
55.7% Done with training data
70.8% Done with training data
85.7% Done with training data
Predicting test data


Check train and test accuracy

In [20]:
print(classifier.train_accuracy)
print(classifier.test_accuracy)

0.916254471842
0.912415965399


Show examples of misclassifications

In [23]:
classifier.misclassified.iloc[:10,:]

Unnamed: 0,STOPPED_WORDS,CATEGORY,PREDICTED
167,"predict,new,week",b,e
655,"window,westminst",b,t
840,"china,learn,rescu,lame,duck",b,e
1117,"death,star,orion,wreak,havoc,planet,even,develop",t,e
1163,news,t,e
1176,"mt,gox,file,us,bankruptci",t,b
1186,"bitcoin,exchang,mt,gox,file,u,bankruptci,death...",t,b
1585,"presid,obama,introduc,cosmo,debut,episod",t,e
1849,"itun,festiv,app,readi,ahead,sxsw,concert,seri",t,e
1856,"ohio,ga,price,steadi,start,work,week",t,b


In [25]:
headlines.iloc[[167,655,840,1117,1163,1176,1186,1585,1849,1856],:]

Unnamed: 0,TITLE,PUBLISHER,CATEGORY,STOPPED_WORDS
167,3 Predictions for the New Week,Motley Fool,b,"predict,new,week"
655,Window on Westminster,Rudaw,b,"window,westminst"
840,China learns not to rescue its lame ducks,Director of Finance online,b,"china,learn,rescu,lame,duck"
1117,'Death stars' in Orion wreak havoc on planets ...,Science Recorder,t,"death,star,orion,wreak,havoc,planet,even,develop"
1163,What's News,Wall Street Journal,t,news
1176,Mt Gox files US bankruptcy,Stuff.co.nz,t,"mt,gox,file,us,bankruptci"
1186,Bitcoin Exchange Mt. Gox Files for U.S. Bankru...,Wired,t,"bitcoin,exchang,mt,gox,file,u,bankruptci,death..."
1585,President Obama Introduces 'Cosmos' Debut Episode,Newsmax.com,t,"presid,obama,introduc,cosmo,debut,episod"
1849,iTunes Festival apps ready ahead of SXSW conce...,iSource,t,"itun,festiv,app,readi,ahead,sxsw,concert,seri"
1856,Ohio gas prices steady to start work week,Huntington Herald Dispatch,t,"ohio,ga,price,steadi,start,work,week"


Check to see if classifier also works on random title

In [26]:
classifier.predictCat('Apple posts higher returns this quarter')

{'b': -41.36362907626666,
 'e': -52.543570547479376,
 'm': -54.80713540820604,
 't': -44.5483659972368}