The UCI ML News Aggregator Dataset contains headlines and categories for over 400k news articles. Let's see if we can accurately classify the news category based just on the headline.

We'll use a [Multinomial Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) model to classify the headlines. Multinomial Naive Bayes models are provided in Python by the [scikit-learn library](http://scikit-learn.org/stable/modules/naive_bayes.html).

In [3]:
# get some libraries that will be useful
import re
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder


# grab the data
news = pd.read_csv("/home/hardik/PycharmProjects/DmNewsPopularity/datasets/uci-news-url.csv")

In [4]:
# let's take a look at our data
news.head()

Unnamed: 0,TITLE,CATEGORY
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b


One thing we'll want to do is normalize the TITLE column a bit: remove punctuation and lowercase everything. This will give us a smaller set of words, which will decrease the size of our model, and ensure that words are treated the same even if they occur capitalized at the beginning of the headline or lowercase in the middle.

In [5]:
def normalize_text(s):
    s = s.lower()
    
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    
    return s

news['TEXT'] = [normalize_text(s) for s in news['TITLE']]

In [6]:
news.head()

Unnamed: 0,TITLE,CATEGORY,TEXT
0,"Fed official says weak data caused by weather,...",b,fed official says weak data caused by weather ...
1,Fed's Charles Plosser sees high bar for change...,b,fed's charles plosser sees high bar for change...
2,US open: Stocks fall after Fed official hints ...,b,us open stocks fall after fed official hints a...
3,"Fed risks falling 'behind the curve', Charles ...",b,fed risks falling behind the curve' charles pl...
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b,fed's plosser nasty weather has curbed job growth


Okay now let's get our data into a format where it will play nicely with the classifier.

In [7]:
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])

encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])

# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(337935, 54636)
(337935,)
(84484, 54636)
(84484,)


So the x training vector contains 337935 observations of 54637 occurrences -- this latter number is the number of unique words in the entire collection of headlines. The x training vector contains the 337935 labels associated with each observation in the x training vector.

So we're ready to go. Let's make the classifier!

In [8]:
nb = MultinomialNB()
nb.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

How well did it do?

In [17]:
example=["amazon-instant-video-browser","astronaut-notre-dame-bcs","jobs-contently","obama-boombox-israeli-radio-station-ad","Google to launch Android SDK for wearables in two weeks","US jobs growth last month hit by weather:Fed President Charles Plosser","Apple versus Samsung case goes to California jury - WBAY"]
e=vectorizer.transform(example)
nb.predict(e)
#nb.score(x_test, y_test)

array([3, 3, 0, 3, 3, 0, 3])

Nice! Over 92% accuracy, just by using words as independent features

If you feel like exploring what words are characteristic of each category, you can pull out the coefficients of the Naive Bayes classifier:

In [None]:
coefs = nb.coef_
print(coefs.shape)
print(coefs)

That's a matrix of the log probability of each word given each category. The usual way to find characteristic words for a category is to take those words with the largest log odds ratio per category, which is an exercise left to the reader.

The coefficients only give you log probabilities by index (which corresponds to whatever ordering the CountVectorizer decided on). To convert these indices back to words, you can pull out the vocabulary from the vectorizer:

In [None]:
def make_reverse_vocabulary(vectorizer):
    revvoc = {}

    vocab = vectorizer.vocabulary_
    for w in vocab:
        i = vocab[w]

        revvoc[i] = w

    return revvoc

And you can do something similar with the LabelEncoder to match the model's output classifications back to the dataset's categories. Again, this is left to the reader.

I hope this was helpful in learning how to classify text! Possible next steps: 
- figure out the most characteristic words for each news category (someone please do this and report back!)
- figure out how accurately we can classify particular news items (STORY in the dataset) given headline text