## Overview

I was trying to **use only the title and category column** to build the classifier. Considering that these news (data) are taken from the Internet, I am focusing to build the model with the help of keywords in the title. **Here is the deal...**

Optimizing blog posts for keywords is not about incorporating as many keywords into posts as possible. Turns out that'll actually hurt the SEO efforts, because search engines will think you're keyword stuffing (i.e., including your keywords as much as possible with the sole purpose of gaining ranking in organic search). A good rule of thumb is **to focus** on one or two keywords per blog post. This'll help keep you focused on a goal for your post. While you can use more than one keyword in a single post, keep the focus of the post narrow enough to allow you to spend time actually optimizing for just one or two keywords.

Though contents (body) also involved, but mostly, the title (i.e., headline) of a blog post will be a search engine's and reader's first step in determining the relevancy of your content, so including a keyword here is vital.

In [1]:
import pip

def install_lib(package):
    return pip.main(['install', package])

install_lib("textblob")
install_lib("nltk")



0

In [2]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\zef\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
import pandas
import numpy
import textblob

from textblob.classifiers import NaiveBayesClassifier
from nltk.classify import PositiveNaiveBayesClassifier

# Locked and load Check
df = pandas.read_json("articles.json")
stopwords = pandas.read_csv("stopwords.txt")

In [4]:
df.head(5)

Unnamed: 0,category,text,title
0,Tekno,"Liputan6.com, Jakarta - Pertumbuhan startup te...",Kiat Sukses Berbisnis Teknologi
1,News,By Eri Komar Sinaga Mantan Kepala Biro Adminis...,KPK Periksa Politikus Demokrat Terkait Korupsi...
2,News,JAKARTA - Komisi Pemilihan Umum (KPU) DKI Jaka...,"Pendaftaran Ditutup, KPU Pastikan Pilgub DKI D..."
3,Tekno,ArenaLTE.com - Perkuat industri gaming yang se...,"Perkuat Industri Gaming, AMD Akuisisi Pengemba..."
4,News,"VIVA.co.id - Hingga Senin malam, 26 September ...","Bertambah, Korban Tewas Banjir Garut Jadi 34 O..."


In [5]:
df['title'] = df['title'].str.lower()
df['title'] = df['title'].str.split(" ")

df['text'] = df['text'].str.lower()
df['text'] = df['text'].str.split(" ")

df['text'] = filter(None, df['text'])

In [6]:
import unicodedata

for column in ['title', 'text']:
    for row in range(0, len(df[column])):
    
        for element in range(0, len(df[column][row])):
            if type(df[column][row][element]) == unicode:
                df[column][row][element] = unicodedata.normalize('NFKD', df[column][row][element]).encode('ascii','ignore')

In [7]:
df.head(5)

Unnamed: 0,category,text,title
0,Tekno,"[liputan6.com,, jakarta, -, pertumbuhan, start...","[kiat, sukses, berbisnis, teknologi]"
1,News,"[by, eri, komar, sinaga, mantan, kepala, biro,...","[kpk, periksa, politikus, demokrat, terkait, k..."
2,News,"[jakarta, -, komisi, pemilihan, umum, (kpu), d...","[pendaftaran, ditutup,, kpu, pastikan, pilgub,..."
3,Tekno,"[arenalte.com, -, perkuat, industri, gaming, y...","[perkuat, industri, gaming,, amd, akuisisi, pe..."
4,News,"[viva.co.id, -, hingga, senin, malam,, 26, sep...","[bertambah,, korban, tewas, banjir, garut, jad..."


### Cleaning Data

In [8]:
import re

for column in ['title', 'text']:
    for row in range(0, len(df[column])):
      
        holder = []
        
        # Reduce data to first 120 words (for performance)
        if(column == 'text') and len(df[column][row]) >= 120:
          df[column][row] = df[column][row][:120]

        for element in range(0, len(df[column][row])):

            # Detect escape sequence & unicode

            # On head
            # Normal regex
            if re.match(r"^\\[abfnrtv].*$", df[column][row][element]) or re.match(r"^\\[\\\'\"?].*$", df[column][row][element]):
                print column, row, element
                df[column][row][element] = df[column][row][element][2:]

            if re.match(r"^.*\\[abfnrtv]$", df[column][row][element]) or re.match(r"^.*\\[\\\'\"?]$", df[column][row][element]):
                df[column][row][element] = df[column][row][element][:-2]

            # Detect punctuation

            if re.match(r"^.*[`~!@#$%^&*()_+-=[\]\{}|;\':\",.\/<>?]$", df[column][row][element]):
                df[column][row][element] = df[column][row][element][:-1]

            elif re.match(r"^[`~!@#$%^&*()_+-=[\]\{}|;':\",.\/<>?].*$", df[column][row][element]):
                df[column][row][element] = df[column][row][element][1:]


            # To be removed or not to be
            # Link & number

            if df[column][row][element].isdigit() == True or re.match(r"^.*\.co.*$", df[column][row][element]) or re.match(r"^.*\.id$", df[column][row][element]):
                holder.append(df[column][row][element])

        # Removal Process
        for l in holder:
            df[column][row].remove(l)

        # Filter empty
        df[column][row] = filter(None, df[column][row])

In [9]:
stopwords_list = []

for i in stopwords['stopwords']:
    stopwords_list.append(i)

In [10]:
## Remove stopwords
for category in ['title','text']:
    for row in range(0, len(df[category])):
        holder = []
        
        for element in range(0, len(df[category][row])):
            if df[category][row][element] in stopwords_list:
                holder.append(df[category][row][element])
        
        for l in holder:
            if l in df[category][row]:
                df[category][row].remove(l)
        
        df[column][row] = filter(None, df[column][row])
        df[category][row] = " ".join(df[category][row])

#### End Cleaning Data
### Start Building Model

In [11]:
labeledDF_title = []
labeledDF_text = []

for i in range(0, len(df)):
    labeledDF_title.append((df['title'][i].lower(), df['category'][i].lower()))
    labeledDF_text.append((df['text'][i].lower(), df['category'][i].lower()))
    
labeledDF_title_train = labeledDF_title[:3200]
labeledDF_text_train = labeledDF_text[:3200]

labeledDF_title_test = labeledDF_title[3200:]
labeledDF_text_test = labeledDF_text[3200:]
    
print len(labeledDF_title_train) + len(labeledDF_title_test)

4000


In [12]:
cl = NaiveBayesClassifier(labeledDF_title_train)

#### End Building Model

### Testing and evaluating

In [13]:
only_title_accuracy = cl.accuracy(labeledDF_title_test)
print "Title only accuracy: %s" % (only_title_accuracy)

Title only accuracy: 0.99


In [14]:
only_content_accuracy = cl.accuracy(labeledDF_text_test)
print "Content only accuracy: %s" % (only_content_accuracy)

Content only accuracy: 0.98


In [16]:
# In case databricks broken
average_accuracy = ( only_title_accuracy + only_content_accuracy ) / 2
print "Average accuracy: %.2f" % (average_accuracy)

Average accuracy: 0.98


In [None]:
full_data_accuracy = cl.accuracy(labeledDF_title_test[start : start + divisor] + labeledDF_text_test[start : start + divisor])
print "Full accuracy: %s" % (numpy.mean(report))

In [None]:
print "Title only accuracy: %s\nContent only accuracy: %s\nFull accuracy: %s" % (only_title_accuracy, only_content_accuracy, full_data_accuracy)