# Text classification for reuters21578

In this notebook, I am describing development of a naive text classification system for reuters21578.

As mentioned in the task description, the task of text classification on reuters21578 requires quite a few choice while developing. Let's look at different topics and corresponding news counts which shown as USABLE i.e. TRAIN+TEST



In [1]:
from data_utils import *

article_stats_template = create_stat_template()
article_stats = get_news_stats(article_stats_template)
display(article_stats[article_stats.Set =='USABLE'].sort_values(by='Count', ascending=False))

Unnamed: 0,Topic,Set,Count
143,earn,USABLE,3965
3,acq,USABLE,2369
295,money-fx,USABLE,719
183,grain,USABLE,583
115,crude,USABLE,580
507,trade,USABLE,487
223,interest,USABLE,480
435,ship,USABLE,287
523,wheat,USABLE,283
79,corn,USABLE,239



As you see there are a few topics that have more than 100 usable news articles. We use the top 20 topics as follows:



In [2]:
selected_topics_stats = article_stats[article_stats.Set =='USABLE'].sort_values(by='Count', ascending=False).head(10)
display(selected_topics_stats)

Unnamed: 0,Topic,Set,Count
143,earn,USABLE,3965
3,acq,USABLE,2369
295,money-fx,USABLE,719
183,grain,USABLE,583
115,crude,USABLE,580
507,trade,USABLE,487
223,interest,USABLE,480
435,ship,USABLE,287
523,wheat,USABLE,283
79,corn,USABLE,239



The next step is to define feature extraction and classification method. We have quite a few choices here! State of the art text classification systems are mostly DL-based. considering the time limit for this particulare task, I decided to use SVM classifier on top of TF-IDF features. I am using NLTK and sklearn for TF-IDF extraction and classification respectively.


In [12]:
from IPython.display import HTML, display
import tabulate
def get_result_table(scores):
    table = [['Topic', 'Precision', 'Recal']]
    for topic in scores:
        table.append([topic, scores[topic][0], scores[topic][1]])
    return table


In [7]:
from tfidf_classifier import TFIDFClassifier
scores = dict()
from reuters21578 import Reuters

selected_topics = [item[0] for item in selected_topics_stats.values.tolist()]
data_set = Reuters()
data_set.load_data()
tfidf_classifier = TFIDFClassifier(data_set.get_all_train_data(set='TRAIN'))
data_set.add_tfidf(tfidf_classifier)
for topic in selected_topics:
    print("Training classifier for :  '{}'".format(topic))
    X_train, Y_train = data_set.get_data(topic, 'TRAIN')
    X_test, Y_test = data_set.get_data(topic, 'TEST')
    tfidf_classifier.train(X_train, Y_train)
    print("Testing '{}'".format(topic))
    scores[topic]= tfidf_classifier.get_precision_recall(X_test, Y_test)

display(get_result_table(scores))

processing file 000
processing file 001
processing file 002
processing file 003
processing file 004
processing file 005
processing file 006
processing file 007
processing file 008
processing file 009
processing file 010
processing file 011
processing file 012
processing file 013
processing file 014
processing file 015
processing file 016
processing file 017
processing file 018
processing file 019
processing file 020
processing file 021
Training classifier for :  'earn'
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Testing 'earn'
Training classifier for :  'acq'
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Testing 'acq'
Training classifier for :  'money-fx'
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Testing 'money-fx'
Training classifier for :  'grain'
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Testing 'grain'
Training classifier for :  'crude'
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Testing 'crude'
Training classifier for :  'tr

{'earn': (0.9871323529411765, 0.9871323529411765),
 'acq': (0.9540389972144847, 0.952712100139082),
 'money-fx': (0.569620253164557, 0.25),
 'grain': (0.4117647058823529, 0.14093959731543623),
 'crude': (0.8206896551724138, 0.6296296296296297),
 'trade': (0.7941176470588235, 0.4576271186440678),
 'interest': (0.8243243243243243, 0.45864661654135336),
 'ship': (0.6341463414634146, 0.29213483146067415),
 'wheat': (0.5, 0.1267605633802817),
 'corn': (0.5, 0.03571428571428571),
 'dlr': (0.0, 0.0),
 'money-supply': (0.8666666666666667, 0.7647058823529411),
 'oilseed': (nan, 0.0),
 'sugar': (0.8846153846153846, 0.6388888888888888),
 'coffee': (0.8571428571428571, 0.8571428571428571),
 'gnp': (0.43333333333333335, 0.37142857142857144),
 'gold': (0.7647058823529411, 0.43333333333333335),
 'veg-oil': (0.75, 0.08108108108108109),
 'soybean': (nan, 0.0),
 'nat-gas': (0.5, 0.3333333333333333)}

As you see result degardes for topic with lower number of news article. There are many potential ways of exploring different feature extraction and classifications as researchers are suggesting for this task. I did not have enough time to go through published works on this task.

I would like to mention following points about this implementation 
- This implementation assumes that all the data could be held in memory which is possible for such small data set
- dataframe operation could be much more efficient and nice with joint. I had an error with joint that I couldnt fix so I sent this implementation

In General, test classification tasks are very depending on the amount of data. AS result showes, recall is not consistent even for the top 10 topics

Here are some simple alternatives that would be nice to try
- concatenating uni-gram and bi-gram TFIDF in feature vector. I expect most of the bi-gram TFIDF should be filtered out
- CNN with small filter lenth (e.g. [1,2,3]) could be useful for top 10 topics with highest number of articles
