# Text classification for reuters21578

In this notebook, I am describing development of a naive text classification system for reuters21578.

As mentioned in the task description, the task of text classification on reuters21578 requires quite a few choice while developing. Let's look at different topics and corresponding news counts which shown as `USABLE` i.e. `TRAIN + TEST`


In [1]:
from reuters21578 import Reuters

dataset = Reuters()
article_stats = dataset.get_news_stats(mode='offline')
display(article_stats[article_stats.Set =='USABLE'].sort_values(by='Count', ascending=False))

Unnamed: 0,Topic,Set,Count
143,earn,USABLE,3965
3,acq,USABLE,2369
295,money-fx,USABLE,719
183,grain,USABLE,583
115,crude,USABLE,580
507,trade,USABLE,487
223,interest,USABLE,480
435,ship,USABLE,287
523,wheat,USABLE,283
79,corn,USABLE,239



As you see there are a few topics that have more than 100 usable news articles. We use the top 10 topics as follows:



In [None]:
selected_topics_stats = article_stats[article_stats.Set =='USABLE'].sort_values(by='Count', ascending=False).head(10)
display(selected_topics_stats)


The next step is to define feature extraction and classification method. We have quite a few choices here! State-of-the-art text classification systems are mostly DL-based. Considering the time limit for this particulare task and limited number of news articles, I decided to use SVM classifier on top of TF-IDF features. I am using NLTK and sklearn for TF-IDF extraction and classification respectively.


In [None]:
from pandas import DataFrame
def show_result_table(scores):
    tbl = DataFrame(columns=['Topic', 'Precision', 'Recall'])
    for topic in scores:
        tbl = tbl.append({'Topic': topic,'Precision': scores[topic][0],'Recall': scores[topic][1]}, ignore_index=True)
        
    display(tbl)

In [None]:
from tfidf_classifier import TFIDFClassifier
from reuters21578 import Reuters


selected_topics = [item[0] for item in selected_topics_stats.values.tolist()]
data_set = Reuters()
data_set.load_data()
tfidf_classifier = TFIDFClassifier(data_set.get_all_train())
print("Calculating and adding TFIDF...")
data_set.add_tfidf(tfidf_classifier)

scores = dict()
for topic in selected_topics:
    print("Training classifier for :  '{}'".format(topic))
    X_train, Y_train = data_set.get_data(topic, 'TRAIN')
    X_test, Y_test = data_set.get_data(topic, 'TEST')
    tfidf_classifier.train(X_train, Y_train)
    print("Testing '{}'".format(topic))
    scores[topic]= tfidf_classifier.get_precision_recall(X_test, Y_test)

show_result_table(scores)

As you see result degrades for topic with lower number of news article. There are many potential ways of exploring different feature extraction and classifications as researchers are suggesting for this task. I did not have enough time to go through published works on this task.
For such a task, I would normally go though the data to figure out distribution of words and kind of data filtering that would help better result before trying to play wih classifiers.

I would like to mention following points about this implementation 
- It is not a good practice to upload data on github along with the code. However, there is an encoding problem with file reut2-017.sgm that should be fixed before being used here. Please use the Reuters data provided along with the code. 
- This implementation assumes that all the data could be held in memory which is possible for such small data set
- dataframe operation could be much more efficient and nicer with joint. I had an error with joint that I couldn't fix fast so I sent this implementation

In General, test classification tasks are very depending on the amount of data. AS result shows, recall is not consistent even for the top 10 topics. Here are some simple variations that would be nice to try

- concatenating uni-gram and bi-gram TFIDF in feature vector. I expect most of the bi-gram TFIDF should be filtered out
- It is easier to capture higher order n-gram sequences with correct length and number of filters. CNN with small filter length (e.g. [1,2]) could be useful for top 10 topics with highest number of articles. See <a href="http://localhost:8889/notebooks/cnn-based-topic-classification.ipynb"> CNN-based implementation </a>

