Skip to content

This is an experiment on text classification using different supervised learning classifiers and their variants conducted on the Reuters-21578 dataset. The aim is to evaluate the best performance for each of the classifiers by properly tuning the parameters of each classifier so that the least error is recorded during the classification.

Notifications You must be signed in to change notification settings

kanad-rep/Text-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

This is an experiment on text classification using different supervised learning classifiers and their variants conducted on the Reuters-21578 dataset. The aim is to evaluate the best performance for each of the classifiers by properly tuning the parameters of each classifier so that the least error is recorded during the classification.

Dataset:

The original Reuters-21578 corpus originally contains 135 categories and the categories are overlapped, i.e., a document may exist in several categories. Hence, we consider the Mod Apte version of Reuters, which contains 12902 documents with 90 categories and the corpus is divided into training and test sets. For the given experiment, we are given the following 10 categories:

                  alum, barley, coffee, dmk, fuel, livestock, palm-oil, retail, soybean, veg-oil

Procedure: -

From the given training and test set for the dataset, we pre-process the data and perform text-cleaning methods for better classification. We remove the stop words and punctuations from both the training and test sets. We then tokenize each word using the count vectorizer function and perform a stemming algorithm to map each token to its root word. Next, we create a term frequency matrix by using Tfidf vectorizer for the classifiers to work on.

We train our model for a classifier using the training set and obtain an optimum set of parameters in order to achieve the best performance. We then use our model to classify the data points in the test set evaluate the performance of the model. Finally, we analyse and compare our results with the actual class labels of the test samples and come up with appropriate conclusions based on our comparisons.

About

This is an experiment on text classification using different supervised learning classifiers and their variants conducted on the Reuters-21578 dataset. The aim is to evaluate the best performance for each of the classifiers by properly tuning the parameters of each classifier so that the least error is recorded during the classification.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages