kanad-rep / Text-Classification Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

This is an experiment on text classification using different supervised learning classifiers and their variants conducted on the Reuters-21578 dataset. The aim is to evaluate the best performance for each of the classifiers by properly tuning the parameters of each classifier so that the least error is recorded during the classification.

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
Report.pdf		Report.pdf
Text_Classification.py		Text_Classification.py

Repository files navigation

This is an experiment on text classification using different supervised learning classifiers and their variants conducted on the Reuters-21578 dataset. The aim is to evaluate the best performance for each of the classifiers by properly tuning the parameters of each classifier so that the least error is recorded during the classification.

Dataset:

The original Reuters-21578 corpus originally contains 135 categories and the categories are overlapped, i.e., a document may exist in several categories. Hence, we consider the Mod Apte version of Reuters, which contains 12902 documents with 90 categories and the corpus is divided into training and test sets. For the given experiment, we are given the following 10 categories:

                  alum, barley, coffee, dmk, fuel, livestock, palm-oil, retail, soybean, veg-oil

Procedure: -

From the given training and test set for the dataset, we pre-process the data and perform text-cleaning methods for better classification. We remove the stop words and punctuations from both the training and test sets. We then tokenize each word using the count vectorizer function and perform a stemming algorithm to map each token to its root word. Next, we create a term frequency matrix by using Tfidf vectorizer for the classifiers to work on.

We train our model for a classifier using the training set and obtain an optimum set of parameters in order to achieve the best performance. We then use our model to classify the data points in the test set evaluate the performance of the model. Finally, we analyse and compare our results with the actual class labels of the test samples and come up with appropriate conclusions based on our comparisons.

About

This is an experiment on text classification using different supervised learning classifiers and their variants conducted on the Reuters-21578 dataset. The aim is to evaluate the best performance for each of the classifiers by properly tuning the parameters of each classifier so that the least error is recorded during the classification.

nlp machine-learning kernel text-classification cross-validation naive-bayes-classifier support-vector-machine logistic-regression-algorithm supervised-learning-algorithms knn-classifier nltk-python reuters-dataset

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%