Skip to content
A fasttext implementation based on Torch
Lua
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md
fasttext.lua
main.lua

README.md

This is an Torch implementation of fasttext based on A. Joulin's paper Bag of Tricks for Efficient Text Classification.

Author: Junwei Pan

Email: pandevirus@gmail.com

Requirements

This code is written in Lua and requires Torch. If you're on Ubuntu, installing Torch in your home directory may look something like:

$ curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
$ git clone https://github.com/torch/distro.git ~/torch --recursive
$ cd ~/torch
$ ./install.sh      # and enter "yes" at the end to modify your bashrc
$ source ~/.bashrc

This code also require the nn package:

$ luarocks install nn

Usage

First down load the text classification data mentioned in Xiang Zhang's paper: Character-level Convolutional Networks for Text Classification. We use the ag_news_csv dataset for training and evaluation.

Then run the following commands to train and evaluate the fasttext model:

$ th main.lua -corpus_train data/ag_news_csv/train.csv -corpus_test data/ag_news_csv/test.csv -dim 10 -minfreq 10 -stream 0 -epochs 5 -suffix 1 -n_classes 4 -n_gram 1 -decay 0 -lr 0.5

If the dataset is too large to fit in the memory, try to use the paratemer -stream 1.

The trained model can get an accuracy of 90.93% on the g_news_csv dataset using the above configuration.

Parameters

-corpus_train: path of the training data

-corpus_test: path of the testing data

-minfreq: only those words with frequence higher than this will be used as features, default 10

-dim: the embedding dimension, default 10

-lr: learning rate, default 0.5

-min_lr: the minimal learning rate, default 0.001

-decay: whether to decay learning rate, 1 for decay, 0 for no decay, default 0

-epochs: number of epochs to go through the training data, default 5

-stream: whether to stream the data: 1 for streaming, 0 for store all data in memory, default 0

-suffix: suffix of the model

-n_classes: number of classification categories

-n_gram: 1 for unigram, 2 for bigram, 3 for trigram, default 1

-title: whether use the title to generate features, default 1

-description: whether use the description to generate features, default 1

To be done

  1. Support hashtrick
  2. Efficiency improvement

Acknowledgements

This code is based on the word2vec_torch project, which extends Yoon Kim's word2vec_torch by implementing the Continuous Bag-of-words Model.

You can’t perform that action at this time.