GitHub - junwei-pan/fasttext_torch: A fasttext implementation based on Torch

This is an Torch implementation of fasttext based on A. Joulin's paper Bag of Tricks for Efficient Text Classification.

Author: Junwei Pan

Requirements

This code is written in Lua and requires Torch. If you're on Ubuntu, installing Torch in your home directory may look something like:

$ curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
$ git clone https://github.com/torch/distro.git ~/torch --recursive
$ cd ~/torch
$ ./install.sh      # and enter "yes" at the end to modify your bashrc
$ source ~/.bashrc

This code also require the nn package:

$ luarocks install nn

Usage

First down load the text classification data mentioned in Xiang Zhang's paper: Character-level Convolutional Networks for Text Classification. We use the ag_news_csv dataset for training and evaluation.

Then run the following commands to train and evaluate the fasttext model:

$ th main.lua -corpus_train data/ag_news_csv/train.csv -corpus_test data/ag_news_csv/test.csv -dim 10 -minfreq 10 -stream 0 -epochs 5 -suffix 1 -n_classes 4 -n_gram 1 -decay 0 -lr 0.5

If the dataset is too large to fit in the memory, try to use the paratemer -stream 1.

The trained model can get an accuracy of 90.93% on the g_news_csv dataset using the above configuration.

Parameters

-corpus_train: path of the training data

-corpus_test: path of the testing data

-minfreq: only those words with frequence higher than this will be used as features, default 10

-dim: the embedding dimension, default 10

-lr: learning rate, default 0.5

-min_lr: the minimal learning rate, default 0.001

-decay: whether to decay learning rate, 1 for decay, 0 for no decay, default 0

-epochs: number of epochs to go through the training data, default 5

-stream: whether to stream the data: 1 for streaming, 0 for store all data in memory, default 0

-suffix: suffix of the model

-n_classes: number of classification categories

-n_gram: 1 for unigram, 2 for bigram, 3 for trigram, default 1

-title: whether use the title to generate features, default 1

-description: whether use the description to generate features, default 1

To be done

Support hashtrick
Efficiency improvement

Acknowledgements

This code is based on the word2vec_torch project, which extends Yoon Kim's word2vec_torch by implementing the Continuous Bag-of-words Model.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
fasttext.lua		fasttext.lua
main.lua		main.lua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirements

Usage

Parameters

To be done

Acknowledgements

About

Releases

Packages

Languages

junwei-pan/fasttext_torch

Folders and files

Latest commit

History

Repository files navigation

Requirements

Usage

Parameters

To be done

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages