text

Models, Datasets, Metrics and Utils for NLP.

Installation

...

Usage

Models

Supported model and model type

bert : {"bert-base-cased", "bert-base-uncased", "bert-large-cased","bert-large-uncased", "bert-base-chinese"}

elmo : {"elmo-simplified-chinese", "elmo-traditional-chinese", "elmo-english"}
Load the pretrained model

# Load the pretrained model.
from flowtext.models import bert
bert, tokenizer, bert_config = bert(pretrained=True, model_type=bert-base-uncased', checkpoint_path=None)

# In addition, you can also load normal models.
from flowtext.models import BertConfig, BertModel
config = BertConfig()
bert = BertModel(config)

Datasets

The dataset module currently contains:

Language modeling: [WikiText2, WikiText103, PennTreebank]

Machine translation: [IWSLT2016, IWSLT2017, Multi30k]

Sequence tagging(e.g. POS/NER): [UDPOS, CoNLL2000Chunking]

Question answering: [SQuAD1, SQuAD2]

Text classification: [AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB]
Load NLP related datasets, and build dataloader

from flowtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')
next(train_iter)
# Or iterate with for loop
for (label, line) in train_iter:
    print(label, line)
# Or send to DataLoader
from oneflow.utils.data import DataLoader
train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)

Metrics

The metrics currently contains:

Bleu_score

Ngram_counter
NLP related evaluation metrics

>>> from flowtext.data.metrics import bleu_score
>>> candidate_corpus = [['This', 'is', 'a', 'oneflow', 'bleu','test'], ['Another', 'Sentence']]
>>> references_corpus = [[['This', 'is', 'a', 'oneflow', 'bleu','test'], ['Completely', 'Different']], [['No', 'Match']]]
>>> bleu_score(candidate_corpus, references_corpus)
0.889139711856842

Utils

Load tokenizer

>>> from flowtext.data import get_tokenizer
# The parameter ‘tokenizer’ can support spacy, moses, toktok, revtok, subword, jieba.
>>> tokenizer = get_tokenizer(tokenizer="basic_english", language="en")
>>> tokens = tokenizer("Today is a good day!")
>>> tokens
['today', 'is', 'a', 'good', 'day', '!']

Disclaimer on Datasets

The datasets in flowtext.datasets is a utility library that downloads and prepares public datasets. We are not responsible for hosting and distributing these data sets, nor do we guarantee their quality and fairness, nor do we claim to have the license of the data set. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you are the dataset owner and want to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please contact us through GitHub questions.

License

OneFlow has a BSD-style license, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
ci/check		ci/check
flowtext		flowtext
projects		projects
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text

Installation

Usage

Disclaimer on Datasets

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

text

Installation

Usage

Disclaimer on Datasets

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages