This is a Corpus File Generator

It extract sentence for mycroft-mimic-studio on https://github.com/MycroftAI/mimic-recording-studio.git. The data is processed by https://github.com/MycroftAI/lingua-franca.git and all numbers are replaced by letters.

Installation

pip install -r requirements.txt

Test Files

For now, we have an English corpus, english_corpus.csv made available which can be found in backend/prompt/. To use your own corpus follow these steps.

Create a csv file in the same format as english_corpus.csv using tabs (\t) as the delimiter.
Add your corpus to the backend/prompt directory.
Change the CORPUS environment variable in docker-compose.yml to your corpus name.

use the generator

there is a file generator that generates any sentences from wikipedia. just call the command.

'python3 corpus_file_gen.py' to run. you are always asked about the wiki language 'en'.

if you have only a simple text file without line length and tab you can only check the file.

python3 corpus_file_gen.py --prepare_file 3 --file english_corpus.csv if you have already started a file, the generator will expand the file 35K.
python3 corpus_file_gen.py --prepare_file 1 --file english_corpus.csv or --help for help.

This is a very simple generator. You should always check the file and delite false records. We are working on a solution to change numbers into words

operation

this is a bad tool vor get sentence from mycroft translate,mozilla voice and wiki. It make 35k sentence it use mycroft tools to make numbers in sentence.

Contributions

by gras64

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
prompts		prompts
.gitignore		.gitignore
LICENSE.md		LICENSE.md
corpus_file_gen.py		corpus_file_gen.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is a Corpus File Generator

Installation

Test Files

use the generator

operation

Contributions

About

Releases

Packages

Languages

License

gras64/corpus-file-gen

Folders and files

Latest commit

History

Repository files navigation

This is a Corpus File Generator

Installation

Test Files

use the generator

operation

Contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages