GitHub - mirfan899/SpaCy3Urdu: Build Urdu SpaCy Model

Project setup

run command to setup assets(dataset from UD)

spacy project assets

It uses project.yml file and download the data from UD GitHub repository.

Download vectors

Download fasttext vectors

wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ur.300.vec.gz

Use these vectors to prune it so that model size is reduced. I'm currently using 100000 vectors for training the model.

mkdir vectors
python -m spacy init vectors ur cc.ur.300.vec.gz  ./vectors --truncate 100000 --name ur_model.vectors

Preprocessing

Replace Other with O to train a better model.

sed -i 's/Other/O/g' ner/100000.txt

convert tsv to json

python tsv_to_json.py

Now convert json to spacy pickle format.

python json_to_spacy.py -i ner/urdu_ner_dataset.json -o ner/urdu_ner_dataset.txt

Now convert to spacy .spacy binary format.

python json2spacy3.3.py

Train the model

Now run the command to train the tagger and parser for Urdu language.

spacy project run all

It will train the tagger and parser model on cpu. You can specify gpu in project.yml file.

Install the model

After training, you can install and use the model.

pip install ur_model-0.0.0.tar.gz

There is a script test.py on how to use the model.

Spacy 3.3

Create config.cfg from base_config.cfg

python -m spacy init fill-config base_config.cfg config.cfg

Train two models i.e. one for tagger, parser etc and second for ner. To train tagger and parser run

spacy project assets
spacy project run all

Train ner model

spacy train configs/config.cfg --output ./ner3 --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy

Install both models

pip install location/ur_model.xxxx.tar.gz
pip install location/ur_ner.xxxx.tar.gz

Now Merge two trained models.

python merge_tp2ner.py

Now uninstall these models

pip uninstall ur_model
pip uninstall ur_ner

Now package merged model

# create packages directory if get error
spacy package ur_ner packages --name "ner" --version "0.0.0" --force

Now install it

pip install location/ur_ner.xxx.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

corpus

corpus

model

model

ner

ner

.gitignore

.gitignore

README.md

README.md

json2spacy3.3.py

json2spacy3.3.py

json_to_spacy.py

json_to_spacy.py

merge_tp2ner.py

merge_tp2ner.py

project.lock

project.lock

project.yml

project.yml

test.py

test.py

tsv_to_json.py

tsv_to_json.py

Repository files navigation

Project setup

Download vectors

Preprocessing

Train the model

Install the model

Spacy 3.3

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
corpus		corpus
model		model
ner		ner
.gitignore		.gitignore
README.md		README.md
json2spacy3.3.py		json2spacy3.3.py
json_to_spacy.py		json_to_spacy.py
merge_tp2ner.py		merge_tp2ner.py
project.lock		project.lock
project.yml		project.yml
test.py		test.py
tsv_to_json.py		tsv_to_json.py

mirfan899/SpaCy3Urdu

Folders and files

Latest commit

History

Repository files navigation

Project setup

Download vectors

Preprocessing

Train the model

Install the model

Spacy 3.3

About

Topics

Resources

Stars

Watchers

Forks

Languages