Skip to content

Urdu Word Segmentation using Conditional Random Fields (CRFs)

License

Notifications You must be signed in to change notification settings

harisbinzia/Urdu-Word-Segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Urdu Word Segmentation

This repository contains code & dataset for Urdu word segmentation as described in paper Urdu Word Segmentation using Conditional Random Fields (CRFs).

Requirement(s)

It is implemented in python and requires scikit-learn and python-crfsuite.

Dataset

A manually annotated corpus of approximately 111,000 tokens is available for download.

Reference(s)

If you use this tool in any of your work, please cite below paper.

Urdu Word Segmentation using Conditional Random Fields (CRFs)

@InProceedings{C18-1217,
  author = 	"Bin Zia, Haris
		and Raza, Agha Ali
		and Athar, Awais",
  title = 	"Urdu Word Segmentation using Conditional Random Fields (CRFs)",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"2562--2569",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1217"
}

License(s)

Copyright (c) 2018 CSaLT, ITU

Code licensed under the MIT License: http://opensource.org/licenses/MIT Data licensed under CC-BY 4.0: https://creativecommons.org/licenses/by/4.0/

About

Urdu Word Segmentation using Conditional Random Fields (CRFs)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages