Skip to content

This project is a simple tokenizer for text processing that can tokenize both Persian and English words.

License

Notifications You must be signed in to change notification settings

kiarashrahmani/English-Persian-Tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English-Persian Tokenizer

Overview

The English-Persian Tokenizer is a simple Python program that classifies input strings into English words or Persian words. It leverages a Deterministic Finite Automaton (DFA) to perform this classification, making it a handy tool for distinguishing English and Persian words within text.

Features

  • Tokenize input text into English and Persian words.
  • Utilizes a DFA for efficient classification.
  • Easily customizable for additional languages or character sets.

Usage

  1. Clone or download this repository to your local machine.

  2. Ensure you have Python installed (Python 3 is recommended).

  3. Open a terminal and navigate to the repository's directory.

  4. Run the tokenizer by executing the tokenizer.py script, providing the text you want to classify as an argument.

    python tokenizer.py "Your input text here."
    

Thank you for using the English-Persian Tokenizer!

About

This project is a simple tokenizer for text processing that can tokenize both Persian and English words.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages