From d8a27df0ad86af8ce2a00d942859db6c6085f6fb Mon Sep 17 00:00:00 2001 From: Gaurav Date: Sat, 14 Dec 2019 22:14:31 +0530 Subject: [PATCH 1/2] refactor readme --- docs/api_docs.md | 6 +++--- docs/conf.py | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/api_docs.md b/docs/api_docs.md index e482f1d..78be4cf 100644 --- a/docs/api_docs.md +++ b/docs/api_docs.md @@ -236,7 +236,7 @@ Example: ### Contributing -#### Add a new language support for iNLTK +##### Add a new language support If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue [here](https://github.com/goru001/inltk/issues) @@ -244,14 +244,14 @@ If you would like to add support for language of your own choice to iNLTK, Please checkout the steps I'd [mentioned here for Telugu](https://github.com/goru001/inltk/issues/1) to begin with. They should be almost similar for other languages as well. -#### Improving models/Using models for your own research +##### Improving models/using models for your own research If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that. -#### Add new functionality +##### Add new functionality If you wish for a particular functionality in iNLTK - Start by checking/raising a issue [here](https://github.com/goru001/inltk/issues) diff --git a/docs/conf.py b/docs/conf.py index d350431..cd3497e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -23,9 +23,9 @@ author = 'Gaurav' # The short X.Y version -version = '0.7' +version = 'latest' # The full version, including alpha/beta/rc tags -release = '0.7.2' +release = 'latest' # -- General configuration --------------------------------------------------- From 295d42c94d5887a8cade185ea16cc74a9303ac0c Mon Sep 17 00:00:00 2001 From: Gaurav Date: Sat, 14 Dec 2019 22:16:22 +0530 Subject: [PATCH 2/2] refactor readme --- README.md | 212 +++++------------------------------------------------- 1 file changed, 16 insertions(+), 196 deletions(-) diff --git a/README.md b/README.md index 64cc3a2..4ee36f1 100644 --- a/README.md +++ b/README.md @@ -7,22 +7,10 @@ that an application developer might need for Indic languages. ![Alt Text](inltk/static/inltk.gif) -### Installation on Linux +### Documentation -```bash -pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html -pip install inltk -``` +Checkout detailed docs at https://inltk.readthedocs.io -Note: Just make sure to pick the correct torch wheel url, according to the needed -platform and python version, which you will find [here](https://pytorch.org/get-started/locally/#pip-1). - -iNLTK runs on CPU, as is the desired behaviour for most -of the Deep Learning models in production. - -The first command above will install pytorch for cpu, which, as the name suggests, does not have cuda support. - -`Note: inltk is currently supported only on Linux and Windows 10 with Python >= 3.6` ### Supported languages @@ -41,180 +29,6 @@ The first command above will install pytorch for cpu, which, as the name suggest | Tamil | ta | | Urdu | ur | -### Usage - -**Setup the language** - -```bash -from inltk.inltk import setup - -setup('') // if you wanted to use hindi, then setup('hi') -``` - -`Note: You need to run setup('') when you use a language -for the FIRST TIME ONLY. This will download all the necessary models required -to do inference for that language.` - -**Tokenize** - -```bash -from inltk.inltk import tokenize - -tokenize(text ,'') // where text is string in -``` - -**Get Embedding Vectors** - -This returns an array of "Embedding vectors", containing 400 Dimensional representation for -every token in the text. - - -``` -from inltk.inltk import get_embedding_vectors - -vectors = get_embedding_vectors(text, '') // where text is string in - -Example: - ->> vectors = get_embedding_vectors('भारत', 'hi') ->> vectors[0].shape -(400,) - ->> get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa') -[array([-0.894777, -0.140635, -0.030086, -0.669998, ..., 0.859898, 1.940608, 0.09252 , 1.043363], dtype=float32), array([ 0.290839, 1.459981, -0.582347, 0.27822 , ..., -0.736542, -0.259388, 0.086048, 0.736173], dtype=float32), array([ 0.069481, -0.069362, 0.17558 , -0.349333, ..., 0.390819, 0.117293, -0.194081, 2.492722], dtype=float32), array([-0.37837 , -0.549682, -0.497131, 0.161678, ..., 0.048844, -1.090546, 0.154555, 0.925028], dtype=float32), array([ 0.219287, 0.759776, 0.695487, 1.097593, ..., 0.016115, -0.81602 , 0.333799, 1.162199], dtype=float32), array([-0.31529 , -0.281649, -0.207479, 0.177357, ..., 0.729619, -0.161499, -0.270225, 2.083801], dtype=float32), array([-0.501414, 1.337661, -0.405563, 0.733806, ..., -0.182045, -1.413752, 0.163339, 0.907111], dtype=float32), array([ 0.185258, -0.429729, 0.060273, 0.232177, ..., -0.537831, -0.51664 , -0.249798, 1.872428], dtype=float32)] ->> vectors = get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa') ->> len(vectors) -8 - -``` - -Links to `Embedding visualization` on [Embedding projector](https://projector.tensorflow.org/) for all the supported languages are given in table below. - -**Predict Next 'n' words** - -```bash -from inltk.inltk import predict_next_words - -predict_next_words(text , n, '') - -// text --> string in -// n --> number of words you want to predict (integer) -``` - -`Note: You can also pass a fourth parameter, randomness, to predict_next_words. -It has a default value of 0.8` - -**Identify language** - -Note: If you update the version of iNLTK, you need to run -`reset_language_identifying_models` before identifying language. - -```bash -from inltk.inltk import identify_language, reset_language_identifying_models - -reset_language_identifying_models() # only if you've updated iNLTK version -identify_language(text) - -// text --> string in one of the supported languages - -Example: - ->> identify_language('न्यायदर्शनम् भारतीयदर्शनेषु अन्यतमम्। वैदिकदर्शनेषु ') -'sanskrit' - -``` - -**Remove foreign languages** - -```bash -from inltk.inltk import remove_foreign_languages - -remove_foreign_languages(text, '') - -// text --> string in one of the supported languages -// --> code of that language whose words you want to retain - -Example: - ->> remove_foreign_languages('विकिपीडिया सभी विषयों ਇੱਕ ਅਲੌਕਿਕ ਨਜ਼ਾਰਾ ਬੱਝਾ ਹੋਇਆ ਸਾਹਮਣੇ ਆ ਖਲੋਂਦਾ ਸੀ पर प्रामाणिक और 维基百科:关于中文维基百科 उपयोग, परिवर्तन 维基百科:关于中文维基百科', 'hi') -['▁विकिपीडिया', '▁सभी', '▁विषयों', '▁', '', '▁', '', '▁', '', '▁', '', '▁', '', '▁', '', '▁', '', '▁', '', '▁', '', '▁पर', '▁प्रामाणिक', '▁और', '▁', '', ':', '', '▁उपयोग', ',', '▁परिवर्तन', '▁', '', ':', ''] -``` - -Every word other than that of host language will become `` and `▁` signifies `space character` - -Checkout [this notebook](https://drive.google.com/file/d/0B3K0rqnCfC9pbVpSWk9Ndm5raGRCdjV6cGxVN1BGWFhTTlA0/view?usp=sharing) - by [Amol Mahajan](https://www.linkedin.com/in/amolmahajan0804/) where he uses iNLTK to remove foreign characters from - [iitb_en_hi_parallel corpus](http://www.cfilt.iitb.ac.in/iitb_parallel/iitb_corpus_download/) - - - **Get Sentence Encoding** - -``` -from inltk.inltk import get_sentence_encoding - -get_sentence_encoding(text, '') - -Example: - ->> encoding = get_sentence_encoding('मुझे अपने देश से', 'hi') ->> encoding.shape -(400,) -``` - -`get_sentence_encoding` returns 400 dimensional encoding of the sentence from -ULMFiT LM Encoder of `` trained in repositories linked below. - - -**Get Sentence Similarity** - -``` -from inltk.inltk import get_sentence_similarity - -get_sentence_similarity(sentence1, sentence2, '', cmp = cos_sim) - -// sentence1, sentence2 are strings in '' -// similarity of encodings is calculated by using cmp function whose default is cosine similarity - -Example: - ->> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'मैंने कन्फेक्शनरी स्टोर्स पर सेब और संतरे की कीमतों की तुलना की', 'hi') -0.126698300242424 - ->> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'यहां कोई तुलना नहीं है। आप सेब की तुलना संतरे से कर रहे हैं', 'hi') -0.25467658042907715 -``` - -`get_sentence_similarity` returns similarity between two sentences by calculating -`cosine similarity` (default comparison function) between the encoding vectors of two -sentences. - - -**Get Similar Sentences** - -``` -from inltk.inltk import get_similar_sentences - -get_similar_sentences(sentence, no_of_variants, '') - - -Example: - ->> get_similar_sentences('मैं आज बहुत खुश हूं', 10, 'hi') -['मैं आजकल बहुत खुश हूं', - 'मैं आज काफ़ी खुश हूं', - 'मैं आज काफी खुश हूं', - 'मैं अब बहुत खुश हूं', - 'मैं आज अत्यधिक खुश हूं', - 'मैं अभी बहुत खुश हूं', - 'मैं आज बहुत हाजिर हूं', - 'मैं वर्तमान बहुत खुश हूं', - 'मैं आज अत्यंत खुश हूं', - 'मैं सदैव बहुत खुश हूं'] - -``` - -`get_similar_sentences` returns `list` of length `no_of_variants` which contains sentences which - are similar to `sentence` #### Repositories containing models used in iNLTK | Language | Repository | Perplexity of Language model | Wikipedia Articles Dataset | Classification accuracy | Classification Kappa score | Embeddings visualization on [Embedding projector](https://projector.tensorflow.org/) | @@ -234,7 +48,7 @@ Example: ### Contributing -**Add a new language support for iNLTK** +##### Add a new language support If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue [here](https://github.com/goru001/inltk/issues) @@ -242,19 +56,22 @@ If you would like to add support for language of your own choice to iNLTK, Please checkout the steps I'd [mentioned here for Telugu](https://github.com/goru001/inltk/issues/1) to begin with. They should be almost similar for other languages as well. -**Improving models/Using models for your own research** +##### Improving models/using models for your own research If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that. -**Add new functionality** +##### Add new functionality If you wish for a particular functionality in iNLTK - Start by checking/raising a issue [here](https://github.com/goru001/inltk/issues) -### What's next (and being worked upon) +### What's next + + +#### ..and being worked upon `Shout out if you want to help :)` * Add [Telugu](https://github.com/goru001/inltk/issues/1) @@ -264,7 +81,7 @@ and [Maithili](https://github.com/goru001/inltk/issues/10) support * Add English to iNLTK -### What's next - (and NOT being worked upon) +#### ..and NOT being worked upon `Shout out if you want to lead :)` @@ -272,14 +89,17 @@ and [Maithili](https://github.com/goru001/inltk/issues/10) support * [POS support](https://github.com/goru001/inltk/issues/13) in iNLTK * Add translations - to and from languages in iNLTK + English -### Appreciation for iNLTK + + +### iNLTK's Appreciation * [By Jeremy Howard on Twitter](https://twitter.com/jeremyphoward/status/1111318198891110402) * [By Vincent Boucher on LinkedIn](https://www.linkedin.com/feed/update/urn:li:activity:6517137647310241792/) * [By Kanimozhi](https://www.linkedin.com/feed/update/urn:li:activity:6517277916030701568), [By Soham](https://www.linkedin.com/feed/update/urn:li:activity:6513084638955696128), [By Imaad](https://www.linkedin.com/feed/update/urn:li:activity:6536258026687557632/) on LinkedIn * iNLTK was [trending on GitHub](https://github.motakasoft.com/trending/ranking/monthly/?d=2019-05-01&l=python&page=2) in May 2019 -* iNLTK has had [19,000+ Downloads]( +* iNLTK has had [20,000+ Downloads]( https://console.cloud.google.com/bigquery?sq=375816891401:185fda81bdc64eb79b98c6b28c77a62a -) till Nov 2019 +) on PyPi +