telugu support #1

soumith · 2019-03-28T17:55:14Z

Hey, great repository.
I'd like to add Telugu support. If you have a framework I should follow to download Telugu wikipedia and train it, I'd love some instructions and get going

goru001 · 2019-03-29T03:52:48Z

Thanks for the initiative!
I had a look at Telugu Wikipedia Homepage and it looks like, it does not have all of its pages indexed by alphabets at the homepage like some other languages. I'd faced a similar issue with Marathi, so the notebooks I'd used to scrape Marathi wikipedia will be quite useful. So,

Use this notebook to get all the Telugu wikipedia articles' links. What this notebook does is that it starts collecting article links from this page, then goes to next page - collects from there and moves to next page. It keeps doing this till we're able to add more article links and eventually stops. You should be able to get all the Telugu article links just by changing the starting page to this.
Then use this notebook to scrape the articles corresponding to the urls you would have saved in step 1. I don't think you will need to make any changes to this notebook because articles' pages have the same structure, irrespective of the language.
Once you have the Wikipedia Articles Dataset, you can use this notebook as a reference to train LM. To train the LM, you'll need tokenization, for that I've been using sentencepiece - you can use notebook here as a reference for that.

And that should be it. It might be worth scraping some Telugu news website for building a classification model as well on top of the LM. Let me know if I can help you with anything along the way!

soumith · 2019-03-29T03:56:05Z

thanks a ton for the detailed pointers. @binga said he'd cleanup what he already has over here: https://github.com/binga/fastai_notes/tree/master/experiments/notebooks/lang_models and send a PR to inltk ( reference ). I'll follow his lead and take up and tasks that he needs help on.

goru001 · 2019-03-29T03:58:56Z

Okay - That'd be great!

AnushaMotamarri · 2019-04-03T16:00:28Z

I had built a Telugu dataset which contains 1,58,000 articles scraped from a news paper website https://github.com/AnushaMotamarri/Telugu-Newspaper-Article-Dataset , This dataset should be useful for classification. Dataset is divided into 3 years, data under each year is further divided into several categories. Each file has date&time, title and content.

and i had built another dataset which has around 26,000 files scraped from 300 novels https://github.com/AnushaMotamarri/Telugu-Books-Dataset .

Datasets can be directly downloaded from links https://drive.google.com/file/d/1IbqM335M7imzG-2ZV0d8-JbRqCnyAii3/view and https://drive.google.com/file/d/1MDiP-_S2RtAN7c9TLnKi8I2pxIgONIP0/view Respectively.

Here is the Tokenizer I had built for Telugu https://github.com/AnushaMotamarri/TeluguTokenizer

I am currently working on creating a lemmatizer for Telugu Language.
I would like to contribute.

goru001 · 2019-04-04T08:23:26Z

@AnushaMotamarri Thanks for reaching out! You would like to contribute with building Language Model? @binga will be contributing the LM to iNLTK. So, it'd be great, in order to avoid duplicating efforts, if you could contribute with Telugu NER or translation.

AnushaMotamarri · 2019-04-04T14:26:10Z

yes,
Is there any previous work done in any other language on NER or translation to iNLTK ? It would be great if I can get some standard references to get started with.

goru001 · 2019-04-05T06:08:54Z

No, I'd just started with it. So nothing in iNLTK yet.

AnushaMotamarri · 2019-04-05T18:38:32Z

ok, i will work on them

Asrst · 2019-04-06T06:36:56Z

Hey, I would love contribute my part and can I plz collaborate with you guys ?

goru001 · 2019-04-18T07:48:36Z

@Asrst It will be worth tagging and asking @AnushaMotamarri or @binga if you can help them out with something, or elaborating on how you would like to contribute!

sainathadapa · 2019-04-18T07:50:49Z

I would like to contribute as well. @goru001 may be a gitter channel would help for easier/faster conversation here 🤔

goru001 · 2019-04-18T08:00:29Z

@sainathadapa Yes right! Here it is!

praveenc1 · 2019-04-22T23:50:30Z

Hi All, just wanted to introduce myself and see if I can help with something to add Telugu support. Please let me know if you have any initial thoughts on where I can contribute

PS: posted on the Gitter channel and wasn't sure if it was being monitored. So posting here.

goru001 · 2019-04-23T18:31:46Z

Hi @praveenc1, It will be worth tagging and asking @AnushaMotamarri or @binga if you can help them out with something. Or else, you can start with anything NER, Coreference resolution etc, almost everything is unexplored territory.

Threepointone4 · 2019-11-11T05:19:17Z

@goru001 After training on new Language how to integrate that model in inltk to get sentence vector?

Adarshreddyash · 2019-12-16T07:07:51Z

I haven't found a great source for Telugu languages. We shall make a collection by scraping the data from Telugu webpages

sakurusurya2000 · 2020-08-10T08:44:18Z

Hi, I can help you with telugu language source

ShivaSankeerth · 2020-08-11T11:14:50Z

Hi all,
I would like to help you guys in building this project. Can you please let me know where to get started and whom to reach out to.
Thanks.

hariperavali · 2020-10-10T17:02:21Z

Any previous work done on Tenglish(Telugu typed in English) ? The usage Telugu we converse with on whatsapp etc everyday.

goru001 · 2020-10-12T05:36:30Z

With the latest release of iNLTK, i.e. v0.9 Telugu support has been added, thanks to @Shubhamjain27 . Hence, closing this issue.

goru001 · 2020-10-12T05:38:23Z

@hariperavali Tenglish (Telugu+English) support is not there yet, code-mixed support has been added for Hinglish, Tanglish and Manglish in v0.9. Feel free to work on it and raise a PR if you want to.

KARUN014 · 2021-01-12T01:30:34Z

i need telugu sentiwordnet, i was try in many sites not getting, please help me

goru001 added the enhancement New feature or request label Mar 29, 2019

goru001 mentioned this issue Apr 2, 2019

Assamese Support #7

Open

goru001 closed this as completed Oct 12, 2020

This was referenced Nov 28, 2022

How can i contribute? #88

Open

Adding support for Bhojpuri, Magahi and Maithili #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telugu support #1

telugu support #1

soumith commented Mar 28, 2019

goru001 commented Mar 29, 2019

soumith commented Mar 29, 2019

goru001 commented Mar 29, 2019

AnushaMotamarri commented Apr 3, 2019 •

edited

Loading

goru001 commented Apr 4, 2019

AnushaMotamarri commented Apr 4, 2019

goru001 commented Apr 5, 2019

AnushaMotamarri commented Apr 5, 2019

Asrst commented Apr 6, 2019 •

edited

Loading

goru001 commented Apr 18, 2019

sainathadapa commented Apr 18, 2019

goru001 commented Apr 18, 2019

praveenc1 commented Apr 22, 2019

goru001 commented Apr 23, 2019

Threepointone4 commented Nov 11, 2019

Adarshreddyash commented Dec 16, 2019

sakurusurya2000 commented Aug 10, 2020

ShivaSankeerth commented Aug 11, 2020

hariperavali commented Oct 10, 2020

goru001 commented Oct 12, 2020

goru001 commented Oct 12, 2020

KARUN014 commented Jan 12, 2021

telugu support #1

telugu support #1

Comments

soumith commented Mar 28, 2019

goru001 commented Mar 29, 2019

soumith commented Mar 29, 2019

goru001 commented Mar 29, 2019

AnushaMotamarri commented Apr 3, 2019 • edited Loading

goru001 commented Apr 4, 2019

AnushaMotamarri commented Apr 4, 2019

goru001 commented Apr 5, 2019

AnushaMotamarri commented Apr 5, 2019

Asrst commented Apr 6, 2019 • edited Loading

goru001 commented Apr 18, 2019

sainathadapa commented Apr 18, 2019

goru001 commented Apr 18, 2019

praveenc1 commented Apr 22, 2019

goru001 commented Apr 23, 2019

Threepointone4 commented Nov 11, 2019

Adarshreddyash commented Dec 16, 2019

sakurusurya2000 commented Aug 10, 2020

ShivaSankeerth commented Aug 11, 2020

hariperavali commented Oct 10, 2020

goru001 commented Oct 12, 2020

goru001 commented Oct 12, 2020

KARUN014 commented Jan 12, 2021

AnushaMotamarri commented Apr 3, 2019 •

edited

Loading

Asrst commented Apr 6, 2019 •

edited

Loading