Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telugu support #1

Closed
soumith opened this issue Mar 28, 2019 · 22 comments
Closed

telugu support #1

soumith opened this issue Mar 28, 2019 · 22 comments
Labels
enhancement New feature or request

Comments

@soumith
Copy link

soumith commented Mar 28, 2019

Hey, great repository.
I'd like to add Telugu support. If you have a framework I should follow to download Telugu wikipedia and train it, I'd love some instructions and get going

@goru001
Copy link
Owner

goru001 commented Mar 29, 2019

Thanks for the initiative!
I had a look at Telugu Wikipedia Homepage and it looks like, it does not have all of its pages indexed by alphabets at the homepage like some other languages. I'd faced a similar issue with Marathi, so the notebooks I'd used to scrape Marathi wikipedia will be quite useful. So,

  1. Use this notebook to get all the Telugu wikipedia articles' links. What this notebook does is that it starts collecting article links from this page, then goes to next page - collects from there and moves to next page. It keeps doing this till we're able to add more article links and eventually stops. You should be able to get all the Telugu article links just by changing the starting page to this.
  2. Then use this notebook to scrape the articles corresponding to the urls you would have saved in step 1. I don't think you will need to make any changes to this notebook because articles' pages have the same structure, irrespective of the language.
  3. Once you have the Wikipedia Articles Dataset, you can use this notebook as a reference to train LM. To train the LM, you'll need tokenization, for that I've been using sentencepiece - you can use notebook here as a reference for that.

And that should be it. It might be worth scraping some Telugu news website for building a classification model as well on top of the LM. Let me know if I can help you with anything along the way!

@soumith
Copy link
Author

soumith commented Mar 29, 2019

thanks a ton for the detailed pointers. @binga said he'd cleanup what he already has over here: https://github.com/binga/fastai_notes/tree/master/experiments/notebooks/lang_models and send a PR to inltk ( reference ). I'll follow his lead and take up and tasks that he needs help on.

@goru001
Copy link
Owner

goru001 commented Mar 29, 2019

Okay - That'd be great!

@goru001 goru001 added the enhancement New feature or request label Mar 29, 2019
@AnushaMotamarri
Copy link

AnushaMotamarri commented Apr 3, 2019

I had built a Telugu dataset which contains 1,58,000 articles scraped from a news paper website https://github.com/AnushaMotamarri/Telugu-Newspaper-Article-Dataset , This dataset should be useful for classification. Dataset is divided into 3 years, data under each year is further divided into several categories. Each file has date&time, title and content.

and i had built another dataset which has around 26,000 files scraped from 300 novels https://github.com/AnushaMotamarri/Telugu-Books-Dataset .

Datasets can be directly downloaded from links https://drive.google.com/file/d/1IbqM335M7imzG-2ZV0d8-JbRqCnyAii3/view and https://drive.google.com/file/d/1MDiP-_S2RtAN7c9TLnKi8I2pxIgONIP0/view Respectively.

Here is the Tokenizer I had built for Telugu https://github.com/AnushaMotamarri/TeluguTokenizer

I am currently working on creating a lemmatizer for Telugu Language.
I would like to contribute.

@goru001
Copy link
Owner

goru001 commented Apr 4, 2019

@AnushaMotamarri Thanks for reaching out! You would like to contribute with building Language Model? @binga will be contributing the LM to iNLTK. So, it'd be great, in order to avoid duplicating efforts, if you could contribute with Telugu NER or translation.

@AnushaMotamarri
Copy link

yes,
Is there any previous work done in any other language on NER or translation to iNLTK ? It would be great if I can get some standard references to get started with.

@goru001
Copy link
Owner

goru001 commented Apr 5, 2019

No, I'd just started with it. So nothing in iNLTK yet.

@AnushaMotamarri
Copy link

ok, i will work on them

@Asrst
Copy link

Asrst commented Apr 6, 2019

Hey, I would love contribute my part and can I plz collaborate with you guys ?

@goru001
Copy link
Owner

goru001 commented Apr 18, 2019

@Asrst It will be worth tagging and asking @AnushaMotamarri or @binga if you can help them out with something, or elaborating on how you would like to contribute!

@sainathadapa
Copy link

I would like to contribute as well. @goru001 may be a gitter channel would help for easier/faster conversation here 🤔

@goru001
Copy link
Owner

goru001 commented Apr 18, 2019

@sainathadapa Yes right! Here it is!

@praveenc1
Copy link

Hi All, just wanted to introduce myself and see if I can help with something to add Telugu support. Please let me know if you have any initial thoughts on where I can contribute

PS: posted on the Gitter channel and wasn't sure if it was being monitored. So posting here.

@goru001
Copy link
Owner

goru001 commented Apr 23, 2019

Hi @praveenc1, It will be worth tagging and asking @AnushaMotamarri or @binga if you can help them out with something. Or else, you can start with anything NER, Coreference resolution etc, almost everything is unexplored territory.

@Threepointone4
Copy link

@goru001 After training on new Language how to integrate that model in inltk to get sentence vector?

@Adarshreddyash
Copy link

I haven't found a great source for Telugu languages. We shall make a collection by scraping the data from Telugu webpages

@sakurusurya2000
Copy link

Hi, I can help you with telugu language source

@ShivaSankeerth
Copy link

Hi all,
I would like to help you guys in building this project. Can you please let me know where to get started and whom to reach out to.
Thanks.

@hariperavali
Copy link

Any previous work done on Tenglish(Telugu typed in English) ? The usage Telugu we converse with on whatsapp etc everyday.

@goru001
Copy link
Owner

goru001 commented Oct 12, 2020

With the latest release of iNLTK, i.e. v0.9 Telugu support has been added, thanks to @Shubhamjain27 . Hence, closing this issue.

@goru001 goru001 closed this as completed Oct 12, 2020
@goru001
Copy link
Owner

goru001 commented Oct 12, 2020

@hariperavali Tenglish (Telugu+English) support is not there yet, code-mixed support has been added for Hinglish, Tanglish and Manglish in v0.9. Feel free to work on it and raise a PR if you want to.

@KARUN014
Copy link

i need telugu sentiwordnet, i was try in many sites not getting, please help me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests