Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
GSOC 2014 Proposal
A user can upload their Anki deck and Tatoeba can use that data to generate a list of cards to add based on i+1 and other requirements. Export to Anki
Name: Jake Probst
IRC: tmjake (freenode)
Timezone: PDT (UTC-7)
Languages: English, learning Japanese
I intend to write a web application that will accept an Anki deck from a user and compare it against the sentence database to find sentences where the user will know exactly one new word. The idea is based on the Input Hypothesis. It will use named entity recognition and stemming to find appropriate sentences. The user will be able to specify tags or certain words they want in the sentences. Then it will compile these sentences into an .apkg (Anki`s file format for importing decks) and send it to the user.
It will be separated into three parts: libiplusone, iplusone web service, and the webfacing interface.
libiplusone will be written in C.
The iplusone web service will be in either C++ or Python, depending on which language ends up being used in the final website.
As web development is not my strong point, the link between the web service and the actual website is something I will need help with.
m not going to have support for every language in this by the time GSOC ends. Ill have to keep adding languages
as I learn how to deal with them. I figure I will be able to support all the latin character based languages. I will need to read more
on language processing before I can answer how many I will be able to do exactly.
libiplusone will be designed to make adding new languages easy with dynamically loaded plugins.
One problem that may come up is resource usage. In a test with a python script I wrote, it takes about 20 seconds to parse an anki deck
with 2500 cards and a list with 47367 sentences. Now I`m sure if I use C and try and optimize it properly, I could get this number down,
but I imagine it will still be a few seconds of computing. Only about half a second of that is spent reading the Anki deck. So I could
The server sends sentences to the client where it then checks if it matches. The client would reply if it matches and after X matches
it would compile an .apkg and send it to the client.
On the web front, it will show a person X number of sentences, they can choose which sentences they want to add and it will create an .apkg for them.
Before I start this, I will need to familiarize myself with django and cppcms. I will also need to read a lot about natural language processing.
Week 1: figure out structure of libiplusone, code .anki/.apkg format import/export
Week 2: start coding algorithm to handle i+1
Week 3: implement named entity recognition, stemming, and phrase detection
Week 4: continue implementing the above
Week 5: work on optimization
Week 6: figure out structure of iplusone
Week 7: code basic structure
Week 8: code in various options (clozes, specific tags/words, etc)
Week 9: finish up iplusone code
Week 10: connect iplusone and tatoeba
Week 11: Buffer for unplanned problems
Week 12: Buffer for unplanned problems
If I finish early, I will add as many languages to libiplusone as I can.
If I finish late, PANIC.
I will be taking a class or two over the summer, but they shouldn`t get in the way of development. Also I will be gone the last week of May.
I have not used Tatoeba much, only to double check certain bits of grammar. I decided to work with Tatoeba because when I was first starting to learn Japanese, this is the exact sort of thing I wanted to exist. My skills and interests are mainly in Flashcards (I even wrote an SRS application for the Nintendo DS).
This is the only gsoc project I am applying for.