GSOC 2014 Proposal

Trang edited this page Jun 1, 2014 · 1 revision
Clone this wiki locally

Export to Anki

A user can upload their Anki deck and Tatoeba can use that data to generate a list of cards to add based on i+1 and other requirements.

Contact Info:
Name: Jake Probst
Email: jake.probst@gmail.com
Tatoeba: Jake
Skype: jakeprobst
IRC: tmjake (freenode)
Timezone: PDT (UTC-7)
Languages: English, learning Japanese

Project Details:
I intend to write a web application that will accept an Anki deck from a user and compare it against the sentence database to find sentences where the user will know exactly one new word. The idea is based on the Input Hypothesis. It will use named entity recognition and stemming to find appropriate sentences. The user will be able to specify tags or certain words they want in the sentences. Then it will compile these sentences into an .apkg (Anki`s file format for importing decks) and send it to the user.
It will be separated into three parts: libiplusone, iplusone web service, and the webfacing interface.

libiplusone will be written in C.
The iplusone web service will be in either C++ or Python, depending on which language ends up being used in the final website.
As web development is not my strong point, the link between the web service and the actual website is something I will need help with.
Realistically, I`m not going to have support for every language in this by the time GSOC ends. I`ll have to keep adding languages as I learn how to deal with them. I figure I will be able to support all the latin character based languages. I will need to read more on language processing before I can answer how many I will be able to do exactly.
libiplusone will be designed to make adding new languages easy with dynamically loaded plugins.

One problem that may come up is resource usage. In a test with a python script I wrote, it takes about 20 seconds to parse an anki deck with 2500 cards and a list with 47367 sentences. Now I`m sure if I use C and try and optimize it properly, I could get this number down, but I imagine it will still be a few seconds of computing. Only about half a second of that is spent reading the Anki deck. So I could just have a limit of results (say, 20) and have it stop once it hits that. Another option might be to do the checking in javascript. The server sends sentences to the client where it then checks if it matches. The client would reply if it matches and after X matches it would compile an .apkg and send it to the client.

On the web front, it will show a person X number of sentences, they can choose which sentences they want to add and it will create an .apkg for them.

Schedule:
Before I start this, I will need to familiarize myself with django and cppcms. I will also need to read a lot about natural language processing.
Week 1: figure out structure of libiplusone, code .anki/.apkg format import/export
Week 2: start coding algorithm to handle i+1
Week 3: implement named entity recognition, stemming, and phrase detection
Week 4: continue implementing the above
Week 5: work on optimization
Week 6: figure out structure of iplusone
Week 7: code basic structure
Week 8: code in various options (clozes, specific tags/words, etc)
Week 9: finish up iplusone code
Week 10: connect iplusone and tatoeba
Week 11: Buffer for unplanned problems
Week 12: Buffer for unplanned problems

If I finish early, I will add as many languages to libiplusone as I can.
If I finish late, PANIC.
I will be taking a class or two over the summer, but they shouldn`t get in the way of development. Also I will be gone the last week of May.

Personal Details:
I am comfortable with Python and C/C++, and okay with Javascript. I have spent a decent amount of time messing with Anki internals and have made a couple addons. I have a personal script that does an i+1 search on a text file. I have commited code to a few open source projects (anki, uzbl, darktable). I have a github and a personal site.
I have not used Tatoeba much, only to double check certain bits of grammar. I decided to work with Tatoeba because when I was first starting to learn Japanese, this is the exact sort of thing I wanted to exist. My skills and interests are mainly in Flashcards (I even wrote an SRS application for the Nintendo DS).
This is the only gsoc project I am applying for.