Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More canonical DAG implementation #18

Closed
kmadathil opened this issue Jul 10, 2017 · 20 comments
Closed

More canonical DAG implementation #18

kmadathil opened this issue Jul 10, 2017 · 20 comments
Assignees

Comments

@kmadathil
Copy link
Owner

Please see the dag branch.

Opening an issue to discuss this implementation, which I will merge to master soon.

getSandhiSplits now returns a SanskritLexicalGraph object, with the splits represented in graph form. Calling findAllPaths on this object returns a flat list of splits.

This is faster than the earlier implementation if you include flattening (now called pathfinding). We can split and find all paths/flatten in astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH in about 30s. I'd like to get it down to less than a second if possible

$ python SanskritLexicalAnalyzer.py astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH --split --print-max 100
Parsing of XMLs started at 2017-07-10 12:20:07.216416
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-10 12:20:12.179545
Input String: astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start split: 2017-07-10 12:20:14.445834
End split: 2017-07-10 12:20:14.480862
End pathfinding: 2017-07-10 12:20:52.277015

@avinashvarna
Copy link
Collaborator

Would it be better to use a standard graph library instead of rolling our own? I've used networkx (https://networkx.readthedocs.io) before, and it comes with a lot of algorithms, including some for what we need. E.g. https://networkx.readthedocs.io/en/stable/reference/algorithms.simple_paths.html (See shortest simple paths - this should correspond to fewest splits) and some DAG specific ones (https://networkx.readthedocs.io/en/stable/reference/algorithms.dag.html)
I believe these algorithms are written using DFS/BFS to avoid the overhead of recursion in Python, and should have better performance than the simple recursive implementation.

@avinashvarna
Copy link
Collaborator

I hacked together a networkx based implementation as an example. Needs some cleaning, but the speedup is worth it. From a minute on my laptop to just about 0.1 seconds:

Parsing of XMLs started at 2017-07-10 20:33:07.631000
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-10 20:33:13.823000
Input String: astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start Split: 2017-07-10 20:33:18.199000
End DAG generation: 2017-07-10 20:33:18.254000
End pathfinding: 2017-07-10 20:34:13.241000
Finding k shortest paths using networkx: 2017-07-10 20:34:13.242000
End pathfinding: 2017-07-10 20:34:13.332000

@avinashvarna
Copy link
Collaborator

P.S. Let me know if you want me to clean it up. Didn't add too many comments since I wanted to get it implemented and share the results.

@drdhaval2785
Copy link

drdhaval2785 commented Jul 11, 2017 via email

@kmadathil
Copy link
Owner Author

@avinashvarna
My plan was to use a simple implementation now and move to a better one as needed. We will definitely need a graph library for Level 3, so we might as well integrate it now. Let me look at your branch.

@avinashvarna
Copy link
Collaborator

Sorry, I actually added it into your dag branch, since I had some in-progress work on my branch. Hope that's ok. You can revert the commit if you want.

@kmadathil
Copy link
Owner Author

No you didn't - you pushed it to a different branch called origin/dag by mistake. All's well :-)

@avinashvarna
Copy link
Collaborator

First time running git worktree add and I mess it up. Well, no harm done I guess :).

@drdhaval2785
Copy link

I hacked together a networkx based implementation as an example. Needs some cleaning, but the speedup is worth it. From a minute on my laptop to just about 0.1 seconds:

Where is it? How to use?

@kmadathil
Copy link
Owner Author

@avinashvarna
What's the maximum number of paths youve tried for? I couldn't get it to terminate in a reasonable time with --print-max 10000

@drdhaval2785
please pull the branch called "origin/dag" (not dag, but origin/dag)

@kmadathil
Copy link
Owner Author

Comparison of runtimes (rough) - 1000 paths on astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH

BFS - dag-bfs branch: 10 seconds
DFS/Memoized - dag branch: 40 seconds
Networkx - dag-nx branch: 4 seconds
- origin/dag branch: 3 seconds

If you want to try this out, please note some branches use --print-max and some use --max-paths for the same thing. I'll eventually settle on --max-paths

@kmadathil
Copy link
Owner Author

Note that the DFS/Memoized dag branch takes 40 seconds for the entire set of paths. No other implementation seems to be able to go to 10000 paths.

@kmadathil
Copy link
Owner Author

@drdhaval2785 Please try the dag-nx branch.

@drdhaval2785
Copy link

No other implementation seems to be able to go to 10000 paths.

Excuse my lack of technical knowledge about paths and graphs.
I am just curious why so many paths?
There is some idea in my mind by which we can definitely come to a conclusion that the sandhi or samasa is not split at certain letters. In 42 letter, there were only 18 which qualified to be called a split point.

@avinashvarna
Copy link
Collaborator

Maybe it's just to understand the limits right now. We definitely have some work ahead of us to minimize the number of false splits. I've been modifying the way we encode the sandhi rules to minimize these splits.

@kmadathil It appears to be some problem with the different algorithm they use in shortest_simple_paths. Just as a test, I replaced shortest_simple_paths with all_simple_paths in dag-nx branch, and it finishes with the same number of paths, taking almost the same time as the dag branch on my laptop:
End DAG generation: 2017-07-11 09:06:28.385000
End pathfinding: 2017-07-11 09:07:27.163000
Found 8478668 paths
So we could switch to using all_simple_paths for now and sorting them ourselves, which is what dag branch does anyway.

I will post a question on networkx mailing list/stackoverflow to see if this is a bug or some other issue.

@kmadathil
Copy link
Owner Author

@drdhaval2785 Graphs are simple, you can figure out the basics in a day or less. :-)
https://www.codechef.com/wiki/tutorial-graph-theory-part-1

Why so many paths? Let's say we have a graph with an average of s possible splits each at n positions. The complexity of building the lexical graph is of the order of s.n, while the complexity of number of paths through it is s^n. This is why graph building is quick, while calculating the number of paths blows up with longer sentences. It's good practice when we're sitting on a problem with exponential complexity to know our limits, and possible fallback to a different method if we detect that we've exceeded them. That's why I'm worried about paths.
Right now, for astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH, we need approx 1000 shortest paths to ensure that we include the correct split. If I add the second half of the verse (haven't had the courage to yet), we'll probably need 10000.

@kmadathil
Copy link
Owner Author

@avinashvarna
That's a good catch. We could have both versions, and fallback to the all_simple_paths implementation depending on the number of paths requested.

@kmadathil
Copy link
Owner Author

dag-nx looks like the one that will be taken forward. I will close this after integrating with master. Awaiting the first round of integration issues with sandhi.py to be ironed out.

@kmadathil kmadathil self-assigned this Jul 12, 2017
@avinashvarna
Copy link
Collaborator

I just happened to pick dag-nx to start because you mentioned it used networkx. I haven't actually yet seen the bfs branch. If you think that dag-nx is indeed the right one to go forward, then we are good to go. Else, let's use a different one.

@kmadathil
Copy link
Owner Author

Merged dag-nx (with shortest_simple_paths and fallback all_simple_paths implementation) has been merged with master. Closing.
Any sandhi integration issues can be dealt with elsewhere

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants