More canonical DAG implementation #18

kmadathil · 2017-07-10T07:00:34Z

Please see the dag branch.

Opening an issue to discuss this implementation, which I will merge to master soon.

getSandhiSplits now returns a SanskritLexicalGraph object, with the splits represented in graph form. Calling findAllPaths on this object returns a flat list of splits.

This is faster than the earlier implementation if you include flattening (now called pathfinding). We can split and find all paths/flatten in astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH in about 30s. I'd like to get it down to less than a second if possible

$ python SanskritLexicalAnalyzer.py astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH --split --print-max 100
Parsing of XMLs started at 2017-07-10 12:20:07.216416
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-10 12:20:12.179545
Input String: astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start split: 2017-07-10 12:20:14.445834
End split: 2017-07-10 12:20:14.480862
End pathfinding: 2017-07-10 12:20:52.277015

avinashvarna · 2017-07-10T22:49:05Z

Would it be better to use a standard graph library instead of rolling our own? I've used networkx (https://networkx.readthedocs.io) before, and it comes with a lot of algorithms, including some for what we need. E.g. https://networkx.readthedocs.io/en/stable/reference/algorithms.simple_paths.html (See shortest simple paths - this should correspond to fewest splits) and some DAG specific ones (https://networkx.readthedocs.io/en/stable/reference/algorithms.dag.html)
I believe these algorithms are written using DFS/BFS to avoid the overhead of recursion in Python, and should have better performance than the simple recursive implementation.

avinashvarna · 2017-07-11T03:45:04Z

I hacked together a networkx based implementation as an example. Needs some cleaning, but the speedup is worth it. From a minute on my laptop to just about 0.1 seconds:

Parsing of XMLs started at 2017-07-10 20:33:07.631000
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-10 20:33:13.823000
Input String: astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start Split: 2017-07-10 20:33:18.199000
End DAG generation: 2017-07-10 20:33:18.254000
End pathfinding: 2017-07-10 20:34:13.241000
Finding k shortest paths using networkx: 2017-07-10 20:34:13.242000
End pathfinding: 2017-07-10 20:34:13.332000

avinashvarna · 2017-07-11T03:50:29Z

P.S. Let me know if you want me to clean it up. Didn't add too many comments since I wanted to get it implemented and share the results.

drdhaval2785 · 2017-07-11T04:08:15Z

Seems like things are moving

kmadathil · 2017-07-11T05:14:53Z

@avinashvarna
My plan was to use a simple implementation now and move to a better one as needed. We will definitely need a graph library for Level 3, so we might as well integrate it now. Let me look at your branch.

avinashvarna · 2017-07-11T05:28:04Z

Sorry, I actually added it into your dag branch, since I had some in-progress work on my branch. Hope that's ok. You can revert the commit if you want.

kmadathil · 2017-07-11T05:33:51Z

No you didn't - you pushed it to a different branch called origin/dag by mistake. All's well :-)

avinashvarna · 2017-07-11T05:42:01Z

First time running git worktree add and I mess it up. Well, no harm done I guess :).

drdhaval2785 · 2017-07-11T09:00:58Z

I hacked together a networkx based implementation as an example. Needs some cleaning, but the speedup is worth it. From a minute on my laptop to just about 0.1 seconds:

Where is it? How to use?

kmadathil · 2017-07-11T10:13:00Z

@avinashvarna
What's the maximum number of paths youve tried for? I couldn't get it to terminate in a reasonable time with --print-max 10000

@drdhaval2785
please pull the branch called "origin/dag" (not dag, but origin/dag)

kmadathil · 2017-07-11T10:35:17Z

Comparison of runtimes (rough) - 1000 paths on astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH

BFS - dag-bfs branch: 10 seconds
DFS/Memoized - dag branch: 40 seconds
Networkx - dag-nx branch: 4 seconds
- origin/dag branch: 3 seconds

If you want to try this out, please note some branches use --print-max and some use --max-paths for the same thing. I'll eventually settle on --max-paths

kmadathil · 2017-07-11T11:03:49Z

Note that the DFS/Memoized dag branch takes 40 seconds for the entire set of paths. No other implementation seems to be able to go to 10000 paths.

kmadathil · 2017-07-11T13:57:39Z

@drdhaval2785 Please try the dag-nx branch.

drdhaval2785 · 2017-07-11T15:47:55Z

No other implementation seems to be able to go to 10000 paths.

Excuse my lack of technical knowledge about paths and graphs.
I am just curious why so many paths?
There is some idea in my mind by which we can definitely come to a conclusion that the sandhi or samasa is not split at certain letters. In 42 letter, there were only 18 which qualified to be called a split point.

avinashvarna · 2017-07-11T16:15:07Z

Maybe it's just to understand the limits right now. We definitely have some work ahead of us to minimize the number of false splits. I've been modifying the way we encode the sandhi rules to minimize these splits.

@kmadathil It appears to be some problem with the different algorithm they use in shortest_simple_paths. Just as a test, I replaced shortest_simple_paths with all_simple_paths in dag-nx branch, and it finishes with the same number of paths, taking almost the same time as the dag branch on my laptop:
End DAG generation: 2017-07-11 09:06:28.385000
End pathfinding: 2017-07-11 09:07:27.163000
Found 8478668 paths
So we could switch to using all_simple_paths for now and sorting them ourselves, which is what dag branch does anyway.

I will post a question on networkx mailing list/stackoverflow to see if this is a bug or some other issue.

kmadathil · 2017-07-12T02:48:48Z

@drdhaval2785 Graphs are simple, you can figure out the basics in a day or less. :-)
https://www.codechef.com/wiki/tutorial-graph-theory-part-1

Why so many paths? Let's say we have a graph with an average of s possible splits each at n positions. The complexity of building the lexical graph is of the order of s.n, while the complexity of number of paths through it is s^n. This is why graph building is quick, while calculating the number of paths blows up with longer sentences. It's good practice when we're sitting on a problem with exponential complexity to know our limits, and possible fallback to a different method if we detect that we've exceeded them. That's why I'm worried about paths.
Right now, for astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH, we need approx 1000 shortest paths to ensure that we include the correct split. If I add the second half of the verse (haven't had the courage to yet), we'll probably need 10000.

kmadathil · 2017-07-12T02:50:22Z

@avinashvarna
That's a good catch. We could have both versions, and fallback to the all_simple_paths implementation depending on the number of paths requested.

kmadathil · 2017-07-12T09:53:18Z

dag-nx looks like the one that will be taken forward. I will close this after integrating with master. Awaiting the first round of integration issues with sandhi.py to be ironed out.

avinashvarna · 2017-07-12T16:10:15Z

I just happened to pick dag-nx to start because you mentioned it used networkx. I haven't actually yet seen the bfs branch. If you think that dag-nx is indeed the right one to go forward, then we are good to go. Else, let's use a different one.

kmadathil · 2017-07-14T05:42:37Z

Merged dag-nx (with shortest_simple_paths and fallback all_simple_paths implementation) has been merged with master. Closing.
Any sandhi integration issues can be dealt with elsewhere

kmadathil self-assigned this Jul 12, 2017

kmadathil closed this as completed Jul 14, 2017

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More canonical DAG implementation #18

More canonical DAG implementation #18

kmadathil commented Jul 10, 2017

avinashvarna commented Jul 10, 2017

avinashvarna commented Jul 11, 2017

avinashvarna commented Jul 11, 2017

drdhaval2785 commented Jul 11, 2017 via email

kmadathil commented Jul 11, 2017

avinashvarna commented Jul 11, 2017

kmadathil commented Jul 11, 2017

avinashvarna commented Jul 11, 2017

drdhaval2785 commented Jul 11, 2017

kmadathil commented Jul 11, 2017

kmadathil commented Jul 11, 2017

kmadathil commented Jul 11, 2017

kmadathil commented Jul 11, 2017

drdhaval2785 commented Jul 11, 2017

avinashvarna commented Jul 11, 2017

kmadathil commented Jul 12, 2017

kmadathil commented Jul 12, 2017

kmadathil commented Jul 12, 2017

avinashvarna commented Jul 12, 2017

kmadathil commented Jul 14, 2017

More canonical DAG implementation #18

More canonical DAG implementation #18

Comments

kmadathil commented Jul 10, 2017

avinashvarna commented Jul 10, 2017

avinashvarna commented Jul 11, 2017

avinashvarna commented Jul 11, 2017

drdhaval2785 commented Jul 11, 2017 via email

kmadathil commented Jul 11, 2017

avinashvarna commented Jul 11, 2017

kmadathil commented Jul 11, 2017

avinashvarna commented Jul 11, 2017

drdhaval2785 commented Jul 11, 2017

kmadathil commented Jul 11, 2017

kmadathil commented Jul 11, 2017

kmadathil commented Jul 11, 2017

kmadathil commented Jul 11, 2017

drdhaval2785 commented Jul 11, 2017

avinashvarna commented Jul 11, 2017

kmadathil commented Jul 12, 2017

kmadathil commented Jul 12, 2017

kmadathil commented Jul 12, 2017

avinashvarna commented Jul 12, 2017

kmadathil commented Jul 14, 2017