-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More canonical DAG implementation #18
Comments
Would it be better to use a standard graph library instead of rolling our own? I've used networkx (https://networkx.readthedocs.io) before, and it comes with a lot of algorithms, including some for what we need. E.g. https://networkx.readthedocs.io/en/stable/reference/algorithms.simple_paths.html (See shortest simple paths - this should correspond to fewest splits) and some DAG specific ones (https://networkx.readthedocs.io/en/stable/reference/algorithms.dag.html) |
I hacked together a networkx based implementation as an example. Needs some cleaning, but the speedup is worth it. From a minute on my laptop to just about 0.1 seconds: Parsing of XMLs started at 2017-07-10 20:33:07.631000 |
P.S. Let me know if you want me to clean it up. Didn't add too many comments since I wanted to get it implemented and share the results. |
Seems like things are moving
|
@avinashvarna |
Sorry, I actually added it into your dag branch, since I had some in-progress work on my branch. Hope that's ok. You can revert the commit if you want. |
No you didn't - you pushed it to a different branch called origin/dag by mistake. All's well :-) |
First time running |
Where is it? How to use? |
@avinashvarna @drdhaval2785 |
Comparison of runtimes (rough) - 1000 paths on astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH BFS - dag-bfs branch: 10 seconds If you want to try this out, please note some branches use --print-max and some use --max-paths for the same thing. I'll eventually settle on --max-paths |
Note that the DFS/Memoized dag branch takes 40 seconds for the entire set of paths. No other implementation seems to be able to go to 10000 paths. |
@drdhaval2785 Please try the dag-nx branch. |
Excuse my lack of technical knowledge about paths and graphs. |
Maybe it's just to understand the limits right now. We definitely have some work ahead of us to minimize the number of false splits. I've been modifying the way we encode the sandhi rules to minimize these splits. @kmadathil It appears to be some problem with the different algorithm they use in shortest_simple_paths. Just as a test, I replaced shortest_simple_paths with all_simple_paths in dag-nx branch, and it finishes with the same number of paths, taking almost the same time as the dag branch on my laptop: I will post a question on networkx mailing list/stackoverflow to see if this is a bug or some other issue. |
@drdhaval2785 Graphs are simple, you can figure out the basics in a day or less. :-) Why so many paths? Let's say we have a graph with an average of s possible splits each at n positions. The complexity of building the lexical graph is of the order of s.n, while the complexity of number of paths through it is s^n. This is why graph building is quick, while calculating the number of paths blows up with longer sentences. It's good practice when we're sitting on a problem with exponential complexity to know our limits, and possible fallback to a different method if we detect that we've exceeded them. That's why I'm worried about paths. |
@avinashvarna |
dag-nx looks like the one that will be taken forward. I will close this after integrating with master. Awaiting the first round of integration issues with sandhi.py to be ironed out. |
I just happened to pick dag-nx to start because you mentioned it used networkx. I haven't actually yet seen the bfs branch. If you think that dag-nx is indeed the right one to go forward, then we are good to go. Else, let's use a different one. |
Merged dag-nx (with shortest_simple_paths and fallback all_simple_paths implementation) has been merged with master. Closing. |
Please see the dag branch.
Opening an issue to discuss this implementation, which I will merge to master soon.
getSandhiSplits now returns a SanskritLexicalGraph object, with the splits represented in graph form. Calling findAllPaths on this object returns a flat list of splits.
This is faster than the earlier implementation if you include flattening (now called pathfinding). We can split and find all paths/flatten in astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH in about 30s. I'd like to get it down to less than a second if possible
$ python SanskritLexicalAnalyzer.py astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH --split --print-max 100
Parsing of XMLs started at 2017-07-10 12:20:07.216416
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-10 12:20:12.179545
Input String: astyuttarasyAmdishidevatAtmAhimAlayonAmanagAdhirAjaH
Input String in SLP1: astyuttarasyAmdiSidevatAtmAhimAlayonAmanagADirAjaH
Start split: 2017-07-10 12:20:14.445834
End split: 2017-07-10 12:20:14.480862
End pathfinding: 2017-07-10 12:20:52.277015
The text was updated successfully, but these errors were encountered: