Bug 1493527 - Optimize analysis merging #154

staktrace · 2018-10-01T00:17:37Z

No description provided.

emilio

Mostly nits, thanks for doing this!

emilio · 2018-10-01T08:14:07Z

tools/src/format.rs

@@ -138,7 +134,7 @@ pub fn format_code(jumps: &HashMap<String, Jump>, format: FormatAs,
                let items = d.iter().map(|item| {
                    let mut obj = json::Object::new();
                    obj.insert("pretty".to_string(), Json::String(item.pretty.clone()));
-                    obj.insert("sym".to_string(), Json::String(item.sym.clone()));
+                    obj.insert("sym".to_string(), Json::String(item.sym.clone().join(",")));


I think the clone() before join() can go away.

emilio · 2018-10-01T08:15:39Z

tools/src/file_format/analysis.rs

@@ -13,6 +14,12 @@ pub struct Location {
    pub col_end: u32,
 }

+impl fmt::Display for Location {
+    fn fmt(&self, formatter: &mut fmt::Formatter<'_>) -> fmt::Result {


nit: I think the <'_> can be omitted in these impls.

emilio · 2018-10-01T08:16:31Z

tools/src/file_format/analysis.rs

-        Err(_) => return vec![],
-    };
-    let reader = BufReader::new(&file);
+    read_analyses(vec![ filename ], filter)


nit: I don't think spaces are put after [ / before ].

emilio · 2018-10-01T08:16:53Z

tools/src/file_format/analysis.rs

+    read_analyses(vec![ filename ], filter)
+}
+
+pub fn read_analyses<T>(filenames: Vec<&str>, filter: &Fn(&Object) -> Option<T>) -> Vec<WithLocation<Vec<T>>> {


nit: Does this need to take a Vec? Or just &[&str] would suffice?

emilio · 2018-10-01T08:18:42Z

tools/src/file_format/analysis.rs

+    /// Also asserts that the `pretty` and `no_crossref` fields are
+    /// the same because otherwise the merge doesn't really make sense.
+    pub fn merge(&mut self, mut other: AnalysisSource) {
+        assert!(self.pretty == other.pretty);


nit: assert_eq! gives you a better error message.

emilio · 2018-10-01T08:20:00Z

tools/src/bin/merge-analyses.rs

+            if let Some(tgt) = read_target(obj) {
+                let loc = parse_location(obj.get("loc").unwrap().as_string().unwrap());
+                let target_str = format!("{}", WithLocation{ data: tgt, loc: loc });
+                if !unique_targets.contains(&target_str) {


nit: You could use the return value of insert, though I guess that means that you need to clone it... Not sure what's better, probably can leave as-is, but I just though I'd point it out.

emilio · 2018-10-01T08:21:36Z

tools/src/bin/merge-analyses.rs

+            // hashset to deduplicate them.
+            if let Some(tgt) = read_target(obj) {
+                let loc = parse_location(obj.get("loc").unwrap().as_string().unwrap());
+                let target_str = format!("{}", WithLocation{ data: tgt, loc: loc });


nit: space after WithLocation, and can avoid loc: loc and just use loc.

emilio · 2018-10-01T08:22:51Z

tools/src/bin/merge-analyses.rs

+        if let Some(e) = last_entry {
+            loc_data.data.push(e);
+        }
+        print!("{}", loc_data);


On retrospective a new trait for this could've been better (maybe most of them can be written using serde annotations), but not a big deal in any case.

The analysis data structures are modified slightly to bring them more in line with what the C++ indexer emits and the documented format of the analysis data. This consists of a few changes. The first is a trivial change that reorders some of the fields in the "source" analysis struct to match that of the C++ indexer. This doesn't actually matter but I'm OCD like that. The next is to use a vec of strings to represent the comma-separated list of symbols allowed in source analysis lines, rather than keeping it as a single string. Make this change exposed a defect in the output file formatter where a token may have gotten co-highlighted improperly because it was using a list of symbols instead of a single symbol, and so the change to format.rs should fix that by picking a single symbol. The final change is to sort and deduplicate the `syntax` and `sym` fields in the source analysis files, since conceptually these properties are sets rather than lists. Sorting them provides additional benefits in terms of producing a normalized output format as the next patch does.

This implements the Display trait on some structures so that they can easily be JSON-ified, with the output format being normalized to what the C++ indexer emits. "Normalized" means given any set of equivalent input analysis lines, the output should always be the same - ordering of properties in the emitted JSON, ordering of tokens with property values that are comma-separated, omission of optional properties, formatting of numbers, etc. are all made consistent. This allows us to do cheap deduplication of equivalent lines, as well as just being easier to debug.

This modifies the existing read_analysis function to take a list of files instead of just a single file, and read the analysis data from all of those files. It renames the function read_analyses to be consistent with the change. It adds back a read_analysis function with the old signature that delegates to the multiple-file version. When viewing this patch, view the ignore-whitespace diff as most of the patch is just a change in indentation.

This will be needed in the next patch, where we pass in a lambda that needs to modify captured state.

This rewrites the merge-analyses.py script in Rust. There is a minor semantic difference in the Rust version in that it automatically deduplicates the target lines, while the Python version does not. This makes it unnecessary for the calling script to pipe the output through `| sort | uniq` to deduplicate that. Logically this deduplication step belongs with the merging tool since the duplicate lines should only appear because of the need to handle multiple input files. The output from the Rust version is also better normalized to a consistent format (with respect to ordering of properties, etc. within a given line). While the Python version preserved the order of properties from the input files, the Rust version takes it a step further and normalizes them to a known order.

staktrace · 2018-10-01T15:08:52Z

Thanks for the quick review! Addressed the nits, except I decided to leave the contains() check as-is since I feel it's slightly more readable.

staktrace requested a review from emilio October 1, 2018 00:17

staktrace mentioned this pull request Oct 1, 2018

Bug 1493527 - Cleanups and switching analysis merging to rust mozsearch/mozsearch-mozilla#21

Merged

emilio approved these changes Oct 1, 2018

View reviewed changes

staktrace added 6 commits October 1, 2018 11:00

Reduce unnecessarily-wide scope of local variable

711de8c

Make read_analysis take a FnMut instead of a Fn.

cb623e2

This will be needed in the next patch, where we pass in a lambda that needs to modify captured state.

staktrace force-pushed the optimize branch from ab14f44 to 0027db2 Compare October 1, 2018 15:07

staktrace merged commit 8fc378a into mozsearch:master Oct 1, 2018

staktrace deleted the optimize branch October 1, 2018 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1493527 - Optimize analysis merging #154

Bug 1493527 - Optimize analysis merging #154

staktrace commented Oct 1, 2018

emilio left a comment

emilio Oct 1, 2018

emilio Oct 1, 2018

emilio Oct 1, 2018

emilio Oct 1, 2018

emilio Oct 1, 2018

emilio Oct 1, 2018

emilio Oct 1, 2018

emilio Oct 1, 2018

staktrace commented Oct 1, 2018

Bug 1493527 - Optimize analysis merging #154

Bug 1493527 - Optimize analysis merging #154

Conversation

staktrace commented Oct 1, 2018

emilio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staktrace commented Oct 1, 2018