Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1493527 - Optimize analysis merging #154

Merged
merged 6 commits into from Oct 1, 2018

Conversation

staktrace
Copy link
Contributor

No description provided.

Copy link
Contributor

@emilio emilio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits, thanks for doing this!

@@ -138,7 +134,7 @@ pub fn format_code(jumps: &HashMap<String, Jump>, format: FormatAs,
let items = d.iter().map(|item| {
let mut obj = json::Object::new();
obj.insert("pretty".to_string(), Json::String(item.pretty.clone()));
obj.insert("sym".to_string(), Json::String(item.sym.clone()));
obj.insert("sym".to_string(), Json::String(item.sym.clone().join(",")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the clone() before join() can go away.

@@ -13,6 +14,12 @@ pub struct Location {
pub col_end: u32,
}

impl fmt::Display for Location {
fn fmt(&self, formatter: &mut fmt::Formatter<'_>) -> fmt::Result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the <'_> can be omitted in these impls.

Err(_) => return vec![],
};
let reader = BufReader::new(&file);
read_analyses(vec![ filename ], filter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think spaces are put after [ / before ].

read_analyses(vec![ filename ], filter)
}

pub fn read_analyses<T>(filenames: Vec<&str>, filter: &Fn(&Object) -> Option<T>) -> Vec<WithLocation<Vec<T>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does this need to take a Vec? Or just &[&str] would suffice?

/// Also asserts that the `pretty` and `no_crossref` fields are
/// the same because otherwise the merge doesn't really make sense.
pub fn merge(&mut self, mut other: AnalysisSource) {
assert!(self.pretty == other.pretty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assert_eq! gives you a better error message.

if let Some(tgt) = read_target(obj) {
let loc = parse_location(obj.get("loc").unwrap().as_string().unwrap());
let target_str = format!("{}", WithLocation{ data: tgt, loc: loc });
if !unique_targets.contains(&target_str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could use the return value of insert, though I guess that means that you need to clone it... Not sure what's better, probably can leave as-is, but I just though I'd point it out.

// hashset to deduplicate them.
if let Some(tgt) = read_target(obj) {
let loc = parse_location(obj.get("loc").unwrap().as_string().unwrap());
let target_str = format!("{}", WithLocation{ data: tgt, loc: loc });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after WithLocation, and can avoid loc: loc and just use loc.

if let Some(e) = last_entry {
loc_data.data.push(e);
}
print!("{}", loc_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On retrospective a new trait for this could've been better (maybe most of them can be written using serde annotations), but not a big deal in any case.

The analysis data structures are modified slightly to bring them more
in line with what the C++ indexer emits and the documented format of
the analysis data. This consists of a few changes.

The first is a trivial change that reorders some of the fields in the
"source" analysis struct to match that of the C++ indexer. This doesn't
actually matter but I'm OCD like that.

The next is to use a vec of strings to represent the comma-separated
list of symbols allowed in source analysis lines, rather than keeping
it as a single string. Make this change exposed a defect in the output
file formatter where a token may have gotten co-highlighted improperly
because it was using a list of symbols instead of a single symbol,
and so the change to format.rs should fix that by picking a single
symbol.

The final change is to sort and deduplicate the `syntax` and `sym`
fields in the source analysis files, since conceptually these properties
are sets rather than lists. Sorting them provides additional benefits
in terms of producing a normalized output format as the next patch
does.
This implements the Display trait on some structures so that
they can easily be JSON-ified, with the output format being
normalized to what the C++ indexer emits. "Normalized" means
given any set of equivalent input analysis lines, the output
should always be the same - ordering of properties in the
emitted JSON, ordering of tokens with property values that are
comma-separated, omission of optional properties, formatting of
numbers, etc. are all made consistent. This allows us to do
cheap deduplication of equivalent lines, as well as just
being easier to debug.
This modifies the existing read_analysis function to take a list
of files instead of just a single file, and read the analysis data
from all of those files. It renames the function read_analyses to
be consistent with the change. It adds back a read_analysis function
with the old signature that delegates to the multiple-file version.

When viewing this patch, view the ignore-whitespace diff as most of
the patch is just a change in indentation.
This will be needed in the next patch, where we pass in a lambda
that needs to modify captured state.
This rewrites the merge-analyses.py script in Rust. There is
a minor semantic difference in the Rust version in that it
automatically deduplicates the target lines, while the Python
version does not. This makes it unnecessary for the calling script
to pipe the output through `| sort | uniq` to deduplicate that.
Logically this deduplication step belongs with the merging
tool since the duplicate lines should only appear because of
the need to handle multiple input files.

The output from the Rust version is also better normalized to
a consistent format (with respect to ordering of properties,
etc. within a given line). While the Python version preserved
the order of properties from the input files, the Rust version
takes it a step further and normalizes them to a known order.
@staktrace
Copy link
Contributor Author

Thanks for the quick review! Addressed the nits, except I decided to leave the contains() check as-is since I feel it's slightly more readable.

@staktrace staktrace merged commit 8fc378a into mozsearch:master Oct 1, 2018
@staktrace staktrace deleted the optimize branch October 1, 2018 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants