Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill the Spec11ThreatMatches table #714

Closed
wants to merge 10 commits into from

Conversation

leginachen
Copy link
Contributor

@leginachen leginachen commented Jul 24, 2020

Previously, the Spec11Pipeline stored ThreatMatches in JSON files in a GCS folder. The new Spec11ThreatMatches database is up and running now, and this command is used to backfill the table with the historical data stored in the JSON files (starting from January 31st, 2019).


This change is Reviewable

@leginachen leginachen changed the title Command for backfilling the Spec11ThreatMatches table Add the command to backfill the Spec11ThreatMatches table Jul 24, 2020
@leginachen leginachen changed the title Add the command to backfill the Spec11ThreatMatches table Backfill the Spec11ThreatMatches table Jul 24, 2020
Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 10 unresolved discussions (waiting on @gbrodman, @leginachen, and @sarahcaseybot)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 49 at r1 (raw file):

import org.json.JSONObject;

public class BackfillSpec11ThreatMatchCommand {

If you make it implement ConfirmingCommand you'll get the prompt and execute methods


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 51 at r1 (raw file):

public class BackfillSpec11ThreatMatchCommand {

  private final GcsUtils gcsUtils;

nit: move this down so it's close to the constructor -- keeps things together and keeps constants separate from the rest of the code


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 53 at r1 (raw file):

  private final GcsUtils gcsUtils;
  private static final FluentLogger logger = FluentLogger.forEnclosingClass();
  private static final YearMonth START_DATE = new YearMonth(2019, 01);

nit: technically these are months, not dates


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 54 at r1 (raw file):

  private static final FluentLogger logger = FluentLogger.forEnclosingClass();
  private static final YearMonth START_DATE = new YearMonth(2019, 01);
  private static final YearMonth END_DATE = new YearMonth(2020, 07);

We can probably just use the current month


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 69 at r1 (raw file):

      ImmutableList<Spec11ThreatMatch> threatMatches =
          getSpec11ThreatMatchesFromFile(spec11ReportFilename);
      for (Spec11ThreatMatch threatMatch : threatMatches) {

Probably worth looking in to the jpaTm() method that saves multiple entities (it could save us from having to open up a new transaction every time)

Potentially also look in to batching, so maybe saving like 50 entities at a time.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 94 at r1 (raw file):

              .setCheckDate(LocalDate.parse(date, ISODateTimeFormat.date()))
              .setRegistrarId(registrarId)
              .setDomainRepoId(getDomainRepoId(domainName, registrarId, new DateTime()))

Generally it's not a good idea to create the DateTime from scratch, since it could vary on how we run this, or just various timing shenanigans. Usually you use one of two options:

  1. Getting the transaction time from a database transaction that you're inside (this doesn't really apply here)
  2. Using a particular reference point. Here, this entry is in the past, right? So we should use the date as it was then.

core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 119 at r1 (raw file):

  // This method has not been tested yet, and there may be an off-by-one error
  private List<YearMonth> getMonthsList(YearMonth start, YearMonth end) {

Can we do something like

ImmutableList.Builder<String> builder = new ImmutableList.Builder();
YearMonth now = ???
YearMonth month = START_MONTH;
while (!month.isAfter(now)) {
  builder.add(String.format(REPORTING_FOLDER_FORMAT, month.toString())
  month = month.plusMonths(1);
}
return builder.build()

as a method to get all the folders that we want to list? Might be nice to avoid having to specify an end date


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 132 at r1 (raw file):

   */
  private String getDateFromFilename(String filename) {
    return filename.substring(22);

Better to use a regex (with a capturing group)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 154 at r1 (raw file):

  }

  private ImmutableList<GcsFilename> getSpec11ReportFilenames(ImmutableList<LocalDate> dateList)

It seems like we're converting to/from dates and strings a few times, probably more than we need to. We should only need to do things like

  • get a list of months in question
  • convert those months to file paths to list
  • list those paths, get all the full file paths of the spec11 files
  • for each path, get all the objects from that path, and use the date from the file name

core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 164 at r1 (raw file):

  }

  private String getDomainRepoId(String domainName, String registrarId, DateTime now) {

I don't think this should be here :)

Copy link
Collaborator

@sarahcaseybot sarahcaseybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @gbrodman and @leginachen)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 109 at r1 (raw file):

      // Iterate from 1 to size() to skip the header at line 0.
      for (int i = 1; i < reportLines.size(); i++) {
        String date = getDateFromFilename(spec11ReportFilename.getBucketName());

I think in your current implementation you already had this date as a LocalDate at one point when you were constructing the list of GcsFilename. Rather than having to construct the date twice, you could use a map that maps a date to a filename, and then you could call this method with both the date and the filename. This may not apply if you end up reconfiguring how you get the GcsFilenames


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 111 at r1 (raw file):

        String date = getDateFromFilename(spec11ReportFilename.getBucketName());
        for (Spec11ThreatMatch threatMatch : createSpec11ThreatMatches(reportLines.get(i), date)) {
          builder.add(threatMatch);

you shouldnt have to do this in a for loop, you can just addAll the contents of the response from createSpec11ThreatMatches to the builder


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 136 at r1 (raw file):

  private ImmutableList<LocalDate> getDateList() throws IOException {
    List<YearMonth> bucketMonths = getMonthsList(START_DATE, END_DATE);

Rather than a separate method to get a list of months, why not just use a while loop. Something like...

YearMonth yearMonth = START_DATE;
while(yearMonth <= END_DATE){
....
yearMonth.plus(1);
}


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 154 at r1 (raw file):

Previously, gbrodman wrote…

It seems like we're converting to/from dates and strings a few times, probably more than we need to. We should only need to do things like

  • get a list of months in question
  • convert those months to file paths to list
  • list those paths, get all the full file paths of the spec11 files
  • for each path, get all the objects from that path, and use the date from the file name

I agree.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 158 at r1 (raw file):

    ImmutableList.Builder<GcsFilename> spec11ReportFilenames = new ImmutableList.Builder<>();
    for (LocalDate date : dateList) {
      GcsFilename gcsFilename = getGcsFilename(REPORTING_BUCKET, date);

the contents of getGcsFilename() can just be inlined

Copy link
Contributor Author

@leginachen leginachen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @gbrodman and @leginachen)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 49 at r1 (raw file):

Previously, gbrodman wrote…

If you make it implement ConfirmingCommand you'll get the prompt and execute methods

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 54 at r1 (raw file):

Previously, gbrodman wrote…

We can probably just use the current month

Done.

Copy link
Contributor Author

@leginachen leginachen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 12 unresolved discussions (waiting on @gbrodman, @leginachen, and @sarahcaseybot)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 69 at r1 (raw file):

Previously, gbrodman wrote…

Probably worth looking in to the jpaTm() method that saves multiple entities (it could save us from having to open up a new transaction every time)

Potentially also look in to batching, so maybe saving like 50 entities at a time.

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 94 at r1 (raw file):

Previously, gbrodman wrote…

Generally it's not a good idea to create the DateTime from scratch, since it could vary on how we run this, or just various timing shenanigans. Usually you use one of two options:

  1. Getting the transaction time from a database transaction that you're inside (this doesn't really apply here)
  2. Using a particular reference point. Here, this entry is in the past, right? So we should use the date as it was then.

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 109 at r1 (raw file):

Previously, sarahcaseybot wrote…

I think in your current implementation you already had this date as a LocalDate at one point when you were constructing the list of GcsFilename. Rather than having to construct the date twice, you could use a map that maps a date to a filename, and then you could call this method with both the date and the filename. This may not apply if you end up reconfiguring how you get the GcsFilenames

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 111 at r1 (raw file):

Previously, sarahcaseybot wrote…

you shouldnt have to do this in a for loop, you can just addAll the contents of the response from createSpec11ThreatMatches to the builder

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 119 at r1 (raw file):

Previously, gbrodman wrote…

Can we do something like

ImmutableList.Builder<String> builder = new ImmutableList.Builder();
YearMonth now = ???
YearMonth month = START_MONTH;
while (!month.isAfter(now)) {
  builder.add(String.format(REPORTING_FOLDER_FORMAT, month.toString())
  month = month.plusMonths(1);
}
return builder.build()

as a method to get all the folders that we want to list? Might be nice to avoid having to specify an end date

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 132 at r1 (raw file):

Previously, gbrodman wrote…

Better to use a regex (with a capturing group)

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 136 at r1 (raw file):

Previously, sarahcaseybot wrote…

Rather than a separate method to get a list of months, why not just use a while loop. Something like...

YearMonth yearMonth = START_DATE;
while(yearMonth <= END_DATE){
....
yearMonth.plus(1);
}

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 154 at r1 (raw file):

Previously, sarahcaseybot wrote…

I agree.

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 158 at r1 (raw file):

Previously, sarahcaseybot wrote…

the contents of getGcsFilename() can just be inlined

Done.

@lgtm-com
Copy link

lgtm-com bot commented Aug 4, 2020

This pull request introduces 1 alert when merging bee5919 into d7fb609 - view on LGTM.com

new alerts:

  • 1 for Potential input resource leak

Copy link
Collaborator

@sarahcaseybot sarahcaseybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @gbrodman and @leginachen)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 54 at r1 (raw file):

Previously, leginachen (Legina Chen) wrote…

Done.

Does setting this to just new YearMonth() set it to the current month? If so, can you please add a comment that indicates that?


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 55 at r2 (raw file):

      Pattern.compile("SPEC11_MONTHLY_REPORT_(\\d{4}-\\d{2}-\\d{2})");
  private static final String REPORTING_FOLDER = "domain-registry-reporting/icann/spec11/";
  private HashMap<GcsFilename, LocalDate> filenamesToDates = new HashMap<>();

Typically we use ImmutableMap

Copy link
Contributor Author

@leginachen leginachen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @gbrodman, @leginachen, and @sarahcaseybot)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 55 at r2 (raw file):

Previously, sarahcaseybot wrote…

Typically we use ImmutableMap

Done.

@lgtm-com
Copy link

lgtm-com bot commented Aug 4, 2020

This pull request introduces 1 alert when merging b3fc3dd into d7fb609 - view on LGTM.com

new alerts:

  • 1 for Potential input resource leak

Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 9 unresolved discussions (waiting on @gbrodman, @leginachen, and @sarahcaseybot)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 48 at r4 (raw file):

public class BackfillSpec11ThreatMatchCommand extends ConfirmingCommand {

We'll also want to override the prompt method -- this should print out a description of the actions that will be done in the execute() method in some way (perhaps just printing out the number of files we'd parse or something quick like that)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 60 at r4 (raw file):

  @Override
  protected String execute() throws IOException {
    ImmutableMap<GcsFilename, LocalDate> filenamesToDates =

possible nit: does this fit all on one line?


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 62 at r4 (raw file):

    ImmutableMap<GcsFilename, LocalDate> filenamesToDates =
        mapFilenamesToDates(START_MONTH, END_MONTH);
    for (GcsFilename spec11ReportFilename : filenamesToDates.keySet()) {

Instead of iterating over the key set, a common way of iterating through maps is the following:

filenamesToDates.entrySet().stream().map(entry -> getSpec11ThreatMatchesFromFile(entry.getKey(), entry.getValue()))
  .forEach(threatMatches -> jpaTm().transact(() -> jpaTm().saveNewOrUpdateAll(threatMatches)));

basically streaming the entry set, then mapping each <key, value> entry into the corresponding list, then calling persist on each of the lists


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 68 at r4 (raw file):

      jpaTm().transact(() -> jpaTm().saveNewOrUpdateAll(threatMatches));
    }
    // TODO: return a result description

Generally you want to return a string that describes a summary of what was done -- successes, failures, etc


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 77 at r4 (raw file):

    JSONArray threatMatchesArray = reportJSON.getJSONArray(Spec11Pipeline.THREAT_MATCHES_FIELD);
    ImmutableList.Builder<Spec11ThreatMatch> threatMatches = ImmutableList.builder();
    for (int i = 0; i < threatMatchesArray.length(); i++) {

super annoying that JSONArray implements Iterable, so we can't use standard iteration mechanisms.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 86 at r4 (raw file):

              .setCheckDate(date)
              .setRegistrarId(reportJSON.getString(Spec11Pipeline.REGISTRAR_CLIENT_ID_FIELD))
              .setDomainRepoId(getDomainRepoId(domainName, date.toDateTimeAtCurrentTime()))

Calling toDateTimeAtCurrentTime seems a bit variable to me. We don't know when the script was run during the day before so it seems better to just use the method that gets it at the start of the day -- it's not going to be right either, but if it's impossible to be right we should be consistent.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 97 at r4 (raw file):

    return LocalDate.parse(matcher.group(0), ISODateTimeFormat.date());
  }

A thought I had: what happens if we run this script multiple times? We'd get get duplicate entries, right? We might want to give some thought as to how to avoid this.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 106 at r4 (raw file):

  }

  private GcsFilename getGcsFilename(String reportingBucket, LocalDate localDate) {

this method can be inlined

Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r4.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @gbrodman, @leginachen, and @sarahcaseybot)

Copy link
Contributor Author

@leginachen leginachen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @gbrodman, @leginachen, and @sarahcaseybot)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 54 at r1 (raw file):

Previously, sarahcaseybot wrote…

Does setting this to just new YearMonth() set it to the current month? If so, can you please add a comment that indicates that?

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 48 at r4 (raw file):

Previously, gbrodman wrote…

We'll also want to override the prompt method -- this should print out a description of the actions that will be done in the execute() method in some way (perhaps just printing out the number of files we'd parse or something quick like that)

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 60 at r4 (raw file):

Previously, gbrodman wrote…

possible nit: does this fit all on one line?

I moved this into prompt()


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 62 at r4 (raw file):

Previously, gbrodman wrote…

Instead of iterating over the key set, a common way of iterating through maps is the following:

filenamesToDates.entrySet().stream().map(entry -> getSpec11ThreatMatchesFromFile(entry.getKey(), entry.getValue()))
  .forEach(threatMatches -> jpaTm().transact(() -> jpaTm().saveNewOrUpdateAll(threatMatches)));

basically streaming the entry set, then mapping each <key, value> entry into the corresponding list, then calling persist on each of the lists

Could not stream because of the unhandled IOException


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 68 at r4 (raw file):

Previously, gbrodman wrote…

Generally you want to return a string that describes a summary of what was done -- successes, failures, etc

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 86 at r4 (raw file):

Previously, gbrodman wrote…

Calling toDateTimeAtCurrentTime seems a bit variable to me. We don't know when the script was run during the day before so it seems better to just use the method that gets it at the start of the day -- it's not going to be right either, but if it's impossible to be right we should be consistent.

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 106 at r4 (raw file):

Previously, gbrodman wrote…

this method can be inlined

Done.

@lgtm-com
Copy link

lgtm-com bot commented Aug 6, 2020

This pull request introduces 1 alert when merging fd59a8d into b2a78b5 - view on LGTM.com

new alerts:

  • 1 for Potential input resource leak

Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 10 unresolved discussions (waiting on @gbrodman, @leginachen, and @sarahcaseybot)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 40 at r6 (raw file):

import java.util.regex.Pattern;
import javax.inject.Inject;
import jdk.nashorn.internal.ir.annotations.Immutable;

what's up with this import? I don't think it's supposed to be here


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 53 at r6 (raw file):

  private static final YearMonth START_MONTH = new YearMonth(2019, 01);
  private static final YearMonth END_MONTH =
      new YearMonth(); // Constructs a YearMonth with the current year and month

nit: we can fit all this on one line if we're more concise about the comment, e.g. YearMonth() defaults to current month


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 58 at r6 (raw file):

  private static final Pattern FILENAME_PATTERN =
      Pattern.compile("SPEC11_MONTHLY_REPORT_(\\d{4}-\\d{2}-\\d{2})");
  private static final String REPORTING_FOLDER = "domain-registry-reporting/icann/spec11/";

Technically we cannot rely on the bucket being domain-registry-reporting. See how we inject the argument in Spec11RegistrarThreatMatchesParser (and indeed we could potentially refactor this + that class to use the same code for file path creation)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 59 at r6 (raw file):

      Pattern.compile("SPEC11_MONTHLY_REPORT_(\\d{4}-\\d{2}-\\d{2})");
  private static final String REPORTING_FOLDER = "domain-registry-reporting/icann/spec11/";
  private static ImmutableMap<GcsFilename, LocalDate> filenamesToDates;

This shouldn't be static; it's specific to one run of the command


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 66 at r6 (raw file):

  protected String prompt() throws IOException {
    filenamesToDates = mapFilenamesToDates(START_MONTH, END_MONTH);
    return String.format("Parsing through %d files.", numFiles);

it's a code smell to be storing an unnecessary variable numFiles -- let's just use the size of the filenamesToDates map in some way


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 87 at r6 (raw file):

    } else {
      return String.format(
          "Successfully parsed through %d files. We failed to parse through the following files: %s",

much better -- a suggestion could be that if we have a ton of files this might look a bit messy, so perhaps we'd want to print them on multiple lines (so joining on \n, perhaps using Joiner.on('\n').join(someCollection)

Copy link
Contributor Author

@leginachen leginachen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 3 files reviewed, 9 unresolved discussions (waiting on @gbrodman and @sarahcaseybot)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 97 at r4 (raw file):

Previously, gbrodman wrote…

A thought I had: what happens if we run this script multiple times? We'd get get duplicate entries, right? We might want to give some thought as to how to avoid this.

Addressing this with Spec11ThreatMatchDao


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 40 at r6 (raw file):

Previously, gbrodman wrote…

what's up with this import? I don't think it's supposed to be here

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 58 at r6 (raw file):

Previously, gbrodman wrote…

Technically we cannot rely on the bucket being domain-registry-reporting. See how we inject the argument in Spec11RegistrarThreatMatchesParser (and indeed we could potentially refactor this + that class to use the same code for file path creation)

Done. (I also moved the injects above the other variables to be able to reference the reporting bucket when creating the filepath)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 59 at r6 (raw file):

Previously, gbrodman wrote…

This shouldn't be static; it's specific to one run of the command

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 66 at r6 (raw file):

Previously, gbrodman wrote…

it's a code smell to be storing an unnecessary variable numFiles -- let's just use the size of the filenamesToDates map in some way

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 87 at r6 (raw file):

Previously, gbrodman wrote…

much better -- a suggestion could be that if we have a ton of files this might look a bit messy, so perhaps we'd want to print them on multiple lines (so joining on \n, perhaps using Joiner.on('\n').join(someCollection)

Done.

@lgtm-com
Copy link

lgtm-com bot commented Aug 10, 2020

This pull request introduces 1 alert when merging caba9a3 into 26dfdd5 - view on LGTM.com

new alerts:

  • 1 for Potential input resource leak

Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 3 files reviewed, 6 unresolved discussions (waiting on @gbrodman, @leginachen, and @sarahcaseybot)


core/src/main/java/google/registry/tools/RegistryTool.java, line 36 at r8 (raw file):

          .put("ack_poll_messages", AckPollMessagesCommand.class)
          .put("backfill_registry_locks", BackfillRegistryLocksCommand.class)
          .put("backfill_spec11threatmatch", BackfillSpec11ThreatMatchCommand.class)

i'd say backfill_spec11_threat_matches, analogous to the backfill registry locks command


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 97 at r4 (raw file):

Previously, leginachen (Legina Chen) wrote…

Addressing this with Spec11ThreatMatchDao

Resolving for now, but we'll probably want to wait until that's merged in to finalize this.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 85 at r8 (raw file):

                spec11ReportFilename, filenamesToDates.get(spec11ReportFilename));
        jpaTm.transact(() -> jpaTm.saveNewOrUpdateAll(threatMatches));
      } catch (IOException e) {

Let's be a bit more generic with the exceptions we catch here, and we don't want to swallow the exception entirely. Let's use a logger, catch generic Exceptions, log the exception, and then add it to the failedFilesBuilder


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 137 at r8 (raw file):

    ImmutableList.Builder<Spec11ThreatMatch> threatMatches = ImmutableList.builder();
    try (InputStream in = gcsUtils.openInputStream(spec11ReportFilename)) {
      InputStreamReader streamReader = new InputStreamReader(in, UTF_8);

You can (and should) have multiple resources in the try-with-resources clause.

https://stackoverflow.com/questions/30553139/close-multiple-resources-with-autocloseable-try-with-resources

that should hopefully help


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 170 at r8 (raw file):

        mappedFilenamesToDates.put(
            new GcsFilename(
                "domain-registry-reporting", Spec11Pipeline.getSpec11ReportFilePath(fileDate)),

pretty sure this should be taken from the injected bucket

but if we already have the filename, I don't think we should need to call Spec11Pipeline.getSpec11ReportFilePath?

Copy link
Contributor Author

@leginachen leginachen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 3 files reviewed, 6 unresolved discussions (waiting on @gbrodman and @sarahcaseybot)


core/src/main/java/google/registry/tools/RegistryTool.java, line 36 at r8 (raw file):

Previously, gbrodman wrote…

i'd say backfill_spec11_threat_matches, analogous to the backfill registry locks command

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 85 at r8 (raw file):

Previously, gbrodman wrote…

Let's be a bit more generic with the exceptions we catch here, and we don't want to swallow the exception entirely. Let's use a logger, catch generic Exceptions, log the exception, and then add it to the failedFilesBuilder

Done.


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 137 at r8 (raw file):

Previously, gbrodman wrote…

You can (and should) have multiple resources in the try-with-resources clause.

https://stackoverflow.com/questions/30553139/close-multiple-resources-with-autocloseable-try-with-resources

that should hopefully help

Didn't know about this, thanks! Let me know if that works


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 170 at r8 (raw file):

Previously, gbrodman wrote…

pretty sure this should be taken from the injected bucket

but if we already have the filename, I don't think we should need to call Spec11Pipeline.getSpec11ReportFilePath?

Good point! Thank you

Copy link
Collaborator

@sarahcaseybot sarahcaseybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like one of your tests is failing presubmit checks because it is checking for the wrong filename. Please fix.

Reviewed 1 of 2 files at r7.
Reviewable status: 2 of 3 files reviewed, 5 unresolved discussions (waiting on @gbrodman)


core/src/main/java/google/registry/tools/RegistryTool.java, line 36 at r8 (raw file):

Previously, leginachen (Legina Chen) wrote…

Done.

I don' think this is done


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 85 at r8 (raw file):

Previously, leginachen (Legina Chen) wrote…

Done.

I don't think you did this

Copy link
Contributor Author

@leginachen leginachen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 3 files reviewed, 5 unresolved discussions (waiting on @gbrodman and @sarahcaseybot)


core/src/main/java/google/registry/tools/RegistryTool.java, line 36 at r8 (raw file):

Previously, sarahcaseybot wrote…

I don' think this is done

My bad, didn't git add this file


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 85 at r8 (raw file):

Previously, sarahcaseybot wrote…

I don't think you did this

this was done in my last commit (the code was pushed after the comments)

Copy link
Collaborator

@gbrodman gbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r9.
Reviewable status: 2 of 3 files reviewed, 3 unresolved discussions (waiting on @gbrodman and @leginachen)


core/src/main/java/google/registry/tools/javascrap/BackfillSpec11ThreatMatchCommand.java, line 149 at r9 (raw file):

        threatMatches.addAll(createSpec11ThreatMatches(reportLines.get(i), date));
      }
      Spec11ThreatMatchDao.deleteEntriesByDate(jpaTm, date);

This seems like an odd place to put this line of code. Let's put it closer to where we're iterating by date, conceptually

for date in dates:
  open transaction
  delete entries from this day
  get + save new entries from this day
  close transaction

that way each day is done transactionally, and you'll never half-finish one

@gbrodman
Copy link
Collaborator

gbrodman commented Dec 7, 2020

Closing in favor of #897

@gbrodman gbrodman closed this Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants