Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.NET Core based parser - 2x speedup #114

Merged
merged 57 commits into from
Sep 8, 2020
Merged

.NET Core based parser - 2x speedup #114

merged 57 commits into from
Sep 8, 2020

Conversation

philipmat
Copy link
Owner

@philipmat philipmat commented Aug 31, 2020

This branch contains a .NET Core based parser which is at least 2x faster than the Python one.

File Record Count Python C#
discogs_20200806_artists.xml.gz 7,046,615 6:22 2:35
discogs_20200806_labels.xml.gz 1,571,873 1:15 0:22
discogs_20200806_masters.xml.gz 1,734,371 3:56 1:57
discogs_20200806_releases.xml.gz 12,867,980 1:45:16 42:38

Note: tests consistent (same OS, files, etc) only across same file

This alternative parser:

  • exports compatible csv files (with the exception of track_id which has been made more resilient, but requires a column change);
  • is self-contained. It does not require any program or module to be installed.
  • compresses using gzip (.gz) vs python's bzip2 (.bz2)
  • lacks, as of this PR, a few "nice" features relative to the main parser (api counts, etc), but that's something that can be built upon quick.

How to test:

  1. In the repo, visit Actions -> DotNet Build -- direct link
  2. Find and access the last build. This "Artifacts" section lists the builds available for multiple platforms:
    image
  3. Download and un-zip the build for the applicable platform. Within it there will be 2 files: a discogs executable and a discogs.pdb support file.
  4. Place these two files in a local folder, then invoke the discogs executable, passing it the path to discogs files, for example: ./discogs /tmp/discogs/discogs_20200806_labels.xml.gz
  5. csv files will be exported in the same folder as the discogs file, /tmp/discogs in the above example.
  6. to compress the csv files, pass the --gz argument
  7. to perform a dry-run, only outputting file counts, pass the --dry-run argument

@ijabz, @berz - do you think you could take a swing at this?

I tested on:

@MuleaneEve
Copy link

MuleaneEve commented Sep 1, 2020

After applying these two changes, this program went from taking 22 seconds for "labels" to taking 14 seconds.
"Masters" went from 1:21 to 57 seconds on my PC.

It is possible to make it even faster, but that will require replacing the de-serializer with a hand-written solution, which is a lot of code to write, so it may not be worth it.

@MuleaneEve
Copy link

MuleaneEve commented Sep 1, 2020

Good news: I quickly hacked together a de-serializer for "labels".
This is a proof of concept that shows that it is possible to parse all labels in under 4 seconds ;)
The output CSV files are identical.

Just replace the current ParseStreamAsync() with this one:

public async Task ParseStreamAsync(Stream stream)
{
  int objectCount = 0;
  var settings = new XmlReaderSettings
  {
    ConformanceLevel = ConformanceLevel.Fragment,
    Async = true,
  };
  using XmlReader reader = XmlReader.Create(stream, settings);

  await reader.MoveToContentAsync();
  await reader.ReadAsync();
  while (!reader.EOF)
  {
    if (reader.Name == _typeName)
    {
      var lbl = new label();
      while (reader.Read())
      {
        if (reader.IsStartElement(_typeName))
          break;
        if (reader.IsStartElement("id"))
          lbl.id = reader.ReadElementContentAsString();
        if (reader.IsStartElement("name"))
          lbl.name = reader.ReadElementContentAsString();
        if (reader.IsStartElement("contactinfo"))
          lbl.contactinfo = reader.ReadElementContentAsString();
        if (reader.IsStartElement("profile"))
          lbl.profile = reader.ReadElementContentAsString();
        if (reader.IsStartElement("data_quality"))
          lbl.data_quality = reader.ReadElementContentAsString();
        if (reader.IsStartElement("parentLabel"))
          lbl.parentLabel = new parentLabel { id = reader.GetAttribute("id"), name = reader.ReadElementContentAsString() };
        if (reader.IsStartElement("images"))
        {
          var images = new List<image>();
          while (reader.Read() && reader.IsStartElement("image"))
          {
            var image = new image { type = reader.GetAttribute("type"), width = reader.GetAttribute("width"), height = reader.GetAttribute("height") };
            images.Add(image);
          }
          lbl.images = images.ToArray();
        }
        if (reader.IsStartElement("urls"))
        {
          reader.Read();
          var urls = new List<string>();
          while (reader.IsStartElement("url"))
          {
            var url = reader.ReadElementContentAsString();
            if (!string.IsNullOrWhiteSpace(url))
              urls.Add(url);
          }
          lbl.urls = urls.ToArray();
        }
      }
      if (lbl.id is null)
        continue;
      var obj = (T)(object)lbl;
      await _exporter.ExportAsync(obj);

      objectCount++;
      if (objectCount % _throttle == 0) OnSucessfulParse(null, new ParseEventArgs { ParseCount = objectCount });
    }
    else
    {
      await reader.ReadAsync();
    }
  }
  await _exporter.CompleteExportAsync(objectCount);
}

@philipmat
Copy link
Owner Author

philipmat commented Sep 2, 2020

@MuleaneEve - that was a fantastic improvement, thank you so much for it.
I've started to make some design changes and they're on the dotnet_parser_specific_xml_parsing branch.

Do you think you could take a look at the members parsing for artists - I can't quite nail the XML node parsing for it.

https://github.com/philipmat/discogs-xml2db/blob/dotnet_parser_specific_xml_parsing/alternatives/dotnet/discogs/DiscogsArtist.cs#L111

Edit: I actually think I figured it out in this commit, but it seems clumsy. Could you still please take a look and see if there's a more elegant way?

@MuleaneEve
Copy link

MuleaneEve commented Sep 2, 2020

The way you are parsing members looks good. I wonder why the Discogs dump bothers mixing id elements in there...

I'm glad that you were able to integrate these improvements so quickly!
I will take a look later today.

@MuleaneEve
Copy link

Some comments on the new branch:

  1. Would you please include artist.groups? I think that other people using your library (like me) may want that.

// public name[] groups {get;set;}

  1. label.contactinfo was turned into this:

private string contactinfo1;
public string Getcontactinfo()
{
return contactinfo1;
}
public void Setcontactinfo(string value)
{
contactinfo1 = value;
}

  1. label needs to populate its sublabels:

Currently, we are "lucky" that the <sublabels /> element always comes at the end, so when parsing, we skip them without losing any data.
But, it is more correct to parse them accurately.

Here is my implementation:

I renamed the parentLabel class into labelReference, removed SubId & SubName, then added at the end of Populate():

if (reader.IsStartElement("sublabels"))
{
  reader.Read();
  var sublabelList = new List<labelReference>();
  while (reader.IsStartElement("label"))
  {
    var label = new labelReference
    {
      id = reader.GetAttribute("id"),
      name = reader.ReadElementContentAsString()
    };
    sublabelList.Add(label);
  }
  sublabels = sublabelList.ToArray();
}

With that, we never need to check IsValid(), which was a sign that the parser was missing something. It should be impossible for the Discogs dump to contain invalid data.

@philipmat
Copy link
Owner Author

philipmat commented Sep 2, 2020

  1. I've created Export artist groups #116 for exporting artists groups. Please see if you're ok with my export proposal.
    The current parser doesn't do this either, so I want to achieve 100% parity before I take on new features; it makes it easier to compare files.
  2. Fixed contactinfo1. It was an overly eager vscode helper.
  3. As far as I can tell, the python parser doesn't export sublabels either. Same as point 1 and I've opened Export sublabels #117 for it.
    I'll add your code and keep it commented out. The reason it was using that silly trick with SubId, SubName it's because I was trying to do it with the XmlSerializer and it doesn't like multiple classes having the same name. While the new parser doesn't have that limitation, I'd like to keep the structure until this becomes the main parser and then we can make some changes (the lower-name classes and properties is highly offensive to my C# sensibilities).

It should be impossible for the Discogs dump to contain invalid data.

Famous last words... Joking aside, it's gotten better (the xml files didn't even use to be well-formed at one point - "Pepperidge Farms meme"); still would like to be a bit midful.

@MuleaneEve
Copy link

OK for the new features.
Still, you should detect <sublabels> and <groups> to skip them. The parser will be more correct then (and a bit faster).

Regarding invalid data in general, the way I usually deal with this possibility is that I add a lot of validation code in Debug mode, that way, when I get a new data dump file, I validate it once, without affecting performance in Release mode.

@philipmat
Copy link
Owner Author

@MuleaneEve - I've pushed the master parser to the dotnet_parser_specific_xml_parsing.

In trying to make it work, I've noticed a few peculiarities of this new SAX-style parsing system. To be honest, I'm completely new to the finer details of XML stream parsing, so there may be obvious "d'oh" to someone more experienced.

  1. The new parsing approach seems highly dependent on the order of nodes. Because of how it moves the reading head upon ReadElementAsString, if the next node is not met by the end of the while/if loop, it will skip it.
    For example, if it if-tests for "year" first then "title" second, but the nodes in the XML document are <title> first then <year>, it will correctly populate this.title, but it will miss this.year since at the end of the loop it advances to the next element with while(reader.Read())
  2. I would also like to make it more resilient against unknown nodes. A good deal of the inner-loops are while(reader.IsStart("a") || reader.IsStart("b")) { .... If a node <c> gets introduced before <a>, the parsing loop/subloop ends without ever parsing <a> and <b>
  3. The parsing also seems sensitive to the whether there's any text/space between nodes or not.

The speedup is amazing (artist get parsed in about 50 seconds on this machine vs almost 11 minutes for the python version).

I need to become more comfortable with the SAX-style parsing and its idiosyncrasies and I'm hoping that meanwhile you can help me make that version more resilient.

I'm inclining to close this particular PR as soon as we get some feedback from people testing it and take on the conversation to a new PR opened against the "even faster speedup" 😄 branch.

@MuleaneEve
Copy link

MuleaneEve commented Sep 3, 2020

Looks good to me. The XmlReader-based implementation can be pushed at a later time when it is ready.

Regarding the forward-only parsing, it is a lot easier when the elements are always in the same order, that way, the parsing code can match that order, and you won't even need the while loop. From what I can tell with artists and labels, that order is indeed deterministic. So, assuming it is also the case for masters and releases, you can update your code accordingly. It also becomes possible to notice when an unknown element is introduced. Then, you can decide if you want to implement it or skip it.

For XML documents that are more free-formed, the only choice is the while loop. Still, it is possible to make it more resilient.
As a practical example, take a look at this RSS parser from Microsoft: Think of the switch on ElementType as a switch on reader.Name. And in our case, the default case would be the unknown elements...

Regarding white spaces, the trick is to replace reader.Read() with a more targeted solution, like:

public static bool MoveToElement(this XmlReader reader)
{
  reader.Read();
  while (reader.NodeType != XmlNodeType.Element && !reader.EOF)
    reader.Skip();
  return !reader.EOF;
}

This is similar to MoveToContent() but focused on elements.

I recommend that you read the whole documentation of XmlReader. There are indeed a lot of little details to be aware of.
For example, it is also important to be aware of IsEmptyElement when reading list of items...

@philipmat philipmat closed this Sep 8, 2020
@philipmat philipmat reopened this Sep 8, 2020
@philipmat philipmat merged commit cd0929a into develop Sep 8, 2020
@philipmat philipmat deleted the dotnet_parser branch September 8, 2020 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate the opportunities of a .NET-based parser
2 participants