Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle 'moved' files #4

Closed
johnmhoran opened this issue Oct 28, 2017 · 21 comments
Closed

Handle 'moved' files #4

johnmhoran opened this issue Oct 28, 2017 · 21 comments

Comments

@johnmhoran
Copy link
Member

johnmhoran commented Oct 28, 2017

This will need some thinking, but we will want some way to tell if a file has been 'moved' between the new and old scans of some codebase.

This means the sha1 should be matching, but the path would not be. There are also cases where the same file could be present in multiple locations.

We may need to index the files by sha1, similar to what we did in determine_delta

@johnmhoran johnmhoran self-assigned this Oct 28, 2017
@johnmhoran
Copy link
Member Author

[@mjherzog comment] In most move cases, I would expect many lower-level path elements to be the same so we are really looking at path similarities from file name back up the tree.

@johnmhoran
Copy link
Member Author

[@MaJuRG comment] If a file has the same filename and sha1 but not same path, it should be considered moved. we will have to consider collisions in this case though.

@steven-esser
Copy link
Contributor

Moved files may be in either the Add or Removed buckets, so this would be a good place to start when thinking about a possible fix.

johnmhoran added a commit that referenced this issue Jan 8, 2018
  * Added new method to recategorize 'added' and 'removed' files as
    'moved'.
  * Modified related code to accommodate new category, including in
    'Delta.to_dict()', which changes the JSON output.
  * CSV output will be modified once we’ve settled on the column
    headings and data to display for the 'moved' files.
  * Fixed failing tests, added three new tests.

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran
Copy link
Member Author

@MaJuRG I’ve deleted the modified code in DeltaCode.determine_delta() and created a new method, DeltaCode.determine_moved(), which uses a dictionary of removed files, keyed by their sha1 attribute, to identify and move matching files from the added and removed categories to a new moved category.

The DeltaCode constructor and the DeltaCode.get_stats() and DeltaCode.to_dict() methods now include the moved category, as does Delta.to_dict(). I’ve modified the latter to display attributes for the new_file/old_file to enable the user to distinguish which moved files would otherwise have been categorized as added or removed, but the resulting JSON display is a bit verbose. Here’s a sample excerpt from the JSON output for a file that would otherwise have been categorized as added:

{
  "deltacode_version": "0.0.1.beta",
  "deltacode_stats": {
    "added": 0,
    "modified": 0,
    "moved": 3,
    "removed": 0,
    "unmodified": 7
  },
  "deltas": [
    {
      "category": "moved",
      "new_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
      "old_sha1": "",
      "new_path": "b/a4.py",
      "old_path": "",
      "new_name": "a4.py",
      "old_name": "",
      "new_type": "file",
      "old_type": "",
      "new_size": 200,
      "old_size": ""
    },
. . .

Perhaps a better approach would be a single new attribute denoting added/removed or new/old. I haven’t yet modified the CSV-generation code but will once we’ve settled on the column headings and data to display for the moved files.

In addition to fixing a group of failing tests, I’ve added three new tests. I also plan to add some CSV tests once we’ve settled on what to display for the moved files, and I’d appreciate your suggestions on additional tests I could add to thoroughly vet the determine_moved() method.

I’m available for an uberconf whenever it’s convenient for you.

@steven-esser
Copy link
Contributor

Ok, let me play around with this on my end.

@steven-esser
Copy link
Contributor

@johnmhoran After playing around with this, one of the first things I noticed is that we should have a single 'moved' object for an added/removed pair.

It looks like we have two separate moved objects being created at this time. For how we do output, etc this is not ideal. our 'moved' delta object should have both the new_ and old_ attributes filled out.

Hopefully that makes sense.

@johnmhoran
Copy link
Member Author

@MaJuRG That was my original goal, but after working with my 3 test codebases I realized that pairing an added and removed is not always straightforward. For example, suppose a.py is removed from the a directory in old codebase and added to the b and the c directories in the new codebase. Do we pair old a/a.py with new b/a.py or new c/a.py? And what do we do with the second added copy of a.py?

It was in light of this potential complexity that I chose to treat each added and removed file as a separate moved file, thinking that we could then display them to the user as potential pairs by their common sha1, name and size attributes. How would you suggest handling this scenario?

@steven-esser
Copy link
Contributor

We should probably just focus on the easiest cases for now, and pair/create moved objects for those.

This means, after we index, we should only look at places where len(added_index[sha1]) == 1 and len(removed_index[sha1]) == 1. Then we know that this is the only place on both sides that this file can exist, so it is a singular move.

Once we can reliable calculate moves for the most simplest of cases, we can think about how to handle these other scenarios.

@steven-esser
Copy link
Contributor

In other words, because we are indexing by sha1, there is the potential to have an arbitrary number of files that have the same sha1 value. Hence why it is not straightforward to handle moves, as you mentioned above.

So if we only look at the places where sha1 index corresponds to a single file, then we dont have to worry.

Yes, this will not catch all the possible moves, but it will get a good number of them.

@johnmhoran
Copy link
Member Author

I'll refactor accordingly.

@johnmhoran
Copy link
Member Author

@MaJuRG What data do you want to display in the CSV output for the moved deltas? Current CSV columns are Type of delta, Path, Name, Type and Size.

Here's a sample excerpt from the current JSON output for moved and unmodified Delta objects:

{
  "deltacode_version": "0.0.1.beta",
  "deltacode_stats": {
    "added": 0,
    "modified": 0,
    "moved": 1,
    "removed": 0,
    "unmodified": 7
  },
  "deltas": [
    {
      "category": "moved",
      "new_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
      "old_sha1": "6f71666c46446c29d3f45feef5419ae76fb86a5b",
      "new_path": "b/a4.py",
      "old_path": "a/a4.py",
      "new_name": "a4.py",
      "old_name": "a4.py",
      "new_type": "file",
      "old_type": "file",
      "new_size": 200,
      "old_size": 200
    },
    {
      "category": "unmodified",
      "path": "a/a3.py",
      "name": "a3.py",
      "type": "file",
      "size": 200
    },

@steven-esser
Copy link
Contributor

We can treat new_path as the 'Path' attribute for the output. All the other fields (name type size) can be taken from the new side.

Honestly, all we really need is the old_path attribute in our output for moved files i.e.:

deltas: [
  {
    'category': 'moved',
    'path': 'some/new/path/',
    'old_path': 'some/old/path',
    'name': 'name',
    'type': 'type',
    'size': 200,
  }
]

just to keep it consistent for now.

@johnmhoran
Copy link
Member Author

And corresponding columns for the CSV output?

@steven-esser
Copy link
Contributor

You can add it to the end for now, so it doesn't clutter up the majority of the results.

@johnmhoran
Copy link
Member Author

OK.

johnmhoran added a commit that referenced this issue Jan 11, 2018
  * For now, we ignore 'added'/'removed' matches comprising more than
    one 'removed' and one 'added' file.
  * Modified structure of 'moved' Delta object.
  * Modified CSV output generation.
  * Fixed failing tests, added new tests.

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
johnmhoran added a commit that referenced this issue Jan 15, 2018
  * Convert utility function 'index_delta_files()' to DeltaCode method
    'index_deltas()'.
  * Index Delta objects rather than their respective File objects.
  * Create new test codebases and add new tests.
  * Modify existing tests to include new 'moved' Delta as necessary.

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
johnmhoran added a commit that referenced this issue Jan 15, 2018
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
steven-esser pushed a commit that referenced this issue Jan 15, 2018
  * Convert utility function 'index_delta_files()' to DeltaCode method
    'index_deltas()'.
  * Index Delta objects rather than their respective File objects.
  * Create new test codebases and add new tests.
  * Modify existing tests to include new 'moved' Delta as necessary.

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran
Copy link
Member Author

@MaJuRG FYI, working on the dummy Delta objects and tests for the revised check_moved() method, haven't yet figured out how to structure.

Have tried several approaches. For example, the following approach throws an error -- TypeError: unbound method check_moved() must be called with DeltaCode instance as first argument (got str instance instead):

    def test_DeltaCode_check_moved_no_sha1_match(self):
        added_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
        added_deltas = [
                OrderedDict([('category', 'added'), ('path', u'b/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
            ]
        removed_sha1 = '1234566c46446c29d3f45feef5419ae76fb86a5b'
        removed_deltas =  [
                OrderedDict([('category', 'removed'), ('path', u'a/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
            ]

        result = deltacode.DeltaCode.check_moved(added_sha1, added_deltas, removed_sha1, removed_deltas)

        print('\nresult = {}\n'.format(result))

I think I'm stumbling on how to handle the self variable in the check_moved() method.

@johnmhoran
Copy link
Member Author

@MaJuRG I've moved check_moved() from a DeltaCode method to a utility in utils.py but I'm still encountering errors with the structure of my dummy Delta objects. Current error:

. . .
>       if added_deltas[0].new_file.name == removed_deltas[0].old_file.name:
E       AttributeError: 'dict' object has no attribute 'new_file'

Current test structure (similar errors with the alternative constructions of added_deltas and removed_deltas):

    def test_DeltaCode_check_moved_1_sha1_match(self):
        added_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
        added_deltas = [
                # OrderedDict([('category', 'added'), ('path', u'b/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
                {
                'old_file': {'sha1': '', 'name': '', 'original_path': '', 'licenses': [], 'path': '', 'type': '', 'size': ''},
                'category': 'removed',
                'new_file': {'sha1': u'6f71666c46446c29d3f45feef5419ae76fb86a5b', 'name': u'a4.py', 'original_path': u'1_file_moved_old/a/a4.py', 'path': u'a/a4.py', 'type': u'file', 'size': 200}
                }
            ]
        removed_sha1 = '6f71666c46446c29d3f45feef5419ae76fb86a5b'
        removed_deltas =  [
                # OrderedDict([('category', 'removed'), ('path', u'a/a4.py'), ('name', u'a4.py'), ('type', u'file'), ('size', 200)])
                {
                'old_file': {'sha1': u'6f71666c46446c29d3f45feef5419ae76fb86a5b', 'name': u'a4.py', 'original_path': u'1_file_moved_old/a/a4.py', 'path': u'a/a4.py', 'type': u'file', 'size': 200},
                'category': 'removed',
                'new_file': {'sha1': '', 'name': '', 'original_path': '', 'licenses': [], 'path': '', 'type': '', 'size': ''}
                }
            ]

        result = utils.check_moved(added_sha1, added_deltas, removed_sha1, removed_deltas)

        print('\nresult = {}\n'.format(result))

@steven-esser
Copy link
Contributor

@johnmhoran You are using dicts instead of Delta objects. You have to create actual Delta objects in order to access .new_file or old_file

@steven-esser
Copy link
Contributor

This test for example creates a single delta object:

def test_Delta_create_object_modified(self):
        new = models.File({'path': 'path/modified.txt', 'sha1': 'a'})
        old = models.File({'path': 'path/modified.txt', 'sha1': 'b'})

        delta = deltacode.Delta(new, old, 'modified')

        assert delta.new_file.path == 'path/modified.txt'
        assert delta.new_file.sha1 == 'a'
        assert delta.old_file.path == 'path/modified.txt'
        assert delta.old_file.sha1 == 'b'
        assert delta.category == 'modified'

@johnmhoran
Copy link
Member Author

Thanks @MaJuRG -- I thought I needed to create a "hand-made" Delta object.

johnmhoran added a commit that referenced this issue Jan 17, 2018
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
johnmhoran added a commit that referenced this issue Jan 17, 2018
  * Converted 'check_moved()' from DeltaCode method to a utility.
  * Added four tests for 'check_moved()'.

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@steven-esser
Copy link
Contributor

The basics of this have been implemented with the merge of #42. Closing.

steven-esser pushed a commit that referenced this issue Feb 17, 2021
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants