Skipping songs taking longer than fetching one #44

Arsanian · 2018-07-10T10:27:07Z

First of all, thanks for the nice program, seems to work well for the most part.
I'm trying to build a corpus of lyrics for a project at my university, so I try to fetch all the songs of the artists I want to incorporate.
Once the program fetched most of the songs, it seems to find many duplicates and attempts to skip, but skipping takes way longer than fetching a song.
Is there any way to speed up the skipping process?
Best regards.

johnwmillr · 2018-07-10T17:03:08Z

I'm glad you've found the project to be of some use!

When you say "skipping", are you referring to when the program prints Song already in Artist, not adding song.?

Just looking at the code very briefly, the approach I take now is to loop through the entire list of songs already in the artist object each time a new song is added, which probably isn't the best approach. But an obvious answer doesn't jump out to me yet why skipping duplicates would take longer than adding a new song.

How many songs are you adding to the artist object?

Arsanian · 2018-07-11T08:26:32Z

Thanks for the quick response!
I just fetch all songs of one artist by not giving a song limitation.
The terminal prints the following line:
SKIPPING "Party up - radio edit" -- already found in artist collection.
It seems to be skipping a whole lot of songs in the end, taking a pretty long time.
The artists I fetched had about 500 to 900 Songs.

johnwmillr · 2018-07-11T17:03:26Z

Ah, this is embarrassing, my search in SublimeText was case sensitive, and I missed the "SKIPPING" line...

I think the issue is how I'm checking for duplicate songs. I even already have a comment in the api.py function on line 115:

# This takes way too long! It's basically O(n^2), can I do better?

I'm checking every single song in the artist object for duplicates. I think it would speed the process up quite a bit if I searched the songs in reverse (starting with most recently added). I'll try to come up with a fix in the next day or two.

Arsanian · 2018-07-12T13:03:16Z

Ah okay :)
Maybe it would help if you stored the songs in a dict with the songname as a key instead of a list.
Might take a little longer to initiate, but would probably greatly increase the look up speed.
Would be a little bit of a bigger fix though.

npmccord · 2018-07-12T19:53:54Z

I added an additional issue, but I believe it fits alongside this one. Thank you for the excellent tool! Below are my thoughts on how to approach the issue.

1 def songsAreSame(s1, s2):
2 from difflib import SequenceMatcher as sm # For comparing similarity of lyrics
3 seqA = sm(None, s1.lyrics, s2['lyrics'])
4 seqB = sm(None, s2['lyrics'], s1.lyrics)
5 return seqA.ratio() > 0.5 or seqB.ratio() > 0.5

I'm curious as to the purpose of the second SM on line 4 (line 80 in artist.py), wouldn't this be one possible cause of the bottleneck occurring during the JSON writing (line 101 artist.py)? If the second SM is necessary, I believe using a permutation approach to lyric checks could reduce the time to write to file. that is mentioned in the comment above the line.

E.g - A temp list would be created and "Song A" would be compared with "B" and "C", then "A" would be removed from the temp list and "B" would be compared with only "C"

johnwmillr · 2018-07-13T17:12:31Z

Thank you for this thorough feedback, @npmccord!

If I remember correctly, I had to use seqA and seqB because I got different comparison results depending on the order of the song inputs (which seems suspicious).

I'm not sure if I understand your suggestion for a permutation approach. Currently, when attempting to add a new song to the Artist object, the code runs through each song already in the Artist object (songInArtist(), line 113 in artist.py) and compares their lyrics (seqA/seqB) with the new song.

I might be missing something obvious, but how does your permutation approach avoid the need to compare the new song with each preexisting song?

I thought of checking the songs in reverse:

# artist.py line 85
for song in lyrics_to_write['songs'][::-1]:

instead of

for song in lyrics_to_write['songs']:

because the recently added songs are more likely to be duplicates. This approach doesn't avoid looping through the entire song list if there aren't any duplicates though (which is the most common case).

johnwmillr · 2018-07-13T17:14:12Z

@Arsanian I like your suggestion of using a dictionary, but the trouble is we can't rely on exact matches in the song titles (or song lyrics) to catch a duplicate. The SequenceMatcher module compares lyrics similarity and rejects songs whose lyrics are too similar. So I don't think it'd work to just check if a key already exists in the dict.

npmccord · 2018-07-16T14:07:50Z

@johnwmillr I believe I understand what you are saying, and I just arrived at the existing solution in reverse. I definitely will do some digging of my own into the varying SM results as that does seem odd.

Do you think running a function similar to songsareSame but based on song titles prior to checking if the lyrics are the same would yield duplicates faster (with the exception of incidents similar to the beer and trucks article)?

I also looked into the sequence matcher differences and from my preliminary analysis, it doesn't appear to vary by a significant margin. A small sample of 200 songs had 2 instances of a difference greater than .08, but I am digging into larger sets to see there are additional edge cases to be considered.

johnwmillr · 2018-08-01T02:51:07Z

Hopefully PR #47 from @npmccord has sufficiently addressed the issue. Closing for now.

johnwmillr added bug enhancement labels Jul 16, 2018

johnwmillr closed this as completed Aug 1, 2018

johnwmillr mentioned this issue Aug 5, 2018

Speed issues #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skipping songs taking longer than fetching one #44

Skipping songs taking longer than fetching one #44

Arsanian commented Jul 10, 2018

johnwmillr commented Jul 10, 2018

Arsanian commented Jul 11, 2018

johnwmillr commented Jul 11, 2018

Arsanian commented Jul 12, 2018

npmccord commented Jul 12, 2018

johnwmillr commented Jul 13, 2018

johnwmillr commented Jul 13, 2018

npmccord commented Jul 16, 2018 •

edited

johnwmillr commented Aug 1, 2018

Skipping songs taking longer than fetching one #44

Skipping songs taking longer than fetching one #44

Comments

Arsanian commented Jul 10, 2018

johnwmillr commented Jul 10, 2018

Arsanian commented Jul 11, 2018

johnwmillr commented Jul 11, 2018

Arsanian commented Jul 12, 2018

npmccord commented Jul 12, 2018

johnwmillr commented Jul 13, 2018

johnwmillr commented Jul 13, 2018

npmccord commented Jul 16, 2018 • edited

johnwmillr commented Aug 1, 2018

npmccord commented Jul 16, 2018 •

edited