Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping songs taking longer than fetching one #44

Closed
Arsanian opened this issue Jul 10, 2018 · 9 comments
Closed

Skipping songs taking longer than fetching one #44

Arsanian opened this issue Jul 10, 2018 · 9 comments

Comments

@Arsanian
Copy link

First of all, thanks for the nice program, seems to work well for the most part.
I'm trying to build a corpus of lyrics for a project at my university, so I try to fetch all the songs of the artists I want to incorporate.
Once the program fetched most of the songs, it seems to find many duplicates and attempts to skip, but skipping takes way longer than fetching a song.
Is there any way to speed up the skipping process?
Best regards.

@johnwmillr
Copy link
Owner

I'm glad you've found the project to be of some use!

When you say "skipping", are you referring to when the program prints Song already in Artist, not adding song.?

Just looking at the code very briefly, the approach I take now is to loop through the entire list of songs already in the artist object each time a new song is added, which probably isn't the best approach. But an obvious answer doesn't jump out to me yet why skipping duplicates would take longer than adding a new song.

How many songs are you adding to the artist object?

@Arsanian
Copy link
Author

Thanks for the quick response!
I just fetch all songs of one artist by not giving a song limitation.
The terminal prints the following line:
SKIPPING "Party up - radio edit" -- already found in artist collection.
It seems to be skipping a whole lot of songs in the end, taking a pretty long time.
The artists I fetched had about 500 to 900 Songs.

@johnwmillr
Copy link
Owner

Ah, this is embarrassing, my search in SublimeText was case sensitive, and I missed the "SKIPPING" line...

I think the issue is how I'm checking for duplicate songs. I even already have a comment in the api.py function on line 115:

# This takes way too long! It's basically O(n^2), can I do better?

I'm checking every single song in the artist object for duplicates. I think it would speed the process up quite a bit if I searched the songs in reverse (starting with most recently added). I'll try to come up with a fix in the next day or two.

@Arsanian
Copy link
Author

Ah okay :)
Maybe it would help if you stored the songs in a dict with the songname as a key instead of a list.
Might take a little longer to initiate, but would probably greatly increase the look up speed.
Would be a little bit of a bigger fix though.

@npmccord
Copy link
Contributor

I added an additional issue, but I believe it fits alongside this one. Thank you for the excellent tool! Below are my thoughts on how to approach the issue.

1 def songsAreSame(s1, s2):
2 from difflib import SequenceMatcher as sm # For comparing similarity of lyrics
3 seqA = sm(None, s1.lyrics, s2['lyrics'])
4 seqB = sm(None, s2['lyrics'], s1.lyrics)
5 return seqA.ratio() > 0.5 or seqB.ratio() > 0.5

I'm curious as to the purpose of the second SM on line 4 (line 80 in artist.py), wouldn't this be one possible cause of the bottleneck occurring during the JSON writing (line 101 artist.py)? If the second SM is necessary, I believe using a permutation approach to lyric checks could reduce the time to write to file. that is mentioned in the comment above the line.

E.g - A temp list would be created and "Song A" would be compared with "B" and "C", then "A" would be removed from the temp list and "B" would be compared with only "C"

@johnwmillr
Copy link
Owner

Thank you for this thorough feedback, @npmccord!

If I remember correctly, I had to use seqA and seqB because I got different comparison results depending on the order of the song inputs (which seems suspicious).

I'm not sure if I understand your suggestion for a permutation approach. Currently, when attempting to add a new song to the Artist object, the code runs through each song already in the Artist object (songInArtist(), line 113 in artist.py) and compares their lyrics (seqA/seqB) with the new song.

I might be missing something obvious, but how does your permutation approach avoid the need to compare the new song with each preexisting song?

I thought of checking the songs in reverse:

# artist.py line 85
for song in lyrics_to_write['songs'][::-1]:

instead of

for song in lyrics_to_write['songs']:

because the recently added songs are more likely to be duplicates. This approach doesn't avoid looping through the entire song list if there aren't any duplicates though (which is the most common case).

@johnwmillr
Copy link
Owner

@Arsanian I like your suggestion of using a dictionary, but the trouble is we can't rely on exact matches in the song titles (or song lyrics) to catch a duplicate. The SequenceMatcher module compares lyrics similarity and rejects songs whose lyrics are too similar. So I don't think it'd work to just check if a key already exists in the dict.

@npmccord
Copy link
Contributor

npmccord commented Jul 16, 2018

@johnwmillr I believe I understand what you are saying, and I just arrived at the existing solution in reverse. I definitely will do some digging of my own into the varying SM results as that does seem odd.

Do you think running a function similar to songsareSame but based on song titles prior to checking if the lyrics are the same would yield duplicates faster (with the exception of incidents similar to the beer and trucks article)?

I also looked into the sequence matcher differences and from my preliminary analysis, it doesn't appear to vary by a significant margin. A small sample of 200 songs had 2 instances of a difference greater than .08, but I am digging into larger sets to see there are additional edge cases to be considered.

@johnwmillr
Copy link
Owner

Hopefully PR #47 from @npmccord has sufficiently addressed the issue. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants