New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skipping songs taking longer than fetching one #44
Comments
I'm glad you've found the project to be of some use! When you say "skipping", are you referring to when the program prints Just looking at the code very briefly, the approach I take now is to loop through the entire list of songs already in the How many songs are you adding to the |
Thanks for the quick response! |
Ah, this is embarrassing, my search in SublimeText was case sensitive, and I missed the "SKIPPING" line... I think the issue is how I'm checking for duplicate songs. I even already have a comment in the # This takes way too long! It's basically O(n^2), can I do better? I'm checking every single song in the |
Ah okay :) |
I added an additional issue, but I believe it fits alongside this one. Thank you for the excellent tool! Below are my thoughts on how to approach the issue. 1 def songsAreSame(s1, s2): I'm curious as to the purpose of the second SM on line 4 (line 80 in artist.py), wouldn't this be one possible cause of the bottleneck occurring during the JSON writing (line 101 artist.py)? If the second SM is necessary, I believe using a permutation approach to lyric checks could reduce the time to write to file. that is mentioned in the comment above the line. E.g - A temp list would be created and "Song A" would be compared with "B" and "C", then "A" would be removed from the temp list and "B" would be compared with only "C" |
Thank you for this thorough feedback, @npmccord! If I remember correctly, I had to use I'm not sure if I understand your suggestion for a permutation approach. Currently, when attempting to add a new song to the I might be missing something obvious, but how does your permutation approach avoid the need to compare the new song with each preexisting song? I thought of checking the songs in reverse: # artist.py line 85
for song in lyrics_to_write['songs'][::-1]: instead of for song in lyrics_to_write['songs']: because the recently added songs are more likely to be duplicates. This approach doesn't avoid looping through the entire song list if there aren't any duplicates though (which is the most common case). |
@Arsanian I like your suggestion of using a dictionary, but the trouble is we can't rely on exact matches in the song titles (or song lyrics) to catch a duplicate. The |
@johnwmillr I believe I understand what you are saying, and I just arrived at the existing solution in reverse. I definitely will do some digging of my own into the varying SM results as that does seem odd. Do you think running a function similar to songsareSame but based on song titles prior to checking if the lyrics are the same would yield duplicates faster (with the exception of incidents similar to the beer and trucks article)? I also looked into the sequence matcher differences and from my preliminary analysis, it doesn't appear to vary by a significant margin. A small sample of 200 songs had 2 instances of a difference greater than .08, but I am digging into larger sets to see there are additional edge cases to be considered. |
First of all, thanks for the nice program, seems to work well for the most part.
I'm trying to build a corpus of lyrics for a project at my university, so I try to fetch all the songs of the artists I want to incorporate.
Once the program fetched most of the songs, it seems to find many duplicates and attempts to skip, but skipping takes way longer than fetching a song.
Is there any way to speed up the skipping process?
Best regards.
The text was updated successfully, but these errors were encountered: