Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chartInfoSoup contents - album info index off by one #2

Closed
brycematsuda opened this issue Jul 6, 2014 · 7 comments · Fixed by #3
Closed

chartInfoSoup contents - album info index off by one #2

brycematsuda opened this issue Jul 6, 2014 · 7 comments · Fixed by #3

Comments

@brycematsuda
Copy link
Collaborator

First off, thanks so much for doing this. As both a charts geek and CS major, I've always wanted to implement the Hot 100 into a simpler, quick and easy to read layout. I was disappointed to learn that the Billboard API had been shut down for awhile now...and then I found this a few days ago and was immediately overjoyed.

Anyway, some background on the issue, I've implemented the basic API functionality into a personal project web app of mine (which can be found here) where right now it displays the top 10 entries info and all that. I also put the info in a SQLite database so the app doesn't have to spend time re-downloading the same info over and over again when navigating to the page.

bild3billboard

A couple of hours ago though, while making some adjustments, all the albums for every entry suddenly became null and the compiler obviously wasn't happy about it. I thought it had something to do with my program, but just to check, I ran another unaltered copy of the API script in a separate folder, and all of the albums turned up null as well.

albumnone

I had the feeling that the pages code had changed somehow and now the script was grabbing the wrong thing, so I printed out the contents of chartInfoSoup and here's what I got.

billboardsoupcontents

If you count the indexes, you can see the album info got pushed one over by the <br> tag. I shifted the index the album string gets its info from 3 to 4 so lines 77-80 look like this:

if chartInfoSoup.contents[4].string:
    album = chartInfoSoup.contents[4].string.strip()
else:
    album = None

And it seemed to grab the album info like normally again. I'm not putting a PR for now....since I'm kinda skeptical that it will stay this way, but if it stays the same after a few days then I'll likely do so. Just keeping it as an issue as maybe something to monitor.

@guoguo12
Copy link
Owner

guoguo12 commented Jul 6, 2014

Thanks for your input! I'm really glad this project has been useful!

I wasn't able to duplicate the problem above on my computer.

untitled

As you might expect, this is very troubling. About a month ago, there was a pull request targeting a different issue that I also couldn't reproduce. It's like other people are getting different HTML pages from Billboard's servers than me, which shouldn't be happening.

Do me a favor, please—run this script and tell me what you get.

import json, requests

url = 'http://www.billboard.com/charts/hot-100'
headers_current = {'User-Agent': 'billboard.py (https://github.com/guoguo12/billboard-charts)'}

req = requests.get(url, headers=headers_current)
print json.dumps(dict(req.headers), sort_keys=True, indent=4, separators=(',', ': '))
print json.dumps(dict(req.request.headers), sort_keys=True, indent=4, separators=(',', ': '))

This script sends a HTTP GET request to Billboard's servers and prints the return and request headers to stdout in JSON format. What I got was this.

Not sure if this will help, but it's worth a shot. Let me know if you have any other ideas as to why this might be happening.

@brycematsuda
Copy link
Collaborator Author

Here's my output: http://pastebin.com/4Gy49tbi

The only notable differences I see are the server ("server": "ECS (cpm/F9B6)") and cache hits ("x-cache-hits": "HIT (5)") but other than that, it's relatively the same. I'm not very familiar with how http requests work and all that at the moment, so I'm not too certain as to what might be happening. I tried fooling with user agents on this site but both my browser and the windows FF/Chrome browsers came back with relatively the same info.

Also ran the script again this morning, still getting the all null albums.

@guoguo12
Copy link
Owner

guoguo12 commented Jul 7, 2014

Hmm. Well, I'm not sure where to go from here. I'm not familiar with the intricacies of HTTP either, but I'm guessing content might be varied based on the client IP address.

I can think of two possible options. We can put something like this in:

if chartInfoSoup.contents[3].string:
    album = chartInfoSoup.contents[3].string.strip()
elif chartInfoSoup.contents[4].string:
    album = chartInfoSoup.contents[4].string.strip()
else:
    album = None
# This might not work for songs without album names on my end.

Alternatively, we can rewrite the code to ignore the line breaks, maybe using regex. Let me know what you think is best.

@brycematsuda
Copy link
Collaborator Author

I was thinking more towards the first option to keep things simple for now. Also it seems like to a lot of people that parsing with regex screams bloody murder, so maybe we'll hold back on it for now since the Billboard HTML code is pretty big.

It's been about 24 hours or so and I haven't ran into any problems, so I'll pull up a PR. If anything comes up we can reopen this.

brycematsuda added a commit to brycematsuda/billboard-charts that referenced this issue Jul 7, 2014
@brycematsuda brycematsuda mentioned this issue Jul 7, 2014
guoguo12 added a commit that referenced this issue Jul 7, 2014
@guoguo12
Copy link
Owner

guoguo12 commented Jul 7, 2014

Merged. Thank you for your help!

I've given you full access to the repository. If there are any fixes or improvements you want to make in the future, feel free to do so.

@brycematsuda
Copy link
Collaborator Author

Oh wow, I wasn't expecting that, thanks again!

To be honest, I think you've gotten the main stuff nailed down at the moment. The only other feature I was thinking about implementing with the data we can get is determining if the entry rose/fell in the ranks from the previous week or if its a new entry/re-entry, which should be pretty easy to do since we already have the necessary info to determine it.

@guoguo12
Copy link
Owner

Actually, that's already sort of included. Each song has attributes lastPos and peakPos for last position and peak position on the chart. There's also a weeks attribute for number of weeks on chart.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants