Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python 3 issues with working-with-web-pages #407

Closed
fredgibbs opened this issue Apr 4, 2017 · 6 comments
Closed

python 3 issues with working-with-web-pages #407

fredgibbs opened this issue Apr 4, 2017 · 6 comments

Comments

@fredgibbs
Copy link
Contributor

Hi! I'm enjoying learning to work with text files and web pages using Programming Historian.

I installed Python 3.6 and have run into a couple major differences from how the directions are written. Once I power through the lessons and hopefully do very well in a job interview, I'd be happy to help update the directions with more detail.

For now, this is what I've run into on http://programminghistorian.org/lessons/working-with-web-pages :

  • import urllib2 does not work. StackOverflow kindly suggested the fix "from urllib.request import urlopen"
  • f = open('obo-t17800628-33.html', 'w') no longer works. Again, StackOverflow pointed out that it needs to be opened in binary: "f = open('obo-t17800628-33.html', 'wb')." I'd love to understand that better

Thank you again for the site! I hope this is helpful info.

All the best,
Cathi

@ianmilligan1
Copy link
Contributor

ianmilligan1 commented Apr 4, 2017

Here's a few workarounds. The lesson is indeed written for Python 2.7.

To import, use urllib.request as you've noted.

import urllib.request

url = 'http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33'

response = urllib.request.urlopen(url)
webContent = response.read()

print(webContent[0:300])

and

# save-webpage.py

import urllib.request

url = 'http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33'

response = urllib.request.urlopen(url)
webContent = response.read().decode('utf-8')

f = open('test.html', 'w')
f.write(webContent)
f.flush()
f.close()

Putting together:

import urllib.request

url = 'http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33'

response = urllib.request.urlopen(url)
HTML = response.read().decode('utf-8')

print(stripTags(HTML))

and to get the word list, I did

import urllib.request

url = 'http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33'

response = urllib.request.urlopen(url)
HTML = response.read().decode('utf-8')
clip = stripTags(HTML)
text = BeautifulSoup(clip,"html5lib").get_text().lower()

wordlist = text.split()

print(wordlist[0:120])

@mdlincoln
Copy link
Contributor

Is this closeable now that we have the python warnings?

@acrymble
Copy link

We used to have comments back when we were on Wordpress. It's a shame we've lost that actually because this could be the type of thing solved through a comment (I'm not suggesting we implement that now!). But yes, I suppose we can close it.

@lukifer195
Copy link

lukifer195 commented Jul 30, 2019

from urllib.request import Request, urlopen
url = 'https://dict.laban.vn/find?type=1&query=ch%C3%A2n'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
resource = urlopen(req)
print(resource.read())
content = resource.read().decode(resource.headers.get_content_charset())

any who can help me decode that link , i don't know factor when i tried with another link and it worked

@guanbuc
Copy link

guanbuc commented Dec 5, 2019

doenst work

@svmelton
Copy link
Contributor

svmelton commented Dec 5, 2019

@guanbuc could you provide a bit more information about what isn't working for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants