Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

err: print(parse_product(url)) #1

Open
Nemii1 opened this issue Jul 28, 2022 · 9 comments
Open

err: print(parse_product(url)) #1

Nemii1 opened this issue Jul 28, 2022 · 9 comments

Comments

@Nemii1
Copy link

Nemii1 commented Jul 28, 2022

hi sorry for bothering you just starting with this scraper. everything works fine until i add

urls = get_product_links(1)
for url in urls:
print(parse_product(url))

after that ill get

Traceback (most recent call last):
File "/Users/scrap.py", line 29, in
print(parse_product(url))
File "/Users/scrap.py", line 17, in parse_product
r = s.get(url)
File "/Users/scrap/venv/lib/python3.8/site-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/Users/scrap/venv/lib/python3.8/site-packages/requests/sessions.py", line 573, in request
prep = self.prepare_request(req)
File "/Users/scrap/venv/lib/python3.8/site-packages/requests/sessions.py", line 484, in prepare_request
p.prepare(
File "/Users/scrap/venv/lib/python3.8/site-packages/requests/models.py", line 368, in prepare
self.prepare_url(url, params)

i know its easy for you but i really dont see it :(

@Nemii1
Copy link
Author

Nemii1 commented Jul 29, 2022

well dont know why but after dumb repair its working but dont know why. i added

urls` = get_product_links(1)
for url in urls:
print(parse_product("http:" + url))

@jhnwr
Copy link
Owner

jhnwr commented Jul 29, 2022

hi! - did you leave out the http part on your initial url? that looks the most likely thing, as the code still runs as is

@Nemii1
Copy link
Author

Nemii1 commented Jul 29, 2022

no my url looks like

url = "https://......"

but for some reason on the output list of urls looks like

//www....
//www....
//www....

@jhnwr
Copy link
Owner

jhnwr commented Jul 29, 2022

are you able to paste your code here or share the link?

@Nemii1
Copy link
Author

Nemii1 commented Jul 29, 2022

from requests_html import HTMLSession
import csv

s = HTMLSession()

def get_product_links(page):
url = f"https://www.ceskereality.cz/firmy/elektrikari-elektroinstalace/{page}"
links = []
r = s.get(url)
products = r.html.find("div.k_vypisFirmy2 div.fleft.slast")
for item in products:
links.append(item.find("a", first=True).attrs["href"])
return links

def parse_product(url):
r = s.get(url)
nazev = (r.html.find("div.mainTitle_i", first=True).text.strip())
vizitka = (r.html.find("div.k_base_adresa2", first=True).text.strip().replace("\n", ", "))

product = {
    "nazev": nazev,
    "vizitka": vizitka
}
return product

urls = get_product_links(1)
for url in urls:
print(parse_product(url))

@jhnwr
Copy link
Owner

jhnwr commented Jul 29, 2022

thanks. if you check out the "href" tag for the links you are grabbing, they look like this:

//www.ceskereality.cz/firmy/elektroinstalace-vd/

there is no schema for these links - no "https://" which is why it works when you add it in. either add it in where you do, or change it in the initial line like:

for item in products: links.append("https://" + item.find("a", first=True).attrs["href"][2:])

@Nemii1
Copy link
Author

Nemii1 commented Jul 29, 2022

ok thanks it works with it. but i still dont know why it grabs it without https. is it some security ?

@jhnwr
Copy link
Owner

jhnwr commented Jul 29, 2022

because when you do this:

links.append(item.find("a", first=True).attrs["href"])

it gets whatever is in the "href" attribute of that element. in this case it is exactly this:

//www.ceskereality.cz/firmy/elektroinstalace-vd/

I've not seen that before.

@Nemii1
Copy link
Author

Nemii1 commented Jul 29, 2022

oh ok thank you very much for help. thanks to your YT i can learn it easily. but with this href i was really confused :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants