err: print(parse_product(url)) #1

Nemii1 · 2022-07-28T17:57:03Z

hi sorry for bothering you just starting with this scraper. everything works fine until i add

urls = get_product_links(1)
for url in urls:
print(parse_product(url))

after that ill get

Traceback (most recent call last):
File "/Users/scrap.py", line 29, in
print(parse_product(url))
File "/Users/scrap.py", line 17, in parse_product
r = s.get(url)
File "/Users/scrap/venv/lib/python3.8/site-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/Users/scrap/venv/lib/python3.8/site-packages/requests/sessions.py", line 573, in request
prep = self.prepare_request(req)
File "/Users/scrap/venv/lib/python3.8/site-packages/requests/sessions.py", line 484, in prepare_request
p.prepare(
File "/Users/scrap/venv/lib/python3.8/site-packages/requests/models.py", line 368, in prepare
self.prepare_url(url, params)

i know its easy for you but i really dont see it :(

Nemii1 · 2022-07-29T06:49:35Z

well dont know why but after dumb repair its working but dont know why. i added

urls` = get_product_links(1)
for url in urls:
print(parse_product("http:" + url))

jhnwr · 2022-07-29T06:58:00Z

hi! - did you leave out the http part on your initial url? that looks the most likely thing, as the code still runs as is

Nemii1 · 2022-07-29T07:14:01Z

no my url looks like

url = "https://......"

but for some reason on the output list of urls looks like

//www....
//www....
//www....

jhnwr · 2022-07-29T07:15:16Z

are you able to paste your code here or share the link?

Nemii1 · 2022-07-29T07:16:46Z

from requests_html import HTMLSession
import csv

s = HTMLSession()

def get_product_links(page):
url = f"https://www.ceskereality.cz/firmy/elektrikari-elektroinstalace/{page}"
links = []
r = s.get(url)
products = r.html.find("div.k_vypisFirmy2 div.fleft.slast")
for item in products:
links.append(item.find("a", first=True).attrs["href"])
return links

def parse_product(url):
r = s.get(url)
nazev = (r.html.find("div.mainTitle_i", first=True).text.strip())
vizitka = (r.html.find("div.k_base_adresa2", first=True).text.strip().replace("\n", ", "))

product = {
    "nazev": nazev,
    "vizitka": vizitka
}
return product

urls = get_product_links(1)
for url in urls:
print(parse_product(url))

jhnwr · 2022-07-29T07:29:55Z

thanks. if you check out the "href" tag for the links you are grabbing, they look like this:

//www.ceskereality.cz/firmy/elektroinstalace-vd/

there is no schema for these links - no "https://" which is why it works when you add it in. either add it in where you do, or change it in the initial line like:

for item in products: links.append("https://" + item.find("a", first=True).attrs["href"][2:])

Nemii1 · 2022-07-29T07:35:47Z

ok thanks it works with it. but i still dont know why it grabs it without https. is it some security ?

jhnwr · 2022-07-29T07:48:07Z

because when you do this:

links.append(item.find("a", first=True).attrs["href"])

it gets whatever is in the "href" attribute of that element. in this case it is exactly this:

//www.ceskereality.cz/firmy/elektroinstalace-vd/

I've not seen that before.

Nemii1 · 2022-07-29T08:38:41Z

oh ok thank you very much for help. thanks to your YT i can learn it easily. but with this href i was really confused :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

err: print(parse_product(url)) #1

err: print(parse_product(url)) #1

Nemii1 commented Jul 28, 2022

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022 •

edited

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022

Nemii1 commented Jul 29, 2022

err: print(parse_product(url)) #1

err: print(parse_product(url)) #1

Comments

Nemii1 commented Jul 28, 2022

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022 • edited

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022

Nemii1 commented Jul 29, 2022

jhnwr commented Jul 29, 2022 •

edited