Problem writing to href.json #14

mathouthouthou · 2022-07-03T14:58:36Z

Hello I have been tryting to set up your bot but I keep having an issue when writing on the href.json file.

First I was getting a 405 status code on the results page on every crawl. I fixed this by randomizing the scrapy user agent as they mention here: https://stackoverflow.com/questions/67401114/how-can-i-use-random-useragent-everytitme-when-i-send-resquest

Now I get a 200 on that request. However I don't see anything written on the href.json file. This is the full trace of one iteration of the scraper:

artsytech-C02FX151MD6R:immobot artsyloaner$ python3 immo.py
2022-07-03 16:42:35 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: immobot)
2022-07-03 16:42:35 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform macOS-11.4-x86_64-i386-64bit
2022-07-03 16:42:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'immobot',
 'LOG_ENABLED': 'true',
 'NEWSPIDER_MODULE': 'immobot.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['immobot.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_8 like Mac OS X) '
               'AppleWebKit/532.2 (KHTML, like Gecko) CriOS/60.0.882.0 '
               'Mobile/75M265 Safari/532.2'}
2022-07-03 16:42:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-07-03 16:42:35 [scrapy.extensions.telnet] INFO: Telnet Password: 8c08fc5f87b827be
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-07-03 16:42:35 [scrapy.core.engine] INFO: Spider opened
2022-07-03 16:42:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-07-03 16:42:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-07-03 16:42:35 [filelock] DEBUG: Attempting to acquire lock 4400266544 on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Lock 4400266544 acquired on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Attempting to release lock 4400266544 on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Lock 4400266544 released on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.immobilienscout24.de/robots.txt> (referer: None)
2022-07-03 16:42:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?numberofrooms=3.0-&price=-2000.0&livingspace=70.0-&exclusioncriteria=swapflat&pricetype=rentpermonth&geocodes=110000000702,110000000801,110000000202,110000000104,110000000401,110000000701&enteredFrom=result_list> (referer: None)
2022-07-03 16:42:36 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-03 16:42:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1301,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 17384,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.344543,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 7, 3, 14, 42, 36, 190857),
 'log_count/DEBUG': 7,
 'log_count/INFO': 10,
 'memusage/max': 65961984,
 'memusage/startup': 65957888,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 7, 3, 14, 42, 35, 846314)}
2022-07-03 16:42:36 [scrapy.core.engine] INFO: Spider closed (finished)
There was a problem with reading a json formatted object
Traceback (most recent call last):
  File "/Users/artsyloaner/Downloads/immo-master/immobot/immo.py", line 17, in <module>
    data = json.load(data_file)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Everytime I see an empty array on the scrapped results. Do you know what might be the issue?

Thank you

The text was updated successfully, but these errors were encountered:

nickirk · 2022-07-04T19:24:40Z

As you can see this project was developed several years ago, so if immobilienscout24 implemented some counter measures against bots, this could be the result. You may try wg-gesuchet.de.

If you really want to use it on immobilienscout24, you could try start scrapy alone in interactive mode and try to scrape the website you want to dig and see what returns. Just follow scrapy's introductions for beginners on how to use it. https://docs.scrapy.org/en/latest/intro/tutorial.html

fabikrah · 2022-08-17T22:12:55Z

@mathouthouthou did you manage to run the script correctly? I'm stuck at the same part

mathouthouthou · 2022-08-18T09:30:31Z

I gave up !

…

On Thu, 18 Aug 2022 at 00:13, fabikrah ***@***.***> wrote: @mathouthouthou <https://github.com/mathouthouthou> did you manage to run the script correctly? I'm stuck at the same part — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZ4W4WUCCV3SL7CU2OVGGCDVZVPXDANCNFSM52Q2GREA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

fabikrah · 2022-08-18T09:39:53Z

That's a pity. I found this project here https://github.com/orangecoding/fredy, which works like a charm with scrapingant and immoscout24. But it only sends new flats based on search criterias to a telegram bot

nickirk closed this as completed Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem writing to href.json #14

Problem writing to href.json #14

mathouthouthou commented Jul 3, 2022

nickirk commented Jul 4, 2022

fabikrah commented Aug 17, 2022

mathouthouthou commented Aug 18, 2022 via email

fabikrah commented Aug 18, 2022

Problem writing to href.json #14

Problem writing to href.json #14

Comments

mathouthouthou commented Jul 3, 2022

nickirk commented Jul 4, 2022

fabikrah commented Aug 17, 2022

mathouthouthou commented Aug 18, 2022 via email

fabikrah commented Aug 18, 2022