Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem writing to href.json #14

Closed
mathouthouthou opened this issue Jul 3, 2022 · 4 comments
Closed

Problem writing to href.json #14

mathouthouthou opened this issue Jul 3, 2022 · 4 comments

Comments

@mathouthouthou
Copy link

Hello I have been tryting to set up your bot but I keep having an issue when writing on the href.json file.

First I was getting a 405 status code on the results page on every crawl. I fixed this by randomizing the scrapy user agent as they mention here: https://stackoverflow.com/questions/67401114/how-can-i-use-random-useragent-everytitme-when-i-send-resquest

Now I get a 200 on that request. However I don't see anything written on the href.json file. This is the full trace of one iteration of the scraper:

artsytech-C02FX151MD6R:immobot artsyloaner$ python3 immo.py
2022-07-03 16:42:35 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: immobot)
2022-07-03 16:42:35 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform macOS-11.4-x86_64-i386-64bit
2022-07-03 16:42:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'immobot',
 'LOG_ENABLED': 'true',
 'NEWSPIDER_MODULE': 'immobot.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['immobot.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_8 like Mac OS X) '
               'AppleWebKit/532.2 (KHTML, like Gecko) CriOS/60.0.882.0 '
               'Mobile/75M265 Safari/532.2'}
2022-07-03 16:42:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-07-03 16:42:35 [scrapy.extensions.telnet] INFO: Telnet Password: 8c08fc5f87b827be
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-07-03 16:42:35 [scrapy.core.engine] INFO: Spider opened
2022-07-03 16:42:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-07-03 16:42:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-07-03 16:42:35 [filelock] DEBUG: Attempting to acquire lock 4400266544 on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Lock 4400266544 acquired on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Attempting to release lock 4400266544 on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Lock 4400266544 released on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.immobilienscout24.de/robots.txt> (referer: None)
2022-07-03 16:42:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?numberofrooms=3.0-&price=-2000.0&livingspace=70.0-&exclusioncriteria=swapflat&pricetype=rentpermonth&geocodes=110000000702,110000000801,110000000202,110000000104,110000000401,110000000701&enteredFrom=result_list> (referer: None)
2022-07-03 16:42:36 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-03 16:42:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1301,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 17384,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.344543,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 7, 3, 14, 42, 36, 190857),
 'log_count/DEBUG': 7,
 'log_count/INFO': 10,
 'memusage/max': 65961984,
 'memusage/startup': 65957888,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 7, 3, 14, 42, 35, 846314)}
2022-07-03 16:42:36 [scrapy.core.engine] INFO: Spider closed (finished)
There was a problem with reading a json formatted object
Traceback (most recent call last):
  File "/Users/artsyloaner/Downloads/immo-master/immobot/immo.py", line 17, in <module>
    data = json.load(data_file)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Everytime I see an empty array on the scrapped results. Do you know what might be the issue?

Thank you

@nickirk
Copy link
Owner

nickirk commented Jul 4, 2022

As you can see this project was developed several years ago, so if immobilienscout24 implemented some counter measures against bots, this could be the result. You may try wg-gesuchet.de.

If you really want to use it on immobilienscout24, you could try start scrapy alone in interactive mode and try to scrape the website you want to dig and see what returns. Just follow scrapy's introductions for beginners on how to use it. https://docs.scrapy.org/en/latest/intro/tutorial.html

@fabikrah
Copy link

@mathouthouthou did you manage to run the script correctly? I'm stuck at the same part

@mathouthouthou
Copy link
Author

mathouthouthou commented Aug 18, 2022 via email

@fabikrah
Copy link

That's a pity. I found this project here https://github.com/orangecoding/fredy, which works like a charm with scrapingant and immoscout24. But it only sends new flats based on search criterias to a telegram bot

@nickirk nickirk closed this as completed Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants