Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connections are too fragile #3

Closed
rgaudin opened this issue Oct 27, 2020 · 3 comments
Closed

Connections are too fragile #3

rgaudin opened this issue Oct 27, 2020 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@rgaudin
Copy link
Member

rgaudin commented Oct 27, 2020

Here's the output of my last run

INFO     ---------   url: https://eng.libretexts.org/Bookshelves/Electrical_Engineering/Signal_Processing_and_Modeling/Book%3A_Introduction_to_Linear_Time-Invariant_Dynamic_Systems_for_Students_of_Engineering_(Hallauer)/03%3A_Mechanical_Units%2C_Low-Order_Mechanical_Systems%2C_and_Simple_Transient_Responses_of_First_Order_Systems/3.08%3A_Chapter_3_Homework
INFO     ----- Course Index title: 4: Frequency Response of First Order Systems, Transfer Functions, and General Method for Derivation of Frequency Response
INFO     -----    url: https://eng.libretexts.org/Bookshelves/Electrical_Engineering/Signal_Processing_and_Modeling/Book%3A_Introduction_to_Linear_Time-Invariant_Dynamic_Systems_for_Students_of_Engineering_(Hallauer)/04%3A_Frequency_Response_of_First_Order_Systems%2C_Transfer_Functions%2C_and_General_Method_for_Derivation_of_Frequency_Response
INFO     ----- Course Index title: 4.1: Definition of Frequency Response
INFO     -----    url: https://eng.libretexts.org/Bookshelves/Electrical_Engineering/Signal_Processing_and_Modeling/Book%3A_Introduction_to_Linear_Time-Invariant_Dynamic_Systems_for_Students_of_Engineering_(Hallauer)/04%3A_Frequency_Response_of_First_Order_Systems%2C_Transfer_Functions%2C_and_General_Method_for_Derivation_of_Frequency_Response/4.01%3A_Definition_of_Frequency_Response
INFO     --------- Chapter: 4.1: Definition of Frequency Response
INFO     ---------   url: https://eng.libretexts.org/Bookshelves/Electrical_Engineering/Signal_Processing_and_Modeling/Book%3A_Introduction_to_Linear_Time-Invariant_Dynamic_Systems_for_Students_of_Engineering_(Hallauer)/04%3A_Frequency_Response_of_First_Order_Systems%2C_Transfer_Functions%2C_and_General_Method_for_Derivation_of_Frequency_Response/4.01%3A_Definition_of_Frequency_Response
INFO     ----- Course Index title: 4.2: Response of a First Order System to a Suddenly Applied Cosine
INFO     -----    url: https://eng.libretexts.org/Bookshelves/Electrical_Engineering/Signal_Processing_and_Modeling/Book%3A_Introduction_to_Linear_Time-Invariant_Dynamic_Systems_for_Students_of_Engineering_(Hallauer)/04%3A_Frequency_Response_of_First_Order_Systems%2C_Transfer_Functions%2C_and_General_Method_for_Derivation_of_Frequency_Response/4.02%3A_Response_of_a_First_Order_System_to_a_Suddenly_Applied_Cosine
Error with connection ('HTTPSConnectionPool(host='eng.libretexts.org', port=443): Read timed out. (read timeout=5)'); about to perform retry 1 of 5.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 436, in _error_catcher
    yield
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 518, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/local/lib/python3.8/http/client.py", line 458, in read
    n = self.readinto(b)
  File "/usr/local/lib/python3.8/http/client.py", line 502, in readinto
    n = self.fp.readinto(b)
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 751, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 575, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 540, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 454, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./sushichef.py", line 1414, in <module>
    chef.main()
  File "/usr/local/lib/python3.8/site-packages/ricecooker/chefs.py", line 289, in main
    self.run(args, options)
  File "/usr/local/lib/python3.8/site-packages/ricecooker/chefs.py", line 279, in run
    self.pre_run(args, options)
  File "./sushichef.py", line 1310, in pre_run
    channel_tree = self.scrape(args, options)
  File "./sushichef.py", line 1390, in scrape
    for collection_node in collections.to_node():
  File "./sushichef.py", line 157, in to_node
    yield collection.to_node()
  File "./sushichef.py", line 190, in to_node
    topic.units()
  File "./sushichef.py", line 269, in units
    course_index.index(build_path(base_path + [chapter_link.text]))
  File "./sushichef.py", line 484, in index
    result = course_index.index(
  File "./sushichef.py", line 484, in index
    result = course_index.index(
  File "./sushichef.py", line 484, in index
    result = course_index.index(
  File "./sushichef.py", line 490, in index
    chapter = Chapter(course_link.text, course_link_href)
  File "./sushichef.py", line 638, in __init__
    self.soup = self.to_soup()
  File "./sushichef.py", line 587, in to_soup
    document = download(self.source_id)
  File "./sushichef.py", line 1259, in download
    document = downloader.read(source_id, loadjs=loadjs, session=sess)
  File "/usr/local/lib/python3.8/site-packages/ricecooker/utils/downloader.py", line 113, in read
    return response.content
  File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 829, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 754, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
INFO     Deleting chef temp files at /app/.ricecooker-temp

Look like connection might be too fragile and may benefit from better retry/wait mechanism ; or it might be a different issue that's poorly reported.

@rgaudin rgaudin added the bug Something isn't working label Oct 27, 2020
@kelson42 kelson42 pinned this issue Nov 2, 2020
@rgaudin
Copy link
Member Author

rgaudin commented Nov 17, 2020

New, different connection robustness issue trying to get the bio subject

INFO     ---------   url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/45%3A_Population_and_Community_Ecology/45.7%3A_Behavioral_Biology_-_Proximate_and_Ultimate_Causes_of_Behavior
INFO     ----- Course Index title: 45.E: Population and Community Ecology (Exercises)
INFO     -----    url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/45%3A_Population_and_Community_Ecology/45.E%3A_Population_and_Community_Ecology_(Exercises)
INFO     --------- Chapter: 45.E: Population and Community Ecology (Exercises)
INFO     ---------   url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/45%3A_Population_and_Community_Ecology/45.E%3A_Population_and_Community_Ecology_(Exercises)
INFO     ----- Course Index title: 46: Ecosystems
INFO     -----    url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems
INFO     ----- Course Index title: 46.0: Prelude to Ecosystems
INFO     -----    url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems/46.0%3A_Prelude_to_Ecosystems
INFO     --------- Chapter: 46.0: Prelude to Ecosystems
INFO     ---------   url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems/46.0%3A_Prelude_to_Ecosystems
INFO     ----- Course Index title: 46.1: Ecology of Ecosystems
INFO     -----    url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems/46.1%3A_Ecology_of_Ecosystems
INFO     --------- Chapter: 46.1: Ecology of Ecosystems
INFO     ---------   url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems/46.1%3A_Ecology_of_Ecosystems
INFO     ----- Course Index title: 46.2: Energy Flow through Ecosystems
INFO     -----    url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems/46.2%3A_Energy_Flow_through_Ecosystems
INFO     --------- Chapter: 46.2: Energy Flow through Ecosystems
INFO     ---------   url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems/46.2%3A_Energy_Flow_through_Ecosystems
ERROR    HTTPSConnectionPool(host='bio.libretexts.org', port=443): Read timed out. (read timeout=20)
INFO     ----- Course Index title: 46.3: Biogeochemical Cycles
INFO     -----    url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems/46.3%3A_Biogeochemical_Cycles
INFO     Error: 504 Server Error: Gateway Time-out for url: https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_General_Biology_(OpenStax)/8%3A_Ecology/46%3A_Ecosystems/46.3%3A_Biogeochemical_Cycles
ERROR    HTTPSConnectionPool(host='bio.libretexts.org', port=443): Read timed out. (read timeout=20)
INFO     Retrying
Traceback (most recent call last):
  File "./sushichef.py", line 1426, in <module>
    chef.main()
  File "/usr/local/lib/python3.8/site-packages/ricecooker/chefs.py", line 289, in main
    self.run(args, options)
  File "/usr/local/lib/python3.8/site-packages/ricecooker/chefs.py", line 279, in run
    self.pre_run(args, options)
  File "./sushichef.py", line 1321, in pre_run
    channel_tree = self.scrape(args, options)
  File "./sushichef.py", line 1401, in scrape
    for collection_node in collections.to_node():
  File "./sushichef.py", line 163, in to_node
    yield collection.to_node()
  File "./sushichef.py", line 196, in to_node
    topic.units()
  File "./sushichef.py", line 280, in units
    course_index.index(build_path(base_path + [hashed(chapter_link.text)]))
  File "./sushichef.py", line 495, in index
    result = course_index.index(
  File "./sushichef.py", line 495, in index
    result = course_index.index(
  File "./sushichef.py", line 495, in index
    result = course_index.index(
  [Previous line repeated 1 more time]
  File "./sushichef.py", line 425, in index
    retry += 1
UnboundLocalError: local variable 'retry' referenced before assignment
INFO     Deleting chef temp files at /app/.ricecooker-temp

@Miniland1333
Copy link

For my similar applications, I've noticed this issue when communicating with LibreTexts when sending lots of requests. Would you be able to test if whether slowing down the scraping (by adding in pauses) and then distributing the workload across multiple nodes works better? This would help determine whether the issue is the fragility of quickly sending many requests from a single node or if it is just a global error that occurs when the backend is overwhelmed.

@rgaudin
Copy link
Member Author

rgaudin commented Apr 8, 2021

Outdated ; fixed

@rgaudin rgaudin closed this as completed Apr 8, 2021
@rgaudin rgaudin unpinned this issue Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants