-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing "context"-based web requests in favor of curl-only web requests #4319
Comments
file_get_contents() is simpler, but you are correct, it probably needs to go (I did get rid of many of them a while ago, but not all of them). I even put it in the docs that file_get_contents should not be used - except for actual files. |
I also replaced the context-based request to get headers with curl-based request, because the context-based request did not provide the required options, whereas curl did provide them. I don't remember if I submitted that pull request or it was just in my local modification. My opinion may be too extravagant, but it is that even for single files context-based requests should not be used, but curl, again, for the uniformity of options and easier debugging. In my local copy I used proxy server to cache replies from various databases between the bot run, to not exhaust API limits in vain. And I figured out that it is easier when all requests are easier to manage when they come via a single interface, such as CURL, because even proxy settings and the way of accessing proxy are completely different between context-based requests and curl-based requests. Can you please implement a function to return curl context such as we discussed in #4316 , and then I will be using this wrapper function. Would you mind if I will be submitting pull requests to replace context-based requests on curl-based requests using this wrapper function? I would also like to thank you for your cooperation and friedliness separately! I participated in many open-sources projects on GitHub. My biggest proud is that I fixed a bug in OpenSSL key data generation on session negotiation, accepted a pull request and the maineners accepted it. I also participated in Ansible, MySQL data connector, Pyra/Pysa check, etc., but you are the most cooperative person I ever encountered. |
All file_get_contents are gone (other than the ones that actual open files in the test suite). |
@GlazerMann -- this is fantastic news, thank you! I see that the structures for Still, I've noticed that in some cases, when a site reports error code such as 403, the What is the benefit of keeping |
No benefit other than as you can see, "semi-dead" DOIs are a pain. Handling NULL's is the real problem. The current code "works" and I would prefer CURL, as long as it also works. |
I am rewriting the code slowly right now |
Continuing forward. I have now added arrays of know "NULL" dois - one for "it's actually dead" and one for "it's actually alive". That should really speed up the bot. |
@GlazerMann -- thank you very much for your work!! These NULL processing is a matter of debate which we had before: I propose to consider NULLs which took longer than a timeout as FALSE, and you accepted this pull request, still, then you changed the logic back and now NULLs take lot of time to process. You wrote that you are working on dead DOIs. Can you please debug whether the following DOI (jpet) is dead? You can try on the Wikipedia page "Stimulant":
It gave me the following output but took lot of time to progress. Here is the start of the output:
etc... |
Added to list of null dois. I am now logging them and adding them to a list of good/bad. |
The re-tries are now completely removed. |
Wow! That is fantastic news! Thank you very much! You have made my day!
…On January 27, 2024 7:37:23 PM GMT+02:00, GlazerMann ***@***.***> wrote:
The re-tries are now completely removed.
|
Also, I now have a huge (and growing) data-base in the code of DOIs that return NULL. They all get logged to the CodeCoverage logfile. This should help a lot also. |
I am now logging HDL nulls too. Should be interesting. |
If you get a bunch of NULLs during your runs, please post them here, so I can add them to the lists. Please remove duplicates of course. |
How can I institute proper logging to properly catch NULLs? |
The nulls get logged to CodeCoverage file |
Please see the lines from my CodeCoverage file. Please let me know if you find them useful.
|
Those are useful. On another note, I figured out where the out of memory errors were coming from and fixed that. |
Great news! Where did you experience the out of memory errors? Which use patterns? I never experienced that. Can you please give me information on a diff that fixes out of memory errors, so I could know for the future? |
Do you have any idea on what kind of message is that:
|
The main solution to memory errors was curl request to web.archive.org. I have to add some code the curl requests to make sure it stopped at a certain point. Multi-gigabyte archive files are a problem. I assume unset some variables sooner. The S2 JSON is HUGE. |
Maybe explicitly freeing curl handle early like in one of my previous pull requests is also beneficial for the memory use? |
If you saw one of my past pull requests, processing URLs was in stages: curl handle was freed before starting analyzing data received, i.e. before JSON parse. I think we should apply similar approach now. Thank you for pointing that out. |
@GlazerMann - please see the contents of my Codecoverage file since I last sent it to you. Do you know what does
|
restore_italics is related to CrossRef might be missing italics. |
|
I've got some new NULLs during last few days (sending the CodeCoverage file, there were duplicate lines, I removed them and sorted the lines alphabetically; if you need the original file with duplicates, let me know):
|
I think that we should close this issue because 'Removing "context"-based web requests in favor of curl-only web requests' is resolved. As for the "CodeCoverage" issues, I can supply this information to you separately. |
Please find the lines of the CodeCoverage file that I got since I last time sent you last week. If you wish, I may send you this data in the future by email. In this case, please send me a message to maxim AT masiutin DOT com Which way is more convenient for you?
|
Please find the recent CodeCoverage file (will all current master changes applied) since last week:
|
Please find new lines that I got since the last message.
|
Got them |
All added to source code |
Thank you! Since you resolved the task |
I found out that some of these NULL DOIs may be intermittent and may be resolved in the future, such as 10.14341/probl13328 that was once NULL but now works. |
Maybe I need to periodically recheck the full list of DOIs that were once NULL, what do you think, @GlazerMann ? |
I recheck the bad ones |
Please create new issues every time you upload NULLS. |
Hello, @GlazerMann , (this is a copy of the message I've sent at https://en.wikipedia.org/w/index.php?title=User_talk%3AMaxim_Masiutin&diff=1215577072&oldid=1215539721 ) I have submitted two issues on GitHub: I splitted my CodeCoverage files on two parts: one part contain only NULLs, another part contains all other lines; I sorted the lines in both parts and removed duplicate lines. I then submitted each of the part as a separate issue on GitHub. Therefore, you can, for example, process the NULLs first and let the other issues accumulate so that you will handle them later when you have time. Please let me know whether such format of submitting my CodeCoverage is OK for you. Also please let me know which frequency would you prefer. Would the weekly frequency be OK for you? |
What do you think about the following code cleanup: removing "context"-based web requests in favor of curl-only web requests? That would be a consistent interface easier to work with. Everything will be handled via curl, not half functions via curl, half via "context". The program will be easier to manage, to debug and to develop, and to add new options, such as proxy use, advanced options (if needed) etc.
The text was updated successfully, but these errors were encountered: