Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of http_header property in site config #61

Merged
merged 6 commits into from
Nov 9, 2016

Conversation

Kdecherf
Copy link
Collaborator

@Kdecherf Kdecherf commented Nov 6, 2016

This PR adds the ability to graby to use user-agent provided in site configs instead of default_ua and user_agents from HttpClient.

http_header is extracted from site config and provided to HttpClient#fetch() and HttpClient#getUserAgent() to extract the user-agent, if any. In case of missing value we fall back to the original behavior: check host from user_agents or use default_ua.

There are some issues however, at this time:

  • default_ua is not used anymore as long as a http_header(user-agent) is currently provided in global.txt. A solution could be to ignore the user-agent provided by global, any thoughts?
  • HttpClient does not have access to SiteConfig, some tests began to fail because of the deletion of default user_agents array (e9f2981). I fixed them by providing a custom array with user_agents to the HttpClient instance in the test. Not sure if it's okay for you.

Fixes #49

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 98.523% when pulling 3f9b341 on Kdecherf:http-headers into 57e2757 on j0k3r:master.

@Kdecherf
Copy link
Collaborator Author

Kdecherf commented Nov 6, 2016

Well, nevermind for the first issue. It was a non-tracked manual change in vendor/j0k3r/graby-site-config/global.txt on my side. Anyway, I don't know if you are okay with this behavior: user_agents and default_ua are completely ignored if global.txt provides a http_header(user-agent), which is not the case actually (for now).

Copy link
Owner

@j0k3r j0k3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look interesting, thanks for that !
I'll soon test it

* @param bool $skipTypeVerification Avoid mime detection which means, force GET instead of potential HEAD
*
* @return array With keys effective_url, body & headers
*/
public function fetch($url, $skipTypeVerification = false)
public function fetch($url, $http_header = array(), $skipTypeVerification = false)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use httpHeader instead?
All variables are lowerCamelCase.

@@ -341,6 +345,8 @@ public function parseLines(array $lines)
} elseif ((substr($command, -1) == ')') && preg_match('!^([a-z0-9_]+)\((.*?)\)$!i', $command, $match) && $match[1] == 'replace_string') {
array_push($config->find_string, $match[2]);
array_push($config->replace_string, $val);
} elseif ((substr($command, -1) == ')') && preg_match('!^([a-z0-9_]+)\(([a-z0-9_-]+)\)$!i', $command, $match) && $match[1] == 'http_header' && in_array($match[2], array('user-agent'))) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of in_array($match[2], array('user-agent')) can you use a simple 'user-agent' === $match[2]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used in_array() to make addition of new headers easier. But I can replace it with a strict check if you prefer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let in_array() for #62

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 98.523% when pulling 51a6852 on Kdecherf:http-headers into 57e2757 on j0k3r:master.

'.wikipedia.org' => 'Mozilla/5.2',
'.fok.nl' => 'Googlebot/2.1',
'getpocket.com' => 'PHP/5.2',
),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove that, we should be sure that these values are in the config file for each domain.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, skip it, I've checked, they are all already defined in the site config 👍

* @param bool $skipTypeVerification Avoid mime detection which means, force GET instead of potential HEAD
*
* @return array With keys effective_url, body & headers
*/
public function fetch($url, $skipTypeVerification = false)
public function fetch($url, $httpHeader = array(), $skipTypeVerification = false)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move that new parameter as third one? If we inject it in the second arg, it'll require to release a new major version because the signature of this public method will change. I know, it'll require to put the false argument for $skipTypeVerification every where you'll have to inject $httpHeader but it'll avoid BC

{
$ua = $this->config['ua_browser'];

if (array_key_exists('user-agent', $httpHeader) && !empty($httpHeader['user-agent'])) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the array_key_exists(), empty() already perform the isset().

'.wikipedia.org' => 'Mozilla/5.2',
'.fok.nl' => 'Googlebot/2.1',
'getpocket.com' => 'PHP/5.2',
),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, skip it, I've checked, they are all already defined in the site config 👍

@j0k3r
Copy link
Owner

j0k3r commented Nov 7, 2016

I fixed them by providing a custom array with user_agents to the HttpClient instance in the test. Not sure if it's okay for you.

Yeah it's ok, you should have an other option to inject it in the fetch() but what you've done allow the fallback to the old behavior to still be tested

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
A static array is added to whitelist authorized headers during parsing.

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
These default options are no longer needed with the use of
http_header(user-agent)

Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 98.532% when pulling b25abc6 on Kdecherf:http-headers into cef2fba on j0k3r:master.

Copy link
Owner

@j0k3r j0k3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@j0k3r j0k3r merged commit bdd9f7b into j0k3r:master Nov 9, 2016
@Kdecherf Kdecherf deleted the http-headers branch November 14, 2016 10:15
j0k3r added a commit that referenced this pull request Nov 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add ability to send HTTP header like user-agent or referer
3 participants