implemented fake-useragent package#49
Conversation
Proxies can now be filtered with an optional parameter in the RequestProxy constructor. For this the parsers have to set the protocol for each proxy object they build. Protocols are stored in the ProxyObject.Protocol enum. PremProxy: parsing was broken because the port number was 'encrypted' and no longer stored in CSS. They are now obtained from a javascript file that holds a function to the key-port pairs.
The default way the program gets user agents is now from an online, up-to-date database with the help of the fake-useragent package. Reading useragents from a custom local file is still available as a parameter to the UserAgentManager class. Solves pgaref#28.
| def __init__(self, agent_file=os.path.join(os.path.dirname(__file__), '../data/user_agents.txt')): | ||
| self.agent_file = agent_file | ||
| self.useragents = self.load_user_agents(self.agent_file) | ||
| def __init__(self, fallback=None, file=None): |
There was a problem hiding this comment.
How is fallback used here exactly?
There was a problem hiding this comment.
The user can specify an arbitrary user agent as a fallback. It would be used in the rare occasion when some error happens in fake-useragent while processing the request (network error etc)
Or if we explicitly set a default fallback that would be better?
There was a problem hiding this comment.
Hey @la55u
Just took a better look at the library, my suggestions:
- Let's avoid storing collected data at the os temp dir (keep everything in memory) by using
cache=Falseargument - Also lets always use a fallback browser (could be the most popular one) as we dont want out library to fail for any reason
| else: | ||
| logger.info('Using fake-useragent package for user agents.') | ||
| fb = fallback | ||
| self.fakeuseragent = FakeUserAgent(fallback=fb) |
| user_agent = random.choice(self.useragents) | ||
| return user_agent.decode('utf-8') | ||
| else: | ||
| return self.fakeuseragent.random |
There was a problem hiding this comment.
LGTM - are we sure the fakeuseragent.random is in utf-8 format?
We had issues with that in the past
|
|
||
| def get_first_user_agent(self): | ||
| return self.useragents[0].decode('utf-8') | ||
| return self.useragents[0].decode('utf-8') if self.agent_file else None |
There was a problem hiding this comment.
Shall we print a WARNING message when this method is called with fakeuseragent?
There was a problem hiding this comment.
Yes we should, but I just noticed the logging doesn't work in this class. Can you help me why?
There was a problem hiding this comment.
Get rid of the logger.setLevel part - this is done in the caller/main class
|
|
||
| def get_last_user_agent(self): | ||
| return self.useragents[-1].decode('utf-8') | ||
| return self.useragents[-1].decode('utf-8') if self.agent_file else None |
|
|
||
| def get_len_user_agent(self): | ||
| return len(self.useragents) | ||
| return len(self.useragents) if self.agent_file else None |
| 'requests >= 2.18.4', | ||
| 'pyOpenSSL >= 17.5.0' | ||
| 'pyOpenSSL >= 17.5.0', | ||
| 'fake-useragent >= 0.1.10' |
| if self.agent_file: | ||
| return self.useragents[0].decode('utf-8') | ||
| else: | ||
| logger.warning('User-Agents file not set') |
There was a problem hiding this comment.
Looks good!
We could make the message a bit more descriptive like:
"Fake useragent lib does not support operaration blah - change to user-agent file?"
| from fake_useragent import FakeUserAgent | ||
| import logging | ||
|
|
||
| logger = logging.getLogger(__name__) |
|
Merged! Thanks again for the PR @la55u !! |
The default way the program gets user agents is now from an online, up-to-date database
with the help of the fake-useragent package. Reading useragents from a custom local
file is still available as a parameter to the UserAgentManager class. Solves #28.