Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The process of generating random user agents is abnormally time-consuming. #25

Closed
xffxff opened this issue May 17, 2023 · 8 comments
Closed

Comments

@xffxff
Copy link
Contributor

xffxff commented May 17, 2023

image

I added logs before and after random_user_agent() and found that its processing time even exceeded 10 seconds. Is this expected?

image

@neon-mmd
Copy link
Owner

neon-mmd commented May 17, 2023

I added logs before and after random_user_agent() and found that its processing time even exceeded 10 seconds. Is this expected?

This was unexpected 🙂 After testing it on my side it turns out that for me it takes 7 seconds so it seems that it has to do with the performance of the system but we can try enabling one option, let's change the caching option from false to true for generating user agent and let's see if it improves the speed 🙂 .

image

The file that needs to be changed is located under src/search_results_handler/user_agent.rs.

@neon-mmd
Copy link
Owner

So after studying and digging deep into the crate itself, I found that the crate actually fetches data from the upstream website and scrapes it to get the required user agents that's why it causes a delay and also it looks like the project has been abandoned because the last commit seems to be 5 years which is a very long time for an open source repository.

Also, enabling the cache option did improve speed slightly by 2-3 seconds but I think that having a delay of 5 seconds seems to be good to allow some random delay to occur between requests which help to evade IP blocking I can think of reducing the random time delay that I have added in the code from 1-10 secs to 1-5 seconds to improve speed. What do you say @xffxff??

Also, maybe in the future, we might need to either explore an alternative for this crate or maybe implement our own 😄 .

@xffxff
Copy link
Contributor Author

xffxff commented May 19, 2023

Also, enabling the cache option did improve speed slightly by 2-3 seconds but I think that having a delay of 5 seconds seems to be good to allow some random delay to occur between requests which help to evade IP blocking I can think of reducing the random time delay that I have added in the code from 1-10 secs to 1-5 seconds to improve speed

@neon-mmd Hmm, Do we have to insert a delay between different requests? This may conflict with our lighting-fast goal. Additionally, when there are many concurrent search requests, even with a delay, there will still be a lot of requests to the engine at the same time.

@xffxff
Copy link
Contributor Author

xffxff commented May 19, 2023

pub fn random_user_agent() -> String {
UserAgentsBuilder::new()
.cache(false)
.dir("/tmp")
.thread(1)
.set_browsers(
Browsers::new()
.set_chrome()
.set_safari()
.set_edge()
.set_firefox()
.set_mozilla(),
)
.build()
.random()
.to_string()
}

I believe that it is unnecessary to construct a new UserAgent object for every request, as all requests can utilize the same instance of UserAgent. Instead, you can call UserAgent.random() to obtain a random user agent string for each request. Here is an example of how this can be implemented:

// Construct the UserAgent object once when the server starts
let user_agents = UserAgentsBuilder::new()
         .cache(false)
         .dir(/tmp)
         .thread(1)
         .set_browsers(
             Browsers::new()
                 .set_chrome()
                 .set_safari()
                 .set_edge()
                 .set_firefox()
                 .set_mozilla(),
         )
         .build()
...

// Retrieve a random user agent string in aggregator.rs
let user_agent = user_agents.random().to_string()

@alamin655
Copy link
Collaborator

alamin655 commented May 19, 2023

I think @xffxff is right. Here is my implementation using the lazy_static crate:

use fake_useragent::{Browsers, UserAgents, UserAgentsBuilder};
use lazy_static::lazy_static;

lazy_static! {
    static ref USER_AGENTS: UserAgents = {
        UserAgentsBuilder::new()
            .cache(false)
            .dir("/tmp")
            .thread(1)
            .set_browsers(
                Browsers::new()
                    .set_chrome()
                    .set_safari()
                    .set_edge()
                    .set_firefox()
                    .set_mozilla(),
            )
            .build()
    };
}

/// A function to generate a random user agent to improve privacy of the user.
///
/// # Returns
///
/// A randomly generated user agent string.
pub fn random_user_agent() -> String {
    USER_AGENTS.random().to_string()
}

@neon-mmd
Copy link
Owner

@neon-mmd Hmm, Do we have to insert a delay between different requests? This may conflict with our lighting-fast goal. Additionally, when there are many concurrent search requests, even with a delay, there will still be a lot of requests to the engine at the same time.

No, actually we need it because if we do not add a random delay between requests especially for large-scale server use cases as these servers will have thousands of users and will create a lot of traffic and this, in turn, may cause the upstream search engines to get DDoSed which is not good and they might ban the IP that caused the DDoS but I can see one option like having a config option like production_use which when enabled puts random delay lets say after every 4 concurrent requests and when disabled it either it reduces the random delays or removes it completely this will be very helpful for small scale use like if you are hosting on your home server just for own use. What do you say @xffxff @alamin655 ??

@neon-mmd
Copy link
Owner

I think @xffxff is right. Here is my implementation using the lazy_static crate:

use fake_useragent::{Browsers, UserAgents, UserAgentsBuilder};
use lazy_static::lazy_static;

lazy_static! {
    static ref USER_AGENTS: UserAgents = {
        UserAgentsBuilder::new()
            .cache(false)
            .dir("/tmp")
            .thread(1)
            .set_browsers(
                Browsers::new()
                    .set_chrome()
                    .set_safari()
                    .set_edge()
                    .set_firefox()
                    .set_mozilla(),
            )
            .build()
    };
}

/// A function to generate a random user agent to improve privacy of the user.
///
/// # Returns
///
/// A randomly generated user agent string.
pub fn random_user_agent() -> String {
    USER_AGENTS.random().to_string()
}

This looks good 👍 but after doing some research to see whether there are any better and faster implementations than this I found that lazy_static seems to be a bit slow and there is even better and faster crate for the same called once_cell and it also has been merged into std::lazy and is available as an experimental feature right now in the nightly version so I see once_cell should be the way to go forward. What do you say @alamin655 ??

Here are some links to follow:

@xffxff
Copy link
Contributor Author

xffxff commented May 21, 2023

No, actually we need it because if we do not add a random delay between requests especially for large-scale server use cases as these servers will have thousands of users and will create a lot of traffic and this, in turn, may cause the upstream search engines to get DDoSed which is not good and they might ban the IP that caused the DDoS but I can see one option like having a config option like production_use which when enabled puts random delay lets say after every 4 concurrent requests and when disabled it either it reduces the random delays or removes it completely this will be very helpful for small scale use like if you are hosting on your home server just for own use

@neon-mmd Thank you for the explanation! I think you are right and having a config option like production_use is helpful

neon-mmd pushed a commit that referenced this issue Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

3 participants