Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

是否可以为 ASCII2D添加 base_url ? #115

Closed
Container-Zero opened this issue Mar 12, 2024 · 10 comments
Closed

是否可以为 ASCII2D添加 base_url ? #115

Container-Zero opened this issue Mar 12, 2024 · 10 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@Container-Zero
Copy link

类似于

google = GoogleSync(proxies=proxies, base_url=base_url)
resp = google.search(url=url)

google可以通过自定义base_url来选择镜像源,希望ascii2d也可以
目的是通过自建ascii2d反代站点到安全环境,来永久避开cf的爬虫检测,一劳永逸

@Container-Zero Container-Zero changed the title ASCII2D 是否可以添加 base_url ? 是否可以为 ASCII2D添加 base_url ? Mar 12, 2024
@Container-Zero
Copy link
Author

追加一下请求:
希望其它接口都能加上此参数,可以避免网络问题或实现负载均衡
如saucenao本身有每天针对同一IP的请求数量以及速度限制(即使用了Token),通过搭建复数镜像站可以实现负载均衡,以实现开放式公共查询API之类的服务

@Container-Zero
Copy link
Author

Container-Zero commented Mar 12, 2024

对了我还是说一下,我提这个 issue 的动机我判断目前 ASCII2D 遇到了 CF 的拦截,但仔细想了想我并没有办法完全确定这一点,以下是我这边遇到的 ASCII2D 的具体报错:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.9/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/usr/local/lib/python3.9/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.9/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/main.py", line 83, in ascii2d
    resp = await ascii2d.search(url=url)
  File "/usr/local/lib/python3.9/dist-packages/PicImageSearch/ascii2d.py", line 65, in search
    return Ascii2DResponse(resp.text, resp.url)
  File "/usr/local/lib/python3.9/dist-packages/PicImageSearch/model/ascii2d.py", line 144, in __init__
    data = PyQuery(fromstring(resp_text, parser=utf8_parser))
  File "/usr/local/lib/python3.9/dist-packages/lxml/html/__init__.py", line 873, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/local/lib/python3.9/dist-packages/lxml/html/__init__.py", line 761, in document_fromstring
    raise etree.ParserError(
lxml.etree.ParserError: Document is empty

由于特征不明显,我暂时无法确认是否是 CF 的 403 引起,但我尝试了下述代码

url="https://ascii2d.net/search/url/http://5b0988e595225.cdn.sohucs.com/images/20200109/74e33947a41248839725d6c8d54540e4.jpeg"
headers= {'User-Agent': 'PostmanRuntime/7.29.0'}
payload = {}
scraper = cloudscraper.create_scraper()
response1 = scraper.get(url, headers=headers, data = payload)
response2 = requests.request("GET", url, headers=headers, data = payload)

目前 response1 response2 均返回 403

@kitUIN
Copy link
Owner

kitUIN commented Mar 12, 2024

目前确实存在403状态

@kitUIN kitUIN added bug Something isn't working enhancement New feature or request labels Mar 12, 2024
@wlt233
Copy link

wlt233 commented Mar 19, 2024

所以因为 cf 的 waf,整个 ascii2d 接口不可用了吗?有没有考虑引入 selenium 曲线救国一下?

@kitUIN
Copy link
Owner

kitUIN commented Mar 20, 2024

目前还在思考解决方案🤔

@wlt233
Copy link

wlt233 commented Mar 20, 2024

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

@NekoAria
Copy link
Collaborator

NekoAria commented Apr 10, 2024

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

这个方式还需要引入额外的库和相应的重构,不打算考虑采用。
selenium 就过重了,更不可能采用。

不过,会触发这个和网络环境有关。
我已经很久没遇到过了。

可以接受给所有模块加上 base_url 的方案。

@Container-Zero
Copy link
Author

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

这个方式还需要引入额外的库和相应的重构,不打算考虑采用。 selenium 就过重了,更不可能采用。

不过,会触发这个和网络环境有关。 我已经很久没遇到过了。

可以接受给所有模块加上 base_url 的方案。

有一个小小的规范期望,如果打算使用 base_url 希望最后实现的时候,base_url 能统一为不带路由的二级域名,如 https://www.baidu.com 而非 https://www.baidu.com/route ,目前谷歌要带一个 search 的路由,这个当然无伤大雅,但如果最后实现的时候,每个 base_url 的输入如果参差不齐还挺怪的。

@NekoAria
Copy link
Collaborator

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

这个方式还需要引入额外的库和相应的重构,不打算考虑采用。 selenium 就过重了,更不可能采用。
不过,会触发这个和网络环境有关。 我已经很久没遇到过了。
可以接受给所有模块加上 base_url 的方案。

有一个小小的规范期望,如果打算使用 base_url 希望最后实现的时候,base_url 能统一为不带路由的二级域名,如 https://www.baidu.com 而非 https://www.baidu.com/route ,目前谷歌要带一个 search 的路由,这个当然无伤大雅,但如果最后实现的时候,每个 base_url 的输入如果参差不齐还挺怪的。

这个没问题。

@Container-Zero
Copy link
Author

Container-Zero commented Apr 10, 2024

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

试了下,集成 curl_cffi 还挺简单的,基本向下兼容各类主流请求,改 4 行 network.py 就行,我就不PR了,代码贴一下:

from collections import namedtuple
from types import TracebackType
from typing import Any, Dict, Optional, Type, Union

# from httpx import AsyncClient, QueryParams
from httpx import  QueryParams
from curl_cffi.requests import AsyncSession as AsyncClient # 导入 curl_cffi AsyncSession 设别名兼容已有代码

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/99.0.4844.82 Safari/537.36"
    )
}
RESP = namedtuple("RESP", ["text", "url", "status_code"])


class Network:
    """Manages HTTP client for network operations.

    Attributes:
        internal: Indicates if the object manages its own client lifecycle.
        cookies: Dictionary of parsed cookies, provided in string format upon initialization.
        client: Instance of an HTTP client.
    """

    def __init__(
        self,
        internal: bool = False,
        proxies: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        cookies: Optional[str] = None,
        timeout: float = 30,
        verify_ssl: bool = True,
    ):
        """Initializes Network with configuration for HTTP requests.

        Args:
            internal: If True, Network manages its own HTTP client lifecycle.
            proxies: Proxy settings for the HTTP client.
            headers: Custom headers for the HTTP client.
            cookies: Cookies in string format for the HTTP client.
            timeout: Timeout duration for the HTTP client.
            verify_ssl: If True, verifies SSL certificates.
        """
        self.internal: bool = internal
        headers = {**DEFAULT_HEADERS, **headers} if headers else DEFAULT_HEADERS
        self.cookies: Dict[str, str] = {}
        if cookies:
            for line in cookies.split(";"):
                key, value = line.strip().split("=", 1)
                self.cookies[key] = value

        self.client: AsyncClient = AsyncClient(
            headers=headers,
            cookies=self.cookies,
            verify=verify_ssl,
            proxies=proxies,
            timeout=timeout,
            # follow_redirects=True,
            allow_redirects=True, # 修改为 requests 标准
            impersonate="chrome120" # 模拟 chrome
        )

    def start(self) -> AsyncClient:
        """Initializes and returns the HTTP client.

        Returns:
            AsyncClient: Initialized HTTP client for network operations.
        """
        return self.client

    async def close(self) -> None:
        """Closes the HTTP client session if managed internally."""
        # await self.client.aclose()
        await self.client.close() # 修改为 requests 标准

    async def __aenter__(self) -> AsyncClient:
        """Async context manager entry for initializing or returning the HTTP client.

        Returns:
            AsyncClient: The HTTP client instance.
        """
        return self.client

    async def __aexit__(
        self,
        exc_type: Optional[Type[BaseException]] = None,
        exc_val: Optional[BaseException] = None,
        exc_tb: Optional[TracebackType] = None,
    ) -> None:
        """Async context manager exit for closing the HTTP client if managed internally."""
        # await self.client.aclose() 
        await self.client.close() # 修改为 requests 标准

#之后的代码不用动

但是和我预料的差不多,ja3 修改后也没办法阻止我这边 403 发生,我以前也遇到过ja3墙的问题,但ascii2d 的防护策略似乎比想象的要更麻烦(我试着拿 cf_clearance 也不行,或许要稳定适应的话得用无头浏览器绕过...?饶黑盒真的麻烦),目前唯一确认的是修改 base_url 为反代站绝对可以百分百解决 403 问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants