Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding转码失败,变成空白 #34

Closed
ctfang opened this issue Apr 2, 2018 · 23 comments
Closed

encoding转码失败,变成空白 #34

ctfang opened this issue Apr 2, 2018 · 23 comments

Comments

@ctfang
Copy link

ctfang commented Apr 2, 2018

// 失败的
->encoding('UTF-8','GB2312')

正常的,在结果集后
echo iconv('GB2312', 'UTF-8', $item['title'])."
";

@ctfang
Copy link
Author

ctfang commented Apr 2, 2018

    $listmain  = $ql->encoding('UTF-8','GBK')->rules([
        'title' => array('dd>a', 'text'),
        'link' => array('dd>a', 'href')
    ])->query()->getData();

// 进入源码,看到转码成功,但是$listmain为空
class EncodeService
{
public static function convert(QueryList $ql,string $outputEncoding,string $inputEncoding = null)
{
$html = $ql->getHtml();
$inputEncoding || $inputEncoding = self::detect($html);
$html = iconv($inputEncoding,$outputEncoding,$html);
dump($inputEncoding,$outputEncoding,$html);
$ql->setHtml($html);
return $ql;
}

@wangyouw
Copy link

wangyouw commented Jun 5, 2018

楼主 查到原因了吗,我这也有这问题

@varphper
Copy link

varphper commented Jul 9, 2018

这个问题还没解决吗?

@luffyzhao
Copy link

我的解决方案是:

$ql->find('meta[http-equiv="Content-Type"]')->attr('content', 'text/html; charset=utf-8');

@qwqcode
Copy link

qwqcode commented Jul 14, 2018

function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8,不然 phpQuery 不能解析标签
    
    return $html;
}

$html = handleGbkPage($html);
$ql = (new QueryList())->html($html);

@youngda
Copy link

youngda commented Jul 27, 2018

同样的问题,文档里面的方法都试了还是不行,自己默默写个正则,输出正常。目测采集正常,用了这个匹配就乱码了,楼上哥们给的代码试了也不行。有解决的哥们麻烦@一下,谢谢

@qwqcode
Copy link

qwqcode commented Jul 27, 2018

@youngda 先转码gbk为utf-8 再把 meta 标贴charset=* 替换为 utf-8 我这样就解决了

@youngda
Copy link

youngda commented Jul 30, 2018

@Zneiat 这边测试的结果不行,如果把GET到的HTML直接输出,是正常,打开匹配模式输出就乱了

@shanezhiu
Copy link

我抓的html页面编码本来就是utf-8,但是获取里面text属性中文值时就是乱码。感觉这是整个库的bug。

@youngda
Copy link

youngda commented Jul 31, 2018

@shanezhiu 同感,也有可能是咱们没找对方法,驾驭不了

@qwqcode
Copy link

qwqcode commented Jul 31, 2018

@youngda 发一下你的代码 我看看

@shanezhiu
Copy link

shanezhiu commented Jul 31, 2018

@Zneiat

public function handle_content()
{
		$data = $this->spider
			->rules([
				'title' => ['#activity-name','text']
			])
			->get("https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1")
			->encoding('UTF-8','GB2312')
			->query()
			->getData()
			->toArray();
		$title = array_pop($data)['title'];
		var_dump($title);exit;
}

@shanezhiu
Copy link

@youngda bug的可能性比较大。我去翻翻源码。

@qwqcode
Copy link

qwqcode commented Jul 31, 2018

@shanezhiu 尝试

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl
$html = handleGbkPage($html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->rules([
    'title' => ['#activity-name','text']
])->query()->getData()->all();
var_dump($data);die();

function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8,不然 phpQuery 不能解析标签
    
    return $html;
}

@qwqcode
Copy link

qwqcode commented Jul 31, 2018

@shanezhiu https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1 XD 编码本来就是 UTF-8 无需转换

@shanezhiu
Copy link

@Zneiat 你可以去除一下encoding的代码,打印title,看看结果。

@qwqcode
Copy link

qwqcode commented Jul 31, 2018

@shanezhiu 似乎讨论的不是同一个问题。。。我遇到的问题是 gbk 转 utf-8 后,没有乱码,但是 phpQuery 依然不能获取内容

@shanezhiu
Copy link

@Zneiat 让我感到好奇的是,你运行了你提供的snippet吗?我运行你的结果是:

array (size=1)
  0 => 
    array (size=1)
      'title' => string '1603澶���孧H370��棫��硶纭��澶辫��������' (length=152)

这结果显然是不正确的。

@shanezhiu
Copy link

@Zneiat 我认为这两个都属于编码问题。

@qwqcode
Copy link

qwqcode commented Jul 31, 2018

@shanezhiu 已解决。。。你采集的是微信公众号文章,html 代码开头 <!--headTrap<body></body><head></head><html></html>--> 和结尾 <!--tailTrap<body></body><head></head><html></html>--> 会影响 phpQuery

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl

$html = str_replace(['<!--headTrap<body></body><head></head><html></html>-->', '<!--tailTrap<body></body><head></head><html></html>-->'], '', $html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->find('#activity-name')->text();
var_dump($data);

@shanezhiu
Copy link

shanezhiu commented Jul 31, 2018

@Zneiat 谢谢你,对,是这个原因。我逐步调试了,确实是这个原因。可能需要管理员帮我移下这些东西到新的issue下。

@qwqcode
Copy link

qwqcode commented Jul 31, 2018

@shanezhiu 哈哈 不用谢 (/ω\)

@youngda
Copy link

youngda commented Jul 31, 2018

@Zneiat 谢谢啊,就是这个问题,果然是自己功力尚浅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants