encoding转码失败，变成空白 #34

ctfang · 2018-04-02T06:57:08Z

// 失败的
->encoding('UTF-8','GB2312')

正常的，在结果集后
echo iconv('GB2312', 'UTF-8', $item['title'])."
";

ctfang · 2018-04-02T07:18:38Z

    $listmain  = $ql->encoding('UTF-8','GBK')->rules([
        'title' => array('dd>a', 'text'),
        'link' => array('dd>a', 'href')
    ])->query()->getData();

// 进入源码，看到转码成功，但是$listmain为空
class EncodeService
{
public static function convert(QueryList $ql,string $outputEncoding,string $inputEncoding = null)
{
$html = $ql->getHtml();
$inputEncoding || $inputEncoding = self::detect($html);
$html = iconv($inputEncoding,$outputEncoding,$html);
dump($inputEncoding,$outputEncoding,$html);
$ql->setHtml($html);
return $ql;
}

wangyouw · 2018-06-05T08:15:36Z

楼主查到原因了吗，我这也有这问题

varphper · 2018-07-09T10:17:22Z

这个问题还没解决吗？

luffyzhao · 2018-07-13T05:09:00Z

我的解决方案是：

$ql->find('meta[http-equiv="Content-Type"]')->attr('content', 'text/html; charset=utf-8');

qwqcode · 2018-07-14T05:01:45Z

function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8，不然 phpQuery 不能解析标签
    
    return $html;
}

$html = handleGbkPage($html);
$ql = (new QueryList())->html($html);

youngda · 2018-07-27T15:13:39Z

同样的问题，文档里面的方法都试了还是不行，自己默默写个正则，输出正常。目测采集正常，用了这个匹配就乱码了，楼上哥们给的代码试了也不行。有解决的哥们麻烦@一下，谢谢

qwqcode · 2018-07-27T19:51:58Z

@youngda 先转码gbk为utf-8 再把 meta 标贴charset=* 替换为 utf-8 我这样就解决了

youngda · 2018-07-30T09:45:02Z

@Zneiat 这边测试的结果不行，如果把GET到的HTML直接输出，是正常，打开匹配模式输出就乱了

shanezhiu · 2018-07-31T02:58:58Z

我抓的html页面编码本来就是utf-8，但是获取里面text属性中文值时就是乱码。感觉这是整个库的bug。

youngda · 2018-07-31T03:47:37Z

@shanezhiu 同感，也有可能是咱们没找对方法，驾驭不了

qwqcode · 2018-07-31T03:49:01Z

@youngda 发一下你的代码我看看

shanezhiu · 2018-07-31T06:38:27Z

@Zneiat

public function handle_content()
{
		$data = $this->spider
			->rules([
				'title' => ['#activity-name','text']
			])
			->get("https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1")
			->encoding('UTF-8','GB2312')
			->query()
			->getData()
			->toArray();
		$title = array_pop($data)['title'];
		var_dump($title);exit;
}

shanezhiu · 2018-07-31T06:43:18Z

@youngda bug的可能性比较大。我去翻翻源码。

qwqcode · 2018-07-31T06:46:26Z

@shanezhiu 尝试

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl
$html = handleGbkPage($html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->rules([
    'title' => ['#activity-name','text']
])->query()->getData()->all();
var_dump($data);die();

function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8，不然 phpQuery 不能解析标签
    
    return $html;
}

qwqcode · 2018-07-31T06:51:20Z

@shanezhiu https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1 XD 编码本来就是 UTF-8 无需转换

shanezhiu · 2018-07-31T06:52:41Z

@Zneiat 你可以去除一下encoding的代码，打印title，看看结果。

qwqcode · 2018-07-31T06:57:10Z

@shanezhiu 似乎讨论的不是同一个问题。。。我遇到的问题是 gbk 转 utf-8 后，没有乱码，但是 phpQuery 依然不能获取内容

shanezhiu · 2018-07-31T06:57:20Z

@Zneiat 让我感到好奇的是，你运行了你提供的snippet吗？我运行你的结果是：

array (size=1)
  0 => 
    array (size=1)
      'title' => string '1603æ¾¶â��æ��é��å§H370æ¸�æ¿�æ£«é��ç�³ç¡¶çºî�¿î�»æ¾¶è¾«ä»�é�ªç�¸î��é��ç�·æ´�é��' (length=152)

这结果显然是不正确的。

shanezhiu · 2018-07-31T06:59:56Z

@Zneiat 我认为这两个都属于编码问题。

qwqcode · 2018-07-31T07:34:15Z

@shanezhiu 已解决。。。你采集的是微信公众号文章，html 代码开头  和结尾  会影响 phpQuery

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl

$html = str_replace(['<!--headTrap<body></body><head></head><html></html>-->', '<!--tailTrap<body></body><head></head><html></html>-->'], '', $html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->find('#activity-name')->text();
var_dump($data);

shanezhiu · 2018-07-31T08:21:39Z

@Zneiat 谢谢你，对，是这个原因。我逐步调试了，确实是这个原因。可能需要管理员帮我移下这些东西到新的issue下。

qwqcode · 2018-07-31T08:35:05Z

@shanezhiu 哈哈不用谢 (/ω＼)

youngda · 2018-07-31T09:50:49Z

@Zneiat 谢谢啊，就是这个问题，果然是自己功力尚浅

jae-jae added the help wanted label Oct 15, 2018

jae-jae closed this as completed Sep 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding转码失败，变成空白 #34

encoding转码失败，变成空白 #34

ctfang commented Apr 2, 2018

ctfang commented Apr 2, 2018

wangyouw commented Jun 5, 2018 •

edited

varphper commented Jul 9, 2018

luffyzhao commented Jul 13, 2018

qwqcode commented Jul 14, 2018

youngda commented Jul 27, 2018

qwqcode commented Jul 27, 2018

youngda commented Jul 30, 2018

shanezhiu commented Jul 31, 2018

youngda commented Jul 31, 2018

qwqcode commented Jul 31, 2018

shanezhiu commented Jul 31, 2018 •

edited

shanezhiu commented Jul 31, 2018

qwqcode commented Jul 31, 2018

qwqcode commented Jul 31, 2018

shanezhiu commented Jul 31, 2018

qwqcode commented Jul 31, 2018

shanezhiu commented Jul 31, 2018

shanezhiu commented Jul 31, 2018

qwqcode commented Jul 31, 2018 •

edited

shanezhiu commented Jul 31, 2018 •

edited

qwqcode commented Jul 31, 2018

youngda commented Jul 31, 2018

encoding转码失败，变成空白 #34

encoding转码失败，变成空白 #34

Comments

ctfang commented Apr 2, 2018

ctfang commented Apr 2, 2018

wangyouw commented Jun 5, 2018 • edited

varphper commented Jul 9, 2018

luffyzhao commented Jul 13, 2018

qwqcode commented Jul 14, 2018

youngda commented Jul 27, 2018

qwqcode commented Jul 27, 2018

youngda commented Jul 30, 2018

shanezhiu commented Jul 31, 2018

youngda commented Jul 31, 2018

qwqcode commented Jul 31, 2018

shanezhiu commented Jul 31, 2018 • edited

shanezhiu commented Jul 31, 2018

qwqcode commented Jul 31, 2018

qwqcode commented Jul 31, 2018

shanezhiu commented Jul 31, 2018

qwqcode commented Jul 31, 2018

shanezhiu commented Jul 31, 2018

shanezhiu commented Jul 31, 2018

qwqcode commented Jul 31, 2018 • edited

shanezhiu commented Jul 31, 2018 • edited

qwqcode commented Jul 31, 2018

youngda commented Jul 31, 2018

wangyouw commented Jun 5, 2018 •

edited

shanezhiu commented Jul 31, 2018 •

edited

qwqcode commented Jul 31, 2018 •

edited

shanezhiu commented Jul 31, 2018 •

edited