Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong encoding is used for some URLs #42

Closed
TariqAbughofa opened this issue Dec 29, 2015 · 3 comments
Closed

Wrong encoding is used for some URLs #42

TariqAbughofa opened this issue Dec 29, 2015 · 3 comments

Comments

@TariqAbughofa
Copy link

this is the code i'm using

var Curl = require('node-libcurl').Curl
var curl = new Curl()
curl.setOpt(Curl.option.URL, "http://www1.bloomingdales.com/shop/product/mac-powder-blush-wash-dry-collection?ID=1371879")
curl.setOpt(Curl.option.FOLLOWLOCATION, true)
curl.setOpt(Curl.option.COOKIEJAR, "cookie.jar")
curl.setOpt(Curl.option.COOKIEFILE, "cookie.jar")
curl.on('end', function(statusCode, body, headers) {
    console.log(body)
})
curl.perform()

here is what I got in the title tag:

M�A�C Powder Blush, Wash & Dry Collection | Bloomingdale's

however when I use curl:

curl -XGET -L http://www1.bloomingdales.com/shop/product/mac-powder-blush-wash-dry-collection?ID=1371879 -c cookie.jar

I get the title with the right encoding:

M·A·C Powder Blush, Wash & Dry Collection | Bloomingdale's

tested on: node-libcurl@0.5.1 and node-libcurl@0.7.0

system libcurl version: curl 7.29.0

@JCMais
Copy link
Owner

JCMais commented Dec 29, 2015

The Curl wrapper module always use UTF-8 for decoding purposes. If you need to use something different you must disable the parsing of the data or don't use the wrapper at all, and instead use the Easy constructor.

var Curl = require( 'node-libcurl' ).Curl,
    // a package to convert from given encoding,
    //  we are using iconv-lite here (https://github.com/ashtuchkin/iconv-lite)
    iconv = require( 'iconv-lite' ),
    curl = new Curl();

curl.setOpt( Curl.option.URL, 'http://www1.bloomingdales.com/shop/product/mac-powder-blush-wash-dry-collection?ID=1371879' );
curl.setOpt( Curl.option.FOLLOWLOCATION, true );
curl.setOpt( Curl.option.COOKIEJAR, "cookie.jar" )
curl.setOpt( Curl.option.COOKIEFILE, "cookie.jar" )

// Enable the NO_DATA_PARSING bitmask, this basically means that
// the data will be returned to the end callback without any conversion.
curl.enable( Curl.feature.NO_DATA_PARSING );

function errorCallback( error, errCode ) {

    console.log( error, errCode );
    this.close();
}

function endCallback( statusCode, body, headers ) {

    // Since we enabled NO_DATA_PARSING, body will be a Buffer object here, and not a string.
    // parse it using the decoder
    console.log( iconv.decode( body, 'ISO-8859-1' ).substring( 0, 2000 ) );
    this.close();
}

curl.on( 'error', errorCallback );
curl.on( 'end', endCallback );
curl.perform();

@JCMais
Copy link
Owner

JCMais commented Dec 29, 2015

Btw if you wanted to convert this automatically by getting the charset in the header, it would not work for the given link, since the server there is sending the wrong charset in the headers. It's using utf-8 on the header but ISO-8859-1 in the body.

@TariqAbughofa
Copy link
Author

Nice, I didn't know that you can stop data parsing.
yeah, I know many websites mess up the charset in the headers and the html. that's why I understand that the library should not depend on that to detect the encoding.
thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants