Extract article from non-english text #72

GoogleCodeExporter · 2016-03-22T03:15:46Z

I am trying to use boilerpipe to extract article from URLS containing 
non-english language. However it generates some ascii text, check 
this(http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fwww.sandesh.com%
2Farticle.aspx%3Fnewsid%3D2905443&extractor=ArticleExtractor&output=htmlFragment
&extractImages=). I saw this 
issue(https://code.google.com/p/boilerpipe/issues/detail?id=16&q=non%20english).

I tried to make some changes in the code.
1) Modified HTMLfetcher.java. Appended following lines before end of method 
fetch
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one 
(convertion)
    cs = Charset.forName("UTF-8"); //set the charset to UFT-8

Or/And then 2) Changed code in my class by using UTF-8 charset with inputsource
   URL url = new URL(urls);
        InputSource is = new InputSource();
        is.setEncoding("UTF-8");
        is.setByteStream(url.openStream());
        text = ArticleExtractor.INSTANCE.getText(is);

still I am not able to get desired result.
Test URL :http://www.sandesh.com/article.aspx?newsid=2905443
Test Text(In gujarati language): મુંબઈ, 30 
જાન્યુઆરી સલમાન ખાને 
ગુજરાતમાં આવીને નરેન્દ્ર 
મોદીના વખાણ શુ કર્યા તેની 
મુસીબતોમાં ખૂબ વધારો થઈ ગયો 
છે. સલમાન ખાન ફિલ્મ 'જય હો'ના 
પ્રમોશન માટે ઉત્તરાયણમાં 
અમદાવાદ આવ્યા હોવાથી અને તે 
સમયે તેણે નરેન્દ્ર મોદીના 
વખાણ કર્યા હોવાથી કોંગ્રેસ 
દ્વારા મુસ્લિમોને તેની ફિલ્મ 
'જય હો' ના જોવાની અરજી કરવામાં 
આવી હતી અને હવે મુસ્લિમ 
મૌલવીઓ દ્વારા તેના સામે ફતવો 
જાહેર કરી દેવામાં આવ્યો છે.

Test Result: àª®à«�àª�àª¬àª�, 30 
àª�àª¾àª¨à«�àª¯à«�àª�àª°à«� 
àª¸àª²àª®àª¾àª¨ àª�àª¾àª¨à«� 
àª�à«�àª�àª°àª¾àª¤àª®àª¾àª� 
àª�àªµà«�àª¨à«� àª¨àª°à«�àª¨à«�àª¦à«�àª° 
àª®à«�àª¦à«�àª¨àª¾ àªµàª�àª¾àª£ àª¶à«� 
àª�àª°à«�àª¯àª¾ àª¤à«�àª¨à«� 
àª®à«�àª¸à«�àª¬àª¤à«�àª®àª¾àª� àª�à«�àª¬ 
àªµàª§àª¾àª°à«� àª¥àª� àª�àª¯à«� àª�à«�. 
àª¸àª²àª®àª¾àª¨ àª�àª¾àª¨ 
àª«àª¿àª²à«�àª® 'àª�àª¯ àª¹à«�'àª¨àª¾ 
àªªà«�àª°àª®à«�àª¶àª¨ àª®àª¾àª�à«� 
àª�àª¤à«�àª¤àª°àª¾àª¯àª£àª®àª¾àª� 
àª�àª®àª¦àª¾àªµàª¾àª¦ àª�àªµà«�àª¯àª¾ 
àª¹à«�àªµàª¾àª¥à«� àª�àª¨à«� àª¤à«� 
àª¸àª®àª¯à«� àª¤à«�àª£à«� 
àª¨àª°à«�àª¨à«�àª¦à«�àª° 
àª®à«�àª¦à«�àª¨àª

Original issue reported on code.google.com by ranjanba...@iblogee.com on 2 Feb 2014 at 12:44

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2016-03-22T03:15:46Z

Hi,

Save it as HTML page and use below lines to extract text as it is.

Reader r = new InputStreamReader(new FileInputStream("D:/test1.htm"));
String text = CommonExtractors.ARTICLE_EXTRACTOR.getText(r);
System.out.println("Text:"+text);


Regards,

Vanaja Jayaraman

Original comment by vanaja.u...@gmail.com on 22 May 2014 at 12:01

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Mar 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract article from non-english text #72

Extract article from non-english text #72

GoogleCodeExporter commented Mar 22, 2016

GoogleCodeExporter commented Mar 22, 2016

Extract article from non-english text #72

Extract article from non-english text #72

Comments

GoogleCodeExporter commented Mar 22, 2016

GoogleCodeExporter commented Mar 22, 2016