Jsoup.parse() doesn't seem to load whole HTML content #287

Closed
xjaphx opened this Issue Jan 24, 2013 · 18 comments

Projects

None yet

8 participants

@xjaphx
xjaphx commented Jan 24, 2013

I save stackoverflow.com into a file: input.html
load it inside:

File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8");
// first query
Elements resultsA = doc.select("h3 > a");
// second query
Elements resultsB = doc.select("div.nav li a");

"resultsA" has no element while "resultsB" contains 6 found elements.
Wondering by it, I extract the html content from "doc" variable; wow, it contains just part of HTML, where "resultsB" html content can be found, but not content for "resultsA".

I've tried to parse several URLs (even, google.com), it all results in the same way.
"Jsoup.parse()" doesn't return whole HTML content.

@amferraz
Contributor

Could you please create a gist with the input.html file you're using?

@xjaphx
xjaphx commented Jan 25, 2013

Here you are: https://gist.github.com/4630682
Oh one thing I forgot to mention, that I'm testing on Android to parse this data.

@jhy
Owner
jhy commented Jan 26, 2013

I've tried reproducing this, with no issues. h3 > a returns 89 hits, div.nav li a gives 5. The parse tree looks fine.

Can you show us doc.html() and compare that to the html() produced by fetching from a Jsoup.connect(url).get() ? I'm trying to ID if the issue is with how you've fetched the content, or saved it, or the file load.

@xjaphx
xjaphx commented Jan 28, 2013

Hi John, thanks for reply.
In order to id the issue, I think it's better to start over where it starts first.

I create two project: one Java, the other Android
The parsing functions are the same but the outputs are not.

I've create the sample repo, you might want to check it out:
https://github.com/xjaphx/JSoupSample

  • The selector-syntax: ".summary h3 a"
    +) Output
  • Java project: 15 results found.
  • Android project: empty result.
@jhy
Owner
jhy commented Jan 28, 2013

OK. Thanks. The issue is that you are not specifying a user-agent when you fetch the URL, and so you are sending default user-agents from Java and from Android. And StackOverflow is sending you different HTML responses in return, one for desktops / crawlers (Java), and one for mobile agents (Android). The mobile version doesn't have anything that matches your selector.

I suggest using your browser's UA and setting it with http://jsoup.org/apidocs/org/jsoup/Connection.html#userAgent(java.lang.String)

Also you might like to use a debugging proxy like http://www.charlesproxy.com/ to watch HTTP traffic your apps are making.

Please give it a go and let me know your results.

@xjaphx
xjaphx commented Jan 28, 2013

oh wow, that's right. I've never thought of it.
By updating the User-Agent to Jsoup before parsing, the responses are match.
I've checked up on documentation but this is not mentioned anywhere. It's nice if you can add a note about this, under, Parser or Jsoup.connect().

Problem solved! Thanks John.

@jhy
Owner
jhy commented Jan 28, 2013

Cool -- glad we found it. Yep I'll mention in the .connect() method that it's a good idea to set the UA and a timeout. It might be a good idea to create a default UA based on a desktop browser.

@jhy jhy closed this Jan 28, 2013
@RajatT
RajatT commented Mar 9, 2013

i want to show text and url in listview parsed using jsoup plzzz help me i tried a lot but dont get succes yet.. here is the link of my code in stack over flow
http://stackoverflow.com/questions/15307970/listview-of-jsoup-parsed-data-in-android

@nikhilekbote

Hi.I am using Jsoup to parse url and I am using .connect() , timeout() and useragent() methods for parsing, but still I am not able to fetch entire page, some tags are missing.

@cobr123
cobr123 commented Mar 29, 2016

try

Document = Jsoup.connect(url)
    .header("Accept-Encoding", "gzip, deflate")
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
    .maxBodySize(0)
    .timeout(600000)
    .get();

http://jmchung.github.io/blog/2013/10/25/how-to-solve-jsoup-does-not-get-complete-html-document/

@inohtaf
inohtaf commented Apr 17, 2016

Hi. I have already implemented the suggestion from cobr123 but it did not work. But when I try to get the page on try.jsoup.org it can retrieve a complete html page. Do you have any suggestions?

@cobr123
cobr123 commented Apr 17, 2016

inohtaf, can you post url which did not work?

@inohtaf
inohtaf commented Apr 17, 2016

Hi cobr123 for example http://kbbi.web.id/mempelajari, I have experienced on the other site too before. on that site I would like to retrieve :
Element content = doc.select("div.content").first();
Element desc = content.select("div#desc").first();
Element descDetail = desc.select("div#d1").first();

Unfortunately the content of "div#d1" can not be found. But I am pretty sure it should be worked using Jsoup since it can be retrieved perfectly on try.jsoup.org. Hope for your suggestions :)

@cobr123
cobr123 commented Apr 17, 2016 edited

this works well

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.IOException;

public class Test {
    public static void main(String[] args) throws IOException {
        String url = "http://kbbi.web.id/mempelajari";
        Document doc = Jsoup.connect(url)
                .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                .maxBodySize(0)
                .get();
        Element content = doc.select("div.content > div#desc").first();
        Element desc = content.select("div#desc").first();
        Element descDetail = desc.select("div#d1").first();
        System.out.println(descDetail);
    }
}

and print

<div id="d1"> 
 <div id="info"></div>
 <b>ajar</b> 
..
~100 lines of text
..
 <b>~ mikro</b> teknik pelatihan mengajar yang jumlah muridnya dibatasi, misalnya 5—10 orang; 
 <b>~ remedial</b> pengajaran yang diberikan khusus untuk memperbaiki kesulitan belajar yang dialami murid 
</div>
@inohtaf
inohtaf commented Apr 18, 2016

Hi cobr123. Yes it works well. It seems I made a mistake on my code yesterday. Btw, Thank you very much for your help :)

@gwidaz
gwidaz commented Nov 6, 2016
 .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                .maxBodySize(0)
                .get();

Does not work, I have tested it with soundcloud and it returns null or the error..

@cobr123
cobr123 commented Nov 7, 2016 edited

Give an example of a selector. Soundcloud uses autoload page content when scrolling.

@gwidaz
gwidaz commented Nov 7, 2016

Well I got to working it but only after messing by converting the website to txt file as the output is absolutelly different from what you see in inspect element... The html which I received in txt file is now used to grab data from url directly, which partly works for me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment