Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I Cant able to parse below generated HTML using Jsoup. Why I cant able to do? #751

Closed
sgobinathsr opened this issue Aug 22, 2016 · 11 comments

Comments

@sgobinathsr
Copy link

I tried to parse the below HTML content using Jsoup 1.9.2. But I couldn't able to do. It returning blank object. But if I put condition slno < 7155; into slno < 7150; its working fine.

`












<%
for(int slno = 1; slno < 7155; slno++){
%>







<%
}
%>

SLNo Heading 1 Heading 2 Heading 3 Heading 4
<%=slno%> column-<%=slno%> column-<%=slno%> column-<%=slno%> column-<%=slno%>
`
@Stephan972
Copy link

>But I couldn't able to do.
What you're trying to parse is not HTML code actually. It looks like some JSP code actually.
I'd suggest you to use a more appropriate parser rather than Jsoup.

@sgobinathsr
Copy link
Author

Yes it is a JSP page only.
After this JSP executed, I will pass the HTML content which generated by the JSP page to a Java Jsoup HTML parse. But its not working, when the slno > 7155.
Is Jsoup won't support this much data? Or Is there any solution?

@Stephan972
Copy link

Can you please post the HTML content generated by the JSP page?

@sgobinathsr
Copy link
Author

sgobinathsr commented Aug 22, 2016

I couldn't able to post the entire HTML content here. Here it is exceeding maximum character count.

sample content below for slno < 100, But you need to imagine for slno > 7155 also. then only problem will arise. For slno < 100 no issues.

    <tr>
        <td>1</td>
        <td>column-1</td>
        <td>column-1</td>
        <td>column-1</td>
        <td>column-1</td>
    </tr>
    
    <tr>
        <td>2</td>
        <td>column-2</td>
        <td>column-2</td>
        <td>column-2</td>
        <td>column-2</td>
    </tr>
    
    <tr>
        <td>3</td>
        <td>column-3</td>
        <td>column-3</td>
        <td>column-3</td>
        <td>column-3</td>
    </tr>
    
    <tr>
        <td>4</td>
        <td>column-4</td>
        <td>column-4</td>
        <td>column-4</td>
        <td>column-4</td>
    </tr>
    
    <tr>
        <td>5</td>
        <td>column-5</td>
        <td>column-5</td>
        <td>column-5</td>
        <td>column-5</td>
    </tr>
    
    <tr>
        <td>6</td>
        <td>column-6</td>
        <td>column-6</td>
        <td>column-6</td>
        <td>column-6</td>
    </tr>
    
    <tr>
        <td>7</td>
        <td>column-7</td>
        <td>column-7</td>
        <td>column-7</td>
        <td>column-7</td>
    </tr>
    
    <tr>
        <td>8</td>
        <td>column-8</td>
        <td>column-8</td>
        <td>column-8</td>
        <td>column-8</td>
    </tr>
    
    <tr>
        <td>9</td>
        <td>column-9</td>
        <td>column-9</td>
        <td>column-9</td>
        <td>column-9</td>
    </tr>
    
    <tr>
        <td>10</td>
        <td>column-10</td>
        <td>column-10</td>
        <td>column-10</td>
        <td>column-10</td>
    </tr>
    
    <tr>
        <td>11</td>
        <td>column-11</td>
        <td>column-11</td>
        <td>column-11</td>
        <td>column-11</td>
    </tr>
    
    <tr>
        <td>12</td>
        <td>column-12</td>
        <td>column-12</td>
        <td>column-12</td>
        <td>column-12</td>
    </tr>
    
    <tr>
        <td>13</td>
        <td>column-13</td>
        <td>column-13</td>
        <td>column-13</td>
        <td>column-13</td>
    </tr>
    
    <tr>
        <td>14</td>
        <td>column-14</td>
        <td>column-14</td>
        <td>column-14</td>
        <td>column-14</td>
    </tr>
    
    <tr>
        <td>15</td>
        <td>column-15</td>
        <td>column-15</td>
        <td>column-15</td>
        <td>column-15</td>
    </tr>
    
    <tr>
        <td>16</td>
        <td>column-16</td>
        <td>column-16</td>
        <td>column-16</td>
        <td>column-16</td>
    </tr>
    
    <tr>
        <td>17</td>
        <td>column-17</td>
        <td>column-17</td>
        <td>column-17</td>
        <td>column-17</td>
    </tr>
    
    <tr>
        <td>18</td>
        <td>column-18</td>
        <td>column-18</td>
        <td>column-18</td>
        <td>column-18</td>
    </tr>
    
    <tr>
        <td>19</td>
        <td>column-19</td>
        <td>column-19</td>
        <td>column-19</td>
        <td>column-19</td>
    </tr>
    
    <tr>
        <td>20</td>
        <td>column-20</td>
        <td>column-20</td>
        <td>column-20</td>
        <td>column-20</td>
    </tr>
    
    <tr>
        <td>21</td>
        <td>column-21</td>
        <td>column-21</td>
        <td>column-21</td>
        <td>column-21</td>
    </tr>
    
    <tr>
        <td>22</td>
        <td>column-22</td>
        <td>column-22</td>
        <td>column-22</td>
        <td>column-22</td>
    </tr>
    
    <tr>
        <td>23</td>
        <td>column-23</td>
        <td>column-23</td>
        <td>column-23</td>
        <td>column-23</td>
    </tr>
    
    <tr>
        <td>24</td>
        <td>column-24</td>
        <td>column-24</td>
        <td>column-24</td>
        <td>column-24</td>
    </tr>
    
    <tr>
        <td>25</td>
        <td>column-25</td>
        <td>column-25</td>
        <td>column-25</td>
        <td>column-25</td>
    </tr>
    
    <tr>
        <td>26</td>
        <td>column-26</td>
        <td>column-26</td>
        <td>column-26</td>
        <td>column-26</td>
    </tr>
    
    <tr>
        <td>27</td>
        <td>column-27</td>
        <td>column-27</td>
        <td>column-27</td>
        <td>column-27</td>
    </tr>
    
    <tr>
        <td>28</td>
        <td>column-28</td>
        <td>column-28</td>
        <td>column-28</td>
        <td>column-28</td>
    </tr>
    
    <tr>
        <td>29</td>
        <td>column-29</td>
        <td>column-29</td>
        <td>column-29</td>
        <td>column-29</td>
    </tr>
    
    <tr>
        <td>30</td>
        <td>column-30</td>
        <td>column-30</td>
        <td>column-30</td>
        <td>column-30</td>
    </tr>
    
    <tr>
        <td>31</td>
        <td>column-31</td>
        <td>column-31</td>
        <td>column-31</td>
        <td>column-31</td>
    </tr>
    
    <tr>
        <td>32</td>
        <td>column-32</td>
        <td>column-32</td>
        <td>column-32</td>
        <td>column-32</td>
    </tr>
    
    <tr>
        <td>33</td>
        <td>column-33</td>
        <td>column-33</td>
        <td>column-33</td>
        <td>column-33</td>
    </tr>
    
    <tr>
        <td>34</td>
        <td>column-34</td>
        <td>column-34</td>
        <td>column-34</td>
        <td>column-34</td>
    </tr>
    
    <tr>
        <td>35</td>
        <td>column-35</td>
        <td>column-35</td>
        <td>column-35</td>
        <td>column-35</td>
    </tr>
    
    <tr>
        <td>36</td>
        <td>column-36</td>
        <td>column-36</td>
        <td>column-36</td>
        <td>column-36</td>
    </tr>
    
    <tr>
        <td>37</td>
        <td>column-37</td>
        <td>column-37</td>
        <td>column-37</td>
        <td>column-37</td>
    </tr>
    
    <tr>
        <td>38</td>
        <td>column-38</td>
        <td>column-38</td>
        <td>column-38</td>
        <td>column-38</td>
    </tr>
    
    <tr>
        <td>39</td>
        <td>column-39</td>
        <td>column-39</td>
        <td>column-39</td>
        <td>column-39</td>
    </tr>
    
    <tr>
        <td>40</td>
        <td>column-40</td>
        <td>column-40</td>
        <td>column-40</td>
        <td>column-40</td>
    </tr>
    
    <tr>
        <td>41</td>
        <td>column-41</td>
        <td>column-41</td>
        <td>column-41</td>
        <td>column-41</td>
    </tr>
    
    <tr>
        <td>42</td>
        <td>column-42</td>
        <td>column-42</td>
        <td>column-42</td>
        <td>column-42</td>
    </tr>
    
    <tr>
        <td>43</td>
        <td>column-43</td>
        <td>column-43</td>
        <td>column-43</td>
        <td>column-43</td>
    </tr>
    
    <tr>
        <td>44</td>
        <td>column-44</td>
        <td>column-44</td>
        <td>column-44</td>
        <td>column-44</td>
    </tr>
    
    <tr>
        <td>45</td>
        <td>column-45</td>
        <td>column-45</td>
        <td>column-45</td>
        <td>column-45</td>
    </tr>
    
    <tr>
        <td>46</td>
        <td>column-46</td>
        <td>column-46</td>
        <td>column-46</td>
        <td>column-46</td>
    </tr>
    
    <tr>
        <td>47</td>
        <td>column-47</td>
        <td>column-47</td>
        <td>column-47</td>
        <td>column-47</td>
    </tr>
    
    <tr>
        <td>48</td>
        <td>column-48</td>
        <td>column-48</td>
        <td>column-48</td>
        <td>column-48</td>
    </tr>
    
    <tr>
        <td>49</td>
        <td>column-49</td>
        <td>column-49</td>
        <td>column-49</td>
        <td>column-49</td>
    </tr>
    
    <tr>
        <td>50</td>
        <td>column-50</td>
        <td>column-50</td>
        <td>column-50</td>
        <td>column-50</td>
    </tr>
    
    <tr>
        <td>51</td>
        <td>column-51</td>
        <td>column-51</td>
        <td>column-51</td>
        <td>column-51</td>
    </tr>
    
    <tr>
        <td>52</td>
        <td>column-52</td>
        <td>column-52</td>
        <td>column-52</td>
        <td>column-52</td>
    </tr>
    
    <tr>
        <td>53</td>
        <td>column-53</td>
        <td>column-53</td>
        <td>column-53</td>
        <td>column-53</td>
    </tr>
    
    <tr>
        <td>54</td>
        <td>column-54</td>
        <td>column-54</td>
        <td>column-54</td>
        <td>column-54</td>
    </tr>
    
    <tr>
        <td>55</td>
        <td>column-55</td>
        <td>column-55</td>
        <td>column-55</td>
        <td>column-55</td>
    </tr>
    
    <tr>
        <td>56</td>
        <td>column-56</td>
        <td>column-56</td>
        <td>column-56</td>
        <td>column-56</td>
    </tr>
    
    <tr>
        <td>57</td>
        <td>column-57</td>
        <td>column-57</td>
        <td>column-57</td>
        <td>column-57</td>
    </tr>
    
    <tr>
        <td>58</td>
        <td>column-58</td>
        <td>column-58</td>
        <td>column-58</td>
        <td>column-58</td>
    </tr>
    
    <tr>
        <td>59</td>
        <td>column-59</td>
        <td>column-59</td>
        <td>column-59</td>
        <td>column-59</td>
    </tr>
    
    <tr>
        <td>60</td>
        <td>column-60</td>
        <td>column-60</td>
        <td>column-60</td>
        <td>column-60</td>
    </tr>
    
    <tr>
        <td>61</td>
        <td>column-61</td>
        <td>column-61</td>
        <td>column-61</td>
        <td>column-61</td>
    </tr>
    
    <tr>
        <td>62</td>
        <td>column-62</td>
        <td>column-62</td>
        <td>column-62</td>
        <td>column-62</td>
    </tr>
    
    <tr>
        <td>63</td>
        <td>column-63</td>
        <td>column-63</td>
        <td>column-63</td>
        <td>column-63</td>
    </tr>
    
    <tr>
        <td>64</td>
        <td>column-64</td>
        <td>column-64</td>
        <td>column-64</td>
        <td>column-64</td>
    </tr>
    
    <tr>
        <td>65</td>
        <td>column-65</td>
        <td>column-65</td>
        <td>column-65</td>
        <td>column-65</td>
    </tr>
    
    <tr>
        <td>66</td>
        <td>column-66</td>
        <td>column-66</td>
        <td>column-66</td>
        <td>column-66</td>
    </tr>
    
    <tr>
        <td>67</td>
        <td>column-67</td>
        <td>column-67</td>
        <td>column-67</td>
        <td>column-67</td>
    </tr>
    
    <tr>
        <td>68</td>
        <td>column-68</td>
        <td>column-68</td>
        <td>column-68</td>
        <td>column-68</td>
    </tr>
    
    <tr>
        <td>69</td>
        <td>column-69</td>
        <td>column-69</td>
        <td>column-69</td>
        <td>column-69</td>
    </tr>
    
    <tr>
        <td>70</td>
        <td>column-70</td>
        <td>column-70</td>
        <td>column-70</td>
        <td>column-70</td>
    </tr>
    
    <tr>
        <td>71</td>
        <td>column-71</td>
        <td>column-71</td>
        <td>column-71</td>
        <td>column-71</td>
    </tr>
    
    <tr>
        <td>72</td>
        <td>column-72</td>
        <td>column-72</td>
        <td>column-72</td>
        <td>column-72</td>
    </tr>
    
    <tr>
        <td>73</td>
        <td>column-73</td>
        <td>column-73</td>
        <td>column-73</td>
        <td>column-73</td>
    </tr>
    
    <tr>
        <td>74</td>
        <td>column-74</td>
        <td>column-74</td>
        <td>column-74</td>
        <td>column-74</td>
    </tr>
    
    <tr>
        <td>75</td>
        <td>column-75</td>
        <td>column-75</td>
        <td>column-75</td>
        <td>column-75</td>
    </tr>
    
    <tr>
        <td>76</td>
        <td>column-76</td>
        <td>column-76</td>
        <td>column-76</td>
        <td>column-76</td>
    </tr>
    
    <tr>
        <td>77</td>
        <td>column-77</td>
        <td>column-77</td>
        <td>column-77</td>
        <td>column-77</td>
    </tr>
    
    <tr>
        <td>78</td>
        <td>column-78</td>
        <td>column-78</td>
        <td>column-78</td>
        <td>column-78</td>
    </tr>
    
    <tr>
        <td>79</td>
        <td>column-79</td>
        <td>column-79</td>
        <td>column-79</td>
        <td>column-79</td>
    </tr>
    
    <tr>
        <td>80</td>
        <td>column-80</td>
        <td>column-80</td>
        <td>column-80</td>
        <td>column-80</td>
    </tr>
    
    <tr>
        <td>81</td>
        <td>column-81</td>
        <td>column-81</td>
        <td>column-81</td>
        <td>column-81</td>
    </tr>
    
    <tr>
        <td>82</td>
        <td>column-82</td>
        <td>column-82</td>
        <td>column-82</td>
        <td>column-82</td>
    </tr>
    
    <tr>
        <td>83</td>
        <td>column-83</td>
        <td>column-83</td>
        <td>column-83</td>
        <td>column-83</td>
    </tr>
    
    <tr>
        <td>84</td>
        <td>column-84</td>
        <td>column-84</td>
        <td>column-84</td>
        <td>column-84</td>
    </tr>
    
    <tr>
        <td>85</td>
        <td>column-85</td>
        <td>column-85</td>
        <td>column-85</td>
        <td>column-85</td>
    </tr>
    
    <tr>
        <td>86</td>
        <td>column-86</td>
        <td>column-86</td>
        <td>column-86</td>
        <td>column-86</td>
    </tr>
    
    <tr>
        <td>87</td>
        <td>column-87</td>
        <td>column-87</td>
        <td>column-87</td>
        <td>column-87</td>
    </tr>
    
    <tr>
        <td>88</td>
        <td>column-88</td>
        <td>column-88</td>
        <td>column-88</td>
        <td>column-88</td>
    </tr>
    
    <tr>
        <td>89</td>
        <td>column-89</td>
        <td>column-89</td>
        <td>column-89</td>
        <td>column-89</td>
    </tr>
    
    <tr>
        <td>90</td>
        <td>column-90</td>
        <td>column-90</td>
        <td>column-90</td>
        <td>column-90</td>
    </tr>
    
    <tr>
        <td>91</td>
        <td>column-91</td>
        <td>column-91</td>
        <td>column-91</td>
        <td>column-91</td>
    </tr>
    
    <tr>
        <td>92</td>
        <td>column-92</td>
        <td>column-92</td>
        <td>column-92</td>
        <td>column-92</td>
    </tr>
    
    <tr>
        <td>93</td>
        <td>column-93</td>
        <td>column-93</td>
        <td>column-93</td>
        <td>column-93</td>
    </tr>
    
    <tr>
        <td>94</td>
        <td>column-94</td>
        <td>column-94</td>
        <td>column-94</td>
        <td>column-94</td>
    </tr>
    
    <tr>
        <td>95</td>
        <td>column-95</td>
        <td>column-95</td>
        <td>column-95</td>
        <td>column-95</td>
    </tr>
    
    <tr>
        <td>96</td>
        <td>column-96</td>
        <td>column-96</td>
        <td>column-96</td>
        <td>column-96</td>
    </tr>
    
    <tr>
        <td>97</td>
        <td>column-97</td>
        <td>column-97</td>
        <td>column-97</td>
        <td>column-97</td>
    </tr>
    
    <tr>
        <td>98</td>
        <td>column-98</td>
        <td>column-98</td>
        <td>column-98</td>
        <td>column-98</td>
    </tr>
    
    <tr>
        <td>99</td>
        <td>column-99</td>
        <td>column-99</td>
        <td>column-99</td>
        <td>column-99</td>
    </tr>
    
</tbody>
SLNo Heading 1 Heading 2 Heading 3 Heading 4

@Stephan972
Copy link

Stephan972 commented Aug 22, 2016

Try to load your HTML code (slno > 7155) with the code below.

Document doc = Jsoup.connect("http://your-server.com/slno/gt/7155/").maxBodySize(0).get();

NOTE: maxBodySize(0) will remove the default Jsoup limit of 1MB.

What happen?

@sgobinathsr
Copy link
Author

Document doc = Jsoup.connect("http://your-server.com/slno/gt/7155/").maxBodySize(0).get();
Its working fine.

But in my case it wont help, because in my screen the user will hide some rows or columns, that was done by jQuery. so the hidden rows and columns I need to skip while I'm parsing. for that I need to pass the HTML content only after hiding some rows or columns.

Is possible to do with Document doc = Jsoup.parse(htmlData);

@Stephan972
Copy link

Stephan972 commented Aug 23, 2016

Try this:

1- Load the data with Jsoup

Document doc = Jsoup.connect("http://your-server.com/slno/gt/7155/").maxBodySize(0).get();

2- Remove unwanted rows and/or columns

String myCssQuery = "tr.to.remove, td.to.remove";
doc.select(myCssQuery).remove();

3- ...

doc.select("table").html();

@sgobinathsr
Copy link
Author

Now I will explain the flow.

#1. To the enduser http://your-server.com/slno/gt/7155/ content will display.
example :

SLNo Heading 1 Heading 2 Heading 3 Heading 4
1 column-1 column-1 column-1 column-1 hide
2 column-2 column-2 column-2 column-2 hide
3 column-3 column-3 column-3 column-3 hide
4 column-4 column-4 column-4 column-4 hide
5 column-5 column-5 column-5 column-5 hide

#2. The user may hide fews rows or columns. suppose they are hiding row3 and row4

SLNo Heading 1 Heading 2 Heading 3 Heading 4
1 column-1 column-1 column-1 column-1 hide
2 column-2 column-2 column-2 column-2 hide
5 column-5 column-5 column-5 column-5 hide

#3. Now I want to parse above visible HTML only. In that case, Its not possible to read HTML content from URL : http://your-server.com/slno/gt/7155/, because that page having all. The present HTML Content only having "tr.to.remove, td.to.remove". so passing HTML content as string to Jsoup is the way.

@Stephan972
Copy link

Get the visible HTML with jQuery
Make an ajax request with jQuery to send the visible HTML to your server
Let Jsoup deal with the visible HTML on the server (Document doc = Jsoup.parse(visibleHtml);)

@sgobinathsr
Copy link
Author

Yes I tried that. but its not accepting slno > 7155. If it is going more then slno > 7155. it will return blank object.

@sgobinathsr
Copy link
Author

send HTML content as multipart/form-data is working fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants