### Disclaimer

Copyright and other regulations surrounding programmatic gathering of data vary by website and jurisdiction. It is your responsibility to check all applicable laws and terms of use of any website or service before attempting any web harvesting or similar activities.

# Website Content Scraping

Website content extraction is harder. Usually `html`/`javascript` knowledge is required.

Some toolsets help in making things easier

# Example 1 (simple): Table Data

`pandas.read_html` is a killer app for table data harvesting

<img src="assets/WikipediaTable-screenshot.png" style="height: 500px"/>

In [None]:
import pandas as pd
import requests

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

text = requests.get(url).text
text

In [None]:
tables = pd.read_html(url)

print("Found {} tables in {}".format(len(tables), url))

In [None]:
tables[2]

# Example 2 (normal): Static HTML Content Extraction

https://www.jll.com.sg/ => `Properties` => `Office Properties`

<img src="assets/PropertyInfo-screenshot.png" style="height:500px"/>

In [None]:
import requests
import lxml.html

property_url = 'https://property.jll.com.sg/office-lease/singapore/raffles-place-and-vicinity/18-robinson-sgp-p-000ff3'

html_text = requests.get(property_url).text
html_text

In [None]:
root = lxml.html.fromstring(html_text)
root

In [None]:
[e for e in root.iter() if e.get("class") and 'Property--Details-Table' in e.get("class")]

In [None]:
info_body_element = [e for e in root.iter() if e.get("class") and 'Property--Details-Table' in e.get("class")][0]
info_body_element

In [None]:
field_names = [e.text.strip() for e in info_body_element.iter() if e.tag == 'strong']
field_names

In [None]:
field_values = [e.text.strip() for e in info_body_element.iter() if e.tag == 'p']
field_values

In [None]:
len(field_names) == len(field_values)

In [None]:
dict(zip(field_names, field_values))

### Advanced Usage - xpath

In [None]:
root.xpath("//div[contains(@class, 'Property--Details-Table')]")

In [None]:
root.xpath("//div[contains(@class, 'Property--Details-Table')]//strong")

## Let's compile a function to extract property information given a URL

In [None]:
import requests
import lxml.html

def get_property_info_jll(url):
    html_text = requests.get(url).text
    root = lxml.html.fromstring(html_text)
    
    field_names = root.xpath("//div[contains(@class, 'Property--Details-Table')]//strong/text()")

    field_values = root.xpath("//div[contains(@class, 'Property--Details-Table')]//p/text()")

    return dict(zip([f.strip() for f in field_names], [v.strip() for v in field_values]))
    

In [None]:
get_property_info_jll(property_url)

# Example 3 (hard): Dynamic Data 


Many websites provide data non-statically, for example, property listing websites. These dynamic data are downloaded to the browser via `dynamic rendering (of html)` or `javascript HTTP xhr` calls.


<img src="assets/Craiglist-screenshot.png" style="height:500px" />

In [None]:
import requests
import lxml.html

cg_url = 'https://newyork.craigslist.org/d/vacation-rentals/search/vac'

headers = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36'
}

listing_text = requests.get(cg_url, headers=headers).text
listing_text

In [None]:
listing_root = lxml.html.fromstring(listing_text)
listing_urls = listing_root.xpath('//li[@class="result-row"]/a/@href')
listing_urls

In [None]:
detail_text = requests.get(listing_urls[0], headers=headers).text
detail_description = ''.join(lxml.html.fromstring(detail_text).xpath("//section[@id='postingbody']/text()")).strip()
detail_description