# HTML and web scraping

Reference: 
- Advanced Business Analytics course by Stanislav Borysov, DTU Management.
- https://www.w3schools.com/html/default.asp
- https://towardsdatascience.com/5-strategies-to-write-unblock-able-web-scrapers-in-python-5e40c147bdaf

Web crawling and scraping represent a very flexible way to get the content from the Internet. Essentially, it imitates a user who visits different webpages and views their content. The only difference is that Internet companies usually love real users and hate scraping bots. So be prepared to be blocked. **In the worst case, you can get into serious troubles so please always read Terms & Conditions and follow the company's policy about automatic data collection (or ask them directly if you are not sure)!**

Additionally, you can always read the **Robot.txt** file to know what you are allowed or not to do. In a nutshel **Robots.txt** is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents. The basic format is:

User-agent: [user-agent name] 

Disallow: [URL string not to be crawled]


It follows an example with comments of the Robot.txt file of Buzzfeed (reference: https://moz.com/learn/seo/robotstxt )

![Robot.txt](https://moz-static.s3.amazonaws.com/learn/seo/Robots.txt-and-Robots-meta-directives/_large/Robots.txt.png?mtime=20170427090304)

## 1. HTML

Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser.

#### 1.1 Basic structure

- All HTML documents must start with a document type declaration: $<!DOCTYPE html>$.
- The HTML document itself begins with $<html>$ and ends with $</html>$.
- The visible part of the HTML document is between $<body>$ and $</body>$.

Let's see an example ( _%%HTML_ is only needed to run HTML code in Jupyter) 

In [1]:
%%HTML  

<!DOCTYPE html>
<html>
<body>

<h3>My First Heading</h3>
<p>My first paragraph.</p>

</body>
</html>

#### 1.2 Elements Tags

An HTML element usually consists of a start tag and an end tag, with the content inserted in between. The HTML element is everything from the start tag to the end tag. 

HTML tags are the hidden keywords within a web page that define how your web browser must format and display the content. Most tags must have two parts, an opening and a closing part. For example, <html> is the opening tag and </html> is the closing tag. Note that the closing tag has the same text as the opening tag, but has an additional forward-slash ( / ) character. Ii can be interpreteted as the "end" or "close" character.

Common tags are:
- __Headings__  defined with $<h >$ and a number from 1 to 6. $<h1>$ defines the most important heading, $<h6>$ defines the least important.
- __Paragraphs__  defined with the $<p>$ tag.
- __Links__  defined with the $<a>$ tag. Where the link's destination is specified in the $href$ attribute.
- __Images__  defined with the $<img>$ tag. Where the source file $(src)$, alternative text $(alt)$, $width$, and $height$ are provided as attributes.
- __Buttons__  defined with the $<button>$ tag.
- __Lists__  defined with the $<ul>$ (unordered/bullet list) or the $<ol>$ (ordered/numbered list) tag, followed by $<li>$ tags (list items).

List of Elements Tag: https://www.w3schools.com/tags/ref_byfunc.asp

#### 1.2 Elements container

As a "pure" container, the $<div>$ or $<span>$ element does not inherently represent anything. Instead, it's used to group content so it can be easily styled using the class or id attributes, marking a section of a document as being written in a different language (using the lang attribute), and so on.

Main container elements:
- The __$<div>$ element__ is often used as a container for other HTML elements.
- The __$<span>$ element__ is often used as a container for some text.

Keep in mind that HTML elements can be nested (elements can contain elements). For example, the <body> element of an HTML document can contain paragraph and headings. (See the first example)

#### 1.3 Elements Attributes

Attributes provide additional information about an element. Attributes are always specified in the start tag. For example,
- $href$ specifies the link address.
- $src$ specifies filename of the image source.
- $width$ and $height$ specifies the width and height of the image.
- $alt$ attribute specifies an alternative text to be used, if an image cannot be displayed.
- $style$ is used to specify the styling of an element, like color, font, size etc.
- The language is declared with the $lang$ attribute.
- $title$ attribute is added to the $<p>$ element. The value of the title attribute will be displayed as a tooltip when you mouse over the paragraph.
- $class$ attribute is used to define equal styles for elements with the same class name.
- $id$ attribute specifies a unique id for an HTML element (the value must be unique within the HTML document).

Complete list of attributes: https://www.w3schools.com/tags/ref_attributes.asp

To get a better idea look at the example below.

In [2]:
%%HTML

<!DOCTYPE html>
<html lang="en-US">
<div style="background-image: url('img_girl.jpg');">
<body>

<a href="https://www.w3schools.com">This is a link</a>

<img src="img_girl.jpg" width="50" height="60" alt="Girl with a jacket">

<p style="color:red; font-family:courier;font-size:160%;">This is the first paragraph.</p>

<p title="It really happens">
This is the second paragraph.
</p>

</body>
</html>

Class example:

In [3]:
%%HTML

<style>
.cities {
  background-color: black;
  color: white;
  margin: 20px;
  padding: 20px;
}
</style>
</head>
<body>

<div class="cities">
  <h2>London</h2>
  <p>London is the capital of England.</p>
</div>

#### 1.4 Tables

An HTML table is defined with the $<table>$ tag.

Each table row is defined with the $<tr>$ tag. A table header is defined with the $<th>$ tag. By default, table headings are bold and centered. A table data/cell is defined with the $<td>$ tag.

In [4]:
%%HTML

<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td>
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
  </tr>
</table>

## 2. Web Scraping 
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

### 2.1 DTU News

In this exercise, we will implement a web scraper that get news headlines together with the date and short description from https://www.dtu.dk/english/news. We are gonna use Scrapy (https://scrapy.org/) however, keep in mind that exist many other framework available. 

#### 2.1.1 Navigating the website

In this case, the webpage shows 10 recent news items by default; however, it is possible to get the 100 most recent news modifying the URL structure properly. 

It is very important to get familiar with the website where we are working. We need to know what can and cannot be obtained and we need to understand the URL structure.

In the example below we want the 200 most recent. Since only the maximum of news that can be displayed is 100, we construct 2 urls: the first the get the 100 most recent news and the second to get from the 101st to the 200th most recent news. 

In [5]:
url_base = "https://www.dtu.dk/english/news"

import math

urls = []
n_items = 200
start = 1
items_to_show = 100 # The webpage cannot show more then 100 items at once
for i in range(math.ceil(n_items / items_to_show)):
    parameters = [
        "fr={}".format(start), 
        "mr={}".format(items_to_show)
    ]
    url = url_base + "?" + "&".join(parameters)
    urls.append(url)
    start += items_to_show

print(urls)

### 2.1.2 Scraping the webpage content

Let's retrieve the information we are interested in and save it in a file.

In [6]:
import urllib.request

for url in urls:
    contents = urllib.request.urlopen(url).read()
    with open("dtu_news.html", "wb") as f:
        f.write(contents)

Note, that all markup and images are missing since we saved only the HTML content.

### 2.1.3. Extracting the data

Let's load the data from the file. Once the web content is loaded, we want to extract information wanted. To do that it can be used regular expressions or HTML parser (e.g., PyQuery, BeautifulSoup) that allows to save the content to a JSON file.

To extract any kind of information from a HTML document is very important to understand the structure of the webpage. This can be done directly from your browser. Try:
- Chrome: right click -> inspect
- Safari, Firefox, Edge: right click -> inspect element

Once we have clear the webpage structure, we create a PyQuery object that store all the element we are interested in. To do that we find the information we need inspecting the webpage in our browser and we specify an attribute that identify each single element. Note that often the information are stored in containers and has a particular class. Then, we can iterate through the list of elements and retrieve the needed information specify the attribute name or $.text()$ if the information is stored as text. If the the information is stored in a sub-element, we can use the function $find$. To access sub element we can also use the ">" symbol (this depends on the web-scraper framework (PyQuery) and can be different for other framework). 

In [7]:
# with open("dtu_news.html", "r", encoding='utf8') as f:
#     contents = f.read()
    
from pyquery import PyQuery 

# Let's take the first 100 most recent news

news = []
for url in urls:
    # Request the content of the website
    contents = urllib.request.urlopen(url).read()

    # load the document
    pq = PyQuery(contents)

    # Get all the element that has 'div.newsItem' as tag.
    # We know that from inspecting the webpage. 
    newsItems = pq('div.newsItem')
    
    #
    for i in range(len(newsItems)):
        # one news
        item = newsItems.eq(i)
        # url
        item_url = item.find('h2>a').attr('href')
        # title
        item_title = item.find('h2>a').text()
        # desc
        item_desc = item.find('p').text()
        # date
        item_date = item.find('span.date').text()
        # 
        new_item = {
            'url': item_url,
            'title': item_title,
            'desc': item_desc,
            'date': item_date
        }
        
        print(new_item)
        news.append(new_item)

{'url': 'https://www.dtu.dk/english/news/Nyhed?id={0DB56848-A70C-4584-9486-D36BFA326369}', 'title': 'Students get huge workshop for prototyping and 3D printing', 'desc': 'The inauguration of a new prototype lab in Ballerup gives DTU students 600 m2 of state-of-the-art experimental facilities and a playground for product development and prototype...', 'date': '14 OCT'}
{'url': 'https://www.dtu.dk/english/news/Nyhed?id={DE4FFF77-4F09-4D41-89FA-6E2F4B977199}', 'title': 'DKK 144.5 million from the EU for health technology projects', 'desc': 'On 11 October, the European Research Council (ERC) awarded new ERC Synergy grants. DTU participates in two of these, which have each been granted more than DKK 70 million...', 'date': '11 OCT'}
{'url': 'https://www.dtu.dk/english/news/Nyhed?id={643EC12B-8773-4528-BFF4-917069373B7D}', 'title': 'DTU destroys building to check computer program', 'desc': 'Loading a two-storey building will clarify whether numerical models in computer programs ensure suffic

{'url': 'https://www.dtu.dk/english/news/Nyhed?id={4DEEF47A-4C5D-4FD3-90FC-9B06D69E5F67}', 'title': 'Artificial cooling of Arctic soil safeguards global genetic material', 'desc': 'Artificial cooling and creative engineering solutions are key to taking good care of the backup copies of the world’s plant-based genetic material, which are stored in...', 'date': '04 JUN'}
{'url': 'https://www.dtu.dk/english/news/Nyhed?id={E045F18E-6B68-48A7-8D35-EA16880CF181}', 'title': 'Associate professor Thomas Bolander receives H.C. Ørsted Medaljen', 'desc': 'The prestigious H.C. Ørsted Medalje is awarded to associate professor Thomas Bolander from DTU Compute for his excellent ability to disseminate research.', 'date': '29 MAY'}
{'url': 'https://www.dtu.dk/english/news/Nyhed?id={A2A7EBB9-32D0-48F5-ACA7-F78EEBEF3499}', 'title': 'Lars Christoffersen new Dean of Undergraduate Studies and Student Affairs', 'desc': 'Head of DTU Diplom has been appointed new Dean of Undergraduate Studies and Student Affair

### 2.2 Pictures links MLSM

In this example, we want to get the picture link of the people of the MLSM group (http://mlsm.man.dtu.dk). 

Firstly, we need to specify the webpage url and request the page content.

Second, we load the document in Scrapy

Inspecting the page, we find that the picture link is stored as attribute $scr$ in image ($img$) elements which class is ".person-img.circle". 

In [8]:
# specify the url
url = "http://mlsm.man.dtu.dk/people/"
# get the content
contents = urllib.request.urlopen(url).read()   
# load the document in PyQuery
pq = PyQuery(contents)
# Get all the elements with "img" tag inside the class "person-img.circle"
newsItems = pq('.person-img.circle > img')
# for each element print the source link.
for x in newsItems.items():
    print(x.attr('src'))

https://i2.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2018/11/pereira_francisco-e1545259489853.jpg?resize=103%2C103
https://i0.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2019/08/cami.png?resize=150%2C150
https://i0.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2018/11/Rodrigues_Filipe.jpeg?resize=150%2C150
https://i0.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2019/01/ava_-square_400x400.jpeg?resize=150%2C150
https://i1.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2018/12/Bojan-Kostic-e1547027539712.jpg?resize=150%2C150
https://i0.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2018/12/Petersen_Niklas.jpg?resize=150%2C150
https://i0.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2018/11/peled_inon.jpg?resize=150%2C150
https://i1.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2018/11/Servizi_Valentino2.jpg?resize=150%2C150
https://i2.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2019/08/sergio.png?resize=150%2C150
https://i2.wp.com/mlsm.man.dtu.dk/wp-content/uploads/2019/01/Daniele_picture-e1547036005444.jpg?resize=1

### Weather

In [9]:
# Get the today temperature from 
url = "https://www.foreca.com/Denmark/Kongens_Lyngby"






    import urllib.request

    url = "https://www.foreca.com/Denmark/Kongens_Lyngby"
    contents = urllib.request.urlopen(url).read()
    pq = PyQuery(contents)


    # Here we can see multiple ways to get the same element 

    x = pq("div.left >span.warm.txt-xxlarge ")
    print(x.text())

    newsItems = pq("div.left")
    x=newsItems.find("span.warm.txt-xxlarge")
    print(x.text())

    newsItems = pq("span.warm.txt-xxlarge")
    print(newsItems.text())

    newsItems = pq(".warm.txt-xxlarge")
    print(newsItems.text())

## 3. How to not get blacklisted

When scraping it is possible to be blocked because website owner do not like much scrapers. Here there are some strategies to not be blocked.

### 3.1 User agent

One thing is to set a user-agent. User Agent is a tool that works on behalf of the user and tells the server about which web browser the user is using for visiting the website. Many websites do not let you view the content if the user-agent is not set.
This can be easily done using the **request** package that works exatcly as **urllib.request**

Try first to request the HTML page from H&M strore webpage first in the way we have seen before and, then, using User agent.

In [10]:
url="https://www2.hm.com/da_dk/herre/produkt/hoodies-og-sweatshirts.html"

In [11]:
# Non-User Agent
try:
    cntents = urllib.request.urlopen(url).read()
except:
    print("HTTP Error 403: Forbidden")

HTTP Error 403: Forbidden


In [12]:
import requests

headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    }

# User Agent
r = requests.get(url,headers=headers)

print(r.text)

<!DOCTYPE HTML>
<html lang="da" is-in-aem="false" class="no-js da-dk" ng-app="hmApp" >
    <head>
   
       

<script type="text/javascript" src="/dtagent_ICA23STVbpqr_7000100141019.js" data-dtconfig="agentUri=/dtagent_ICA23STVbpqr_7000100141019.js|rid=RID_1889581274|rpid=886243558|domain=hm.com|bandwidth=300_m|lastModification=1570150705405|lab=1|tp=500,50,0,1,10|reportUrl=dynaTraceMonitor|app=H&M Production Web"></script><script>
var hm_deviceType="desktop";
</script>
<!--[if lt IE 9]>
<link rel="stylesheet" href="/etc.clientlibs/settings/wcm/designs/hm/clientlibs/desktop/ie8.min.css" type="text/css">
<script type="text/javascript" src="/etc.clientlibs/settings/wcm/designs/hm/clientlibs/desktop/ie8.min.js"></script>
<![endif]-->
<script type="text/javascript" src="/etc.clientlibs/settings/wcm/designs/hm/clientlibs/shared/jquery.min.js"></script>
<script type="text/javascript" src="/etc.clientlibs/settings/wcm/designs/hm/clientlibs/shared/head.min.12.1.52.js"></script>
<!--

### 3.2 Delays

When you want to make several request to the same website, a good way to mimic the human behaviour is to wait between requests. You can do that using time.sleep() and wait for a random number of seconds.

In [13]:
import numpy as np
import time
delay=np.random.randint(0,10)
time.sleep(delay)

### 3.3 Other ways to not get blocked

It exist many other trick to not get blocked. A general suggestion is to immitate as much as possible human behaviour:
- Delays your requests;
- Limit the amount of data downloaded at once;
- Do not follow the same crawling pattern.

It is also possible to make multiple requests using different IP, which somehow means pretenting to be different user. The sites can figure out the pattern of a certain IP or a pool of IP and simply block them. But, it is possible to buy IPs and use them to make requests. Similarly, you can rotate user agent. To do that you can use list of random User agent such as [Udger](https://udger.com/resources/ua-list).