<a href="https://colab.research.google.com/github/pakueng/practical-nlp-code/blob/master/Ch2/01_WebScraping_using_BeautifulSoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this notebook we show how we can scrap data from webpages using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a python library.
<br><br>

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install numpy==1.19.5
!pip install beautifulsoup4==4.6.3

# ===========================

Collecting numpy==1.19.5
  Downloading numpy-1.19.5.zip (7.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.3/7.3 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: numpy
  Building wheel for numpy (pyproject.toml) ... [?25l[?25hcanceled[31mERROR: Operation cancelled by user[0m[31m
[0mCollecting beautifulsoup4==4.6.3
  Downloading beautifulsoup4-4.6.3-py3-none-any.whl.metadata (3.0 kB)
Downloading beautifulsoup4-4.6.3-py3-none-any.whl (90 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.4/90.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: beautifulsoup4
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.13.3
    Uninstalling beautifulsoup4-4.13.3:
      S

In [None]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch2/ch2-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch2-requirements.txt"

# ===========================

In [2]:
# making the necessary imports
from pprint import pprint
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [9]:
# myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" # specify the url
# html = urlopen(myurl).read() # query the website so that it returns a html page
# soupified = BeautifulSoup(html, 'html.parser') # parse the html in the 'html' variable, and store it in Beautiful Soup format

In [5]:
from pprint import pprint
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from urllib.error import HTTPError

myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" # specify the url

# Create a Request object with a custom user agent
req = Request(myurl, headers={'User-Agent': 'Mozilla/5.0'})  # Simulate a web browser

try:
    html = urlopen(req).read() # query the website so that it returns a html page
except HTTPError as e:
    print(f"An error occurred: {e}")  # Print the error message
    # You might want to implement error handling here, such as retrying or skipping the URL
else:
    soupified = BeautifulSoup(html, 'html.parser') # parse the html in the 'html' variable, and store it in Beautiful Soup format
    # ... (rest of your code)

As the size of the HTML webpage (soupified) is large, we are just showing some of its output (only 2000 characters).

In [None]:
#pprint(soupified.prettify())      # for printing the full HTML structure of the webpage

In [6]:
pprint(soupified.prettify()[:2000]) # to get an idea of the html structure of the webpage

('<!DOCTYPE html>\n'
 '<html class="html__responsive " itemscope="" '
 'itemtype="https://schema.org/QAPage" lang="en">\n'
 ' <head>\n'
 '  <title>\n'
 '   datetime - How do I get the current time in Python? - Stack Overflow\n'
 '  </title>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" '
 'rel="shortcut icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="apple-touch-icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="image_src"/>\n'
 '  <link href="/opensearch.xml" rel="search" title="Stack Overflow" '
 'type="application/opensearchdescription+xml"/>\n'
 '  <link '
 'href="https://stackoverflow.com/questions/415511/how-do-i-get-the-current-time-in-python" '
 'rel="canonical">\n'
 '   <meta content="width=device-width, height=device-height, '
 'initial-scale=1.0, minimum-scale=1.0" name="vie

In [7]:
soupified.title # to get the title of the web page

<title>datetime - How do I get the current time in Python? - Stack Overflow</title>

In [8]:
question = soupified.find("div", {"class": "question"}) # find the nevessary tag and class which it belongs to
questiontext = question.find("div", {"class": "s-prose js-post-body"})
print("Question: \n", questiontext.get_text().strip())

answer = soupified.find("div", {"class": "answer"}) # find the nevessary tag and class which it belongs to
answertext = answer.find("div", {"class": "s-prose js-post-body"})
print("Best answer: \n", answertext.get_text().strip())

Question: 
 How do I get the current time in Python?
Best answer: 
 Use datetime:
>>> import datetime
>>> now = datetime.datetime.now()
>>> now
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)
>>> print(now)
2009-01-06 15:08:24.789150

For just the clock time without the date:
>>> now.time()
datetime.time(15, 8, 24, 78915)
>>> print(now.time())
15:08:24.789150


To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the prefix datetime. from all of the above.


BeautifulSoup is one of the many libraries which allow us to scrape web pages. Depending on your needs you can choose between the many available choices like beautifulsoup, scrapy, selenium, etc