# Introduction

<div class="alert alert-block alert-warning">
<font color=black><br>

**What?** HTML Parsing and Cleanup with BeutifulSoup

<br></font>
</div>

# What is Beautifulsoup?

<div class="alert alert-block alert-info">
<font color=black><br>

- BeautifulSoup is one of the many libraries which allow us to scrape web pages. 
- **Alternatives** includs: scrapy & selenium.

<br></font>
</div>

# Project's goal

<div class="alert alert-block alert-info">
<font color=black><br>

- You are building a forum search engine for programming questions. We’ve identified Stack. 
- Overflow as a source and decided to extract question and best-answer pairs from the website.
- How can we go through the text-extraction step in this case? If we observe the HTML markup of a typical Stack Overflow question page, we notice that questions and answers have special tags associated with them.

<br></font>
</div>

# Import modules

In [2]:
from pprint import pprint
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Identify source/dataset

In [3]:
#specify the url
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" 
#query the website so that it returns a html page  
html = urlopen(myurl).read() 
# parse the html in the 'html' variable, and store it in Beautiful Soup format
soupified = BeautifulSoup(html, 'html.parser') 

In [4]:
pprint(soupified.prettify()) 

('<!DOCTYPE html>\n'
 '<html class="html__responsive html__fixed-top-bar" itemscope="" '
 'itemtype="https://schema.org/QAPage">\n'
 ' <head>\n'
 '  <title>\n'
 '   datetime - How to get the current time in Python - Stack Overflow\n'
 '  </title>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" '
 'rel="shortcut icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="apple-touch-icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="image_src"/>\n'
 '  <link href="/opensearch.xml" rel="search" title="Stack Overflow" '
 'type="application/opensearchdescription+xml"/>\n'
 '  <link '
 'href="https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" '
 'rel="canonical">\n'
 '   <meta content="width=device-width, height=device-height, '
 'initial-scale=1.0, minimum-scale=1.0" name=

 "                 Jan 9 '13 at 5:50\n"
 '                </span>\n'
 '               </div>\n'
 '               <div class="user-gravatar32">\n'
 '                <a href="/users/1192220/parameterz">\n'
 '                 <div class="gravatar-wrapper-32">\n'
 '                  <img alt="" class="bar-sm" height="32" '
 'src="https://i.stack.imgur.com/t50AC.jpg?s=32&amp;g=1" width="32"/>\n'
 '                 </div>\n'
 '                </a>\n'
 '               </div>\n'
 '               <div class="user-details" itemprop="author" itemscope="" '
 'itemtype="http://schema.org/Person">\n'
 '                <a href="/users/1192220/parameterz">\n'
 '                 ParaMeterz\n'
 '                </a>\n'
 '                <span class="d-none" itemprop="name">\n'
 '                 ParaMeterz\n'
 '                </span>\n'
 '                <div class="-flair">\n'
 '                 <span class="reputation-score" dir="ltr" title="reputation '
 'score ">\n'
 '                  7,997\n'
 ' 

 '                  <span class="badgecount">\n'
 '                   55\n'
 '                  </span>\n'
 '                 </span>\n'
 '                 <span class="v-visible-sr">\n'
 '                  55 bronze badges\n'
 '                 </span>\n'
 '                </div>\n'
 '               </div>\n'
 '              </div>\n'
 '             </div>\n'
 '             <div class="post-signature grid--cell fl0">\n'
 '              <div class="user-info">\n'
 '               <div class="user-action-time">\n'
 '                answered\n'
 '                <span class="relativetime" title="2009-01-06 05:02:43Z">\n'
 "                 Jan 6 '09 at 5:02\n"
 '                </span>\n'
 '               </div>\n'
 '               <div class="user-gravatar32">\n'
 '                <a href="/users/27474/vijay-dev">\n'
 '                 <div class="gravatar-wrapper-32">\n'
 '                  <img alt="" class="bar-sm" height="32" '
 'src="https://www.gravatar.com/avatar/91160e88d86db632

 '            <p>\n'
 '             This is what I ended up going with:\n'
 '            </p>\n'
 '            <pre><code>&gt;&gt;&gt;from time import strftime\n'
 '&gt;&gt;&gt;strftime("%m/%d/%Y %H:%M")\n'
 '01/09/2015 13:11\n'
 '</code></pre>\n'
 '            <p>\n'
 '             Also, this table is a necessary reference for choosing the '
 'appropriate format codes to get the date formatted just the way you want it '
 '(from Python "datetime" documentation\n'
 '             <a '
 'href="https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior" '
 'rel="noreferrer">\n'
 '              here\n'
 '             </a>\n'
 '             ).\n'
 '            </p>\n'
 '            <p>\n'
 '             <img alt="strftime format code table" '
 'src="https://i.stack.imgur.com/i6Hg7.jpg"/>\n'
 '            </p>\n'
 '           </div>\n'
 '           <div class="mt24">\n'
 '            <div class="grid fw-wrap ai-start jc-end gs8 gsy">\n'
 '             <time datetime="2015-01-0

 'pytz.all_timezones\n'
 '</code></pre>\n'
 '           </div>\n'
 '           <div class="mt24">\n'
 '            <div class="grid fw-wrap ai-start jc-end gs8 gsy">\n'
 '             <time datetime="2019-10-18T13:09:33" itemprop="dateCreated">\n'
 '             </time>\n'
 '             <div class="grid--cell mr16" style="flex: 1 1 100px;">\n'
 '              <div class="js-post-menu pt2" data-post-id="58451609">\n'
 '               <div class="grid d-flex gs8 s-anchors s-anchors__muted '
 'fw-wrap">\n'
 '                <div class="grid--cell">\n'
 '                 <a class="js-share-link js-gps-track" '
 'data-controller="se-share-sheet" data-gps-track="post.click({ item: 2, priv: '
 '0, post_type: 2 })" data-s-popover-placement="bottom-start" '
 'data-se-share-sheet-license-name="CC BY-SA 4.0" '
 'data-se-share-sheet-license-url="https%3a%2f%2fcreativecommons.org%2flicenses%2fby-sa%2f4.0%2f" '
 'data-se-share-sheet-location="2" data-se-share-sheet-post-type="answer" '
 'data-se-sh

 '           <div class="js-voting-container grid jc-center fd-column '
 'ai-stretch gs4 fc-black-200" data-post-id="56458580">\n'
 '            <button aria-label="Up vote" aria-pressed="false" '
 'class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer" '
 'data-controller="s-tooltip" data-s-tooltip-placement="right" '
 'data-selected-classes="fc-theme-primary" title="This answer is useful">\n'
 '             <svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" '
 'height="36" viewbox="0 0 36 36" width="36">\n'
 '              <path d="M2 26h32L18 10 2 26z">\n'
 '              </path>\n'
 '             </svg>\n'
 '            </button>\n'
 '            <div class="js-vote-count grid--cell fc-black-500 fs-title grid '
 'fd-column ai-center" data-value="9" itemprop="upvoteCount">\n'
 '             9\n'
 '            </div>\n'
 '            <button aria-label="Down vote" aria-pressed="false" '
 'class="js-vote-down-btn grid--cell s-btn s-btn__unset c-pointer" '
 'data-control

 '         </a>\n'
 '        </li>\n'
 '        <li class="-item">\n'
 '         <a class="-link js-gps-track" data-gps-track="footer.click({ '
 'location: 2, link: 25 })" href="https://islam.stackexchange.com" '
 'title="Muslims, experts in Islam, and those interested in learning more '
 'about Islam">\n'
 '          Islam\n'
 '         </a>\n'
 '        </li>\n'
 '       </ul>\n'
 '      </div>\n'
 '      <div class="site-footer--col site-footer--category js-footer-col" '
 'data-name="Culture / Recreation">\n'
 '       <ul class="-list">\n'
 '        <li class="-item">\n'
 '         <a class="-link js-gps-track" data-gps-track="footer.click({ '
 'location: 2, link: 25 })" href="https://rus.stackexchange.com" '
 'title="лингвистов и энтузиастов русского языка">\n'
 '          Русский язык\n'
 '         </a>\n'
 '        </li>\n'
 '        <li class="-item">\n'
 '         <a class="-link js-gps-track" data-gps-track="footer.click({ '
 'location: 2, link: 25 })" href="https://russian.st

In [5]:
#to get an idea of the html structure of the webpage
pprint(soupified.prettify()[:2000])

('<!DOCTYPE html>\n'
 '<html class="html__responsive html__fixed-top-bar" itemscope="" '
 'itemtype="https://schema.org/QAPage">\n'
 ' <head>\n'
 '  <title>\n'
 '   datetime - How to get the current time in Python - Stack Overflow\n'
 '  </title>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" '
 'rel="shortcut icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="apple-touch-icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="image_src"/>\n'
 '  <link href="/opensearch.xml" rel="search" title="Stack Overflow" '
 'type="application/opensearchdescription+xml"/>\n'
 '  <link '
 'href="https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" '
 'rel="canonical">\n'
 '   <meta content="width=device-width, height=device-height, '
 'initial-scale=1.0, minimum-scale=1.0" name=

In [6]:
#to get the title of the web page
soupified.title 

<title>datetime - How to get the current time in Python - Stack Overflow</title>

# Parsing the HTML file

<div class="alert alert-block alert-info">
<font color=black><br>

- Here, we’re relying on our knowledge of the structure of an HTML document to extract what we want from it.

<br></font>
</div>

In [14]:
#find the nevessary tag and class which it belongs to
question = soupified.find("div", {"class": "question"}) 
questiontext = question.find("div", {"class": "s-prose js-post-body"})
print("Question: \n", questiontext.get_text().strip())

Question: 
 What is the module/method used to get the current time?


# Output

In [16]:
#find the necessary tag and class which it belongs to
answer = soupified.find("div", {"class": "answer"}) 
answertext = answer.find("div", {"class": "s-prose js-post-body"})

print("***********BEST ANSWER STARTS HERE*********")
print(answertext.get_text().strip())
print("***********BEST ANSWER ENDS HERE*********")

***********BEST ANSWER STARTS HERE*********
Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.
***********BEST ANSWER ENDS HERE*********


# References

<div class="alert alert-block alert-warning">
<font color=black><br>

- https://github.com/practical-nlp/practical-nlp/blob/master/Ch2/01_WebScraping_using_BeautifulSoup.ipynb
- Harshit Surana, Practical Natural Language Processing

<br></font>
</div>