* Overview of Web Scraping
* Getting Started with BeautifulSoup
* Overview of HTML
* Process HTML using BeautifulSoup
* Extract URLs from HTML
* Extract Data from Web Pages
* Use requests to read HTML Page
* Parse and Process Web Page using BeautifulSoup
* Exercise and Solution

* Overview of Web Scraping
  * BeautifulSoup
  * Scrapy

* Getting Started with BeautifulSoup

Run `pip install BeautifulSoup4` to install beautifulsoup. Make sure to restart Notebook environment.

* Overview of HTML

```html
<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
        </tr>
    </tbody>
</table>
```

In [None]:
%%html

<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a></td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a></td>
        </tr>
    </tbody>
</table>

* Process HTML using BeautifulSoup

In [None]:
html_str = """<table>
    <tbody>
        <tr>
            <th>Details</th>
            <th>URL</th>
        </tr>
        <tr>
            <td>Video Content</td>
            <td><a href="https://www.youtube.com/itversityin">YouTube Channel</a>
            </td>
        </tr>
        <tr>
            <td>Reference Material</td>
            <td><a href="https://www.github.com/dgadiraju/itversity-books">GitHub Repository</a>
            </td>
        </tr>
    </tbody>
</table>"""

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(html_str, 'html.parser')
print(soup.prettify())

In [None]:
soup.table.tbody.tr

* Extract URLs from HTML

In [None]:
soup.find_all('a')

In [None]:
for item in soup.find_all('a'):
    print(item['href'])

In [None]:
for item in soup.find_all('a'):
    print(item.text)

In [None]:
[(item.text, item['href']) for item in soup.find_all('a')]

* Extract Data from Web Pages
  * URL - https://en.wikipedia.org/wiki/Python_(programming_language)
  * Use `requests` to get content from web page
  * Parse and process HTML content using BeautifulSoup

* Use requests to read HTML Page

In [22]:
import requests

In [23]:
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'

In [24]:
html_content = requests.get(url).content

* Parse and Process Web Page using BeautifulSoup

In [25]:
from bs4 import BeautifulSoup

In [27]:
soup = BeautifulSoup(html_content, 'html.parser')
# print(soup.prettify())

In [30]:
soup.find_all('a')

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
 <a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>,
 <a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>,
 <a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>,
 <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>,
 <a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en" title="Support us by donating to the Wikimedia Foundation"><span>Donate</span></a>,
 <a href=

In [36]:
http_urls = []
for item in soup.find_all('a'):
    if item.get('href') and item.get('href').startswith('http'):
        http_urls.append(item['href'])

In [37]:
http_urls

['https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 'https://af.wikipedia.org/wiki/Python_(programmeertaal)',
 'https://als.wikipedia.org/wiki/Python_(Programmiersprache)',
 'https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86_(%D9%84%D8%BA%D8%A9_%D8%A8%D8%B1%D9%85%D8%AC%D8%A9)',
 'https://an.wikipedia.org/wiki/Python',
 'https://as.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8',
 'https://ast.wikipedia.org/wiki/Python',
 'https://az.wikipedia.org/wiki/Python_(proqramla%C5%9Fd%C4%B1rma_dili)',
 'https://azb.wikipedia.org/wiki/%D9%BE%D8%A7%DB%8C%D8%AA%D9%88%D9%86',
 'https://ban.wikipedia.org/wiki/Python',
 'https://bn.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8_(%E0%A6%AA%E0%A7%8D%E0%A6%B0%E0%A7%8B%E0%A6%97%E0%A7%8D%E0%A6%B0%E0%A6%BE%E0%A6%AE%E0%A6%BF%E0%A6%82_%E0%A6%AD%E0%A6%BE%E0%A6%B7%E0%A6%BE)',
 'https://zh-min-nan.wikipedia.org/w

In [39]:
sorted(set(http_urls))

['http://archive.adaic.com/standards/83lrm/html/lrm-11-03.html#11.3',
 'http://boo.codehaus.org/Gotchas+for+Python+Users',
 'http://cdsweb.cern.ch/journal/CERNBulletin/2006/31/News%20Articles/974627?ln=en',
 'http://cobra-language.com/docs/acknowledgements/',
 'http://doc.pypy.org/en/latest/stackless.html',
 'http://download.tensorflow.org/paper/whitepaper2015.pdf',
 'http://ebeab.com/2014/01/21/python-culture/',
 'http://effbot.org/zone/call-by-object.htm',
 'http://gimp-win.sourceforge.net/faq.html',
 'http://id.worldcat.org/fast/1084736/',
 'http://mypy-lang.org/',
 'http://neopythonic.blogspot.be/2009/04/tail-recursion-elimination.html',
 'http://nondot.org/sabre',
 'http://nondot.org/sabre/',
 'http://nuitka.net/',
 'http://page.mi.fu-berlin.de/prechelt/Biblio/jccpprt_computer2000.pdf',
 'http://radio.weblogs.com/0112098/2003/08/29.html',
 'http://ring-lang.sourceforge.net/doc1.6/introduction.html#ring-and-other-languages',
 'http://shop.oreilly.com/product/9780596007973.do',
 'ht

* Exercise - Get Wiki Page URLs from [NFL Wiki Page](https://en.wikipedia.org/wiki/National_Football_League)
1. Read the entire HTML Content from NFL Wiki Page(https://en.wikipedia.org/wiki/National_Football_League)
2. Get URLs which start with **/wiki**
3. Prefix URLs with **https://en.wikipedia.org/** (eg: https://en.wikipedia.org/wiki/Buffalo_Bills)
4. Make sure to get unique url sorted in ascending order

* Solution - Get Wiki Page URLs from [NFL Wiki Page](https://en.wikipedia.org/wiki/National_Football_League)

In [40]:
import requests

In [41]:
from bs4 import BeautifulSoup

In [42]:
url = 'https://en.wikipedia.org/wiki/National_Football_League'

In [45]:
html_content = requests.get(url).content

In [46]:
soup = BeautifulSoup(html_content, 'html.parser')

In [47]:
wiki_urls = []
for item in soup.find_all('a'):
    if item.get('href') and item.get('href').startswith('/wiki/'):
        wiki_urls.append(f"https://en.wikipedia.org/{item.get('href')}")

In [48]:
sorted(set(wiki_urls))

['https://en.wikipedia.org//wiki/1920_APFA_season',
 'https://en.wikipedia.org//wiki/1921_APFA_season',
 'https://en.wikipedia.org//wiki/1921_NFL_Championship_controversy',
 'https://en.wikipedia.org//wiki/1922_NFL_season',
 'https://en.wikipedia.org//wiki/1923_NFL_season',
 'https://en.wikipedia.org//wiki/1924_NFL_season',
 'https://en.wikipedia.org//wiki/1925_NFL_season',
 'https://en.wikipedia.org//wiki/1926_NFL_season',
 'https://en.wikipedia.org//wiki/1927_NFL_season',
 'https://en.wikipedia.org//wiki/1928_NFL_season',
 'https://en.wikipedia.org//wiki/1929_NFL_season',
 'https://en.wikipedia.org//wiki/1930_NFL_season',
 'https://en.wikipedia.org//wiki/1931_NFL_season',
 'https://en.wikipedia.org//wiki/1932_NFL_Playoff_Game',
 'https://en.wikipedia.org//wiki/1932_NFL_season',
 'https://en.wikipedia.org//wiki/1933_NFL_season',
 'https://en.wikipedia.org//wiki/1934_NFL_season',
 'https://en.wikipedia.org//wiki/1935_NFL_season',
 'https://en.wikipedia.org//wiki/1936_NFL_Draft',
 'http