### Extracting e-mail id from the web page

<p>We are going to learn about, how to extract an email ID on the given webpage.<br><br> 
    <h4>Step 1</h4>
    We need to import all the essential libraries to our program.<br><br>
    <code>BeautifulSoup</code> : It is a Python library for extracting data out of HTML and XML files.<br>
    <code>requests</code> : The requests library allows us to send HTTP requests using Python.<br>
    <code>urllib.parse</code> : This module provides functions for manipulating URLs and their component parts, to either break them down or build them up.<br> 
    <code>collections</code> : It provides different types of containers<br>
    <code>re</code> : A module that handles regular expressions.
</p>

In [1]:
#import packages
from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
import re

<p>
<h4>Step 2</h4>
Select the url, for extracting email-ID from given url.
</p>

In [2]:
# a queue of urls to be crawled
new_urls = deque(['http://www.eshikshak.co.in/'])

<p>
<h4>Step 3</h4>
We have to process the given url only once, so keep track of processed urls.
</p>

In [3]:
# a set of urls that we have already crawled
processed_urls = set()

<p>
<h4>Step 4</h4>
While crawling the given url, we may encounter more than one the email-ID so keep them in the collections.
</p>

In [4]:
# a set of crawled emails
emails = set()

<p>
<h4>Step 5</h4>
Time to start crawling, we need to crawl all the urls in the queue and maintain the list of crawled urls & get the page content from the webpage, if any error is encountered move to next page
</p>

In [5]:
# process urls one by one until we exhaust the queue
while len(new_urls):
    # move next url from the queue to the set of processed urls
    url = new_urls.popleft()
    processed_urls.add(url)
    # get url's content
    print("Processing %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # ignore pages with errors 
        continue

Processing http://www.eshikshak.co.in/


<p>
<h4>Step 6</h4>
Now we need to extract some base parts of the current url; essential part for transfering relative links found in the document into absolute ones:
</p>

In [6]:
# extract base url and path to resolve relative links
parts = urlsplit(url)
base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url

<p>
<h4>Step 7</h4>
From the page content extract emailIDs and add them to emails set
</p>

In [7]:
# extract all email addresses and add them into the resulting set 
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)

<p>
<h4>Step 8</h4>
Once the current page is processed, its time to search links to other pages and add them to url queue(that's the magic of crawling). Get a <code>Beautifulsoup</code> object for parsing html pages.
</p>

In [8]:
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text)

<p>
<h4>Step 9</h4>
<code>soup</code> object contains html elements, now find all the <code>anchor</code> tags with its <code>href</code> attributes to reslove relative links and keep a record of processed urls
</p>

In [9]:
# find and process all the anchors in the document
for anchor in soup.find_all("a"):
    # extract link url from the anchor
    link = anchor.attrs["href"] if "href" in anchor.attrs else ''
    # resolve relative links
    if link.startswith('/'):
        link = base_url + link
    elif not link.startswith('http'):
        link = path + link
    # add the new url to the queue if it was not enqueued nor processed yet
    if not link in new_urls and not link in processed_urls:
        new_urls.append(link)

<p>
<h4>Step 10</h4>
List out all the email-ID extracted from the given url
</p>

In [10]:
for email in emails:
    print(email)

prakashgkhaire@gmail.com
eshikshak.co.in@gmail.com


<p>
<h2>About Author</h2>
Name : <b>Prakash G. Khaire</b><br>
    Bio  : <b>Software Consultant </b> :::: <b>Online Tutor </b> :::: <b>HOD & Assitant Professor</b><br>
           Contact Details :
    <mobile><b>+91 951 0446 143</b></mobile><br> 
    <a href="http://www.eshikshak.co.in">www.eshikshak.co.in</a>
</p>