In [3]:
import requests
import re

from bs4 import BeautifulSoup

#### This url is having information about how to build a static websites

In [4]:
url = "https://mashable.com/2014/08/28/static-website-generators/"

resp = requests.get(url)

In [5]:
soup = BeautifulSoup(resp.text, 'lxml')

soup

<!DOCTYPE html>
<html lang="en">
<head>
<title>Mashable</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="#00aeef" name="theme-color"/>
<meta content="Mashable.com" name="application-name"/>
<meta content="error_404" name="description"/>
<meta content="122071082108" property="fb:app_id"/>
<meta content="18807449704" property="fb:pages"/>
<meta content="Mashable" property="og:site_name"/>
<meta content="Mashable" property="og:title"/>
<meta content="https://mashable.com/2014/08/28/static-website-generators" property="og:url"/>
<meta content="error_404" property="og:description"/>
<meta content="https://helios-i.mashable.com/imagery/defaults/fallback-thumbnail.fill.size_1200x675.1.png" property="og:image"/>
<meta content="https://helios-i.mashable.com/imagery/defaults/fallback-thumbnail.fill.size_1200x675.1.png" property="og:image:secure_url"/>
<meta content="675" property="og:image:height"/>
<meta content="1200" property

### Extracting a list of links from a web page using BeautifulSoup

#### In any webpage links are created inside "a" tag, "link" tag and links for images in "img" tag. Any link should starts with https|http, so to extract only links from tags we should give expression to code so here we are using regular expression filter to extract only links fron tags.

This code will give list of links which are having the attribute("href") value started with "https"(not include "http") in "a" along with complete content of "a" tag.

In [6]:
soup.find_all("a",  attrs = {'href': re.compile("^https")})

[<a aria-label="Home" class="flex items-center mr-8 w-full xl:w-auto" data-ga-action="navigation_logo" data-ga-click="" data-ga-element="navigation_logo" data-ga-item="logo" href="https://mashable.com">
 <div x-data="{animate: false, reverse: false}" x-init="setTimeout(() =&gt; animate = true, 1000)">
 <svg :class="{ 'animate': animate, 'animate-reverse': reverse }" class="inline-block -mb-3 w-40 h-11 fill-current hover:fill-secondary-100" id="mashable-wordmark-animated" shape-rendering="geometricPrecision" text-rendering="geometricPrecision" viewbox="0 0 2200 650" x-ref="wordmark" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><style><![CDATA[.animate #euSMf1FbiNs11_to{animation:euSMf1FbiNs11_to__to 1s linear 1 normal forwards}.animate-reverse #euSMf1FbiNs11_to{animation:euSMf1FbiNs11_to__back 1s linear 1 normal forwards}@keyframes  euSMf1FbiNs11_to__to{0%{transform:translate(2326.395841px,218.229656px)}45%{transform:translate(2326.395841px,218.229656px)

This code will give only links, which are having the attribute("href") value started with "https"(not include "http") in "a" tag.

In [7]:
for link in soup.find_all("a",  attrs = {'href': re.compile("^https")}):
    print(link['href'])

https://mashable.com
https://mashable.com/tech
https://mashable.com/science
https://mashable.com/life
https://mashable.com/category/social-good
https://mashable.com/entertainment
https://mashable.com/deals
https://mashable.com/shopping
https://mashable.com/travel
https://mashable.com/category/apps-and-software
https://mashable.com/category/artificial-intelligence
https://mashable.com/category/cybersecurity
https://mashable.com/category/cryptocurrency
https://mashable.com/category/mobile
https://mashable.com/category/smart-home
https://mashable.com/category/social-media
https://mashable.com/category/tech-industry
https://mashable.com/category/transportation
https://mashable.com/tech
https://mashable.com/category/space
https://mashable.com/category/climate-change
https://mashable.com/category/environment
https://mashable.com/videos/category/science
https://mashable.com/science
https://mashable.com/category/digital-culture
https://mashable.com/category/family-parenting
https://mashable.co

This code will give only links, which are having the attribute("href") value started with "http://"

In [8]:
for link in soup.findAll('a', attrs = {"href" : re.compile("^http://")}):
    print(link['href'])

http://instagram.com/Mashable


This code will give only links, which are having the attribute("href") value started with both "http" and "https" in "a" tag.

In [9]:
for link in soup.findAll('a', attrs = {"href" : re.compile("^http|https")}):
    print(link['href'])

https://mashable.com
https://mashable.com/tech
https://mashable.com/science
https://mashable.com/life
https://mashable.com/category/social-good
https://mashable.com/entertainment
https://mashable.com/deals
https://mashable.com/shopping
https://mashable.com/travel
https://mashable.com/category/apps-and-software
https://mashable.com/category/artificial-intelligence
https://mashable.com/category/cybersecurity
https://mashable.com/category/cryptocurrency
https://mashable.com/category/mobile
https://mashable.com/category/smart-home
https://mashable.com/category/social-media
https://mashable.com/category/tech-industry
https://mashable.com/category/transportation
https://mashable.com/tech
https://mashable.com/category/space
https://mashable.com/category/climate-change
https://mashable.com/category/environment
https://mashable.com/videos/category/science
https://mashable.com/science
https://mashable.com/category/digital-culture
https://mashable.com/category/family-parenting
https://mashable.co

We can write above code in simple way, this code will also give the only links which are started with both "https" and "http" in "a" tag.

In [10]:
for link in soup.findAll('a', attrs = {"href" : re.compile("^http")}):
    print(link['href'])

https://mashable.com
https://mashable.com/tech
https://mashable.com/science
https://mashable.com/life
https://mashable.com/category/social-good
https://mashable.com/entertainment
https://mashable.com/deals
https://mashable.com/shopping
https://mashable.com/travel
https://mashable.com/category/apps-and-software
https://mashable.com/category/artificial-intelligence
https://mashable.com/category/cybersecurity
https://mashable.com/category/cryptocurrency
https://mashable.com/category/mobile
https://mashable.com/category/smart-home
https://mashable.com/category/social-media
https://mashable.com/category/tech-industry
https://mashable.com/category/transportation
https://mashable.com/tech
https://mashable.com/category/space
https://mashable.com/category/climate-change
https://mashable.com/category/environment
https://mashable.com/videos/category/science
https://mashable.com/science
https://mashable.com/category/digital-culture
https://mashable.com/category/family-parenting
https://mashable.co

This code will give the all the values of "href" attributes which are in "a" tag, in this output we can see some are not links then also we have got as output because we are extracting all href attribute values.

In [11]:
for link in soup.findAll('a', href=True):
    print(link['href'])

https://mashable.com
/category/super-bowl
/series/ai-at-work
https://mashable.com/tech
https://mashable.com/science
https://mashable.com/life
https://mashable.com/category/social-good
https://mashable.com/entertainment
https://mashable.com/deals
https://mashable.com/shopping
https://mashable.com/travel
/category/super-bowl
/series/ai-at-work
https://mashable.com/category/apps-and-software
https://mashable.com/category/artificial-intelligence
https://mashable.com/category/cybersecurity
https://mashable.com/category/cryptocurrency
https://mashable.com/category/mobile
https://mashable.com/category/smart-home
https://mashable.com/category/social-media
https://mashable.com/category/tech-industry
https://mashable.com/category/transportation
https://mashable.com/tech
https://mashable.com/category/space
https://mashable.com/category/climate-change
https://mashable.com/category/environment
https://mashable.com/videos/category/science
https://mashable.com/science
https://mashable.com/category/di

In above output we got some incompleted links, some cases developers write the short links to make them complete we need to add the page url to that short links.

In [12]:
for link in soup.findAll('a', href=True):
    if not link['href'].startswith('http'):
        link = url + link['href'].strip('/')
    else:
        link = link['href']
    
    print(link)

https://mashable.com
https://mashable.com/2014/08/28/static-website-generators/category/super-bowl
https://mashable.com/2014/08/28/static-website-generators/series/ai-at-work
https://mashable.com/tech
https://mashable.com/science
https://mashable.com/life
https://mashable.com/category/social-good
https://mashable.com/entertainment
https://mashable.com/deals
https://mashable.com/shopping
https://mashable.com/travel
https://mashable.com/2014/08/28/static-website-generators/category/super-bowl
https://mashable.com/2014/08/28/static-website-generators/series/ai-at-work
https://mashable.com/category/apps-and-software
https://mashable.com/category/artificial-intelligence
https://mashable.com/category/cybersecurity
https://mashable.com/category/cryptocurrency
https://mashable.com/category/mobile
https://mashable.com/category/smart-home
https://mashable.com/category/social-media
https://mashable.com/category/tech-industry
https://mashable.com/category/transportation
https://mashable.com/tech
h

Here we are extracting the links from two tags ("a", "link") by passing them as list, where the attribute ("href") value is started with "http" (includes "http" and "https").  

In [13]:
for link in soup.find_all(["a", "link"], attrs = {'href': re.compile("^http")}):
    print(link['href'])

https://mashable.com/feeds/rss/all
https://use.typekit.net
https://g.mashable.com/mashable.js?url=https%3A%2F%2Fmashable.com%2F2014%2F08%2F28%2Fstatic-website-generators
https://cdn.ziffstatic.com/jst/zdconsent.js
https://cdn.static.zdbb.net/js/z0WVjCBSEeGLoxIxOQVEwQ.min.js
https://www.google-analytics.com/analytics.js
https://cdn.ziffstatic.com/pg/mashable.js
https://cdn.ziffstatic.com/pg/mashable.prebid.js
https://cdn.ziffstatic.com/pg/mashable.css
https://securepubads.g.doubleclick.net/tag/js/gpt.js
https://mashable.com
https://mashable.com/tech
https://mashable.com/science
https://mashable.com/life
https://mashable.com/category/social-good
https://mashable.com/entertainment
https://mashable.com/deals
https://mashable.com/shopping
https://mashable.com/travel
https://mashable.com/category/apps-and-software
https://mashable.com/category/artificial-intelligence
https://mashable.com/category/cybersecurity
https://mashable.com/category/cryptocurrency
https://mashable.com/category/mobile


Here we are extracting all image links using "src" attribute.

In [14]:
for img in soup.findAll('img'):
    print(img.get('src'))

/images/mashable-potato.png
/images/group-black-logo-purple.png
https://c.evidon.com/pub/icong1.png
