# Web Scraping with _requests_ and _BeautifulSoup_

In this notebook we will practice web scraping and information extraction using the packages _requests_ and _BeautifulSoup_ and the world renowned website [www.p-tech.org.uk](https://www.p-tech.org.uk).

To begin, if they are not already installed and up-to-date, we need to install the relevant packages.

In [17]:
!pip install --user requests
!pip install --user beautifulsoup4



Note: if you have problems with SSL certificates further down the notebook you may need to update _requests._ You can do this with either:

_!pip install --upgrade requests_

_!pip install --upgrade --user requests_ &nbsp;&nbsp;&nbsp; _# if you don't have admin privileges_

From here we can import our packages. We will also setup a header variable - this basically tells the website this request is from a normal browser-type agent: 

In [18]:
from bs4 import BeautifulSoup
from requests import get
import pandas as pd

headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

From here we can specify our website and make a GET request (as per the lecture). We can print the start of the response to check we have been able to access the site content.

In [19]:
ptech_web = "http://ptechweb.s3-website.us-east-2.amazonaws.com"
response = get(ptech_web, headers=headers)

print(response.text[:500])

<!DOCTYPE html>
<!--[if IE 9 ]> <html lang="en-US" class="ie9 loading-site no-js"> <![endif]-->
<!--[if IE 8 ]> <html lang="en-US" class="ie8 loading-site no-js"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]><!--><html lang="en-US" class="loading-site no-js"> <!--<![endif]-->
<head>
<meta charset="UTF-8"/>
<link rel="profile" href="http://gmpg.org/xfn/11"/>
<link rel="pingback" href="xmlrpc.php.html"/>
<script>(function(html){html.className = html.className.replace(/\bno-js\b/,'js')})(document.documen


Now we have scraped some website content lets see if we can find anything useful in it (by parsing the content with _BeautifulSoup)._ I mean, its unlikely given the source material but we will persevere. Let's start by extracting all the _h1_ tags (first-level headers):

In [20]:
ptech_soup = BeautifulSoup(response.text, 'html.parser')

ptech_h1 = ptech_soup.find_all('h1')
print(ptech_h1)

[<h1>Digital Marketing Consultancy</h1>, <h1>Paid Advertising Training</h1>, <h1>International Marketing</h1>, <h1>Google Analytics Training</h1>]


We can do something similar for _h4_ tags, but this time we will specify a particular CSS class ('thin-font') and extract only the text rather than the HTML tags:

In [21]:
ptech_tag = ptech_soup.find_all('h4', class_='thin-font')

for each in ptech_tag: 
    print(str(each.get_text()))

Need to understand your online marketing?
Google Adwords & Facebook Advertising
Are you looking to Export?
Learn how to reach new markets.
Get started with Analytics, understand the data


Whilst we can see that the site has a strange approach to capitalisation, everything does seems to have worked! Let's try something a bit more complicated.

Here we will loop between two web pages (the _suffixes_ list below) and extract the _h1_ headings, the _meta title_ and the _meta description._ As the website stores content inside [containers](https://www.w3schools.com/w3css/w3css_containers.asp), we will need to search inside these, again via a loop. Finally you will note we use the _time_ module in order to make the script wait - via _sleep( )._ This is good practice when scraping websites as a script can make a lot of requests very quickly and overload the website server. We sleep for a random time just for fun. We also time the whole process using the Notebook function _%%time._ 

In [22]:
%%time

from random import randint
from time import sleep

# setting up the lists/dictionary that will form our dataframe with all the results
titles = []
metatitle = []
title_dict = {} # empty dictionary

uri = 'http://ptechweb.s3-website.us-east-2.amazonaws.com/about-us/'
suffixes = ['staff/james-pennington', 'clients'] # add the rest in here 

for suffix in suffixes: # loop through the suffixes list

    h1heading = [] # new list or empty the list to start again 
    metatitle = []
    metadesc = []
    ptech_webscrape = uri + suffix # concatenate the URL and the suffix in the current loop  
    r = get(ptech_webscrape, headers=headers)
    page_html = BeautifulSoup(r.text, 'html.parser')
    ptech_webpage = page_html.find_all('html')       
    
    if ptech_webpage != []:
        for container in ptech_webpage:
            
            # page title
            meta_title = page_html.find("meta", property="og:title")
            metatitle.append(meta_title)
            
            # meta description
            meta_desc = page_html.find("meta", property="og:description")
            metadesc.append(meta_desc)
            
            # H1
            h1name = container.find_all('h1')[0].text
            h1heading.append(h1name)
            
            
    else:
        continue

    title_dict[suffix] = h1heading, metatitle, metadesc # add to the dictionary the suffix and the list of h1's
    
    
    sleep(randint(1,3))

Wall time: 5.32 s


In [23]:
print(title_dict)

{'staff/james-pennington': (['James Pennington'], [<meta content="James Pennington - ptech" property="og:title"/>], [<meta content="James has been involved in IT consultancy to SMEâs for 18+ years and has worked with over a large number of businesses in that time to help them understand and implement IT effectively within their organisations. With both a practical and strategic view of IT across all areas of the business, James works with a [...]" property="og:description"/>]), 'clients': (['Our Clients'], [<meta content="Our Clients - ptech" property="og:title"/>], [<meta content="We have delivered our services to a wide range of clients over the past 10 years, including business start-ups and Global organisations. Â What ever the type of business you’re running, we are confident that we can provided the skills and support required to take your business to the next level. Our clients, whether they be [...]" property="og:description"/>])}


To finish off our work we will add our data to a _pandas_ DataFrame and export as a CSV.

In [24]:
cols = ['Title'] # columns in our output file

ptechcsv = pd.DataFrame({'Title': title_dict})[cols]
ptechcsv.to_csv('ptech_scrape.csv') # the name of our file

That's it! We have scraped a website :)