# Web Scraping

### 1. Grabbing the title of a page

- Grab the title of a page
- Use **www.example.com** which is a website specifically made to serve as an example domain

In [1]:
#import requests library
import requests


In [3]:
# Step 1: Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter 
# Note sometimes you need to run this twice if it fails the first time
res = requests.get('https://www.example.com/')


This object is a requests.models.Response object and it actually contains the information from the website, for example:

In [4]:
# check the type
type(res)


requests.models.Response

In [5]:
# see the text
res.text


'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

Use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string. 

In [7]:
#import beautiful soup lib (bs4)
import bs4


In [8]:
#use BeautifulSoup to analyze
soup = bs4.BeautifulSoup(res.text, "lxml")


In [10]:
#check your soup results
soup


<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

Use the **.select()** method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'


In [11]:
# check the title
soup.select('title')


[<title>Example Domain</title>]

In [38]:
#grab just the text
title_tag = soup.select('title')
title_tag

[<title>Example Domain</title>]

In [39]:
#check first element
title_tag[0]


<title>Example Domain</title>

In [40]:
# check the type
type(title_tag[0])


bs4.element.Tag

In [43]:
# get the title
title = title_tag[0].getText()
title

'Example Domain'

### 2. Getting an Image from a Website

- Grab the image of lantern from wikipedia article (Wiki are open source).

https://en.wikipedia.org/wiki/Lantern_Festival

In [44]:
#import the library that you will need 
import requests 
import bs4


In [45]:
# send request to get https://en.wikipedia.org/wiki/Lantern_Festival
# store to object res

res = requests.get("https://en.wikipedia.org/wiki/Lantern_Festival")

In [50]:
# use the BeautifulSoup to convert the text formatting.
soup = bs4.BeautifulSoup(res.text, "lxml")
soup


<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Lantern Festival - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"dc0c7deb-a9d7-4bba-950d-914168a70aec","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Lantern_Festival","wgTitle":"Lantern Festival","wgCurRevisionId":1072028790,"wgRevisionId":1072028790,"wgArticleId":463617,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: archived copy as title","Articles with short description","Short description is different from Wikidata","Articles containing Chinese-

In [58]:
#select all image tag
img1_info = soup.select('img')
img1_info
len(img1_info)


27

Inspect the website again.

In [60]:
# select all .thumimage tag, store in image_info
# look at image_info
img_info = soup.select('.thumbimage')
img_info
len(img_info)


4

In [61]:
#check the length of image_info
# how many images?
len(img_info)


4

In [67]:
# get the first lantern image
lantern = img1_info[0]
lantern


<img alt="Lantern Festival in Taiwan at night 5.jpg" data-file-height="3888" data-file-width="5184" decoding="async" height="180" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Lantern_Festival_in_Taiwan_at_night_5.jpg/240px-Lantern_Festival_in_Taiwan_at_night_5.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Lantern_Festival_in_Taiwan_at_night_5.jpg/360px-Lantern_Festival_in_Taiwan_at_night_5.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Lantern_Festival_in_Taiwan_at_night_5.jpg/480px-Lantern_Festival_in_Taiwan_at_night_5.jpg 2x" width="240"/>

In [66]:
#check the type
type(lantern)


bs4.element.Tag

Can make dictionary like calls for parts of the **Tag**, in this case, we are interested in the **src** , or "source" of the image, which should be its own .jpg or .png link:

In [68]:
#get the source only
lantern['src']


'//upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Lantern_Festival_in_Taiwan_at_night_5.jpg/240px-Lantern_Festival_in_Taiwan_at_night_5.jpg'

Now we have the actual src link, we can grab the image with requests and get along with the .content attribute. 

Note how we had to add https:// before the link, if we don't do this, requests will complain (but it gives you a pretty descriptive error code).

In [74]:
#request to get the image with the link directly, save to image_link
image_link = requests.get("https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Lantern_Festival_in_Taiwan_at_night_5.jpg/240px-Lantern_Festival_in_Taiwan_at_night_5.jpg")
image_link

<Response [200]>

In [75]:
# check the raw content (its a binary file, meaning we will need to use binary read/write methods for saving it)
image_link.content


b'\xff\xd8\xff\xdb\x00C\x00\x04\x03\x03\x04\x03\x03\x04\x04\x03\x04\x05\x04\x04\x05\x06\n\x07\x06\x06\x06\x06\r\t\n\x08\n\x0f\r\x10\x10\x0f\r\x0f\x0e\x11\x13\x18\x14\x11\x12\x17\x12\x0e\x0f\x15\x1c\x15\x17\x19\x19\x1b\x1b\x1b\x10\x14\x1d\x1f\x1d\x1a\x1f\x18\x1a\x1b\x1a\xff\xdb\x00C\x01\x04\x05\x05\x06\x05\x06\x0c\x07\x07\x0c\x1a\x11\x0f\x11\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\xff\xc0\x00\x11\x08\x00\xb4\x00\xf0\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1d\x00\x00\x01\x03\x05\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x05\x06\x07\x00\x01\x02\x04\t\x03\xff\xc4\x00@\x10\x00\x02\x01\x03\x04\x01\x02\x05\x02\x04\x05\x02\x04\x05\x05\x00\x01\x02\x03\x04\x05\x11\x00\x06\x12!\x07\x131\x08\x14"AQ2a\x15#q\x81BR\x91\xa1\xb1\x16$3b\xc1\xd1\t\x17Cr\x82%Sc\xe1\xf1\xff\xc4\x00\x1b\x01\x00\x02\x03\x01\x01\x01\x00\x00\x00\x00\

Write this to a file, name it "myNewImage.jpg". The 'wb' call to "write a binary" file.

In [78]:
# write to a file called "myNewImage.jpg" with 'wb' option
f = open('myNewImage.jpg', 'wb')


In [80]:
# write the content
f.write(image_link.content)


22697

In [82]:
# close the file, then check your folder.
f.close()
