<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [166]:
## Import Libraries

import regex as re

import pandas as pd

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [190]:
# specify the url
quote_page = 'https://www.nike.com/sg/launch'

### Retrieve the page
- Require Internet connection

In [191]:
# query the website and return the html to the variable ‘page’
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 646221


### Convert the stream of bytes into a BeautifulSoup representation

In [192]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [193]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="" data-country="sg" lang="en-GB">
 <head>
  <script id="newrelic-browser-agent-script" type="text/javascript">
   window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{c.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(26),c={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(c.console=!0,o.indexOf("dev")!==-1&&(c.dev=!0),o.indexOf("nr_dev")!==-1&&(c.nrDev=!0))}catch(s){}c.nrDev&&i.on("internal-error",function(t){r(t.stack)}),c.dev&&i.on("fn-err",function(t,e,n){r(n.stack)}),c.dev&&(r("NR AGENT IN DEVELOPMENT MODE"),r("flags: "+a(c,function(t,e){return t}).join(", ")))},{}],2:[function(t,e,n){function r(t,e,n,r,c){try{l?l-=1:

### Check the HTML's Title

In [194]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title data-react-helmet="true">Nike SNKRS. Release Dates and Launch Calendar SG</title>:
Title text:Nike SNKRS. Release Dates and Launch Calendar SG:


### Find the main content
- Check if it is possible to use only the relevant data

### Get some of the text
- Plain text without HTML tags

In [195]:
tag = 'div'
div = soup.find_all(tag)[0]
print('Type of the variable \'div\':', div.__class__.__name__)

Type of the variable 'div': Tag


In [196]:
# show the first 500 characters after removing redundant newlines
print(re.sub(r'\n\n+', '\n', div.text)[:500])





In [197]:
root = soup.find(id="root")
root

<div class="u-full-width u-full-height" id="root"><div class="u-full-width u-full-height"><div class="root-controller remove-outline" data-qa="root-controller"><span id="forcedOptimizely" style="display:none"></span><div class="main-layout" data-qa="category-experience"><div class="content-wrapper"><header><div class="d-sm-h d-lg-b"><section class="bg-white border-bottom-light-grey z10 top-nav"><div class="limit-max-width d-sm-h d-lg-flx"><div><a class="back-to-nike-link d-sm-ib va-sm-m pt2-sm pb2-sm prl7-sm d-sm-ib" href="https://www.nike.com/sg/en"><svg fill="#757575" height="10px" viewbox="0 0 185.4 300" width="12px"><path d="M160.4 300c-6.4 0-12.7-2.5-17.7-7.3L0 150 142.7 7.3c9.8-9.8 25.6-9.8 35.4 0 9.8 9.8 9.8 25.6 0 35.4L70.7 150 178 257.3c9.8 9.8 9.8 25.6 0 35.4-4.9 4.8-11.3 7.3-17.6 7.3z"></path></svg><span> <!-- -->Visit Nike.com<!-- --> </span></a></div><ul class="right-nav prl7-sm"><li class="member-nav-item d-sm-ib va-sm-m" data-qa="top-nav-user-menu"><button class="join-lo

In [199]:
content = root.find(class_='upcoming-section bg-white ncss-row prl2-md prl5-lg pb4-md pb6-lg')
content

<section class="upcoming-section bg-white ncss-row prl2-md prl5-lg pb4-md pb6-lg" data-qa="upcoming-section"><figure class="pb2-sm va-sm-t ncss-col-sm-12 ncss-col-md-6 ncss-col-lg-4 pb4-md prl0-sm prl2-md ncss-col-sm-6 ncss-col-lg-3 pb4-md prl2-md pl0-md pr1-md"><div class="product-card ncss-row mr0-sm ml0-sm" data-qa="product-card-0"><div class="ncss-col-sm-12 full"><a aria-label="Air Jordan 4 x UNION LA 'Desert Moss' Release Date" class="card-link d-sm-b" data-qa="product-card-link" href="/sg/launch/t/air-jordan-4-union-la-desert-moss?cp=32474922017_search_%7Csg%7CBrand%2BProduct%3ATM%2B-%2BGN%2B-%2BSNKRS%2B-%2BLaunch%2BCalendar%2B-%2BEN_EN%2B-%2BExact%7CGOOGLE%7Csnkrs"><div style="position:absolute;top:0;right:0;bottom:0;left:0"></div></a><figcaption class="ncss-row"><div class="ncss-col-sm-12 full"><div class="figcaption-content"><div class="copy-container ta-sm-c bg-white pt6-sm pb7-sm pb7-lg"><h3 class="headline-5">Air Jordan 4 x UNION LA<!-- --> </h3><h6 class="headline-3">Deser

In [200]:
product = content.find_all('figure')

In [201]:
product_list = [t.find('a') for t in product]
product_href = [[t.get('aria-label'), 'https://www.nike.com%s'%t.get('href')] for t in product_list]

In [202]:
pd.DataFrame(product_href, columns=['product', 'url'])

Unnamed: 0,product,url
0,Air Jordan 4 x UNION LA 'Desert Moss' Release ...,https://www.nike.com/sg/launch/t/air-jordan-4-...
1,SNKRS Style: Pineapple Pack,https://www.nike.com/sg/launch/t/snkrs-style-p...
2,Women's Air Max Furyosa 'Silver and Black' Rel...,https://www.nike.com/sg/launch/t/womens-air-ma...
3,Women's Air Jordan 1 Low OG 'Neutral Grey' Rel...,https://www.nike.com/sg/launch/t/womens-air-jo...
4,Air Force 1 'Pineapple Cork' Release Date,https://www.nike.com/sg/launch/t/air-force-1-p...
5,Women's Air Force 1 'Pineapple Canvas' Release...,https://www.nike.com/sg/launch/t/womens-air-fo...
6,Air Max 96 II 'Beach' Release Date,https://www.nike.com/sg/launch/t/air-max-96-ii...
7,Air Max 96 II 'Dark Denim' Release Date,https://www.nike.com/sg/launch/t/air-max-96-ii...
8,Kickcheck 6.11,https://www.nike.com/sg/launch/t/kickcheck-6-1...


In [207]:
from bs4 import BeautifulSoup
import requests
  
# sample website
sample_website = 'https://www.nike.com/sg/w/new-mens-shoes-3n82yznik1zy7ok'
  
# call get method to request the page
page = requests.get(sample_website)
  
# with the help of BeautifulSoup method and
# html parser created soup
nike_soup = BeautifulSoup(page.content, 'html.parser')
  

In [242]:
root = nike_soup.find(id='app-root')
root

<div id="app-root">
<!-- START dotcom-nav configuration: shared -->
<!-- Generated: c2542dc5-b3e3-479d-802f-adbce6624de2 @ 2021-06-14T18:10:05.709Z -->
<script id="gen-nav-shared">
        (function initDotcomNavShared() {
          var messages = {"hf-geomismatch-chooseLocation":"Choose Location","hf-geomismatch-confirm":"Confirm","hf-geomismatch-message":"Update your location to shop products available in {country}","hf-geomismatch-prompt":"We think you are in {country}. Update your location?","hf-geomismatch-title":"Confirm your Location","hf-geoselection-title":"Select your Location","hf-header-label-carticon":"Bag Items","hf-header-label-countryicon":"Selected Location","hf_cookie-policy_banner_ok":"OK","hf_cookie_label_cookieSettings":"Cookie Settings","hf_cookie_label_done":"Done","hf_cookie_label_functional":"Functional","hf_cookie_label_performance":"Performance","hf_cookie_label_privacyPolicy":"Privacy & Cookie Policy","hf_cookie_label_socialMedia":"Social Media and Advertisi

In [253]:
products_catalog = root.find(class_='product-grid__items css-yj4gxb css-r6is66 css-zndamd css-1u4idlj')
products_catalog

<div class="product-grid__items css-yj4gxb css-r6is66 css-zndamd css-1u4idlj"><div class="product-card css-1jijlv2 css-z5nr6i css-11ziap1 css-14d76vy css-dpr2cn product-grid__card" data-product-position="1"><div class="product-card__body" data-el-type="Card"><figure><a class="product-card__link-overlay" href="https://www.nike.com/sg/t/air-zoom-pegasus-38-running-shoe-Hmsj6Q/CW7356-400">Nike Air Zoom Pegasus 38</a><a aria-label="Nike Air Zoom Pegasus 38" class="product-card__img-link-overlay" data-el-type="Hero" href="https://www.nike.com/sg/t/air-zoom-pegasus-38-running-shoe-Hmsj6Q/CW7356-400"><div class="wall-image-loader css-1la3v4n"><div><noscript><img alt="Nike Air Zoom Pegasus 38 Men's Running Shoe" class="css-1fxh5tw product-card__hero-image" height="400" loading=

In [296]:
type(products_catalog)

bs4.element.Tag

In [254]:
products = products_catalog.find_all('figure')
products

[<figure><a class="product-card__link-overlay" href="https://www.nike.com/sg/t/air-zoom-pegasus-38-running-shoe-Hmsj6Q/CW7356-400">Nike Air Zoom Pegasus 38</a><a aria-label="Nike Air Zoom Pegasus 38" class="product-card__img-link-overlay" data-el-type="Hero" href="https://www.nike.com/sg/t/air-zoom-pegasus-38-running-shoe-Hmsj6Q/CW7356-400"><div class="wall-image-loader css-1la3v4n"><div><noscript><img alt="Nike Air Zoom Pegasus 38 Men's Running Shoe" class="css-1fxh5tw product-card__hero-image" height="400" loading="lazy" src="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/d199515a-a0d1-4880-aba4-f92ddcbe7695/air-zoom-pegasus-38-running-shoe-Hmsj6Q.png" width="400"/></noscript></div></div></a><div class="product-card__info for--product disable-anima

In [293]:
type(products)

bs4.element.ResultSet

In [255]:
products_2 = [t.find(class_='product-card__info for--product disable-animations') for t in products]
products_2

[<div class="product-card__info for--product disable-animations"><div class=""><div class="product-card__messaging accent--color">Just In</div><div class="product-card__titles"><div class="product-card__title" id="Nike Air Zoom Pegasus 38">Nike Air Zoom Pegasus 38</div><div class="product-card__subtitle">Men's Running Shoe</div></div></div><div class="product-card__count-wrapper false undefined"><div class="product-card__count-item"><button class="product-card__colorway-btn" type="button"><div aria-label="Available in 1 Color" class="product-card__product-count">1 Colour</div></button></div></div><div class="product-card__animation_wrapper"><div class="product-card__price-wrapper"><div class="product-card__price"><div class="product-price__wrapper css-cl9118"><div class="product-price css-11s12ax is--current-price" data-test="product-price">S$199</div></div></div></div></div></div>,
 <div class="product-card__info for--product disable-animations"><div class=""><div class="product-card_

In [275]:
products_2[0]

<div class="product-card__info for--product disable-animations"><div class=""><div class="product-card__messaging accent--color">Just In</div><div class="product-card__titles"><div class="product-card__title" id="Nike Air Zoom Pegasus 38">Nike Air Zoom Pegasus 38</div><div class="product-card__subtitle">Men's Running Shoe</div></div></div><div class="product-card__count-wrapper false undefined"><div class="product-card__count-item"><button class="product-card__colorway-btn" type="button"><div aria-label="Available in 1 Color" class="product-card__product-count">1 Colour</div></button></div></div><div class="product-card__animation_wrapper"><div class="product-card__price-wrapper"><div class="product-card__price"><div class="product-price__wrapper css-cl9118"><div class="product-price css-11s12ax is--current-price" data-test="product-price">S$199</div></div></div></div></div></div>

In [297]:
type(products_2[0])

bs4.element.Tag

In [274]:
products_2[0].find(class_='product-card__messaging accent--color').text

'Just In'

In [273]:
products_2[0].find(class_='product-card__title').text

'Nike Air Zoom Pegasus 38'

In [270]:
products_2[0].find(class_='product-card__subtitle').text

"Men's Running Shoe"

In [287]:
products_2[0].find(class_='product-card__product-count').text

'1 Colour'

In [289]:
products_2[0].find(class_='product-price css-11s12ax is--current-price').text

'S$199'

In [292]:
type(products_2)

list

In [306]:
for t in products_2:
    if type(t) == type(None):
        print(t)

None
None
None


In [313]:
for t in products_2:
    if type(t) == type(None):
        products_2.remove(t)

In [314]:
for t in products_2:
    if type(t) == type(None):
        print(t)

In [315]:
for t in products_2:
    print(type(t))

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


In [317]:
products_3 = [[t.find(class_='product-card__messaging accent--color').text,
               t.find(class_='product-card__title').text,
               t.find(class_='product-card__subtitle').text,
               t.find(class_='product-card__product-count').text,
               t.find(class_='product-price css-11s12ax is--current-price').text
              ]for t in products_2]
pd.DataFrame(products_3, columns=['message', 'title', 'subtitle', 'color', 'price'])

Unnamed: 0,message,title,subtitle,color,price
0,Just In,Nike Air Zoom Pegasus 38,Men's Running Shoe,1 Colour,S$199
1,Just In,Nike Precision 5,Basketball Shoe,1 Colour,S$99
2,Just In,Nike Legend Essential 2,Men's Training Shoe,1 Colour,S$89
3,Just In,Jordan One Take II PF,Basketball Shoe,1 Colour,S$159
4,Just In,Nike Air Force 1 '07 Craft,Men's Shoe,1 Colour,S$199
5,Just In,Nike SB BLZR Court,Skate Shoe,1 Colour,S$89
6,Just In,Nike Kepa Kai,Men's Flip Flop,1 Colour,S$49
7,Just In,Nike Free Metcon 4,Training Shoe,1 Colour,S$199
8,Just In,Nike Ebernon Mid,Men's Shoe,1 Colour,S$115
9,Just In,Nike Air Zoom Terra Kiger 7,Men's Trail Running Shoe,1 Colour,S$219




---



---



> > > > > > > > > © 2021 Institute of Data


---



---



