
# AIDM7330 Basic Programming for Data Science

## Scraping a single webpage

## Webscraping intro

In order to scrape content from a website we first need to download the HTML contents of the website. This can be done with the Python library **requests** (with its `.get` method).

Then when we want to extract certain information from a website we use the scraping tool **BeautifulSoup4** (import bs4). In order to extract information with beautifulsoup we have to create a soup object from the HTML source code of a website.

In [None]:
# Install required packages using pip package manager in the current Jupyter kernel
#import sys
#!{sys.executable} -m pip install bs4

In [1]:
import requests # The requests library is an
# HTTP library for getting content and posting etc.

import bs4 as bs # BeautifulSoup4 is a Python library ，simplify our scraping
# for pulling data out of HTML and XML code.

In [2]:
# stretch Jupyter coding blocks to fit screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

## An example: The HKBU JOUR faculty page

### to fetch and to download the page

In [3]:
source = requests.get("http://www.jour.hkbu.edu.hk/faculty/")
# a GET request will download the HTML webpage.
#get request,post requesy ,put request...
print(source)
# If <Response [200]> then the website has been downloaded succesfully,200 is code with meaning

<Response [200]>


In [4]:
# errors may occur
source2 = requests.get("http://www.jour.hkbu.edu.hk/faculty/asdfajgkhgkhfsdf")
print(source2)
# "404"

<Response [404]>


In [5]:
print(source.content)
# This is the HTML content of the website, do you want to read it?

b'    <!DOCTYPE html>\r\n<html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">\r\n<head>\r\n<meta charset="UTF-8">\r\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0">\r\n<title>Faculty | HKBU | Department of Journalism</title>\r\n<meta name="description" content="Hong Kong Baptist University - Department of Journalism was the first of its kind in Hong Kong when it was founded in 1968." />\r\n<meta property="og:title" content="Faculty | HKBU | Department of Journalism" />\r\n<meta property="og:description" content="HKBU Journalism was the first of its kind in Hong Kong when it was founded in 1968." />\r\n<meta property="og:type" content="website">\r\n<meta property="og:url" content="http://jour.hkbu.edu.hk" />\r\n<meta property="og:image" content="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/fb-image.jpg" />\r\n<link rel="profile" href="http://gmpg.org/xfn/11">\r\n<link rel="icon" type

In [6]:
# Read in source.content to beautifulsoup
# beautifulsoup can parse (extract specific information) HTML code

soup = bs.BeautifulSoup(source.content , features='html.parser')
# we pass in the source content and choose a parser

# features specifies what type of code we are parsing,
# here 'html.parser' specifies an HTML parser, there are other parsers

In [7]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [8]:
print(soup)

 <!DOCTYPE html>

<html lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Faculty | HKBU | Department of Journalism</title>
<meta content="Hong Kong Baptist University - Department of Journalism was the first of its kind in Hong Kong when it was founded in 1968." name="description">
<meta content="Faculty | HKBU | Department of Journalism" property="og:title">
<meta content="HKBU Journalism was the first of its kind in Hong Kong when it was founded in 1968." property="og:description"/>
<meta content="website" property="og:type"/>
<meta content="http://jour.hkbu.edu.hk" property="og:url">
<meta content="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/fb-image.jpg" property="og:image"/>
<link href="http://gmpg.org/xfn/11" rel="profile"/>
<link href="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/

In [9]:
print(soup.prettify())  # make it more readable

<!DOCTYPE html>
<html lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Faculty | HKBU | Department of Journalism
  </title>
  <meta content="Hong Kong Baptist University - Department of Journalism was the first of its kind in Hong Kong when it was founded in 1968." name="description">
   <meta content="Faculty | HKBU | Department of Journalism" property="og:title">
    <meta content="HKBU Journalism was the first of its kind in Hong Kong when it was founded in 1968." property="og:description"/>
    <meta content="website" property="og:type"/>
    <meta content="http://jour.hkbu.edu.hk" property="og:url">
     <meta content="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/fb-image.jpg" property="og:image"/>
     <link href="http://gmpg.org/xfn/11" rel="profile"/>
     <link href="http://jour.hk

Above we printed the HTML code of the website,
decoded as a beautiful soup object
`<xxx> </xxx>`: are the tags, for more info:
https://www.w3schools.com/tags/ref_byfunc.asp

**class and id: **used as hooks to give unique styling and id to elements in HTML

Full list of HTML tags: https://developer.mozilla.org/en-US/docs/Web/HTML/Element

### Extracting contents (understanding the HTML)
** working with tags, attributes, ```find```, and ```find_all``` **

1. extracting the title
2. extracting the body  
3. extracting the paragraphs
4. extracting the URLs

In [10]:
print(soup.title) #including tag,title.text can extract text
# Title of the website, used for check whether it is the proper webpage to work on

<title>Faculty | HKBU | Department of Journalism</title>


In [11]:
print(soup.body)

<body class="page-template page-template-page-templates page-template-page-faculty page-template-page-templatespage-faculty-php page page-id-26 parent-faculty group-blog" id="faculty">
<!-- SCROLL TOP BUTTON -->
<a class="scrollToTop" href="#">
<div class="btn-top">TOP</div>
</a>
<!-- END SCROLL TOP BUTTON -->
<header id="header">
<div class="top-bar">
<div class="container">
<nav class="navbar navbar-inverse">
<div class="container-fluid">
<div class="navbar-header">
<button class="navbar-toggle" data-target=".navbar-collapse" data-toggle="collapse" type="button">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand bu_logo" href="https://www.hkbu.edu.hk/eng/main/index.jsp" target="_blank">
<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/hkbu.png"/>
</a>
<a class="navbar-brand bu_logo_mobile" href="https://www.hkbu.edu.hk/eng/main/index.jsp" target="_blank">
<img src="http://jour.hkbu.e

In [12]:
# If we want to extract specific text
print(soup.find('h2')) # will only return first <h2> tag

<h2 class="slider-title">Faculty</h2>


In [13]:
print(soup.find('h2').text) # will return all the <h2> tag (although this page only has one)

Faculty


In [14]:
# If we want to extract specific text
print(soup.find('p')) # will only return first <p> tag

<p <p="" class="slide-description">Our faculty members have rich experience in the news industry, media education and journalism research. We also invite a number of frontline reporters and senior media professionals as adjunct lecturers and professors.</p>


In [15]:
#print(soup.find('p').text)   # extracts the string within the <p> tag
print( soup.find('p').get_text()  )

Our faculty members have rich experience in the news industry, media education and journalism research. We also invite a number of frontline reporters and senior media professionals as adjunct lecturers and professors.


In [16]:
# If we want to extract all <p> tags
print(soup.find_all('p')) # returns list of all <p> tags
# Output Format [p,p,p,p]

[<p <p="" class="slide-description">Our faculty members have rich experience in the news industry, media education and journalism research. We also invite a number of frontline reporters and senior media professionals as adjunct lecturers and professors.</p>, <p>Professor of Practice<br/>
Department Head</p>, <p>A veteran journalist working at the BBC for over 24 years. Mr. Li was also the first native Chinese to be appointed the Head of Chinese at the BBC.</p>, <p>Associate Professor<br/>
Associate Head</p>, <p>Dr. Luqiu had been a television journalist for 20 years. She has covered many major international events ranging from the wars and also reported on Chinese political news and interviewed several Chinese leaders.</p>, <p>Raymond R. Wong Endowed Professor in Media Ethics<br/>
Professor of Practice</p>, <p>A veteran journalist with more than 30 years of experience, Professor CHAN King Cheung received his Bachelor degree in Arts, Master degree in Communication and MBA degree from t

In [17]:
# using loop to print all the "text" under the "p" (paragraphes)

for p in soup.find_all('p'):
    print(p.text)

Our faculty members have rich experience in the news industry, media education and journalism research. We also invite a number of frontline reporters and senior media professionals as adjunct lecturers and professors.
Professor of Practice
Department Head
A veteran journalist working at the BBC for over 24 years. Mr. Li was also the first native Chinese to be appointed the Head of Chinese at the BBC.
Associate Professor
Associate Head
Dr. Luqiu had been a television journalist for 20 years. She has covered many major international events ranging from the wars and also reported on Chinese political news and interviewed several Chinese leaders.
Raymond R. Wong Endowed Professor in Media Ethics
Professor of Practice
A veteran journalist with more than 30 years of experience, Professor CHAN King Cheung received his Bachelor degree in Arts, Master degree in Communication and MBA degree from the Chinese University of Hong Kong. He was also a Fellow in Media Management of the Poynter Institu

In [18]:
# Extract all the navigation bars
navigation_bar = soup.find('nav')
print(navigation_bar)

<nav class="navbar navbar-inverse">
<div class="container-fluid">
<div class="navbar-header">
<button class="navbar-toggle" data-target=".navbar-collapse" data-toggle="collapse" type="button">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand bu_logo" href="https://www.hkbu.edu.hk/eng/main/index.jsp" target="_blank">
<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/hkbu.png"/>
</a>
<a class="navbar-brand bu_logo_mobile" href="https://www.hkbu.edu.hk/eng/main/index.jsp" target="_blank">
<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/logo_hkbu_mobile.png"/>
</a>
<a class="navbar-brand comf_logo" href="https://www.comm.hkbu.edu.hk/comd-www/english/front/index.htm" target="_blank">
<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/comm.png"/>
</a>
<a class="navbar-brand comf_logo_mobile" href="https://www.comm.hkbu.edu.hk/comd-www/englis

In [19]:
# Extract links / urls
# Links in html is usually coded as <a href="url">
# where the link is url

print(soup.find('a')) # find the first link

#Internal to the webpage link: <a href="#top">
#Internal to the website link: <a href="/alumni.html">
#External link: <a href="http://www.google.com">

<a class="scrollToTop" href="#">
<div class="btn-top">TOP</div>
</a>


In [20]:
soup.find('a').get('href')
# to get the link from href attribute

'#'

In [21]:
print(soup.find_all('a')) # find all the links - although we may not want to do this

[<a class="scrollToTop" href="#">
<div class="btn-top">TOP</div>
</a>, <a class="navbar-brand bu_logo" href="https://www.hkbu.edu.hk/eng/main/index.jsp" target="_blank">
<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/hkbu.png"/>
</a>, <a class="navbar-brand bu_logo_mobile" href="https://www.hkbu.edu.hk/eng/main/index.jsp" target="_blank">
<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/logo_hkbu_mobile.png"/>
</a>, <a class="navbar-brand comf_logo" href="https://www.comm.hkbu.edu.hk/comd-www/english/front/index.htm" target="_blank">
<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/comm.png"/>
</a>, <a class="navbar-brand comf_logo_mobile" href="https://www.comm.hkbu.edu.hk/comd-www/english/front/index.htm" target="_blank">
<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/COMF_2021-white_small_short.png"/>
</a>, <a class="navbar-brand jour_logo" href="https://jour.hkbu.edu.hk">
<im

In [22]:
links = soup.find_all('a')

In [23]:
# check the informaiton about URLs
links = soup.find_all('a')

for link in links:
    print("\nInfo about {}: ".format(link.text), \
    link.get('href'))



Info about 
TOP
:  #

Info about 

:  https://www.hkbu.edu.hk/eng/main/index.jsp

Info about 

:  https://www.hkbu.edu.hk/eng/main/index.jsp

Info about 

:  https://www.comm.hkbu.edu.hk/comd-www/english/front/index.htm

Info about 

:  https://www.comm.hkbu.edu.hk/comd-www/english/front/index.htm

Info about 

:  https://jour.hkbu.edu.hk

Info about 

:  https://jour.hkbu.edu.hk

Info about Home:  https://jour.hkbu.edu.hk

Info about About:  https://jour.hkbu.edu.hk/about-us

Info about 
																Study Path
															
:  #drop-programmes

Info about 
																			Programme Structure
																		:  https://jour.hkbu.edu.hk/programmes/programme-structure

Info about 								Journalism
								:  https://jour.hkbu.edu.hk/programmes/journalism

Info about Data and Media Communication:  http://bu-dmc.hkbu.edu.hk/

Info about Postgraduate Studies:  http://comd.hkbu.edu.hk/masters/en/maijs.html

Info about Research:  https://jour.hkbu.edu.hk/research

Info about

### To locate the exact contents, we need an "Inspector"

- ** Making your browser a development tool**
- ** "inspect element" **


In [24]:
print(soup.find_all('div', attrs={'class': 'col-lg-3 col-xs-6 col-sm-4 col-md-3'}))

[<div class="col-lg-3 col-xs-6 col-sm-4 col-md-3">
<a href="https://jour.hkbu.edu.hk/faculty-member/raymond-li/">
<div class="faculty-staff-wrap wow fadeIn animated" data-wow-delay="300ms" data-wow-duration="500ms">
<div class="featured-thumbnail"><img alt="Mr. Raymond Li" class="img-responsive wp-post-image" height="400" src="https://jour.hkbu.edu.hk/home/wp-content/uploads/2018/03/ppl_ray-li.jpg" width="400"/></div> <div class="overlay">
<div class="staff-inner">
<h4>
                                        Prof. Raymond Li                                    </h4>
<h5 class="hidden-xs">
<p>Professor of Practice<br/>
Department Head</p>
</h5>
<div class="visible-lg">
<p>A veteran journalist working at the BBC for over 24 years. Mr. Li was also the first native Chinese to be appointed the Head of Chinese at the BBC.</p>
</div>
</div>
</div>
</div>
</a>
</div>, <div class="col-lg-3 col-xs-6 col-sm-4 col-md-3">
<a href="https://jour.hkbu.edu.hk/faculty-member/luwei-rose-luqiu/">
<div cla

In [25]:
#print(soup.find_all('div', attrs={'class': 'col-lg-3 col-xs-6 col-sm-4 col-md-3'}))
print(soup.find_all('div', class_ = 'col-lg-3 col-xs-6 col-sm-4 col-md-3'))

[<div class="col-lg-3 col-xs-6 col-sm-4 col-md-3">
<a href="https://jour.hkbu.edu.hk/faculty-member/raymond-li/">
<div class="faculty-staff-wrap wow fadeIn animated" data-wow-delay="300ms" data-wow-duration="500ms">
<div class="featured-thumbnail"><img alt="Mr. Raymond Li" class="img-responsive wp-post-image" height="400" src="https://jour.hkbu.edu.hk/home/wp-content/uploads/2018/03/ppl_ray-li.jpg" width="400"/></div> <div class="overlay">
<div class="staff-inner">
<h4>
                                        Prof. Raymond Li                                    </h4>
<h5 class="hidden-xs">
<p>Professor of Practice<br/>
Department Head</p>
</h5>
<div class="visible-lg">
<p>A veteran journalist working at the BBC for over 24 years. Mr. Li was also the first native Chinese to be appointed the Head of Chinese at the BBC.</p>
</div>
</div>
</div>
</div>
</a>
</div>, <div class="col-lg-3 col-xs-6 col-sm-4 col-md-3">
<a href="https://jour.hkbu.edu.hk/faculty-member/luwei-rose-luqiu/">
<div cla

In [26]:
div_faculty = soup.find_all('div', class_ = 'col-lg-3 col-xs-6 col-sm-4 col-md-3')

In [27]:
print(type(div_faculty))

<class 'bs4.element.ResultSet'>


In [28]:
for link in div_faculty:
    print(link.find('a').attrs['href'])

# how to avoid errors?

https://jour.hkbu.edu.hk/faculty-member/raymond-li/
https://jour.hkbu.edu.hk/faculty-member/luwei-rose-luqiu/
https://jour.hkbu.edu.hk/faculty-member/prof-chan-king-cheung/
https://jour.hkbu.edu.hk/faculty-member/steve-guo/
https://jour.hkbu.edu.hk/faculty-member/cherian-george/
https://jour.hkbu.edu.hk/faculty-member/daya-thussu/
https://jour.hkbu.edu.hk/faculty-member/prof-kennth-paul-tan/
https://jour.hkbu.edu.hk/faculty-member/wang-xiangwei/
https://jour.hkbu.edu.hk/faculty-member/robin-ewing/
https://jour.hkbu.edu.hk/faculty-member/dr-brian-so/
https://jour.hkbu.edu.hk/faculty-member/mr-pun-wai-lam/
https://jour.hkbu.edu.hk/faculty-member/dr-bess-wang-yue/
https://jour.hkbu.edu.hk/faculty-member/xiaoyi-fu/
https://jour.hkbu.edu.hk/faculty-member/janet-lo/
https://jour.hkbu.edu.hk/faculty-member/dr-wang-dan/
https://jour.hkbu.edu.hk/faculty-member/dr-yu-wenting/
https://jour.hkbu.edu.hk/faculty-member/nick-zhang/
https://jour.hkbu.edu.hk/faculty-member/dr-sheng-zou/
https://jour.hk

In [29]:
try:
    for link in div_faculty:
        a = link.find('a').get('href')
        print(a)
except:
    #print("error here")
    pass

https://jour.hkbu.edu.hk/faculty-member/raymond-li/
https://jour.hkbu.edu.hk/faculty-member/luwei-rose-luqiu/
https://jour.hkbu.edu.hk/faculty-member/prof-chan-king-cheung/
https://jour.hkbu.edu.hk/faculty-member/steve-guo/
https://jour.hkbu.edu.hk/faculty-member/cherian-george/
https://jour.hkbu.edu.hk/faculty-member/daya-thussu/
https://jour.hkbu.edu.hk/faculty-member/prof-kennth-paul-tan/
https://jour.hkbu.edu.hk/faculty-member/wang-xiangwei/
https://jour.hkbu.edu.hk/faculty-member/robin-ewing/
https://jour.hkbu.edu.hk/faculty-member/dr-brian-so/
https://jour.hkbu.edu.hk/faculty-member/mr-pun-wai-lam/
https://jour.hkbu.edu.hk/faculty-member/dr-bess-wang-yue/
https://jour.hkbu.edu.hk/faculty-member/xiaoyi-fu/
https://jour.hkbu.edu.hk/faculty-member/janet-lo/
https://jour.hkbu.edu.hk/faculty-member/dr-wang-dan/
https://jour.hkbu.edu.hk/faculty-member/dr-yu-wenting/
https://jour.hkbu.edu.hk/faculty-member/nick-zhang/
https://jour.hkbu.edu.hk/faculty-member/dr-sheng-zou/
https://jour.hk

In [30]:
# another way to find all the URLs of faculty members from the root page ("soup")

for link in soup.find_all('a'):
    a = link.get('href')
    if '/faculty-member/' in a:
        print(a)

https://jour.hkbu.edu.hk/faculty-member/raymond-li/
https://jour.hkbu.edu.hk/faculty-member/luwei-rose-luqiu/
https://jour.hkbu.edu.hk/faculty-member/prof-chan-king-cheung/
https://jour.hkbu.edu.hk/faculty-member/steve-guo/
https://jour.hkbu.edu.hk/faculty-member/cherian-george/
https://jour.hkbu.edu.hk/faculty-member/daya-thussu/
https://jour.hkbu.edu.hk/faculty-member/prof-kennth-paul-tan/
https://jour.hkbu.edu.hk/faculty-member/wang-xiangwei/
https://jour.hkbu.edu.hk/faculty-member/robin-ewing/
https://jour.hkbu.edu.hk/faculty-member/dr-brian-so/
https://jour.hkbu.edu.hk/faculty-member/mr-pun-wai-lam/
https://jour.hkbu.edu.hk/faculty-member/dr-bess-wang-yue/
https://jour.hkbu.edu.hk/faculty-member/xiaoyi-fu/
https://jour.hkbu.edu.hk/faculty-member/janet-lo/
https://jour.hkbu.edu.hk/faculty-member/dr-wang-dan/
https://jour.hkbu.edu.hk/faculty-member/dr-yu-wenting/
https://jour.hkbu.edu.hk/faculty-member/nick-zhang/
https://jour.hkbu.edu.hk/faculty-member/dr-sheng-zou/
https://jour.hk

## More extraction

In [31]:
print( soup.find_all(class_ = 'staff-inner'))

[<div class="staff-inner">
<h4>
                                        Prof. Raymond Li                                    </h4>
<h5 class="hidden-xs">
<p>Professor of Practice<br/>
Department Head</p>
</h5>
<div class="visible-lg">
<p>A veteran journalist working at the BBC for over 24 years. Mr. Li was also the first native Chinese to be appointed the Head of Chinese at the BBC.</p>
</div>
</div>, <div class="staff-inner">
<h4>
                                        Dr. Luwei Rose Luqiu                                    </h4>
<h5 class="hidden-xs">
<p>Associate Professor<br/>
Associate Head</p>
</h5>
<div class="visible-lg">
<p>Dr. Luqiu had been a television journalist for 20 years. She has covered many major international events ranging from the wars and also reported on Chinese political news and interviewed several Chinese leaders.</p>
</div>
</div>, <div class="staff-inner">
<h4>
                                        PROF. CHAN King Cheung                                   

In [32]:
div_staffinner = soup.find_all('div', class_ ='staff-inner')
# div_staffinner = soup.find_all(class_ = 'staff-inner')

try:
    for name_element in div_staffinner:
        name = name_element.find('h4').get_text().strip() #strip() remove the white spaces before and after the text
        print(name)
except:
    pass

Prof. Raymond Li
Dr. Luwei Rose Luqiu
PROF. CHAN King Cheung
Prof. Steve Guo
Prof. Cherian George
Prof. Daya Thussu
Prof. Kenneth Paul TAN
Mr. Wang Xiangwei
Ms. Robin Ewing
DR. BRIAN SO
Mr. Pun Wai Lam
Dr. Bess WANG Yue
Dr. Xiaoyi Fu
Dr. Janet Lo
Dr. WANG DAN (Angela)
Dr. YU Wenting
Dr. Nick Yin ZHANG
DR. SHENG ZOU
Ms. CHIU Lai Yu Bonnie
Mr. Clemence Poon
Ms. Jenny Lam
MS. LAM WING YAN
DR.  ALISON LEUNG


## Scrape images

 - **please proceed with caution and always check the terms of use!!**

In [33]:
# As we can see there are several images
# Images are displayed with the <img> tag in HTML

# open connection and create new soup

raw = requests.get('http://www.jour.hkbu.edu.hk/faculty/').content
soup = bs.BeautifulSoup(raw, features='lxml')

print(soup.find('img'))
# as we can see below the image urls
# are stored as the src inside the img tag

<img src="http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/hkbu.png"/>


In [34]:
# Parse all url to the images
img_urls = list() # create an empty list first, and put elements into it
for img in soup.find_all('img'):
    img_url = img.get('src')
    if '.jpeg' in img_url or '.jpg' in img_url:
        print(img_url)
        img_urls.append(img_url)

http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/bg_people.jpg
http://jour.hkbu.edu.hk/home/wp-content/themes/journalism/images/bg_people2.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2018/03/ppl_ray-li.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2018/07/ppl_luwei-rose-400x400.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2016/09/ppl_steve-guo.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2018/06/Cherian-George-2018-400x400.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2019/07/Professor-Daya-Thussu-400x400.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2021/02/Kenneth-Paul-Tan-400x400.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2016/08/ppl_robin-ewing.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2023/09/Pun-Wai-Lam-Profile-Picture--400x400.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2023/02/Bess-Wang-400x400.jpg
https://jour.hkbu.edu.hk/home/wp-content/uploads/2020/01/photo-400x400.jpg
https://jour.hk

In [35]:
#print(img_urls)
print(type(img_urls))
print(len(img_urls))

<class 'list'>
19


Before we can download the images, we first we need to connect to your Google Drive account.

For security reasons, you will be requested to grant permission to Google Colab to access your Google Drive. Please allow the operation by selecting the correct account where you store your Colab Notebooks.

In [36]:
from google.colab import drive
drivePath = '/content/drive' #please do not change
drive.mount(drivePath)

Mounted at /content/drive


Now we can select the correct directory in your Google drive.

In [37]:
# path to figures directory
figuresPath = drivePath + '/MyDrive/Colab Notebooks/figures' + "/"

We will also check if the directory is present, creating a new one if needed.

In [38]:
import os, pathlib
if not(os.path.exists(figuresPath)):
  # os.mkdir(dataPath) # Creates only the last directory if missing. Rises error if it exists
  path = pathlib.Path(figuresPath)
  path.mkdir(parents=True, exist_ok=True) # Can create the folders in the path if missing. No error if path exists
  print('Path has been created')
else:
  print('The figures path you selected already exists')

Path has been created


In [39]:
# To downloads and save files with Python
# the shutil library which is a file operations library
# it is always good to add a time delay via the "time.sleep" function

import shutil

import time
from random import randint

for idx, img_url in enumerate(img_urls):
    # enumerate to create a file integer name for every image, see the tutorial: http://book.pythontips.com/en/latest/enumerate.html

    # time.sleep(5) # a fixed rate (delay 5 sec)
    time.sleep(randint(2,6))  # a random int delay between 1 to 5 sec

    img_source = requests.get(img_url, stream=True)
    # we set stream = True to download/stream the content of the data

    with open(figuresPath + 'facultyProfilePic' + str(idx) + '.jpg', 'wb') as file:
        # open file connection in write binary mode, create file and write to it
        shutil.copyfileobj(img_source.raw, file)
        # save the raw file object, see the tutorial: https://docs.python.org/3/library/shutil.html

    del img_source # to remove the file from memory

SSLError: ignored

# Acknowledgements

- The code in this notebook are modified from Dr. Xinzhi Zhang teaching material and other various sources. All codes are for educational purposes only and released under the CC1.0.