## Notebook Shortcuts

### Command Mode
* `Esc` or clicking the left blue bar enters the **command mode**

### Run Cells
* `Shift+Enter` to run cells

### Add and Delete Cells
* `A` adds a cell above
* `B` adds a cell below
* `D`, `D` (press twice) deletes cells

### Type Conversion
* `M` converts to markdown cells
* `Y` converts to code cells

### Copy and Paste Cells
* `X` cuts cells
* `C` copies cells
* `V` pastes cells

### Undo
* `Z` undoes cell manipulation (e.g., `X`, `V`)
* `Ctrl+Z` (Win) or `Cmd+Z` (Mac) undoes modification within each cell

Find out more on https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330.

## Scraping Static Webpages

### Two ways to install a package
- `pip install bs4` in Terminal
- `!pip install bs4` in Notebook

In [None]:
# You may also try "!pip3 ..." if this line does not work.
!pip install bs4

### Import Packages
- `import a`
- `import a.x`
- `import a as b`
- `from A import a`
- `from A import a as b`

In [1]:
# Scrap Method I
from requests import get

# Scrap Method II
import urllib.parse
import urllib.request

# Searching the Tree
import re # Regular Expression
from bs4 import BeautifulSoup as bs

# Dataframe
import pandas as pd

The example below partly comes from [Dr. Yongren SHI's](https://clas.uiowa.edu/sociology/people/yongren-shi) tutorial.

### Web Scaping

In [2]:
# Method I
# Specify url and userheader
url = "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc"
userHeader = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12"}

response = get(url, headers=userHeader).text

In [None]:
# Method II
# Specify url and userheader
url="https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc"
userHeader = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12"}

req = urllib.request.Request(url, headers=userHeader)

# open url and read web page
response = urllib.request.urlopen(req).read()

In [3]:
# beautifulsoup parse html
soup=bs(response,"html.parser")
print(soup)


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);


    var ue_err_chan = 'jserr';
(function(d,e){function h(f,b){if(!(a.ec>a.mxe)&&f){a.ter.push(f);b=b||{};var c=f.logLevel||b.logLevel;c&&c!==k&&c!==m&&c!==n&&c!==p||a.ec++;c&&c!=k||a.ecf++;b.pageURL

In [4]:
type(soup)

bs4.BeautifulSoup

## Extract Elements from HTML
### HTML Document
All HTML tags are in pairs. For example:
* `<html>` and `</html>` define an HTML document
* `<head>` and `<body>` pairs define the structure
* `<title>` and `</title>` define a title
* Pairs of `<h1>`, `<h2>`, `<h3>`, etc. define headings
* `<p>` and `</p>` define a paragraph

HTML <b><i>is</i></b> a tree!

In [5]:
html_doc = """
<html>
<head>

<title>My First Title</title>

</head>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>
"""

In [6]:
html_doc

'\n<html>\n<head>\n\n<title>My First Title</title>\n\n</head>\n<body>\n\n<h1>My First Heading</h1>\n\n<p>My first paragraph.</p>\n\n</body>\n</html>\n'

In [8]:
sample_soup = bs(html_doc, 'html.parser')
sample_soup


<html>
<head>
<title>My First Title</title>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

In [9]:
sample_soup.body

<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>

In [10]:
sample_soup.p

<p>My first paragraph.</p>

In [11]:
sample_soup.p.text

'My first paragraph.'

In [12]:
# Essentially, regular expression is used to extract elements.
re.findall("<p>.*</p>", html_doc)

['<p>My first paragraph.</p>']

In [13]:
re.findall("(?<=<p>).*(?=</p>)", html_doc)

['My first paragraph.']

In [31]:
html_doc = """
<html>
<head>
<style>
.cities {
  background-color: black;
  color: white;
  padding: 20px;
}
</style>
</head>
<body>

<h2>The class Attribute</h2>
<p>Use CSS to style an element with the class name "cities":</p>

<div>
  <h2>New York</h2>
  <p>Some introduction.</p>
</div>

<div class="cities">
  <h2>London</h2>
  <p>London is the capital of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
  <p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
</div> 

</body>
</html>
"""

sample_soup = bs(html_doc, 'html.parser')
sample_soup


<html>
<head>
<style>
.cities {
  background-color: black;
  color: white;
  padding: 20px;
}
</style>
</head>
<body>
<h2>The class Attribute</h2>
<p>Use CSS to style an element with the class name "cities":</p>
<div>
<h2>New York</h2>
<p>Some introduction.</p>
</div>
<div class="cities">
<h2>London</h2>
<p>London is the capital of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
<p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
</div>
</body>
</html>

In [32]:
sample_soup.div # only the first "div" is extracted

<div>
<h2>New York</h2>
<p>Some introduction.</p>
</div>

In [33]:
sample_soup.find_all("div") # extract all "div" elements

[<div>
 <h2>New York</h2>
 <p>Some introduction.</p>
 </div>, <div class="cities">
 <h2>London</h2>
 <p>London is the capital of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
 <p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
 </div>]

In [34]:
# if the tag looks like: 
#   <tag attribute="name">
# use: 
#   soup.find_all("tag",{"attribute":"name"})
sample_soup.find_all("div",{"class":"cities"})

[<div class="cities">
 <h2>London</h2>
 <p>London is the capital of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
 <p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
 </div>]

In [35]:
# Find a tag by its location (very uncommon)
sample_soup.find_all("div")[1]

<div class="cities">
<h2>London</h2>
<p>London is the capital of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
<p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
</div>

### Intro to Regular Expression

In [14]:
text = '''蒋道理 djiang@connect.ust.hk +86 18293277420
蒋文明 wjiang@connect.ust.hk +86 18254392637
蒋武德 wdjiang@gmail.com +852 53928934
蒋诗斌 jxbshu@gmail.com +86 13923492836
蒋不蒋 jbj@qq.com +86 13754349472
蒋经国 jj@qq.com +886 952042486'''

In [15]:
re.findall("蒋", text)

['蒋', '蒋', '蒋', '蒋', '蒋', '蒋', '蒋']

In [16]:
re.findall("^蒋", text)

['蒋']

In [17]:
re.findall("^蒋\w{2}", text)

['蒋道理']

In [18]:
re.findall("蒋$", text[3])

[]

In [20]:
re.findall("蒋[文武道]\w", text)

['蒋道理', '蒋文明', '蒋武德']

In [25]:
re.findall("1[38][1-9]", text)

['182', '182', '139', '137']

In [26]:
re.findall("1[38][1-9]\d\d\d\d\d\d\d\d", text)

['18293277420', '18254392637', '13923492836', '13754349472']

In [27]:
re.findall("1[38][1-9]\d{8}", text)

['18293277420', '18254392637', '13923492836', '13754349472']

In [28]:
re.findall("1[38][1-9]\d*", text)

['18293277420', '18254392637', '13923492836', '13754349472']

In [29]:
re.findall("\w+(?=@connect.ust.hk)", text)

['djiang', 'wjiang']

In [30]:
re.findall("(?<=蒋)\w+", text)

['道理', '文明', '武德', '诗斌', '不蒋', '经国']

### Movie Containers

In [36]:
# Use "Inspect" and "Select an Element" in your web brower (chrome, opera, safari, etc.)
# Where is the element on the tree?
movie_containers = soup.find_all('div',{'class':'lister-item mode-advanced'})
print(movie_containers)

[<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt0111161"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt0111161/?ref_=adv_li_i"> <img alt="The Shawshank Redemption" class="loadlate" data-tconst="tt0111161" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
<p class="text-muted">
<span class="certificate">R</span>
<span class="ghost">|</span>
<span class="runtime">142 min</span>
<span class="ghost">|</s

In [37]:
type(movie_containers)

bs4.element.ResultSet

In [38]:
# Extracting an "x_container" is not necessary in most cases.
# But it helps to exclude confouding tags and attributes.
len(movie_containers)

50

### First Movie

In [39]:
# Extract the First Movie
first_movie = movie_containers[0]
print(first_movie)

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt0111161"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt0111161/?ref_=adv_li_i"> <img alt="The Shawshank Redemption" class="loadlate" data-tconst="tt0111161" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
<p class="text-muted">
<span class="certificate">R</span>
<span class="ghost">|</span>
<span class="runtime">142 min</span>
<span class="ghost">|</sp

In [42]:
# first movie's rating
movie_rating = float(first_movie.strong.text)
movie_rating

9.3

In [45]:
# first movie's title
movie_name = first_movie.h3.a.text
movie_name

'The Shawshank Redemption'

In [None]:
# "soup.tag" can be misleading:
#    if there are many spans, only the first span will be returned
first_movie.span 

In [46]:
# the year the movie was released
movie_year = first_movie.find('span',{'class':'lister-item-year text-muted unbold'}).text
movie_year

'(1994)'

In [48]:
# remove the parentheses
re.findall("\d{4}", movie_year)[0]

'1994'

In [49]:
# convert to integer
int(re.findall("\d{4}", movie_year)[0])

1994

In [50]:
# movie metascore
movie_metascore = first_movie.find("span", {"class", "metascore favorable"})
movie_metascore = int(movie_metascore.text)
movie_metascore

81

In [51]:
# What's the problem?
first_movie = movie_containers[25]
movie_metascore = first_movie.find("span", {"class", "metascore favorable"})
movie_metascore = int(movie_metascore.text)
movie_metascore

AttributeError: 'NoneType' object has no attribute 'text'

In [52]:
# Be cautious!
# Some classes may be different across movies.
first_movie = movie_containers[25]
movie_metascore = first_movie.find("span", {"class", "metascore"})
movie_metascore = int(movie_metascore.text)
movie_metascore

59

In [53]:
# What's the problem?
first_movie = movie_containers[7]
movie_metascore = first_movie.find("span", {"class", "metascore"})
movie_metascore = int(movie_metascore.text)
movie_metascore

AttributeError: 'NoneType' object has no attribute 'text'

In [54]:
# What's the problem?
first_movie = movie_containers[7]
movie_metascore = first_movie.find("span", {"class", "metascore"})
if movie_metascore is not None:
    movie_metascore = int(movie_metascore.text)
print(movie_metascore)

None


## Iteration 迭代
### For Loop 对...循环

In [None]:
for i in [1, 2, 3]:
    print(i ** 2)

In [None]:
empty_list = []
for i in [1, 2, 3]:
    empty_list.append(i ** 2)

empty_list

### Recall Four 1D Objects
List
- `[]`
- `list()`

Set
- `{1}`
- `set()`

Dictionary
- `{1:"a"}` where key is 1 and value is "a"
- `dict()`

Tuple
- `()`
- `tuple()`

### List Comprehension 列表推导式
- `[output_function(item) for item in item_collection]`
- `[output_function(item) for item in item_collection if item_judgement]` with if condition
- `[output_function(i) for item in item_collection for i in item]` as the nested iteration

In [None]:
empty_list = []
for i in [1, 2, 3]: # for item in item_collection
    empty_list.append(i ** 2) # output_expression()

empty_list

In [None]:
[i ** 2 for i in [1, 2, 3]]

In [None]:
[i ** 2 for i in {1, 2, 3}] # set as input

In [None]:
[i ** 2 for i in (1, 2, 3)] # tuple as input

In [None]:
# set comprehension 集合推导式
{i ** 2 for i in (1, 2, 3)}

In [None]:
[i ** 2 for i in list(range(10))] # range() generates a sequence of numbers (natural numbers by default)

In [None]:
[i ** 2 for i in list(range(1, 11))]

In [None]:
[i ** 2 for i in list(range(1, 11, 2))] # Step is 2

In [None]:
empty_list = []
input_list = list(range(1, 11))
for i in input_list: # for item in item_collection
    if i % 2 == 1: # if_judgement
        empty_list.append(i ** 2) # output_expression()

empty_list

In [None]:
input_list = list(range(1, 11))
[i ** 2 for i in input_list if i % 2 == 1]

## All Movies in the First Page

In [None]:
# extract movie names
movie_names = [movie.h3.a.text for movie in movie_containers]
movie_names

In [None]:
# extract movie years
def extract_year(movie):
    year_str = movie.find('span',{'class':'lister-item-year text-muted unbold'}).text
    year_int = int(re.findall("\d{4}", year_str)[0])
    return(year_int)

movie_years = [extract_year(movie) for movie in movie_containers]
movie_years

In [None]:
# Extract Movie Metascores
def extract_metascore(movie):
    movie_metascore = movie.find("span", {"class", "metascore"})
    if movie_metascore is not None:
        movie_metascore = int(movie_metascore.text)
    return(movie_metascore)
    
movie_metascores = [extract_metascore(movie) for movie in movie_containers]
movie_metascores

In [None]:
# Use dict() to create a pd.DataFrame
movie_df = pd.DataFrame({"name":movie_names, "year":movie_years, "metascore":movie_metascores})
movie_df

In [None]:
type(movie_df)

In [None]:
# A Quick Visualization
import matplotlib.pyplot as plt
import numpy as np
movie_df["ranking"] = np.arange(len(movie_df))

plt.scatter("year", "metascore", c="black", data=movie_df)
plt.show()

### All Top 250 Movies

In [None]:
list(range(10))

In [None]:
list(range(1,10))

In [None]:
list(range(1,10,2))

In [None]:
start_numbers=list(range(1, 202, 50))
start_numbers

In [None]:
url1="https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start="
url2="&ref_=adv_nxt"
url1 + str(1) + url2

In [None]:
urls=[url1 + str(i) + url2 for i in start_numbers]
urls

## Practice
* Scrape all top 250 movies.
* Extract other movie characteristics (e.g., rating, director, stars).
* Find and scrape top 100 movies released in 2022.
* (Optional) Extract some elements (e.g., video features, 弹幕, comments) from some videos in Bilibili.