# People's Bank of China: a Web Scrapper

Changhao Li | 2021.12


``GOAL`` of this programme: extract data, especially fines issued towards payment companies. Due to the structure obsticles, pure-automatically extracting those information is difficult.
Therefore, we adopt a semi-structured method towards our goal.

### 0. Limitation of extracting data from a PDF format file

It's difficult to read data, especially those written in Chinese character, from a PDF file. For instance, we try to extract relevant information from a sample PDF by **camelot**:

In [1]:
# -*- coding: utf-8 -*-
from camelot import read_pdf
import re

def parse_pdf_camelot(link):
    tables = read_pdf(link, pages = '0', flavor = 'stream', table_area = ['']) # stream会默认整页均为表格
    print(tables)
    print(tables[0].data)
    print()
    #print(re.findall(r'\w+\s元|\w+\s万元', info))
    
parse_pdf_camelot('2021122718272373606.pdf')


<TableList n=1>
[['备注']]





Cannot work...

Let me try another way to read-in pdf file: use **PDFMiner3K** package.

In [2]:
### PDFMiner reading online PDF files

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams = laparams)
    
    PDFPage.get_pages(rsrcmgr, device, pdfFile)
    device.close()
    
    content = retstr.getvalue()
    retstr.close()
    return content

pdfFile = urlopen('http://nanning.pbc.gov.cn/nanning/133346/133364/133371/4432873/2021122718272373606.pdf')
outputString = readPDF(pdfFile)
print('The program is running...')
print(outputString)
pdfFile.close()


The program is running...



Nothing happened... Now let's look at reading local files, and see whether it works or not...

In [19]:
### PDFMiner reading downloaded PDF files

#from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams = laparams)
    
    PDFPage.get_pages(rsrcmgr, device, pdfFile)
    device.close()
    
    content = retstr.getvalue()
    retstr.close()
    return content

outputString = readPDF('2021122718272373606.pdf')
print('The program is running...')
print(outputString)
pdfFile.close()


The program is running...



Nope. Nothing happened. Hopelessness.

The solution: skip the difficult part. This us not an PhD project. Waiting for Dr Josiah Poon, Dr Caren Han and their USYD NLP Group's breakthrough. Bless!

In the main part I will only focus on extracting fine url from different PBC branch website.

### 1. PBC... so many branches!

There are 35 PBC websites... 35! Each with different websites! The structure of the websites are different, the format of the information is also different... This makes web scraping extremely complex.

For instance, PBC Shenzhen branch issue fine information in Excel format (.xls), PBC Xi'an branch use Word (.doc) instead, PBC Nanning branch use PDF files (.pdf), and PBC Nanjing branch issues pure words (HTML)! What a diversity...

全国央行分支机构各自拥有其独立的网站【网站结构不同】、各个网站也单独发布罚单【格式不同】，这让爬虫变得异常复杂。

不同央行分行的行政处罚罚单格式不尽相同————例如，深圳市中心支行的罚单信息为Excel格式（.xls）、西安分行公布的附件为Word格式（.doc）、南宁中心支行的则为PDF格式（.pdf）；更有甚者，南京分行的罚单信息竟然为网页纯文字......

### 2. 分行（9）

##### 2.1 Guangzhou

In [4]:
# -*- coding = utf-8 -*-

### GUANGZHOU BRANCH
'''
INPUT: 
- name: name of the city
- year: year
- month: month
OUTPUT:
- a csv file (e.g. 'guangzhou-2021-12.csv')
'''

from datetime import datetime, date
from urllib import request, parse
from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from fake_useragent import UserAgent
import csv
import re
import webbrowser

ua = UserAgent()

### INPUT!
name = 'guangzhou'
year = '2021'
month = '12'

### AUTO JUDGE DAY
if int(month) in {1,3,5,7,8,10,12}:
    day = '31'
elif int(month) in {4,6,9,11}:
    day = '30'
else:
    if (year % 4) == 0 and (year % 100) != 0 or (year % 400) == 0:
        day = '29'
    else:
        day = '28'

def gzSpider(link):
    driver = webdriver.Chrome()
    time.sleep(1)
    driver.get(link)
    req = driver.page_source
    # print(req)
    soup = BeautifulSoup(req, 'lxml')
    # print(soup.prettify())
    fram = soup.find("td", class_ = "content_right column")
    # print(fram.prettify())
    mylist = []
    finallist = []
    nonExist = False
    count = 0
    datelist = []
    linklist = []
    for item in fram.find_all("table", limit=1):
        # print(item)
        for temp in item.find_all("td", limit=1):
            ### FILTER ALL TIME OUT
            for inner in temp.find_all("td", width="100", class_="hei12jj", limit=10):
                # print(inner)
                d = datetime.strptime(inner.text, '%Y-%m-%d')
                ### INPUT DESIRED TIME HERE!
                if ((d >= datetime(int(year), int(month), 1)) 
                    & (d <= datetime(int(year), int(month), int(day)))):
                    print(d)
                    datelist.append(d)
                    count += 1
                else:
                    print("date is incorrect")
                    pass
            if (count == 0):
                break
            l = temp.select('a[href]', limit=count)
            for k in range(0,len(l)):
                print("http://guangzhou.pbc.gov.cn" + (l[k]['href']))
                w = "http://guangzhou.pbc.gov.cn" + (l[k]['href'])
                linklist.append(w)
    
    txt = '{n}-{y}-{m}.csv'
    f = open(txt.format(n = name, y = year, m = month), 'w')
    writer = csv.writer(f)
    writer.writerow(['发布日期', '罚单链接'])
    
    for i in range(0, count):
        dlist = []
        dlist.append(datelist[i].date().strftime("%Y-%m-%d"))
        dlist.append(linklist[i])
        print(dlist)
        writer.writerow(dlist)
    
    f.close()
    
    #webbrowser.open('')

### INPUT GZ OFFICIAL WEBSITE INSIDE                
gzSpider('http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/index.html')

31
2021-12-27 00:00:00
2021-12-27 00:00:00
2021-12-24 00:00:00
2021-12-23 00:00:00
date is incorrect
date is incorrect
date is incorrect
date is incorrect
date is incorrect
date is incorrect
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433798/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433789/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433778/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433773/index.html
['2021-12-27', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433798/index.html']
['2021-12-27', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433789/index.html']
['2021-12-24', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433778/index.html']
['2021-12-23', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433773/index.html']


##### 1.2 Nanjing

In [3]:
# -*- coding = utf-8 -*-

'''
INPUT: 
- name: name of the city
- year: year
- month: month
OUTPUT:
- a csv file (e.g. 'guangzhou-2021-12.csv')
'''

from datetime import datetime, date
from urllib import request, parse
from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from fake_useragent import UserAgent
import csv
import re
import webbrowser

ua = UserAgent()

name = 'nanjing'
year = '2021'
month = '12'

### AUTO JUDGE DAY
if int(month) in {1,3,5,7,8,10,12}:
    day = '31'
elif int(month) in {4,6,9,11}:
    day = '30'
else:
    if (year % 4) == 0 and (year % 100) != 0 or (year % 400) == 0:
        day = '29'
    else:
        day = '28'
print(day)
    

def njSpider(link):
    driver = webdriver.Chrome()
    time.sleep(1)
    driver.get(link)
    req = driver.page_source
    # print(req)
    soup = BeautifulSoup(req, 'lxml')
    # print(soup.prettify())
    fram = soup.find("td", class_ = "content_right column")
    # print(fram.prettify())
    mylist = []
    finallist = []
    
    count = 0
    datelist = []
    linklist = []
    for item in fram.find_all("table", limit=1):
        # print(item)
        for temp in item.find_all("td", limit=1):
            ### FILTER ALL TIME OUT
            for inner in temp.find_all("td", width="100", class_="hei12jj", limit=10):
                # print(inner)
                d = datetime.strptime(inner.text, '%Y-%m-%d')
                ### INPUT DESIRED TIME HERE!
                if ((d >= datetime(int(year), int(month), 1)) 
                    & (d <= datetime(int(year), int(month), int(day)))):
                    print(d)
                    datelist.append(d)
                    count += 1
                else:
                    #return # delete this if exists!
                    pass
            if (count == 0):
                break
            l = temp.select('a[href]', limit=count)
            for k in range(0,len(l)):
                print("http://nanjing.pbc.gov.cn" + (l[k]['href']))
                w = "http://nanjing.pbc.gov.cn" + (l[k]['href'])
                linklist.append(w)
    
    txt = '{n}-{y}-{m}.csv'
    f = open(txt.format(n = name, y = year, m = month), 'w')
    writer = csv.writer(f)
    writer.writerow(['发布日期', '罚单链接'])
    
    for i in range(0, count):
        dlist = []
        dlist.append(datelist[i].date().strftime("%Y-%m-%d"))
        dlist.append(linklist[i])
        print(dlist)
        writer.writerow(dlist)
    
    f.close()
    
    #webbrowser.open('')

### INPUT OFFICIAL WEBSITE INSIDE                
njSpider('http://nanjing.pbc.gov.cn/nanjing/117542/117560/117567/index.html')

31
2021-12-31 00:00:00
http://nanjing.pbc.gov.cn/nanjing/117542/117560/117567/4437230/index.html
['2021-12-31', 'http://nanjing.pbc.gov.cn/nanjing/117542/117560/117567/4437230/index.html']


Nanjing only got one fine in 2021-12. All good.

##### 2.3 Jinan

In [7]:
# -*- coding = utf-8 -*-

'''
INPUT: 
- name: name of the city
- year: year
- month: month
OUTPUT:
- a csv file (e.g. 'guangzhou-2021-12.csv')
'''

from datetime import datetime, date
from urllib import request, parse
from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from fake_useragent import UserAgent
import csv
import re
import webbrowser

ua = UserAgent()

name = 'jinan'
year = '2021'
month = '12'

### AUTO JUDGE DAY
if int(month) in {1,3,5,7,8,10,12}:
    day = '31'
elif int(month) in {4,6,9,11}:
    day = '30'
else:
    if (year % 4) == 0 and (year % 100) != 0 or (year % 400) == 0:
        day = '29'
    else:
        day = '28'
print(day)
    

def jnSpider(link):
    driver = webdriver.Chrome()
    time.sleep(1)
    driver.get(link)
    req = driver.page_source
    # print(req)
    soup = BeautifulSoup(req, 'lxml')
    # print(soup.prettify())
    fram = soup.find("td", class_ = "content_right column")
    # print(fram.prettify())
    mylist = []
    finallist = []
    
    count = 0
    datelist = []
    linklist = []
    for item in fram.find_all("table", limit=1):
        # print(item)
        for temp in item.find_all("td", limit=1):
            ### FILTER ALL TIME OUT
            for inner in temp.find_all("td", width="100", class_="hei12jj", limit=10):
                # print(inner)
                d = datetime.strptime(inner.text, '%Y-%m-%d')
                ### INPUT DESIRED TIME HERE!
                if ((d >= datetime(int(year), int(month), 1)) 
                    & (d <= datetime(int(year), int(month), int(day)))):
                    print(d)
                    datelist.append(d)
                    count += 1
                else:
                    #return # delete this if exists!
                    pass
            if (count == 0):
                print("Not a single fine found...")
                break
            l = temp.select('a[href]', limit=count)
            for k in range(0,len(l)):
                print("http://jinan.pbc.gov.cn" + (l[k]['href']))
                w = "http://jinan.pbc.gov.cn" + (l[k]['href'])
                linklist.append(w)
    
    txt = '{n}-{y}-{m}.csv'
    f = open(txt.format(n = name, y = year, m = month), 'w')
    writer = csv.writer(f)
    writer.writerow(['发布日期', '罚单链接'])
    
    for i in range(0, count):
        dlist = []
        dlist.append(datelist[i].date().strftime("%Y-%m-%d"))
        dlist.append(linklist[i])
        print(dlist)
        writer.writerow(dlist)
    
    f.close()
    
    #webbrowser.open('')

### INPUT OFFICIAL WEBSITE INSIDE                              
jnSpider('http://jinan.pbc.gov.cn/jinan/120967/120985/120994/index.html')


31
2021-12-31 00:00:00
2021-12-31 00:00:00
2021-12-27 00:00:00
2021-12-27 00:00:00
2021-12-27 00:00:00
2021-12-22 00:00:00
2021-12-21 00:00:00
2021-12-20 00:00:00
2021-12-17 00:00:00
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4438577/index.html
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4436741/index.html
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4436712/index.html
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4431771/index.html
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4431386/index.html
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4430701/index.html
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4427013/index.html
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4423986/index.html
http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4423126/index.html
['2021-12-31', 'http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4438577/index.html']
['2021-12-31', 'http://jinan.pbc.gov.cn/jinan/120967/120985/120994/4436741/index.html']
['2021-12-2

#### 2.4 Xi'an

In [6]:
# -*- coding = utf-8 -*-

'''
INPUT: 
- name: name of the city
- year: year
- month: month
OUTPUT:
- a csv file (e.g. 'guangzhou-2021-12.csv')
'''

from datetime import datetime, date
from urllib import request, parse
from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from fake_useragent import UserAgent
import csv
import re
import webbrowser

ua = UserAgent()

name = 'xian'
year = '2021'
month = '12'

### AUTO JUDGE DAY
if int(month) in {1,3,5,7,8,10,12}:
    day = '31'
elif int(month) in {4,6,9,11}:
    day = '30'
else:
    if (year % 4) == 0 and (year % 100) != 0 or (year % 400) == 0:
        day = '29'
    else:
        day = '28'
print(day)
    

def xaSpider(link):
    driver = webdriver.Chrome()
    time.sleep(1)
    driver.get(link)
    req = driver.page_source
    # print(req)
    soup = BeautifulSoup(req, 'lxml')
    # print(soup.prettify())
    fram = soup.find("td", class_ = "content_right column")
    # print(fram.prettify())
    mylist = []
    finallist = []
    
    count = 0
    datelist = []
    linklist = []
    for item in fram.find_all("table", limit=1):
        # print(item)
        for temp in item.find_all("td", limit=1):
            ### FILTER ALL TIME OUT
            for inner in temp.find_all("td", width="100", class_="hei12jj", limit=10):
                # print(inner)
                d = datetime.strptime(inner.text, '%Y-%m-%d')
                ### INPUT DESIRED TIME HERE!
                if ((d >= datetime(int(year), int(month), 1)) 
                    & (d <= datetime(int(year), int(month), int(day)))):
                    print(d)
                    datelist.append(d)
                    count += 1
                else:
                    #return # delete this if exists!
                    pass
            if (count == 0):
                print("Not a single fine found...")
                break
            l = temp.select('a[href]', limit=count)
            for k in range(0,len(l)):
                print("http://xian.pbc.gov.cn" + (l[k]['href']))
                w = "http://xian.pbc.gov.cn" + (l[k]['href'])
                linklist.append(w)
    
    txt = '{n}-{y}-{m}.csv'
    f = open(txt.format(n = name, y = year, m = month), 'w')
    writer = csv.writer(f)
    writer.writerow(['发布日期', '罚单链接'])
    
    for i in range(0, count):
        dlist = []
        dlist.append(datelist[i].date().strftime("%Y-%m-%d"))
        dlist.append(linklist[i])
        print(dlist)
        writer.writerow(dlist)
    
    f.close()
    
    #webbrowser.open('')

### INPUT OFFICIAL WEBSITE INSIDE                
xaSpider('http://xian.pbc.gov.cn/xian/129428/129449/129458/index.html')



31
Not a single fine found...


### 2. How to extract further information?

To be continued.

In [18]:
# -*- coding = utf-8 -*-

'''
INPUT: 
- name: name of the city
- year: year
- month: month
OUTPUT:
- a csv file (e.g. 'guangzhou-2021-12.csv')
'''

from datetime import datetime
#from urllib import request, parse
from bs4 import BeautifulSoup
import time
#import pandas as pd
from selenium import webdriver
from fake_useragent import UserAgent
import csv

ua = UserAgent()

name = str(input("Please input city name (e.g. guangzhou): "))
year = str(input("Please indicate which year you wish to scrap: "))
month = str(input("Please indicate which month you wish to scrap: "))

def pbcSpider(link):
    driver = webdriver.Chrome()
    time.sleep(1)
    driver.get(link)
    req = driver.page_source
    # print(req)
    soup = BeautifulSoup(req, 'lxml')
    # print(soup.prettify())
    fram = soup.find("td", class_ = "content_right column")
    # print(fram.prettify())
    
    count = 0
    datelist = []
    linklist = []
    for item in fram.find_all("table", limit=1):
        # print(item)
        for temp in item.find_all("td", limit=1):
            ### FILTER ALL TIME OUT
            for inner in temp.find_all("td", width="100", class_="hei12jj", limit=10):
                # print(inner)
                d = datetime.strptime(inner.text, '%Y-%m-%d')
                ### INPUT DESIRED TIME HERE!
                if ((d > datetime(2021, 12, 1)) & (d < datetime(2021, 12, 30))):
                    print(d)
                    datelist.append(d)
                    count += 1
                else:
                    #return # delete this if exists!
                    pass
            if (count == 0):
                break
            l = temp.select('a[href]', limit=count)
            for k in range(0,len(l)):
                print("http://guangzhou.pbc.gov.cn" + (l[k]['href']))
                w = "http://guangzhou.pbc.gov.cn" + (l[k]['href'])
                linklist.append(w)
    
    txt = '{n}-{y}-{m}.csv'
    f = open(txt.format(n = name, y = year, m = month), 'w')
    writer = csv.writer(f)
    writer.writerow(['发布日期', '罚单链接'])
    
    for i in range(0, count):
        dlist = []
        dlist.append(datelist[i].date().strftime("%Y-%m-%d"))
        dlist.append(linklist[i])
        print(dlist)
        writer.writerow(dlist)
    
    f.close()

### INPUT OFFICIAL WEBSITE INSIDE                
pbcSpider('http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/index.html')

Please input city name (e.g. guangzhou): guangzhou
Please indicate which year you wish to scrap: 2021
Please indicate which month you wish to scrap: 12
2021-12-27 00:00:00
2021-12-27 00:00:00
2021-12-24 00:00:00
2021-12-23 00:00:00
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433798/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433789/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433778/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433773/index.html
['2021-12-27', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433798/index.html']
['2021-12-27', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433789/index.html']
['2021-12-24', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433778/index.html']
['2021-12-23', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433773/index.html']
