# People's Bank of China: a Web Scrapper

Changhao Li | 2021.12


``GOAL`` of this programme: extract data, especially fines issued towards payment companies. Due to the structure obsticles, pure-automatically extracting those information is difficult.
Therefore, we adopt a semi-structured method towards our goal.

### 0. Limitation of extracting data from a PDF format file

It's difficult to read data, especially those written in Chinese character, from a PDF file. For instance, we try to extract relevant information from a sample PDF by **camelot**:

In [53]:
# -*- coding: utf-8 -*-
from camelot import read_pdf
import re

def parse_pdf_camelot(link):
    tables = read_pdf(link, pages = '0', flavor = 'stream', table_area = ['']) # stream会默认整页均为表格
    print(tables)
    print(tables[0].data)
    print()
    #print(re.findall(r'\w+\s元|\w+\s万元', info))
    
parse_pdf_camelot('2021122718272373606.pdf')


<TableList n=1>
[['备注']]





Cannot work...

Let me try another way to read-in pdf file: use **PDFMiner3K** package.

In [48]:
### PDFMiner reading online PDF files

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams = laparams)
    
    PDFPage.get_pages(rsrcmgr, device, pdfFile)
    device.close()
    
    content = retstr.getvalue()
    retstr.close()
    return content

pdfFile = urlopen('http://nanning.pbc.gov.cn/nanning/133346/133364/133371/4432873/2021122718272373606.pdf')
outputString = readPDF(pdfFile)
print('The program is running...')
print(outputString)
pdfFile.close()


The program is running...



Nothing happened... Now let's look at reading local files, and see whether it works or not...

In [49]:
### PDFMiner reading downloaded PDF files

#from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams = laparams)
    
    PDFPage.get_pages(rsrcmgr, device, pdfFile)
    device.close()
    
    content = retstr.getvalue()
    retstr.close()
    return content

outputString = readPDF('2021122718272373606.pdf')
print('The program is running...')
print(outputString)
pdfFile.close()


The program is running...



Nope. Nothing happened. Hopelessness.

The solution: skip the difficult part. This us not an PhD project. Waiting for Dr Josiah Poon, Dr Caren Han and their USYD NLP Group's breakthrough. Bless!

In the main part I will only focus on extracting fine url from different PBC branch website.

### 1. PBC... so many branches!

There are 35 PBC websites... 35! Each with different websites! The structure of the websites are different, the format of the information is also different... This makes web scraping extremely complex.

For instance, PBC Shenzhen branch issue fine information in Excel format (.xls), PBC Xi'an branch use Word (.doc) instead, PBC Nanning branch use PDF files (.pdf), and PBC Nanjing branch issues pure words (HTML)! What a diversity...

全国央行分支机构各自拥有其独立的网站【网站结构不同】、各个网站也单独发布罚单【格式不同】，这让爬虫变得异常复杂。

不同央行分行的行政处罚罚单格式不尽相同————例如，深圳市中心支行的罚单信息为Excel格式（.xls）、西安分行公布的附件为Word格式（.doc）、南宁中心支行的则为PDF格式（.pdf）；更有甚者，南京分行的罚单信息竟然为网页纯文字......

##### 1.1 Guangzhou

In [31]:
# -*- coding = utf-8 -*-

### GUANGZHOU BRANCH
'''
INPUT: 
- name: name of the city
- year: year
- month: month
OUTPUT:
- a csv file (e.g. 'guangzhou-2021-12.csv')
'''

from datetime import datetime, date
from urllib import request, parse
from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from fake_useragent import UserAgent
import csv
import re
import webbrowser

ua = UserAgent()

name = 'guangzhou'
year = '2021'
month = '12'

def gzSpider(link):
    driver = webdriver.Chrome()
    time.sleep(1)
    driver.get(link)
    req = driver.page_source
    # print(req)
    soup = BeautifulSoup(req, 'lxml')
    # print(soup.prettify())
    fram = soup.find("td", class_ = "content_right column")
    # print(fram.prettify())
    mylist = []
    finallist = []
    
    count = 0
    datelist = []
    linklist = []
    for item in fram.find_all("table", limit=1):
        # print(item)
        for temp in item.find_all("td", limit=1):
            ### FILTER ALL TIME OUT
            for inner in temp.find_all("td", width="100", class_="hei12jj", limit=10):
                # print(inner)
                d = datetime.strptime(inner.text, '%Y-%m-%d')
                ### INPUT DESIRED TIME HERE!
                if ((d > datetime(2021, 12, 1)) & (d < datetime(2021, 12, 31))):
                    print(d)
                    datelist.append(d)
                    count += 1
            l = temp.select('a[href]', limit=count)
            for k in range(0,len(l)):
                print("http://guangzhou.pbc.gov.cn" + (l[k]['href']))
                w = "http://guangzhou.pbc.gov.cn" + (l[k]['href'])
                linklist.append(w)
    
    txt = '{n}-{y}-{m}.csv'
    f = open(txt.format(n = name, y = year, m = month), 'w')
    writer = csv.writer(f)
    writer.writerow(['发布日期', '罚单链接'])
    
    for i in range(0, count):
        dlist = []
        dlist.append(datelist[i].date().strftime("%Y-%m-%d"))
        dlist.append(linklist[i])
        print(dlist)
        writer.writerow(dlist)
    
    f.close()
    
    #webbrowser.open('')

### INPUT GZ OFFICIAL WEBSITE INSIDE                
gzSpider('http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/index.html')

2021-12-27 00:00:00
2021-12-27 00:00:00
2021-12-24 00:00:00
2021-12-23 00:00:00
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433798/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433789/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433778/index.html
http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433773/index.html
['2021-12-27', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433798/index.html']
['2021-12-27', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433789/index.html']
['2021-12-24', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433778/index.html']
['2021-12-23', 'http://guangzhou.pbc.gov.cn/guangzhou/129142/129159/129166/4433773/index.html']
