Written in Python 3.7.6
Some Crawlers for Daily Data Collection, through Naive Approaches (without Scrapy
Framework)
- Simply clone/download the files in the repository
- Execute command
pip install -r requirements.txt
(or others) to install/ensure all required modules/packages are satisfied - Specify path, check global variables
- Run the codes and have a cup of coffee when you wait for the execution
All data belong to the corresponding source sites.
The crawlers (sometimes together with crawled data) are provided only for proper private use.
Anyone who risk abusing, in any forms, is to be blamed and should shoulder the responsibility on his/her own.
IMPORTANT REMINDER
- Please refer to the corresponding data source sites for more detailed rules (with regard to privacy, distribution, etc.).
- Please wisely use data.
- Repository Folder Name:
./202004 SJTU Zhiyuan Namelist/
- Functionality: It crawls the information of the students involved in Shanghai Jiaotong University (SJTU) Zhiyuan College.
Contents include:- Names lists of students of all majors and years
- Students' self description
- Students' profiles
- Source: SJTU Zhiyuan College - Students
─┬─ OUTPUT_DES_ROOT <folder> make sure path exists
└─┬─ OUTPUT_DES_FOLDER <folder> root of crawled results
├─── &&&.xlsx <file> result file
└─── &&&.jpg <file> profiles, filename format "major year id name.jpg"
There are some global variables that you may be concerned about, for customized settings and an easier use:
ROOT
: Root URL of source site. Please do not modify unless invalid.NAME_LIST_URL
: URL of source page. Please do not modify unless invalid.OUTPUT_DES_ROOT
: Project-based workspace. Please make sure such a path exists. All file operations are done in such a path.TIME_STAMP
: Timestamp, used in path naming.OUTPUT_DES_FOLDER
: Path (relative) where the results of a crawl are stored, default labeled with timestamp. All results files operations are done in such a path.SAVE_PAGE
: Mode selection, whether to save the web pages.SLEEP_INTERVAL
: Sleep intervals for executions to halt. For system use only, modifications NOT recommended.DEBUG_MODE
: Mode selection, whether to show less debug logs.
Generally speaking, the results are stored in a .xlsx
file. They are fairly comprehensive.
-
Repository Folder Name:
./202004 MCM_ICM Results/
-
Competitions: Mathematical Contest in Modeling, The Interdisciplinary Contest in Modeling (MCM/ICM for short)
-
Functionality: It crawls the competition results (Year 2019, 2020 tested) of MCM/ICM. Contents include:
- The Crawler (in
Crawler.py
)- Certificate PDFs
- The Parser (in
Parser.py
)- Winners teams list
- Participants on each team
- Advisor of each team
- Prize Types For more details, please refer to Possible Goals and Modes.
- The Crawler (in
-
Source: COMAP - Problems and Results
-
Team Numbers Formats (Years not Complete)
Year Format Format (RegExp) Possible Minimum Possible Maximum Example 2020 20***** ^20\d{5}\$
2000000 2099999 2004664 2019 19***** ^19\d{5}\$
1900000 1999999 1901362 -
Sample Full Results: (Year 2019, 2020)
./202004 MCM_ICM Results/MCM_ICM/
, as2019 results.json
,2020 results.json
. -
Special Notifications
- Storage Concerns about the Crawler: Please keep in mind that if you want to download all the certificates, an estimation of device storage should be made. For instance, 20951 certificates of Year 2020 takes up 3.18 GB (160 KB each file on average).
- Execution Resources Concerns about the Crawler and the Parser: A large amount of resources (time, computational resources, Internet service, etc.) will be consumed during the process. It is much more critical for the Parser. Here are some of my execution time numbers (hour:minute:second):
- Year 2020, Crawler: 20:52:01 (20951 files)
- Year 2020, Parser - Online Approach: 71:13:33 (20960 items)
- Year 2019, Parser - Online Approach: 81:40:47 (25365 items)
-
Possible Future Improvemnts
- Efficiency: Although great efforts have been taken to imporve the performance, to ensure the accuracy, network connection problems and the usage of some modules still result in a low efficiency.
- PDF miner:
fitz
is used here to convert PDF files containing rederable text areas to image data and then conduct further steps. If it is possible to parse text directly, great amount of time will be saved. - Accuracy: Frankly speaking, some of the particpants' names are given in languages like Chinese instead of English. Although
pytesseract
supportss such languages, its accuracy is still a problem. As a result, non-English characters will possibly not be parsed well enough. - During-Execution Cache Designs: Currently, either memory cache or file I/O burdens the device a lot.
─┬─ root <folder> working root, please make sure path exists
│ (assigned as OUTPUT_DES_ROOT in Crawler, PATH in Parser)
├─┬─ OUTPUT_DES_FOLDER <folder> [Crawler] root of crawled files
│ └─── LOG_FILE <file> [Crawler] working logs
│
├─── cache_2020... <folder> [Parser] auto-created and deleted (if exit successfully) cache folder
├─── templates <folder> [Parser] cropping templates, files/path NOT recommended to be edited
├─── report_filename <file> [Parser] report of the parser execution (customizable)
├─── result_filename <file> [Parser] result JSON file (customizable)
└─── log file <file> [Parser] working logs (customizable)
The Crawler
and the Parser
are initially designed to be working together. However, after further investigation and improvements, if only the information of the winners teams are requried, the Parser
will meet with the needs fairly well. For a clearer explanation, a few possile goals are listed below:
- Only all the certificates: Crawler Only, use
Crawler.py
only - Only all the winners info: Two options:
- Online Parser (strongly recommended)
- saves disk storage, one-step execution, more vulnerable to errors
- use
Parser.py
only
- Local Parser
- requires great disk storage, two-steps execution, less vulnerable to errors
- use
Crawler.py
andParser.py
- Online Parser (strongly recommended)
- All the certificates and the winners info: Local Parser, use
Crawler.py
andParser.py
- Crawler Only: Simply specify the required global variables and run the codes.
- Local Parser: Please follow these steps:
- Execute
Crawler.py
(the same as Crawler Only). - Execute
Parser.py
. (Sample codes block labeled with "Local Parser")- Instantiate class
PrzieParser
(e.g. as<object>
namedpt
).- Specify the folder of the crawled certificates as
files_path
. - [Optional] Specify a logger. If not specified, a class-level default logger will be used.
- Specify extra
kwargs
settings.
- Specify the folder of the crawled certificates as
- Call method
pt.get_files_names()
to get the list of to-parse files (e.g. as<list>
namedfile_list
). - [Optional] Slice the files list for tests or other goals.
- Call
pt.local_parser(file_list)
to run the parser. - After execution is finished, take a look at the results.
- Instantiate class
- Execute
- Online Parser: Please follow these steps:
- Execute
Crawler.py
(the same as Crawler Only). - Execute
Parser.py
. (Sample codes block labeled with "Online Parser")- Instantiate class
PrzieParser
(e.g. as<object>
namedpt
).- [Optional] Specify a logger. If not specified, a class-level default logger will be used.
- Specify extra
kwargs
settings.
Advanced Settings: while parsing, for middle-step cache files, whether to handle data stream or to read/write files. Specified in kwargcache_img_stream
. Recommend to do so for machines of high computational capabilities, while not for machines of high I/O performance.
- Call method
pt.online_parser()
with parameter of<list>
of<int>
, indicating list of to-parse team numbers - After execution is finished, take a look at the results.
- Instantiate class
- Execute
There are also some global variables that you may be concerned about, for customized settings and an easier use:
-
In
Crawler.py
OUTPUT_DES_ROOT
: Project-based workspace, also the path where all results and caches are stored. Please make sure such a path exists. All file operations are done in such a path.TIME_STAMP
: Timestamp, used in path naming.OUTPUT_DES_FOLDER
: Path (relative) where the results of a crawl are stored, default labeled with a timestamp.LOG_FILE
: Filename of the log fileDEBUG_MODE
: Mode selection, whether to show less debug logs.MAX_ATTEMPTS
: Maximum attempts counts while requesting ertificates from source site.MIN
: Lower bound (include itself) of the to-crawl teams numbers.MAX
: Upper bound (include itself) of the to-crawl teams numbers.
-
Parser.py
,kwargs
while instantiating classPrizeParser
root
: REQUIRED. Default workspace: where required files are stored, etc.files_path
: Local path (relative toroot
) where PDF(s) are/is stored, default asfiles/
templates_path
: Local path (relative toroot
) where REQUIRED templates are stored, default astemplates/
logger
: Logger object, default asNone
.delete_cache
: Whether to delete cache files after execution, default asTrue
.cache_img_stream
: Whether to use stream to pass cache images, default asTrue
.report_filename
: Local path (relative toroot
) where report file is stored, default asreport/
result_filename
: Local path (relative toroot
) where result json file is stored, default asresult.json
_online_max_conti_err
: For online parser only, maximum number of continuous errors, default as1000
._online_timeout
: For online parser only, timeout in seconds, default as5
._online_max_attempts
: For online parser only, max failure attempts, default as2
.
Generally speaking, the results are stored in .json
files. They are fairly comprehensive.
- Repository Folder Name:
./202005 High School Rewards Crawler/
- Functionality: It crawls results of several national adolescents science and technology competitions.
Contents include:- Winners name lists of each competition
- Sample certificates
- Detailed information (links, source, subject, etc.) of the crawled data
- Source: Children and Youth Science Center China Association for Science and Technology
─┬─ FILE_ROOT <folder> make sure path exists
├─── FILE_CACHE_PATH <folder> to be deleted when successfully terminated
└─┬─ FILE_DES_ROOT <folder> root of crawled results
├─── FILE_DES_CERT <folder> stores the sample certificates
├─── FILE_DECL_NAME <file> declarations from the source
├─── FILE_RES_NAME <file> result file, in json format
├─── FILE_LOG_NAME <file> log file
└─── FILE_NL_SRC_NAME <file> "cache" like, contains all the sources of name lists
There are some global variables that you may be concerned about, for customized settings and an easier use:
URL_ROOT
: Source URL. Please do NOT modify unless invalid.FILE_ROOT
: Project-based workspace, also the path where all results and caches are stored. Please make sure such a path exists. All file operations are done in such a path.FILE_DES_ROOT
: Path (relative) where the results of a crawl are stored, default labeled with a timestamp.FILE_DECL_NAME
: File name of the file where the declarations on the source site is stored.FILE_DES_CERT
: Path (relative) where the files related to the sample certificates are stored.FILE_RES_NAME
: File name of the file where the name lists results are stored.FILE_LOG_NAME
: File name of the file where logs are stored.FILE_NL_SRC_NAME
: File name of the file where the links and info of name lists are stored.FILE_CACHE_PATH
: Path (relative) of the cache folder. Modifications NOT recommended.TARGET_SAMPLE_CERT
: Mode selection, whether to crawl the sample certificates.LESS_CONSOLE_LOG
: Mode selection, whether to show less debug logs in console.
Generally speaking, the results are stored in .json
files. They are fairly comprehensive.