Skip to content

jluisfgarza/CENACE-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 

Repository files navigation

Python Scraping Tool

CENACE web scraper for obtaining CSV files (Mexico). These files are processed and uploaded to a SQL server where data could be analyzed.

Site: CENACE

Description

  1. Files are downloaded by the PythonWebScrapper.py script.
  2. Files are read by the DailyScirpts on the CSVdir, then they are INSERTED on the SQL DB.

Scripts

Pandas_PML_Daily.py and Pandas_PND_Daily.py : Parse information downloaded from CENACE to the SQL Database every day.

Pandas_PML_Monthly.py and Pandas_PND_Monthly.py : INSERT to the SQL database historical data from CENACE.

Installation

To use the tool, it is necessary to download and install:

Note:

  • To install Geckodriver in windows it is necessary to add geckodriver.exe to the systems path

Currently Working Functions

  • Enter CENACE and download files to the specified directory
  • Validate downloads
  • Use Pandas lib to parse CSV files on specified directory
  • Create a daily condensed CSV files for PML and PND
  • Validate data integrity as dataframe
  • Local DB Connection
  • Local DB INSERT and SELECT
  • PML monthly script
  • PND monthly script
  • PML daily script
  • PND daily script

Pending

  • Azure DB connection
  • Azure data upload
  • Initialize DB with past information using monthly scripts
  • Run every 24 hrs
  • Performance

Performance

current code performance with large amounts of data is slow. About 165,000 inserts per min on a system with:

  • AMD A8-5550M APU 2.10 Ghz

    • CPU load when INSERT historical data ~(40%-55%)

    Working laptop with common programs and bloat ware running (Lotus, IBM, CITRIX, McAfee, Atom Editor, Spiyder, SQL Server Management Studio, etc.)

  • 8 GB RAM when INSERT historical data

    • Memory ~(35%-45%)

    Running SQL Server Windows NT 64bit, python and Spyder as main processes, the python process tend to consume different amounts of memory due to the file size variability.

  • 64bit Windows 8 OS

Further testing to be made.

Site: Tests

According to code logic, performance bottle neck is due to to_sql function on pandas lib.

Download (CSVdir) and Backup Directories tree view (PartCSVBackup)

  PML
    MDA
    MTR
  PND
    MDA
    MTR

Don't forget to change the download paths!!

Example on file Webdriver_Downloader:

profile.set_preference("browser.download.dir", "C:\\Users\e-jlfloresg\Desktop\Python-Requests-CENACE\SELENIUM\test downloads\PML\MTR")

About

Python Downloader: CENACE website downloader for obtaining CSV files (Mexico)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages