Skip to content

pavelchowdhury99/etl_for_web_scraping

Repository files navigation

Web scraping ETL pipeline in Python using multithreading

Objective

To create an ETL pipeline to scrap website, do some processing and finally load in a database of choice.

Tools used

  • Python
    • requests, request-html
    • beautifulsoup
    • concurrent
    • sqlite3
    • pandas
    • re module
    • Pycharm IDE
  • Multithreading
  • OOPs
  • SQL

Design Architecture

We will create an ETL class that will encapsulate the ETL logic for a particular page. Once we have the list of all such pages we shall use multithreading and scale up the ETL by creating an object of ETL class for each page and running the pipeline for each in multithreading.

Learnings from this exercise

  • Web scraping and website inspecting
  • Multithreading
  • ETL building
  • Creating Object-Oriented Programs
  • Creation of functools' partial functions
  • Storing data into RDBMS
  • Regex pattern building (Basics)

Refereces

  1. NHAI list of Toll Plazas.
  2. NHAI Rate Page for a Plaza.
  3. functools' official documentation
  4. concurrent module
  5. re module preliminary
  6. re module official documentation
  7. curl to code converter

Walkthrough

Webscraping ETL Pipeline in Python