Web scraping ETL pipeline in Python using multithreading

Objective

To create an ETL pipeline to scrap website, do some processing and finally load in a database of choice.

Tools used

Python
- requests, request-html
- beautifulsoup
- concurrent
- sqlite3
- pandas
- re module
- Pycharm IDE
Multithreading
OOPs
SQL

Design Architecture

We will create an ETL class that will encapsulate the ETL logic for a particular page. Once we have the list of all such pages we shall use multithreading and scale up the ETL by creating an object of ETL class for each page and running the pipeline for each in multithreading.

Learnings from this exercise

Web scraping and website inspecting
Multithreading
ETL building
Creating Object-Oriented Programs
Creation of functools' partial functions
Storing data into RDBMS
Regex pattern building (Basics)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
etl.py		etl.py
fetch_plaza_ids.py		fetch_plaza_ids.py
main.py		main.py
nhai_info.db		nhai_info.db
nhai_scripts.js		nhai_scripts.js
requirements.txt		requirements.txt
scratch_at_a_glace_page.py		scratch_at_a_glace_page.py
scratch_individual_page.py		scratch_individual_page.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraping ETL pipeline in Python using multithreading

Objective

Tools used

Design Architecture

Learnings from this exercise

Refereces

Walkthrough

About

Releases

Packages

Languages

pavelchowdhury99/etl_for_web_scraping

Folders and files

Latest commit

History

Repository files navigation

Web scraping ETL pipeline in Python using multithreading

Objective

Tools used

Design Architecture

Learnings from this exercise

Refereces

Walkthrough

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages