Skip to content

Webscraping project on pollution data in python and scrapy

Notifications You must be signed in to change notification settings

lajobu/Scrapy_pollution

Repository files navigation

Scrapy_pollution

License: MIT

Here you can find my first web scraping project.

⭐ Data analysis results:

  • Pollution level in PM2.5:

alt text

🔗 More details: https://github.com/lajobu/Scrapy_pollution/blob/master/Analysis.py

⭐ Details:

📍 Website: https://openaq.org/

📍 Code languague: Python3

📍 Scraper: scrapy

📍 Libraries: Numpy, Pandas 🐼, Seaborn 📊, and Matplotlib

📍 Adittional tools: docker and scrapy_splah

❓ What is web scraping?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Source: 🔗 Wikipedia

⭐ User manual:

☑️ 1) Spider to be run: link_country

☑️ 2) Spider to be run: pages

  • $ scrapy crawl pages -o Data/Links/pages.csv
  • It generates 🔗 pages.csv, script: 🔗 pages.py

☑️ 3) Spider to be run: pollution

☑️ 4) Python script to be run: Analysis.py