GitHub - ivancho82/scraping-greenhouse: Web Scraping with selenium

Web Scraping with Selenium

Web scraping de información de ofertas laborales con selenium

Tabla de contenido

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
Contributing
License
Contact

About The Project

Este proyecto es un demo de las características que provee selenium para el scraping de información de la web.

Que hace en el proyecto:

Scraping de la información de páginas de reclutamiento con selenium.
Enriquecimiento de información con el API de google maps.
Persistencia de información a una BD postgres
Generación de archivo con información persistida

(back to top)

Built With

Este proyecto se puede desplegar en cualquier ambiente que tenga python 3.8, sin embargo, algunas librerías dependen de componentes adicionales del SO. Esta probado en ambiente ubuntu 20.04

python

(back to top)

Getting Started

Los siguientes pasos son los pasos requeridos para ejecutar el proyecto en un SO ubuntu 20.04.

Prerequisites

python 3.8
postgres

Installation

Obtener un API Key en https://console.cloud.google.com/apis/credentials para acceder al API de google maps
Clonar el repo

  git clone https://github.com/ivancho82/scraping-greenhouse.git

Instalar dependencias de librerías

  sudo apt-get update
  sudo apt-get install -y curl unzip xvfb libxi6 libgconf-2-4
  sudo apt-get install libpq5
  sudo apt install ./chromedriver/google-chrome-stable_current_amd64.deb

Instalar los requerimientos

  pip install -r requirements.txt

Setear las variables de entorno

  export KEY_GCP=GCP_API_KEY_GOOGLE_MAPS
  export PG_HOST=POSTGRES_HOST_URI_IP
  export PG_DATABASE=POSTGRES_DATABASE_NAME
  export PG_USER=POSTGRES_USER_WITH_DB_ACCESS
  export PG_PASSWORD=POSTGRES_USER_PASSWORD
  export PG_TABLE=POSTGRES_TABLE_NAME

Configurar el archivo params.py con la lista de las urls a hacer scraping

  self.pages =[
          "LIST",
          "PAGES",
          "TO SCRAP"
      ]

(back to top)

Usage

El script genera un archivo data.out con la información que queda persistida e la tabla de postgres

  python3 main.py

(back to top)

Roadmap

Add Changelog
Add docker compose
Add README in english

(back to top)

Contributing

greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Ivan Cuervo - @ivan_cuervo - ivan.cuervom@gmail.com

Project Link: https://github.com/ivancho82/scraping-greenhouse

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
chromedriver		chromedriver
images		images
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md
connector.py		connector.py
data.out		data.out
main.py		main.py
params.py		params.py
requirements.txt		requirements.txt
scraping.py		scraping.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping with Selenium

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Roadmap

Contributing

License

Contact

About

Releases

Packages

Languages

License

ivancho82/scraping-greenhouse

Folders and files

Latest commit

History

Repository files navigation

Web Scraping with Selenium

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Roadmap

Contributing

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages