Skip to content

A microservice scraping GitHub repositories based on a specific topic

License

Notifications You must be signed in to change notification settings

pcolt/playwright-scraper

Repository files navigation

NodeJS TypeScript Playwright MongoDB Redis ESLint Docker

GitHub's topics scraper with Playwright

A microservice crawling and scraping GitHub repositories based on a specific topic (i.e. climatechange). This project is part of my final project for the Helsinki University's Full Stack Open course.

The service is subcribed to a Redis pub/sub message channel and starts a new scraping process whenever a message is received.

The microservice stores the results into an Atlas Mongodb database. The complete result is also stored into a local .json and .csv file.

The scraping process returns for each repository found the following data:

  • owner
  • name
  • URL
  • number of starts
  • description
  • list of repository topics

Work hours

A list of approximate work hours used to develop the project are listed in workhours.md

Installation

Run npm install

Configure secret/environment variables

  • In the root folder create .env file with following keys:
MONGO_URL = 'mongodb+srv://fullstack:MONGODB_FULLSTACK_USER_PASSWORD@cluster0.ck2n2.mongodb.net/repos?retryWrites=true&w=majority'
REDIS_URL = 'redis://default:REDIS_DEFAULTUSER_PASSWORD@redis-12236.c300.eu-central-1-1.ec2.cloud.redislabs.com:12236'
  • Set sensitive data as Fly.io secrets with commands:
    fly secrets set MONGO_URL='mongodb+srv://fullstack:MONGODB_FULLSTACK_USER_PASSWORD@cluster0.ck2n2.mongodb.net/repos?retryWrites=true&w=majority' fly secrets set REDIS_URL='redis://default:MONGODB_DEFAULTUSER_PASSWORD@redis-12236.c300.eu-central-1-1.ec2.cloud.redislabs.com:12236'

Usage

npm run build to compile typescript .ts files located in /src
npm start to run in dev mode the compiled files located in ./build folder
npm run dev to run typescript files on the fly reloading when something changes

Deploy to Fly.io

Check secrets: fly secrets list

Deploy to Fly fly deploy or npm run deploy

Scale Fly app to 0 machines (stopped) fly scale count 0

Scale Fly app back to 1 machine fly scale count 1

Show list of Fly apps currently deployed: fly apps list

Show logs from all machines (or filter by id with -i flag) fly logs

Restart machine fly machine restart

Docker

Docker image is used by Fly.io to deploy this micro-service.
It can be also used to run and debug the Docker image.

Build Docker image docker build . -t scraper

Run Docker image docker run --env MONGO_URL='MONGO_URL_in_.ENV_FILE' --env REDIS_URL='REDIS_URL_in_.ENV_FILE' scraper

Docker list of all containers docker ps -a
Restart a container docker restart [container-id]
Follow container logs docker logs --follow [container-id]

Docker best practices: Docker best practicesOpen it in a new tab.

Git

Print list of all commits to a .txt file (Docs)

git log --reverse --pretty=format:'| %as | 1 | %s |' > log.txt

Dependencies

Mongodb atlas

Connect via web app

https://account.mongodb.com/

Redis cloud

Connect via web app

https://app.redislabs.com/

Connect via terminal

Use the Connect button from the web app which will provide something like this: redis-cli -u redis://default:REDIS_DEFAULTUSER_PASSWORD@redis-12236.c300.eu-central-1-1.ec2.cloud.redislabs.com:12236

Once you are connected, check open and running pub.sub channels with: PUBSUB CHANNELS

References

About

A microservice scraping GitHub repositories based on a specific topic

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published