Open Edu Hub Search ETL

Step 1: Project Setup - Python (manual approach)

make sure you have python3 installed (https://docs.python-guide.org/starting/installation/)
- (Python 3.10 or newer is required)
go to project root
Run the following commands:

sudo apt install python3-dev python3-pip python3-venv libpq-dev -y
python3 -m venv .venv

source .venv/bin/activate (on Linux Unix)

.venv\Scripts\activate.bat (on Windows)

pip3 install -r requirements.txt

Step 1 (alternative): Project Setup - Python (automated, via `poetry`)

Step 1: Make sure that you have Poetry v1.5.0+ installed
Step 2: Open your terminal in the project root directory:
- Step 2.1: (this is an optional, strictly personal preference) If you want to have your .venv to be created in the project root directory:
  - poetry config virtualenvs.in-project true
Step 3: Install dependencies (according to pyproject.toml) by running: poetry install

Step 2: Project Setup - required Docker Containers

If you have Docker installed, use docker-compose up to start up the multi-container for Splash and Playwright-integration.

As a last step, set up your config variables by copying the .env.example-file and modifying it if necessary:

cp converter/.env.example converter/.env

Running crawlers

A crawler can be run with scrapy crawl <spider-name>.
- (It assumes that you have an edu-sharing v6.0+ instance in your .env settings configured which can accept the data.)
If a crawler has Scrapy Spider Contracts implemented, you can test those by running scrapy check <spider-name>

Running crawlers via Docker

git clone https://github.com/openeduhub/oeh-search-etl
cd oeh-search-etl
cp converter/.env.example .env
# modify .env with your edu sharing instance
export CRAWLER=your_crawler_id_spider # i.e. wirlernenonline_spider
docker compose build scrapy
docker compose up

Building a Crawler

We use Scrapy as a framework. Please check out the guides for Scrapy spider (https://docs.scrapy.org/en/latest/intro/tutorial.html)
To create a new spider, create a file inside converter/spiders/<myname>_spider.py
We recommend inheriting the LomBase class in order to get out-of-the-box support for our metadata model
You may also Inherit a Base Class for crawling data, if your site provides LRMI metadata, the LrmiBase is a good start. If your system provides an OAI interface, you may use the OAIBase
As a sample/template, please take a look at the sample_spider.py
To learn more about the LOM standard we're using, you'll find useful information at https://en.wikipedia.org/wiki/Learning_object_metadata

Still have questions? Check out our GitHub-Wiki!

If you need help getting started or setting up your work environment, please don't hesitate to visit our GitHub Wiki at https://github.com/openeduhub/oeh-search-etl/wiki

Name		Name	Last commit message	Last commit date
Latest commit History 1,082 Commits
.github/workflows		.github/workflows
.run		.run
converter		converter
csv		csv
edu_sharing_client		edu_sharing_client
logs		logs
tests		tests
valuespace_converter		valuespace_converter
zip_download		zip_download
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Readme.md		Readme.md
crawl.sh		crawl.sh
docker-compose.yml		docker-compose.yml
edu-sharing-swagger.config.json		edu-sharing-swagger.config.json
entrypoint.sh		entrypoint.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.cfg		setup.cfg

openeduhub/oeh-search-etl

Folders and files

Latest commit

History

Repository files navigation

Open Edu Hub Search ETL

Step 1: Project Setup - Python (manual approach)

Step 1 (alternative): Project Setup - Python (automated, via poetry)

Step 2: Project Setup - required Docker Containers

Running crawlers

Running crawlers via Docker

Building a Crawler

Still have questions? Check out our GitHub-Wiki!

About

Resources

Stars

Watchers

Forks

Languages

Step 1 (alternative): Project Setup - Python (automated, via `poetry`)