Skip to content

Commit

Permalink
Merge pull request #3 from mogita/feat/postgres
Browse files Browse the repository at this point in the history
Feat: postgres
  • Loading branch information
mogita committed Jan 3, 2022
2 parents 73c7203 + e5f891b commit fac0778
Show file tree
Hide file tree
Showing 19 changed files with 698 additions and 172 deletions.
2 changes: 2 additions & 0 deletions .dockerignore
@@ -0,0 +1,2 @@
_pg
redis-data
22 changes: 22 additions & 0 deletions Dockerfile.migrate
@@ -0,0 +1,22 @@
FROM python:3.8-bullseye

# Comment the following line if you don't need an alternative debian software source
RUN sed -i "s@http://deb.debian.org@http://mirrors.aliyun.com@g" /etc/apt/sources.list && rm -Rf /var/lib/apt/lists/* && apt-get update
RUN apt-get update
RUN apt-get install -y git

WORKDIR /app

RUN wget https://github.com.cnpmjs.org/golang-migrate/migrate/releases/download/v4.15.1/migrate.linux-amd64.tar.gz -O migrate.linux-amd64.tar.gz
RUN tar -xvf migrate.linux-amd64.tar.gz -C /usr/local/bin
RUN chmod a+x /usr/local/bin/migrate

RUN git clone https://github.com.cnpmjs.org/vishnubob/wait-for-it.git /wait-for-it
RUN chmod a+x /wait-for-it/wait-for-it.sh

COPY migrate.sh /app/migrate.sh
RUN chmod a+x /app/migrate.sh

COPY migrations /app/migrations

CMD ["/wait-for-it/wait-for-it.sh", "douban-crawler-db:5432", "--", "/app/migrate.sh"]
56 changes: 48 additions & 8 deletions Readme.md
Expand Up @@ -15,6 +15,29 @@ This crawler depends on the [proxy_pool](https://github.com/jhao104/proxy_pool)

## Steps

### Without the Proxy Pool

1. Create a `.env` file under the project root and add the following line:

```bash
WITHOUT_PROXY=yes
```

2. Build and start

```bash
# You can add `--no-cache` to always build a clean image
docker-compose build

# You can add `--force-recreate` if you want to drop the container even when
# the configuration or the image hasn't changed.
docker-compose up -d
```

> Not using proxies might lead to 403 error responses from the source site.
### With the Proxy Pool

1. Free IPs just don't work most of the time. It's highly recommended that you choose a payed proxy provider and tweak the code under `proxy_pool` directory to override the functionality and suit your needs. Take Zhima (芝麻) HTTP Proxy for example, create a `.env` file and put the API endpoint into it:

```env
Expand Down Expand Up @@ -43,10 +66,11 @@ docker-compose up -d
## Prerequisites

- Python 3 with `pip`
- PostgreSQL
- Redis
- [proxy_pool](https://github.com/jhao104/proxy_pool)

> It's recommended to use Virtualenv or Anaconda to handle the environment.
> It might be a bit more convenient to use Virtualenv or Anaconda to handle the environment. But this differs from case to case so please know what you're dealing with before going ahead.
## Steps

Expand All @@ -55,13 +79,21 @@ docker-compose up -d
Edit `.env` file to set the proper environment variables:

```bash
# Adding the following line will make the scripts show verbose logs
DEBUG=yes

# As I'm using Zhima HTTP Proxy I'll put the API here so proxy_pool/fetcher knows
# where to get new IPs to refresh the pool.
ZHIMA_PROXY_URL="https://..."

# Put the host name and port (if needed) here for the "proxy_pool" instance so this
# crawler knows where the pool is.
PROXY_POOL_HOST="https://localhost:5010"

# Anyway if you don't need the proxy pool at all, e.g. you want the script to
# make request directly from your network, you can add the following line and
# go to step 2
WITHOUT_PROXY=yes
```

2. Install dependencies.
Expand All @@ -70,20 +102,28 @@ PROXY_POOL_HOST="https://localhost:5010"
pip install -r requirements.txt
```

3. Run `get_tags` to fetch all the trending tags.
3. Migrate database schemas

First you should install `golang-migrate/migrate` tool to enable the `migrate` command. Follow the installation guides here: [`migrate CLI`](https://github.com/golang-migrate/migrate/tree/master/cmd/migrate).

Then make the migration to your database (change the `user`, `pass` and/or hostname and port accordingly):

```bash
# This will generate a file named tags.csv under the specified `output` directory
PROXY_POOL_HOST=https://<host-of-step-1>... python app.py get_tags -o /your-output-dir
migrate -database "postgres://user:pass@localhost:5432/crawler?sslmode=disable" -path migrations up
```

4. Run `crawl_books` to start crawling by the tags given in the `csv` file.
4. Run the scripts in the following sequence:

```bash
PROXY_POOL_HOST=https://<host-of-step-1>... python app.py crawl_books -i /some-where/tags.csv -o /your-output-dir
```
# First, get as more as possible tags
python app.py get_tags

# Second, iterate through tags and fetch the links to the books
python app.py get_book_links

> Certainly, you can create the tags.csv without using the `get_tags` script. You may want to make sure the tags you entered can lead to any actual result of data.
# Lastly start to crawl books from the links
python app.py crawl_books
```

# License

Expand Down
6 changes: 5 additions & 1 deletion app.py
@@ -1,15 +1,19 @@
from importlib import import_module
from os import environ as env
from dotenv import load_dotenv
import sys
import logging
import traceback

load_dotenv()
debug_mode = True if env.get("DEBUG") == "yes" else False

logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d - %(levelname)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
logging.root.setLevel(logging.INFO)
logging.root.setLevel(logging.DEBUG if debug_mode else logging.INFO)


def main():
Expand Down
37 changes: 34 additions & 3 deletions docker-compose.yml
Expand Up @@ -6,15 +6,46 @@ services:
build:
context: "."
restart: "no"
depends_on:
- db
environment:
PROXY_POOL_HOST: http://douban-crawler-proxy-pool:5010
volumes:
- ./_data:/data
DB_HOST: douban-crawler-db
DB_USER: crawler
DB_PASS: crawler
DB_NAME: crawler
networks:
- douban-crawler
- douban-crawler-db

db:
container_name: douban-crawler-db
restart: unless-stopped
image: postgis/postgis:14-master
expose:
- 5432
volumes:
- ./_pg:/var/lib/postgresql/data
environment:
POSTGRES_USER: crawler
POSTGRES_PASSWORD: crawler
POSTGRES_DB: crawler
networks:
- douban-crawler-db

migration:
container_name: douban-crawler-migration
build:
dockerfile: Dockerfile.migrate
context: "."
depends_on:
- db
networks:
- douban-crawler-db

networks:
douban-crawler:
douban-crawler-db:

volumes:
_data:
_pg:
6 changes: 6 additions & 0 deletions migrate.sh
@@ -0,0 +1,6 @@
#!/bin/sh

echo "running migration..."

/usr/local/bin/migrate -database "postgres://crawler:crawler@douban-crawler-db:5432/crawler?sslmode=disable&search_path=public" -path /app/migrations up

1 change: 1 addition & 0 deletions migrations/000001_create_books_table.down.sql
@@ -0,0 +1 @@
DROP TABLE IF EXISTS books;
31 changes: 31 additions & 0 deletions migrations/000001_create_books_table.up.sql
@@ -0,0 +1,31 @@
CREATE TABLE IF NOT EXISTS books (
id bigserial PRIMARY KEY,
title text DEFAULT '',
subtitle text DEFAULT '',
author text DEFAULT '',
author_url text DEFAULT '',
author_intro text DEFAULT '',
publisher text DEFAULT '',
published_at timestamp without time zone DEFAULT NULL,
original_title text DEFAULT '',
translator text DEFAULT '',
producer text DEFAULT '',
series text DEFAULT '',
price text DEFAULT '',
isbn text DEFAULT '',
pages int DEFAULT 0,
bookbinding text DEFAULT '',
book_intro text DEFAULT '',
toc text DEFAULT '',
rating real DEFAULT 0.0,
rating_count int DEFAULT 0,
cover_img_url text DEFAULT '',
origin text DEFAULT '',
origin_id text DEFAULT '',
origin_url text UNIQUE DEFAULT '',
crawled boolean DEFAULT False,
created_at timestamp without time zone default (now() at time zone 'utc'),
updated_at timestamp without time zone default (now() at time zone 'utc'),
deleted_at timestamp without time zone default NULL
);

1 change: 1 addition & 0 deletions migrations/000002_create_tags_table.down.sql
@@ -0,0 +1 @@
DROP TABLE IF EXISTS tags;
8 changes: 8 additions & 0 deletions migrations/000002_create_tags_table.up.sql
@@ -0,0 +1,8 @@
CREATE TABLE IF NOT EXISTS tags (
id bigserial PRIMARY KEY,
name text UNIQUE NOT NULL,
current_page bigint DEFAULT 0,
created_at timestamp without time zone default (now() at time zone 'utc'),
updated_at timestamp without time zone default (now() at time zone 'utc'),
deleted_at timestamp without time zone default NULL
);
7 changes: 3 additions & 4 deletions proxy_pool/setting.py
Expand Up @@ -50,9 +50,8 @@

# ############# proxy validator #################
# 代理验证目标网站
HTTP_URL = "http://httpbin.org"

HTTPS_URL = "https://www.qq.com"
HTTP_URL = "http://www.baidu.com"
HTTPS_URL = "https://www.baidu.com"

# 代理验证时超时时间
VERIFY_TIMEOUT = 3
Expand All @@ -64,7 +63,7 @@
# MAX_FAIL_RATE = 0.1

# proxyCheck时代理数量少于POOL_SIZE_MIN触发抓取
POOL_SIZE_MIN = 10
POOL_SIZE_MIN = 5

# ############# scheduler config #################

Expand Down
7 changes: 7 additions & 0 deletions requirements.txt
Expand Up @@ -3,13 +3,20 @@ bs4==0.0.1
certifi==2021.10.8
charset-normalizer==2.0.9
idna==3.3
importlib-metadata==4.10.0
importlib-resources==5.4.0
Mako==1.1.6
MarkupSafe==2.0.1
numpy==1.21.4
pandas==1.3.5
psycopg2==2.9.2
python-dateutil==2.8.2
python-dotenv==0.19.2
pytz==2021.3
requests==2.26.0
six==1.16.0
soupsieve==2.3.1
SQLAlchemy==1.4.28
urllib3==1.26.7
xpinyin==0.7.6
zipp==3.6.0

0 comments on commit fac0778

Please sign in to comment.