Skip to content
Football RSS scaper/analyzer/website
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
football
lib
misc
var
web
.gitignore
README.md
scrapy.cfg

README.md

README

TL;DR

This is experimental system to learn about current Natural Language (especially Japanese) libraries.

This repository contains following scripts.

  1. Scrape football articles from RSSs with Scrapy (python)
  2. Extract contents (main body text and primary image from HTML)
  3. Tokenize (using Janome for Japanese, NLTK for English)
  4. Calculate similarity (using gensim Doc2Vec)
  5. Website (using PHP)

The site deployed is https://the-football-spot.com/

Instance Configuration

Change Hostname

sudo hostname [instance alias in servers.json]

Install Linux Packages

sudo yum install git php71 php71-pdo php71-mysqlnd mysql-devel gcc bzip2-devel readline-devel openssl-devel sqlite-devel mysql57 mysql57-devel gcc gcc-c++ libxml2-devel

mysql57 workaround

# work around http://mhag.hatenablog.com/entry/2017/10/25/145313
sudo vi /etc/ld.so.conf.d/mysql57-x86_64.conf
# change 56 -> 57
sudo ldconfig

Install python

Install pyenv

curl -sL https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash

and then edit .bashrc

vi .bashrc

paste below

HISTTIMEFORMAT='%y/%m/%d %H:%M:%S '
HISTSIZE=100000

export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

CRAWLER_NOTIFY=0

apply settings

source .bashrc

Now you can use pyenv, install python

pyenv install 3.5.4
pyenv global 3.5.4
pyenv rehash

Install PIP/packages

curl -skL https://bootstrap.pypa.io/get-pip.py | python

pip install scrapy sqlalchemy slackweb python-dateutil feedparser mysqlclient extractcontent3 numpy Cython Pillow diskcache BeautifulSoup4 nltk
pip install dragnet

Clone repository

Put SSH key which submitted to repos to the server (football_deploy).

chmod 600 ~/.ssh/football_deploy

And edit SSH config file

vi .ssh/config

as below

Host github.com
  User git
  Port 22
  Hostname github.com
  IdentityFile ~/.ssh/football_deploy

then configure permissions

chmod 600 .ssh/config

Then create dir for the repos and clone the repos.

sudo mkdir /var/repos
sudo chown ec2-user.ec2-user /var/repos
cd /var/repos/
git clone git@github.com:kent013/football.git
cd
ln -s /var/repos/football/ football

Modify DATABASE settings

cp football/football/settings-dist.py football/football/settings.py
vi football/football/settings.py

Create crawler log dir

sudo mkdir /var/log/crawler/
sudo chown ec2-user.ec2-user /var/log/crawler/

Test settings through crawler

cd ~/football
scrapy crawl all
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.