Skip to content
Football RSS scaper/analyzer/website
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.



This is experimental system to learn about current Natural Language (especially Japanese) libraries.

This repository contains following scripts.

  1. Scrape football articles from RSSs with Scrapy (python)
  2. Extract contents (main body text and primary image from HTML)
  3. Tokenize (using Janome for Japanese, NLTK for English)
  4. Calculate similarity (using gensim Doc2Vec)
  5. Website (using PHP)

The site deployed is

Instance Configuration

Change Hostname

sudo hostname [instance alias in servers.json]

Install Linux Packages

sudo yum install git php71 php71-pdo php71-mysqlnd mysql-devel gcc bzip2-devel readline-devel openssl-devel sqlite-devel mysql57 mysql57-devel gcc gcc-c++ libxml2-devel

mysql57 workaround

# work around
sudo vi /etc/
# change 56 -> 57
sudo ldconfig

Install python

Install pyenv

curl -sL | bash

and then edit .bashrc

vi .bashrc

paste below

HISTTIMEFORMAT='%y/%m/%d %H:%M:%S '

export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"


apply settings

source .bashrc

Now you can use pyenv, install python

pyenv install 3.5.4
pyenv global 3.5.4
pyenv rehash

Install PIP/packages

curl -skL | python

pip install scrapy sqlalchemy slackweb python-dateutil feedparser mysqlclient extractcontent3 numpy Cython Pillow diskcache BeautifulSoup4 nltk
pip install dragnet

Clone repository

Put SSH key which submitted to repos to the server (football_deploy).

chmod 600 ~/.ssh/football_deploy

And edit SSH config file

vi .ssh/config

as below

  User git
  Port 22
  IdentityFile ~/.ssh/football_deploy

then configure permissions

chmod 600 .ssh/config

Then create dir for the repos and clone the repos.

sudo mkdir /var/repos
sudo chown ec2-user.ec2-user /var/repos
cd /var/repos/
git clone
ln -s /var/repos/football/ football

Modify DATABASE settings

cp football/football/ football/football/
vi football/football/

Create crawler log dir

sudo mkdir /var/log/crawler/
sudo chown ec2-user.ec2-user /var/log/crawler/

Test settings through crawler

cd ~/football
scrapy crawl all
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.