This is experimental system to learn about current Natural Language (especially Japanese) libraries.
This repository contains following scripts.
- Scrape football articles from RSSs with Scrapy (python)
- Extract contents (main body text and primary image from HTML)
- Tokenize (using Janome for Japanese, NLTK for English)
- Calculate similarity (using gensim Doc2Vec)
- Website (using PHP)
The site deployed is https://the-football-spot.com/
sudo hostname [instance alias in servers.json]
sudo yum install git php71 php71-pdo php71-mysqlnd mysql-devel gcc bzip2-devel readline-devel openssl-devel sqlite-devel mysql57 mysql57-devel gcc gcc-c++ libxml2-devel
# work around http://mhag.hatenablog.com/entry/2017/10/25/145313
sudo vi /etc/ld.so.conf.d/mysql57-x86_64.conf
# change 56 -> 57
sudo ldconfig
Install pyenv
curl -sL https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash
and then edit .bashrc
vi .bashrc
paste below
HISTTIMEFORMAT='%y/%m/%d %H:%M:%S '
HISTSIZE=100000
export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
CRAWLER_NOTIFY=0
apply settings
source .bashrc
Now you can use pyenv, install python
pyenv install 3.5.4
pyenv global 3.5.4
pyenv rehash
curl -skL https://bootstrap.pypa.io/get-pip.py | python
pip install scrapy sqlalchemy slackweb python-dateutil feedparser mysqlclient extractcontent3 numpy Cython Pillow diskcache BeautifulSoup4 nltk
pip install dragnet
Put SSH key which submitted to repos to the server (football_deploy).
chmod 600 ~/.ssh/football_deploy
And edit SSH config file
vi .ssh/config
as below
Host github.com
User git
Port 22
Hostname github.com
IdentityFile ~/.ssh/football_deploy
then configure permissions
chmod 600 .ssh/config
Then create dir for the repos and clone the repos.
sudo mkdir /var/repos
sudo chown ec2-user.ec2-user /var/repos
cd /var/repos/
git clone git@github.com:kent013/football.git
cd
ln -s /var/repos/football/ football
cp football/football/settings-dist.py football/football/settings.py
vi football/football/settings.py
sudo mkdir /var/log/crawler/
sudo chown ec2-user.ec2-user /var/log/crawler/
cd ~/football
scrapy crawl all