README

TL;DR

This is experimental system to learn about current Natural Language (especially Japanese) libraries.

This repository contains following scripts.

Scrape football articles from RSSs with Scrapy (python)
Extract contents (main body text and primary image from HTML)
Tokenize (using Janome for Japanese, NLTK for English)
Calculate similarity (using gensim Doc2Vec)
Website (using PHP)

The site deployed is https://the-football-spot.com/

Instance Configuration

Change Hostname

sudo hostname [instance alias in servers.json]

Install Linux Packages

sudo yum install git php71 php71-pdo php71-mysqlnd mysql-devel gcc bzip2-devel readline-devel openssl-devel sqlite-devel mysql57 mysql57-devel gcc gcc-c++ libxml2-devel

mysql57 workaround

# work around http://mhag.hatenablog.com/entry/2017/10/25/145313
sudo vi /etc/ld.so.conf.d/mysql57-x86_64.conf
# change 56 -> 57
sudo ldconfig

Install python

Install pyenv

curl -sL https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash

and then edit .bashrc

vi .bashrc

paste below

HISTTIMEFORMAT='%y/%m/%d %H:%M:%S '
HISTSIZE=100000

export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

CRAWLER_NOTIFY=0

apply settings

source .bashrc

Now you can use pyenv, install python

pyenv install 3.5.4
pyenv global 3.5.4
pyenv rehash

Install PIP/packages

curl -skL https://bootstrap.pypa.io/get-pip.py | python

pip install scrapy sqlalchemy slackweb python-dateutil feedparser mysqlclient extractcontent3 numpy Cython Pillow diskcache BeautifulSoup4 nltk
pip install dragnet

Clone repository

Put SSH key which submitted to repos to the server (football_deploy).

chmod 600 ~/.ssh/football_deploy

And edit SSH config file

vi .ssh/config

as below

Host github.com
  User git
  Port 22
  Hostname github.com
  IdentityFile ~/.ssh/football_deploy

then configure permissions

chmod 600 .ssh/config

Then create dir for the repos and clone the repos.

sudo mkdir /var/repos
sudo chown ec2-user.ec2-user /var/repos
cd /var/repos/
git clone git@github.com:kent013/football.git
cd
ln -s /var/repos/football/ football

Modify DATABASE settings

cp football/football/settings-dist.py football/football/settings.py
vi football/football/settings.py

Create crawler log dir

sudo mkdir /var/log/crawler/
sudo chown ec2-user.ec2-user /var/log/crawler/

Test settings through crawler

cd ~/football
scrapy crawl all

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
football		football
lib		lib
misc		misc
var		var
web		web
.gitignore		.gitignore
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

TL;DR

Instance Configuration

Change Hostname

Install Linux Packages

mysql57 workaround

Install python

Install PIP/packages

Clone repository

Modify DATABASE settings

Create crawler log dir

Test settings through crawler

About

Releases

Packages

Languages

kent013/football

Folders and files

Latest commit

History

Repository files navigation

README

TL;DR

Instance Configuration

Change Hostname

Install Linux Packages

mysql57 workaround

Install python

Install PIP/packages

Clone repository

Modify DATABASE settings

Create crawler log dir

Test settings through crawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages