우리 프로젝트는 일주일 간의 기사를 크롤링 한 뒤, 카테고리별(정치, 경제, 사회)로 top keywords
를 뽑아주고 기사를 요약합니다.
top keywords
는 LDA
와 TextRank
알고리즘을 결합해 활용합니다.
Django
를 사용하여 크롤링한 뉴스의 총 개수와 각 카테고리에서 키워드별 뉴스 개수 차트를 시각화합니다.
또한 그 키워드 차트를 클릭하면 키워드에 맞는 뉴스들을 나열하며, 원본 url과 요약본을 보여줍니다.
OS : UBUNTU Lts 20.04
Python version==3.8.5
Django version==3.2.4
If you have crontab error then check your OS first.
crontab
does not support Windows. So you must use Linux OS to use crontab.
Also our project may not support python version under python3. We wish you use python version over 3.**
Django version 3.2.4 and mysql for viewing and saving data, BeautifulSoup4 and Goose3 for crawling and LDA in gensim and TextRank to extract keywords
We wish you to read our requirements.txt for installing packages
Use the line below at your terminal
$ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
you must create secrets.json and my_settings.py in same locate of manage.py In secrets.json, you must write SECRET_KEY. Like:
{
"SECRET_KEY" : "your secret key"
}
In my_settings.py, you must write information of DATABASE. Like:
DATABASES = {
'default': {
'ENGINE':'django.db.backends.mysql', # mysql engine
'NAME':'oss', # database name
'USER':'root', # user name when connected database
'PASSWORD':'PASSWORD',# user password
'HOST':'127.0.0.1', # database server address
'PORT':'3306' # database server port
}
}
After setting these two files, now you have to do migrate
$ python3 manage.py makemigrations
$ python3 manage.py migrate
If you finish these migrate without errors then run server
$ python3 manage.py runserver
Now you can use the website. But you may not have any news data and keywords.
Our project use django-crontab for crawling and keyword extractor job at particular time everyday. Our project do crawling at 00:00 and keyword extract at 01:00. Of course, you can change time by modifiy settings.py.
CRONJOBS = [
('0 0 * * *', 'crawling.cron.article_crawling_job', '>> log file location'),
('0 1 * * *', 'keywords.cron.lda_job', '>> log file location'),
]
change here and you can run program when you want. But don't pull pull-request with changing time
LDA
and TextRank
algorithm combined
README_keyword.md
TextRank
algorithm implemented
README_summarize.md
IF you want to contribute to our project, be sure to review the contribution guidelines. This project adheres to code_of_conduct. By participating, we are expected to read these two md.
We use GitHub issues for tracking requests, bugs, and enhance our project. So if you have an issue of project, then make and submit new issue.