Skip to content

A simple system for downloading and parsing Japanese news articles to aid in learning Japanese.

Notifications You must be signed in to change notification settings

pattyler/article-parser

Repository files navigation

Article Parser - Japanese Article Downloader and Parser

Important: This project is a work in progress, and in the early stages of development. Many things, especially the front-end, have not been designed or mapped out yet. The first step here is setting up a reliable, extendable back-end system, after which the front-end can be designed.

General Information

About

A simple system for downloading and parsing Japanese news articles to aid in learning Japanese.

Article Parser retrieves articles from a list of sources, parsing their content into a searchable database, allowing users to find real-word, example sentences of words they are studying. For example:

Word:     登る
Results:  1.  山梨県から登る道は1日にオープンしましたが、崩れた石で3450mより上に登ることができませんでした。
                                                                               ^^^^                                           
          2.  夏さんは14日、世界でいちばん高いエベレストの頂上に登ることに成功しました。
                                                             ^^^^
                 etc.  
                    .
                    .

Motivation

This project came from wanting a system to find sample Japanese sentences, mixed with need to get back up to scratch with Java and related frameworks, and have long been

As a student of Japanese, many people (myself included) use the Anki flashcard application to aid vocabulary study. Like many others, I always use example sentences on the flashcard rather than single words. Instead of making my own sentences (which, if done everyday, can be time-consuming), I thought wouldn't it be nice to find real-word example sentences online matching my interests, rather than the often mind-numbingly dull dictionary examples.

Use Cases

  • Use Case 1 (Beginner student)

Search NHK EasyNews for a word I wish to remember, such as 登る (noboru - to climb. See above.).

NHK EasyNews is a simplified news site aimed at a younger audience, and as such contains a simpler sentence structure, which is great for beginner learners looking for example sentences. Not every news article is interesting to everybody, so reults could be further refined based on other semantics. (E.g. limit articles to business articles, or articles about hiking.)

  • Use Case 2 (Any level)

Search an online blog for a word I wish to remember, or see in use.

Similar to the use case above, but rather than a news article I'd like to see more casual example sentences. Using the example of 登る, as above, I'd like to search a few mountain-climbing blogs to find example sentences written informally.

  • Use Case 3 (Intermediate - Advanced student)

Search Asahi Shinbun, NHK (full-site), and other newspaper sites for example grammar points, seeing how it is used in various contexts.

In-depth news articles often contain a variety of grammar points, including some complex but important points. I would like to see examples of grammar points used in various contexts, such as ○○ように (~youni, which can be used in a variety of ways.), in order to aid my grammar study.

Information for Developers

Technologies Used

Built with:

Tested with:

For a full list of technologies, please consult the project POM files.

Project Structure

Important: This is the current project structure, as it currently stands during development of the back-end system. This will be expanded in the future to include other modules or services, such as the database-parsing service to find semantic information within the persisted articles, and the frontend-viewer for use by clients.

Overview

A multi-module Maven project, headed by the artifact article-parser:

|-- article-parser
|   |-- articler-parser-backend
|   |-- article-parser-grabber
|   |-- article-parser-viewer

Module Description

article-parser

The parent project, containing dependencies and other information common across all modules.

article-parser-backend

Contains the backend, database related code common to or shared among other modules. Contains the SQL scripts for setting up the database.

article-praser-grabber

Services that retrieve data from websites, parsing them into models provided by article-parser-backend in order to be stored in the database. These services should be called periodically, such as from a cron job.

This module does not take care of parsing semantic meaning. It is only concerned with ensuring articles from the websites are downloaded and handed to article-parser-backend to be persisted in the database.

article-parser-viewer

Not for production. For use during development for checking the database or retrieving any information you wish programatically.

Setup

Requirements

Before following the setup instructions, please ensure you have the following:

Set up the environment

Currently, as the project is still in early development, getting a development environment set up is a bit messy, but can be achieved by running the following commands:

git clone https://github.com/pattyler/article-parser.git
cd article-parser
mkdir db
mkdir article-parser-backend/db
sudo mkdir /var/log/article-parser
sudo chmod 776 /var/log/article-parser
sudo chgrp $USER /var/log/article-parser
sqlite3 db/test.db
sqlite> .read article-parser-backend/src/main/resources/init-db.sql
sqlite> .quit
mvn package

The mkdir, chmod, and chgrp commands are to ensure the appropriate directory structure is set up. This process will be changed in the future.

The sqlite commands are to ensure the development database is initialised, ready to be used immediately by article-parser-grabber.

Run the application

To run the application, first make sure you are in the article-parser directory (the parent module). Then, run the following to populate the database with a few records and see the output:

cd article-parser-grabber/target
java -jar article-parser-grabber.jar
cd ../../article-parser-viewer/target
java -jar article-parser-viewer -from 10

About

A simple system for downloading and parsing Japanese news articles to aid in learning Japanese.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published